Programming Languages
SQL

SQL

Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. It's essential for data querying, transformation, and database operations in data engineering.

Core Concepts

Data Definition Language (DDL)

  • CREATE: Create tables, indexes, and schemas
  • ALTER: Modify existing database structures
  • DROP: Remove database objects
  • TRUNCATE: Remove all data from tables

Data Manipulation Language (DML)

  • SELECT: Query and retrieve data
  • INSERT: Add new records
  • UPDATE: Modify existing records
  • DELETE: Remove records

Data Control Language (DCL)

  • GRANT: Assign permissions
  • REVOKE: Remove permissions
  • COMMIT: Save transaction changes
  • ROLLBACK: Undo transaction changes

Essential SQL Operations

Data Querying

-- Basic querying with filtering and aggregation
SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MAX(hire_date) as latest_hire
FROM employees 
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;

Window Functions

-- Advanced analytics with window functions
SELECT 
    employee_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept,
    LAG(salary) OVER (ORDER BY hire_date) as previous_salary,
    SUM(salary) OVER (PARTITION BY department) as dept_total_salary
FROM employees;

Common Table Expressions (CTEs)

-- Complex queries with CTEs for readability
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as total_sales
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
moving_averages AS (
    SELECT 
        month,
        total_sales,
        AVG(total_sales) OVER (
            ORDER BY month 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) as three_month_avg
    FROM monthly_sales
)
SELECT * FROM moving_averages ORDER BY month;

SQL Variants & Dialects

PostgreSQL

  • JSON support: Native JSON and JSONB data types
  • Array operations: Advanced array functions
  • Full-text search: Built-in text search capabilities
  • Extensions: PostGIS for geographic data

MySQL

  • Performance: Optimized for web applications
  • Replication: Master-slave and master-master setups
  • Storage engines: InnoDB, MyISAM options
  • JSON functions: JSON data type support

BigQuery SQL

  • Scalability: Petabyte-scale analytics
  • Serverless: No infrastructure management
  • Machine learning: ML functions in SQL
  • Nested data: Support for arrays and structs

Spark SQL

  • Distributed: Process large datasets across clusters
  • DataFrame API: Integration with Spark DataFrames
  • Streaming: Real-time data processing
  • Multiple formats: Parquet, JSON, CSV support

Data Engineering Applications

ETL Operations

-- Extract, Transform, Load example
INSERT INTO fact_sales (
    date_key,
    product_key,
    customer_key,
    quantity,
    revenue
)
SELECT 
    d.date_key,
    p.product_key,
    c.customer_key,
    s.quantity,
    s.quantity * s.unit_price as revenue
FROM staging_sales s
JOIN dim_date d ON s.order_date = d.date
JOIN dim_product p ON s.product_id = p.product_id
JOIN dim_customer c ON s.customer_id = c.customer_id
WHERE s.processed_flag = FALSE;

Data Quality Checks

-- Data validation and quality assessment
SELECT 
    'customers' as table_name,
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) - COUNT(email) as missing_emails,
    COUNT(*) - COUNT(phone) as missing_phones,
    COUNT(*) FILTER (WHERE created_date > CURRENT_DATE) as future_dates
FROM customers;

Performance Monitoring

-- Query performance analysis
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time,
    rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_time DESC
LIMIT 10;

Advanced Techniques

Query Optimization

  1. Index usage: Create appropriate indexes
  2. Query planning: Understand execution plans
  3. Statistics: Keep table statistics updated
  4. Partitioning: Partition large tables

Data Warehousing Patterns

  1. Star schema: Fact and dimension tables
  2. Snowflake schema: Normalized dimensions
  3. Data vault: Scalable modeling approach
  4. Temporal tables: Track historical changes

Modern SQL Features

  1. JSON/XML: Semi-structured data support
  2. Recursive queries: Hierarchical data processing
  3. Materialized views: Pre-computed results
  4. Stored procedures: Reusable database logic

Best Practices

Query Writing

  1. Readable code: Use proper formatting and comments
  2. Explicit joins: Avoid implicit joins
  3. Column specificity: Select only needed columns
  4. Filter early: Apply WHERE clauses early

Performance

  1. Index strategy: Create indexes on frequently queried columns
  2. Query caching: Leverage query result caching
  3. Batch operations: Process data in batches
  4. Connection pooling: Manage database connections efficiently

Security

  1. Parameterized queries: Prevent SQL injection
  2. Least privilege: Grant minimal necessary permissions
  3. Data masking: Protect sensitive information
  4. Audit logging: Track database access and changes

Learning Resources

Fundamentals

  • SQL tutorials: Interactive learning platforms
  • Database design: Normalization and modeling
  • Query optimization: Performance tuning guides
  • Vendor documentation: Platform-specific features

Advanced Topics

  • Data warehousing: Dimensional modeling
  • Analytics: Statistical functions and window operations
  • BigQuery: Google Cloud analytics platform
  • Spark SQL: Distributed data processing

When to Use SQL

Ideal For

  • Structured data analysis
  • Data transformation and cleaning
  • Reporting and dashboards
  • Data warehousing operations
  • Business intelligence queries

Consider Alternatives When

  • Unstructured data processing
  • Real-time stream processing
  • Complex machine learning pipelines
  • Graph database operations

SQL remains the cornerstone of data engineering, providing powerful capabilities for data manipulation, analysis, and reporting across various database systems and platforms.