SQL
Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. It's essential for data querying, transformation, and database operations in data engineering.
Core Concepts
Data Definition Language (DDL)
- CREATE: Create tables, indexes, and schemas
- ALTER: Modify existing database structures
- DROP: Remove database objects
- TRUNCATE: Remove all data from tables
Data Manipulation Language (DML)
- SELECT: Query and retrieve data
- INSERT: Add new records
- UPDATE: Modify existing records
- DELETE: Remove records
Data Control Language (DCL)
- GRANT: Assign permissions
- REVOKE: Remove permissions
- COMMIT: Save transaction changes
- ROLLBACK: Undo transaction changes
Essential SQL Operations
Data Querying
-- Basic querying with filtering and aggregation
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary,
MAX(hire_date) as latest_hire
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;
Window Functions
-- Advanced analytics with window functions
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept,
LAG(salary) OVER (ORDER BY hire_date) as previous_salary,
SUM(salary) OVER (PARTITION BY department) as dept_total_salary
FROM employees;
Common Table Expressions (CTEs)
-- Complex queries with CTEs for readability
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as total_sales
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
),
moving_averages AS (
SELECT
month,
total_sales,
AVG(total_sales) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as three_month_avg
FROM monthly_sales
)
SELECT * FROM moving_averages ORDER BY month;
SQL Variants & Dialects
PostgreSQL
- JSON support: Native JSON and JSONB data types
- Array operations: Advanced array functions
- Full-text search: Built-in text search capabilities
- Extensions: PostGIS for geographic data
MySQL
- Performance: Optimized for web applications
- Replication: Master-slave and master-master setups
- Storage engines: InnoDB, MyISAM options
- JSON functions: JSON data type support
BigQuery SQL
- Scalability: Petabyte-scale analytics
- Serverless: No infrastructure management
- Machine learning: ML functions in SQL
- Nested data: Support for arrays and structs
Spark SQL
- Distributed: Process large datasets across clusters
- DataFrame API: Integration with Spark DataFrames
- Streaming: Real-time data processing
- Multiple formats: Parquet, JSON, CSV support
Data Engineering Applications
ETL Operations
-- Extract, Transform, Load example
INSERT INTO fact_sales (
date_key,
product_key,
customer_key,
quantity,
revenue
)
SELECT
d.date_key,
p.product_key,
c.customer_key,
s.quantity,
s.quantity * s.unit_price as revenue
FROM staging_sales s
JOIN dim_date d ON s.order_date = d.date
JOIN dim_product p ON s.product_id = p.product_id
JOIN dim_customer c ON s.customer_id = c.customer_id
WHERE s.processed_flag = FALSE;
Data Quality Checks
-- Data validation and quality assessment
SELECT
'customers' as table_name,
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(*) - COUNT(email) as missing_emails,
COUNT(*) - COUNT(phone) as missing_phones,
COUNT(*) FILTER (WHERE created_date > CURRENT_DATE) as future_dates
FROM customers;
Performance Monitoring
-- Query performance analysis
SELECT
query,
calls,
total_time,
mean_time,
max_time,
rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_time DESC
LIMIT 10;
Advanced Techniques
Query Optimization
- Index usage: Create appropriate indexes
- Query planning: Understand execution plans
- Statistics: Keep table statistics updated
- Partitioning: Partition large tables
Data Warehousing Patterns
- Star schema: Fact and dimension tables
- Snowflake schema: Normalized dimensions
- Data vault: Scalable modeling approach
- Temporal tables: Track historical changes
Modern SQL Features
- JSON/XML: Semi-structured data support
- Recursive queries: Hierarchical data processing
- Materialized views: Pre-computed results
- Stored procedures: Reusable database logic
Best Practices
Query Writing
- Readable code: Use proper formatting and comments
- Explicit joins: Avoid implicit joins
- Column specificity: Select only needed columns
- Filter early: Apply WHERE clauses early
Performance
- Index strategy: Create indexes on frequently queried columns
- Query caching: Leverage query result caching
- Batch operations: Process data in batches
- Connection pooling: Manage database connections efficiently
Security
- Parameterized queries: Prevent SQL injection
- Least privilege: Grant minimal necessary permissions
- Data masking: Protect sensitive information
- Audit logging: Track database access and changes
Learning Resources
Fundamentals
- SQL tutorials: Interactive learning platforms
- Database design: Normalization and modeling
- Query optimization: Performance tuning guides
- Vendor documentation: Platform-specific features
Advanced Topics
- Data warehousing: Dimensional modeling
- Analytics: Statistical functions and window operations
- BigQuery: Google Cloud analytics platform
- Spark SQL: Distributed data processing
When to Use SQL
Ideal For
- Structured data analysis
- Data transformation and cleaning
- Reporting and dashboards
- Data warehousing operations
- Business intelligence queries
Consider Alternatives When
- Unstructured data processing
- Real-time stream processing
- Complex machine learning pipelines
- Graph database operations
SQL remains the cornerstone of data engineering, providing powerful capabilities for data manipulation, analysis, and reporting across various database systems and platforms.