Programming Languages
SQL

SQL

Structured Query Language (SQL) is the standard language for managing and manipulating relational databases. It's essential for data querying, transformation, and database operations in data engineering.

Core Concepts

Data Definition Language (DDL)

  • CREATE: Create tables, indexes, and schemas
  • ALTER: Modify existing database structures
  • DROP: Remove database objects
  • TRUNCATE: Remove all data from tables

Data Manipulation Language (DML)

  • SELECT: Query and retrieve data
  • INSERT: Add new records
  • UPDATE: Modify existing records
  • DELETE: Remove records

Data Control Language (DCL)

  • GRANT: Assign permissions
  • REVOKE: Remove permissions
  • COMMIT: Save transaction changes
  • ROLLBACK: Undo transaction changes

SQL Joins

SQL joins are fundamental operations that combine data from multiple tables based on relationships between columns. Understanding different join types is crucial for effective data retrieval and analysis.

Sample Tables for Examples

-- Employees table
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    name VARCHAR(100),
    department_id INT,
    salary DECIMAL(10,2),
    hire_date DATE
);
 
-- Departments table
CREATE TABLE departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100),
    location VARCHAR(100)
);
 
-- Projects table
CREATE TABLE projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(100),
    employee_id INT,
    budget DECIMAL(12,2)
);

INNER JOIN

Returns only rows that have matching values in both tables. This is the most common join type.

-- Get employee names with their department information
SELECT 
    e.name,
    e.salary,
    d.department_name,
    d.location
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
 
-- Result: Only employees who belong to existing departments
-- Employees without departments and departments without employees are excluded

Use Cases:

  • Customer orders with product details
  • Employee information with department data
  • Sales records with customer information

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table and matching rows from the right table. Non-matching rows from the right table show NULL values.

-- Get all employees with their department info (including employees without departments)
SELECT 
    e.name,
    e.salary,
    d.department_name,
    d.location
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
 
-- Result: All employees are included
-- Employees without departments will have NULL for department fields

Use Cases:

  • All customers with their recent orders (including customers who haven't ordered)
  • All products with their sales data (including products never sold)
  • Complete employee list with optional project assignments

RIGHT JOIN (RIGHT OUTER JOIN)

Returns all rows from the right table and matching rows from the left table. Non-matching rows from the left table show NULL values.

-- Get all departments with their employees (including empty departments)
SELECT 
    e.name,
    e.salary,
    d.department_name,
    d.location
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
 
-- Result: All departments are included
-- Departments without employees will have NULL for employee fields

Use Cases:

  • All product categories with their products (including empty categories)
  • All regions with their sales representatives (including uncovered regions)
  • Complete department structure with current staffing

FULL OUTER JOIN

Returns all rows from both tables, with NULL values for non-matching rows on either side.

-- Get all employees and all departments, showing relationships where they exist
SELECT 
    e.name,
    e.salary,
    d.department_name,
    d.location
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
 
-- Result: All employees AND all departments are included
-- Unmatched records from both sides will have NULL values

Use Cases:

  • Complete data reconciliation between two systems
  • Comprehensive audit of master data relationships
  • Data migration validation scenarios

CROSS JOIN

Returns the Cartesian product of both tables - every row from the first table combined with every row from the second table.

-- Generate all possible employee-project combinations
SELECT 
    e.name,
    p.project_name,
    p.budget
FROM employees e
CROSS JOIN projects p;
 
-- Result: Every employee paired with every project
-- If 5 employees and 3 projects, result has 15 rows

Use Cases:

  • Generating test data combinations
  • Creating comprehensive comparison matrices
  • Building configuration tables with all possible combinations

SELF JOIN

Joins a table with itself, useful for hierarchical or comparative analysis.

-- Find employees and their managers (assuming manager_id column exists)
ALTER TABLE employees ADD COLUMN manager_id INT;
 
SELECT 
    emp.name as employee_name,
    mgr.name as manager_name,
    emp.salary as employee_salary,
    mgr.salary as manager_salary
FROM employees emp
LEFT JOIN employees mgr ON emp.manager_id = mgr.employee_id;
 
-- Find employees earning more than their department colleagues
SELECT 
    e1.name,
    e1.salary,
    e1.department_id
FROM employees e1
JOIN employees e2 ON e1.department_id = e2.department_id
WHERE e1.salary > e2.salary;

Use Cases:

  • Organizational hierarchies (employee-manager relationships)
  • Comparing records within the same table
  • Finding duplicates or similar records

Multiple Table Joins

Complex queries often require joining multiple tables together.

-- Get comprehensive employee project information
SELECT 
    e.name as employee_name,
    d.department_name,
    p.project_name,
    p.budget,
    e.salary
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id
LEFT JOIN projects p ON e.employee_id = p.employee_id
ORDER BY d.department_name, e.name;
 
-- Multi-step relationship joins
SELECT 
    e.name,
    d.department_name,
    COUNT(p.project_id) as project_count,
    SUM(p.budget) as total_budget
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
LEFT JOIN projects p ON e.employee_id = p.employee_id
GROUP BY e.employee_id, e.name, d.department_name
HAVING COUNT(p.project_id) > 0;

Advanced Join Patterns

Conditional Joins

-- Join with additional conditions
SELECT 
    e.name,
    d.department_name,
    p.project_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id
JOIN projects p ON e.employee_id = p.employee_id 
    AND p.budget > 100000; -- Additional join condition

Join with Aggregations

-- Join with subqueries and aggregations
SELECT 
    d.department_name,
    dept_stats.avg_salary,
    dept_stats.employee_count,
    dept_stats.total_budget
FROM departments d
JOIN (
    SELECT 
        e.department_id,
        AVG(e.salary) as avg_salary,
        COUNT(e.employee_id) as employee_count,
        SUM(p.budget) as total_budget
    FROM employees e
    LEFT JOIN projects p ON e.employee_id = p.employee_id
    GROUP BY e.department_id
) dept_stats ON d.department_id = dept_stats.department_id;

EXISTS vs JOIN

-- Using EXISTS (often more efficient for existence checks)
SELECT e.name, e.salary
FROM employees e
WHERE EXISTS (
    SELECT 1 
    FROM projects p 
    WHERE p.employee_id = e.employee_id
);
 
-- Equivalent JOIN approach
SELECT DISTINCT e.name, e.salary
FROM employees e
INNER JOIN projects p ON e.employee_id = p.employee_id;

Join Performance Optimization

Index Optimization

-- Create indexes on join columns
CREATE INDEX idx_employee_dept ON employees(department_id);
CREATE INDEX idx_project_emp ON projects(employee_id);
CREATE INDEX idx_dept_id ON departments(department_id);

Query Hints and Optimization

-- Use appropriate join order for better performance
SELECT e.name, d.department_name
FROM departments d  -- Start with smaller table
INNER JOIN employees e ON d.department_id = e.department_id
WHERE d.location = 'New York';  -- Apply filters early

Common Join Pitfalls

Cartesian Products

-- Dangerous: Missing join condition creates Cartesian product
SELECT e.name, d.department_name
FROM employees e, departments d; -- Avoid comma joins without WHERE
 
-- Correct approach
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

Data Type Mismatches

-- Ensure compatible data types in join conditions
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON CAST(e.department_id AS VARCHAR) = d.dept_code;

Industry Examples

E-commerce Order Analysis

-- Customer order analysis with product details
SELECT 
    c.customer_name,
    o.order_date,
    p.product_name,
    oi.quantity,
    p.price * oi.quantity as line_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2024-01-01';

Financial Transaction Monitoring

-- Account transaction analysis with customer information
SELECT 
    acc.account_number,
    cust.customer_name,
    trans.transaction_date,
    trans.amount,
    trans.transaction_type,
    acc.current_balance
FROM accounts acc
INNER JOIN customers cust ON acc.customer_id = cust.customer_id
LEFT JOIN transactions trans ON acc.account_id = trans.account_id
WHERE trans.transaction_date >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY trans.transaction_date DESC;

Essential SQL Operations

Data Querying

-- Basic querying with filtering and aggregation
SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MAX(hire_date) as latest_hire
FROM employees 
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;

Window Functions

-- Advanced analytics with window functions
SELECT 
    employee_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank_in_dept,
    LAG(salary) OVER (ORDER BY hire_date) as previous_salary,
    SUM(salary) OVER (PARTITION BY department) as dept_total_salary
FROM employees;

Common Table Expressions (CTEs)

-- Complex queries with CTEs for readability
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as total_sales
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
moving_averages AS (
    SELECT 
        month,
        total_sales,
        AVG(total_sales) OVER (
            ORDER BY month 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) as three_month_avg
    FROM monthly_sales
)
SELECT * FROM moving_averages ORDER BY month;

SQL Variants & Dialects

PostgreSQL

  • JSON support: Native JSON and JSONB data types
  • Array operations: Advanced array functions
  • Full-text search: Built-in text search capabilities
  • Extensions: PostGIS for geographic data

MySQL

  • Performance: Optimized for web applications
  • Replication: Master-slave and master-master setups
  • Storage engines: InnoDB, MyISAM options
  • JSON functions: JSON data type support

BigQuery SQL

  • Scalability: Petabyte-scale analytics
  • Serverless: No infrastructure management
  • Machine learning: ML functions in SQL
  • Nested data: Support for arrays and structs

Spark SQL

  • Distributed: Process large datasets across clusters
  • DataFrame API: Integration with Spark DataFrames
  • Streaming: Real-time data processing
  • Multiple formats: Parquet, JSON, CSV support

Data Engineering Applications

ETL Operations

-- Extract, Transform, Load example
INSERT INTO fact_sales (
    date_key,
    product_key,
    customer_key,
    quantity,
    revenue
)
SELECT 
    d.date_key,
    p.product_key,
    c.customer_key,
    s.quantity,
    s.quantity * s.unit_price as revenue
FROM staging_sales s
JOIN dim_date d ON s.order_date = d.date
JOIN dim_product p ON s.product_id = p.product_id
JOIN dim_customer c ON s.customer_id = c.customer_id
WHERE s.processed_flag = FALSE;

Data Quality Checks

-- Data validation and quality assessment
SELECT 
    'customers' as table_name,
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) - COUNT(email) as missing_emails,
    COUNT(*) - COUNT(phone) as missing_phones,
    COUNT(*) FILTER (WHERE created_date > CURRENT_DATE) as future_dates
FROM customers;

Performance Monitoring

-- Query performance analysis
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time,
    rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_time DESC
LIMIT 10;

Advanced Techniques

Query Optimization

  1. Index usage: Create appropriate indexes
  2. Query planning: Understand execution plans
  3. Statistics: Keep table statistics updated
  4. Partitioning: Partition large tables

Data Warehousing Patterns

  1. Star schema: Fact and dimension tables
  2. Snowflake schema: Normalized dimensions
  3. Data vault: Scalable modeling approach
  4. Temporal tables: Track historical changes

Modern SQL Features

  1. JSON/XML: Semi-structured data support
  2. Recursive queries: Hierarchical data processing
  3. Materialized views: Pre-computed results
  4. Stored procedures: Reusable database logic

Best Practices

Query Writing

  1. Readable code: Use proper formatting and comments
  2. Explicit joins: Avoid implicit joins
  3. Column specificity: Select only needed columns
  4. Filter early: Apply WHERE clauses early

Performance

  1. Index strategy: Create indexes on frequently queried columns
  2. Query caching: Leverage query result caching
  3. Batch operations: Process data in batches
  4. Connection pooling: Manage database connections efficiently

Security

  1. Parameterized queries: Prevent SQL injection
  2. Least privilege: Grant minimal necessary permissions
  3. Data masking: Protect sensitive information
  4. Audit logging: Track database access and changes

Learning Resources

Fundamentals

  • SQL tutorials: Interactive learning platforms
  • Database design: Normalization and modeling
  • Query optimization: Performance tuning guides
  • Vendor documentation: Platform-specific features

Advanced Topics

  • Data warehousing: Dimensional modeling
  • Analytics: Statistical functions and window operations
  • BigQuery: Google Cloud analytics platform
  • Spark SQL: Distributed data processing

When to Use SQL

Ideal For

  • Structured data analysis
  • Data transformation and cleaning
  • Reporting and dashboards
  • Data warehousing operations
  • Business intelligence queries

Consider Alternatives When

  • Unstructured data processing
  • Real-time stream processing
  • Complex machine learning pipelines
  • Graph database operations

SQL remains the cornerstone of data engineering, providing powerful capabilities for data manipulation, analysis, and reporting across various database systems and platforms.


© 2025 Praba Siva. Personal Documentation Site.