API Fundamentals
Application Programming Interfaces (APIs) are the backbone of modern data engineering systems, enabling communication between different services, applications, and data sources. In data engineering contexts, APIs facilitate seamless integration between data pipelines, analytics platforms, and business applications, making them critical for building scalable, maintainable data architectures.
Core Philosophy
API design is fundamentally about building sustainable integration points that evolve with business needs while maintaining reliability and performance. Unlike point-to-point integrations, well-designed APIs create reusable interfaces that scale across the organization.
1. Contract-First Design
APIs must establish clear contracts before implementation:
- Define data models and schemas upfront
- Establish versioning strategies for backward compatibility
- Document expected behavior and error scenarios
- Plan for future extensibility without breaking changes
2. Data-Centric Integration
APIs in data engineering focus on data flow optimization:
- Minimize network round-trips for bulk operations
- Support streaming for real-time data processing
- Provide pagination for large datasets
- Enable efficient filtering and querying at the API level
3. Observability by Design
Production APIs require comprehensive monitoring:
- Built-in metrics for latency, throughput, and error rates
- Distributed tracing for complex data pipelines
- Structured logging for debugging and audit trails
- Health checks and dependency monitoring
4. Security as Foundation
Data APIs handle sensitive information requiring robust security:
- Authentication and authorization at multiple levels
- Data encryption in transit and at rest
- Rate limiting and DDoS protection
- Audit logging for compliance requirements
API Architecture Patterns
Understanding different API types and their optimal use cases:
Types of APIs
REST APIs
Representational State Transfer (REST) is the most common architectural style for web APIs.
# RESTful API examples
GET /api/users/123 # Retrieve user
POST /api/users # Create new user
PUT /api/users/123 # Update user completely
PATCH /api/users/123 # Update user partially
DELETE /api/users/123 # Delete user
# Query parameters
GET /api/users?limit=10&offset=20&sort=created_at
REST Principles:
- Stateless: Each request contains all necessary information
- Client-Server: Clear separation of concerns
- Cacheable: Responses should indicate if they can be cached
- Uniform Interface: Consistent interaction patterns
- Layered System: Architecture can have multiple layers
GraphQL APIs
GraphQL provides a query language for APIs and runtime for executing queries.
# GraphQL query example
query GetUserWithPosts(userId: ID!) {
user(id: userId) {
id
name
email
posts {
id
title
content
createdAt
}
}
}
# GraphQL mutation example
mutation CreatePost(input: CreatePostInput!) {
createPost(input: input) {
id
title
author {
name
}
}
}
GraphQL Benefits:
- Single endpoint for all operations
- Client specifies exactly what data to fetch
- Strong type system
- Real-time subscriptions
- Excellent tooling and introspection
gRPC APIs
Google's high-performance, language-neutral RPC framework.
// user.proto
syntax = "proto3";
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc CreateUser(CreateUserRequest) returns (User);
rpc StreamUsers(StreamUsersRequest) returns (stream User);
}
message User {
int32 id = 1;
string name = 2;
string email = 3;
int64 created_at = 4;
}
message GetUserRequest {
int32 id = 1;
}
gRPC Advantages:
- High performance with Protocol Buffers
- Strongly typed contracts
- Bidirectional streaming
- Built-in load balancing and health checking
- Multi-language support
API Design Principles
RESTful Resource Design
# Good RESTful design
from flask import Flask, request, jsonify
from dataclasses import dataclass
from typing import List, Optional
import uuid
app = Flask(__name__)
@dataclass
class User:
id: str
name: str
email: str
created_at: str
# Resource collection
@app.route('/api/v1/users', methods=['GET'])
def get_users():
# Query parameters for filtering, sorting, pagination
limit = request.args.get('limit', 10, type=int)
offset = request.args.get('offset', 0, type=int)
sort_by = request.args.get('sort', 'created_at')
users = user_service.get_users(
limit=limit,
offset=offset,
sort_by=sort_by
)
return jsonify({
'data': [user.__dict__ for user in users],
'meta': {
'total': user_service.count_users(),
'limit': limit,
'offset': offset
}
})
@app.route('/api/v1/users', methods=['POST'])
def create_user():
data = request.get_json()
# Validate input
if not data or 'name' not in data or 'email' not in data:
return jsonify({'error': 'Name and email are required'}), 400
# Create user
user = User(
id=str(uuid.uuid4()),
name=data['name'],
email=data['email'],
created_at=datetime.utcnow().isoformat()
)
created_user = user_service.create_user(user)
return jsonify(created_user.__dict__), 201
# Individual resource
@app.route('/api/v1/users/<user_id>', methods=['GET'])
def get_user(user_id):
user = user_service.get_user_by_id(user_id)
if not user:
return jsonify({'error': 'User not found'}), 404
return jsonify(user.__dict__)
@app.route('/api/v1/users/<user_id>', methods=['PUT'])
def update_user(user_id):
data = request.get_json()
user = user_service.get_user_by_id(user_id)
if not user:
return jsonify({'error': 'User not found'}), 404
# Update user
updated_user = user_service.update_user(user_id, data)
return jsonify(updated_user.__dict__)
@app.route('/api/v1/users/<user_id>', methods=['DELETE'])
def delete_user(user_id):
success = user_service.delete_user(user_id)
if not success:
return jsonify({'error': 'User not found'}), 404
return '', 204
API Versioning Strategies
URL Path Versioning
GET /api/v1/users
GET /api/v2/users
Header Versioning
GET /api/users
Accept: application/vnd.api+json;version=1
Query Parameter Versioning
GET /api/users?version=1
Error Handling Standards
# Standardized error response format
class APIError:
def __init__(self, code: str, message: str, details: dict = None):
self.code = code
self.message = message
self.details = details or {}
def to_dict(self):
return {
'error': {
'code': self.code,
'message': self.message,
'details': self.details
}
}
# Error handling middleware
@app.errorhandler(400)
def bad_request(error):
return jsonify(
APIError(
code='INVALID_REQUEST',
message='The request is invalid',
details={'validation_errors': error.description}
).to_dict()
), 400
@app.errorhandler(401)
def unauthorized(error):
return jsonify(
APIError(
code='AUTHENTICATION_REQUIRED',
message='Authentication is required to access this resource'
).to_dict()
), 401
@app.errorhandler(403)
def forbidden(error):
return jsonify(
APIError(
code='INSUFFICIENT_PERMISSIONS',
message='Insufficient permissions to access this resource'
).to_dict()
), 403
@app.errorhandler(404)
def not_found(error):
return jsonify(
APIError(
code='RESOURCE_NOT_FOUND',
message='The requested resource was not found'
).to_dict()
), 404
@app.errorhandler(500)
def internal_error(error):
return jsonify(
APIError(
code='INTERNAL_ERROR',
message='An internal server error occurred'
).to_dict()
), 500
HTTP Methods and Status Codes
HTTP Methods
Method | Purpose | Idempotent | Safe |
---|---|---|---|
GET | Retrieve resource | ✅ | ✅ |
POST | Create resource | ❌ | ❌ |
PUT | Replace resource | ✅ | ❌ |
PATCH | Update resource | ❌ | ❌ |
DELETE | Remove resource | ✅ | ❌ |
HEAD | Get headers only | ✅ | ✅ |
OPTIONS | Get allowed methods | ✅ | ✅ |
HTTP Status Codes
Success (2xx)
- 200 OK: Request successful
- 201 Created: Resource created successfully
- 202 Accepted: Request accepted for processing
- 204 No Content: Successful, no content to return
Client Error (4xx)
- 400 Bad Request: Invalid request format
- 401 Unauthorized: Authentication required
- 403 Forbidden: Access denied
- 404 Not Found: Resource not found
- 409 Conflict: Resource conflict
- 422 Unprocessable Entity: Validation errors
- 429 Too Many Requests: Rate limit exceeded
Server Error (5xx)
- 500 Internal Server Error: Server error
- 502 Bad Gateway: Invalid response from upstream
- 503 Service Unavailable: Service temporarily unavailable
- 504 Gateway Timeout: Upstream timeout
Content Negotiation
Accept Headers
from flask import request, jsonify
import xml.etree.ElementTree as ET
@app.route('/api/users/<user_id>')
def get_user_with_content_negotiation(user_id):
user = user_service.get_user_by_id(user_id)
if not user:
return jsonify({'error': 'User not found'}), 404
accept_header = request.headers.get('Accept', 'application/json')
if 'application/json' in accept_header:
return jsonify(user.__dict__)
elif 'application/xml' in accept_header:
root = ET.Element('user')
ET.SubElement(root, 'id').text = user.id
ET.SubElement(root, 'name').text = user.name
ET.SubElement(root, 'email').text = user.email
response = app.response_class(
ET.tostring(root, encoding='unicode'),
mimetype='application/xml'
)
return response
elif 'text/csv' in accept_header:
import csv
import io
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(['id', 'name', 'email'])
writer.writerow([user.id, user.name, user.email])
response = app.response_class(
output.getvalue(),
mimetype='text/csv'
)
return response
else:
return jsonify({'error': 'Unsupported media type'}), 406
API Documentation
OpenAPI/Swagger Specification
# openapi.yaml
openapi: 3.0.3
info:
title: User Management API
description: API for managing users in the system
version: 1.0.0
contact:
name: API Support
email: api-support@example.com
servers:
- url: https://api.example.com/v1
description: Production server
- url: https://staging-api.example.com/v1
description: Staging server
paths:
/users:
get:
summary: List users
description: Retrieve a list of users with optional filtering and pagination
parameters:
- name: limit
in: query
description: Maximum number of users to return
schema:
type: integer
minimum: 1
maximum: 100
default: 10
- name: offset
in: query
description: Number of users to skip
schema:
type: integer
minimum: 0
default: 0
responses:
'200':
description: List of users
content:
application/json:
schema:
type: object
properties:
data:
type: array
items:
ref: '#/components/schemas/User'
meta:
ref: '#/components/schemas/PaginationMeta'
'400':
ref: '#/components/responses/BadRequest'
'500':
ref: '#/components/responses/InternalError'
post:
summary: Create user
description: Create a new user
requestBody:
required: true
content:
application/json:
schema:
ref: '#/components/schemas/CreateUserRequest'
responses:
'201':
description: User created successfully
content:
application/json:
schema:
ref: '#/components/schemas/User'
'400':
ref: '#/components/responses/BadRequest'
'409':
ref: '#/components/responses/Conflict'
components:
schemas:
User:
type: object
required:
- id
- name
- email
properties:
id:
type: string
format: uuid
description: Unique user identifier
name:
type: string
minLength: 1
maxLength: 100
description: User's full name
email:
type: string
format: email
description: User's email address
created_at:
type: string
format: date-time
description: User creation timestamp
CreateUserRequest:
type: object
required:
- name
- email
properties:
name:
type: string
minLength: 1
maxLength: 100
email:
type: string
format: email
PaginationMeta:
type: object
properties:
total:
type: integer
description: Total number of resources
limit:
type: integer
description: Maximum number of resources per page
offset:
type: integer
description: Number of resources skipped
Error:
type: object
properties:
error:
type: object
properties:
code:
type: string
description: Error code
message:
type: string
description: Human-readable error message
details:
type: object
description: Additional error details
responses:
BadRequest:
description: Bad request
content:
application/json:
schema:
ref: '#/components/schemas/Error'
Conflict:
description: Resource conflict
content:
application/json:
schema:
ref: '#/components/schemas/Error'
InternalError:
description: Internal server error
content:
application/json:
schema:
ref: '#/components/schemas/Error'
Testing APIs
Unit Testing
import unittest
from unittest.mock import Mock, patch
import json
from your_api import app, user_service
class TestUserAPI(unittest.TestCase):
def setUp(self):
self.app = app.test_client()
self.app.testing = True
@patch('your_api.user_service')
def test_get_users_success(self, mock_service):
# Mock service response
mock_users = [
User(id='1', name='John', email='john@example.com', created_at='2024-01-01T00:00:00Z'),
User(id='2', name='Jane', email='jane@example.com', created_at='2024-01-01T01:00:00Z')
]
mock_service.get_users.return_value = mock_users
mock_service.count_users.return_value = 2
# Make request
response = self.app.get('/api/v1/users')
# Assertions
self.assertEqual(response.status_code, 200)
data = json.loads(response.data)
self.assertEqual(len(data['data']), 2)
self.assertEqual(data['meta']['total'], 2)
@patch('your_api.user_service')
def test_create_user_success(self, mock_service):
# Mock service response
created_user = User(id='123', name='John', email='john@example.com', created_at='2024-01-01T00:00:00Z')
mock_service.create_user.return_value = created_user
# Make request
response = self.app.post(
'/api/v1/users',
data=json.dumps({'name': 'John', 'email': 'john@example.com'}),
content_type='application/json'
)
# Assertions
self.assertEqual(response.status_code, 201)
data = json.loads(response.data)
self.assertEqual(data['name'], 'John')
self.assertEqual(data['email'], 'john@example.com')
def test_create_user_invalid_data(self):
# Make request with invalid data
response = self.app.post(
'/api/v1/users',
data=json.dumps({'name': 'John'}), # Missing email
content_type='application/json'
)
# Assertions
self.assertEqual(response.status_code, 400)
data = json.loads(response.data)
self.assertIn('error', data)
Integration Testing
import requests
import pytest
from testcontainers.compose import DockerCompose
@pytest.fixture(scope="module")
def api_service():
"""Start API service with dependencies for integration testing."""
with DockerCompose(".", compose_file_name="docker-compose.test.yml") as compose:
# Wait for service to be ready
api_url = f"http://localhost:{compose.get_service_port('api', 8000)}"
# Health check
for _ in range(30): # Wait up to 30 seconds
try:
response = requests.get(f"{api_url}/health")
if response.status_code == 200:
break
except requests.exceptions.ConnectionError:
pass
time.sleep(1)
else:
raise Exception("API service failed to start")
yield api_url
def test_full_user_lifecycle(api_service):
"""Test complete user lifecycle: create, read, update, delete."""
base_url = f"{api_service}/api/v1"
# Create user
create_data = {"name": "Integration Test User", "email": "test@example.com"}
response = requests.post(f"{base_url}/users", json=create_data)
assert response.status_code == 201
user = response.json()
user_id = user['id']
assert user['name'] == create_data['name']
assert user['email'] == create_data['email']
# Read user
response = requests.get(f"{base_url}/users/{user_id}")
assert response.status_code == 200
retrieved_user = response.json()
assert retrieved_user['id'] == user_id
# Update user
update_data = {"name": "Updated Name"}
response = requests.put(f"{base_url}/users/{user_id}", json=update_data)
assert response.status_code == 200
updated_user = response.json()
assert updated_user['name'] == "Updated Name"
# Delete user
response = requests.delete(f"{base_url}/users/{user_id}")
assert response.status_code == 204
# Verify deletion
response = requests.get(f"{base_url}/users/{user_id}")
assert response.status_code == 404
## Related Topics
**Core Infrastructure**:
- **[Data Engineering Pipelines](/data-engineering/pipelines)**: Integrate APIs within data processing workflows
- **[Data Processing](/data-engineering/processing)**: Use APIs for distributed data processing coordination
- **[Data Engineering Monitoring](/data-engineering/monitoring)**: Monitor API performance and reliability
**Advanced API Management**:
- **[API Authentication](/api-management/authentication)**: Secure API access with OAuth2, JWT, and RBAC
- **[API Monitoring](/api-management/monitoring)**: Observe API performance, errors, and usage patterns
- **[API Documentation](/api-management/documentation)**: Create comprehensive API documentation and SDKs
- **[API Lifecycle](/api-management/lifecycle)**: Manage API versioning, deprecation, and evolution
**Technology Integration**:
- **[Rust Programming](/programming-languages/rust)**: Build high-performance, safe API services
- **[Data Technologies](/data-technologies)**: Connect APIs to databases and processing systems
**Analytics and ML Applications**:
- **[Analytics](/analytics)**: Serve analytical results through API endpoints
- **[Machine Learning](/machine-learning)**: Deploy ML models via REST/GraphQL APIs
Understanding API fundamentals is crucial for building robust data engineering systems. Well-designed APIs enable seamless integration between services, improve system maintainability, and provide clear contracts for data exchange. These principles form the foundation for more advanced API management practices.