Analytics Fundamentals
Analytics fundamentals encompass the core mathematical, statistical, and computational concepts that form the foundation of all analytical methods. Understanding these principles is crucial for data engineers who need to build systems that support robust analytical workflows.
Core Philosophy
Analytics is fundamentally about extracting meaningful insights from data. Unlike traditional reporting that shows what happened, analytics focuses on understanding why it happened and what might happen next. This requires:
1. Statistical Rigor
All analytical conclusions must be statistically sound:
- Understanding sampling distributions and their implications
- Proper hypothesis testing with appropriate significance levels
- Controlling for confounding variables and bias
- Validating assumptions underlying statistical tests
2. Computational Efficiency
Analytics at scale requires optimized computation:
- Algorithm complexity considerations for large datasets
- Distributed computing for parallel processing
- Memory-efficient algorithms for streaming data
- Approximation algorithms for real-time insights
3. Data Quality Assurance
Analytics is only as good as the underlying data:
- Data validation and cleansing pipelines
- Outlier detection and treatment strategies
- Missing data handling methodologies
- Data lineage and provenance tracking
Mathematical Foundations
Descriptive Statistics
Dataset Definition:
Sample Mean:
Population Mean:
Sample Variance:
Population Variance:
Standard Deviation:
Symbol Definitions:
- = Dataset containing pairs of input-output values
- = Individual data points or feature values
- = Sample mean (arithmetic average)
- = Population mean (true mean)
- = Sample variance (with Bessel's correction)
- = Population variance (true variance)
- = Sample size, = Population size
Moments and Shape
Skewness (Third Moment):
Kurtosis (Fourth Moment):
Measures of Association
Pearson Correlation Coefficient:
Spearman Rank Correlation:
where is the difference between ranks.
Statistical Inference
Confidence Intervals
For Population Mean (known variance):
For Population Mean (unknown variance):
For Population Proportion:
Hypothesis Testing
Test Statistic for Mean:
Chi-Square Test for Independence:
where are observed frequencies and are expected frequencies.
Probability Distributions
Discrete Distributions
Binomial Distribution:
Poisson Distribution:
Continuous Distributions
Normal Distribution:
Exponential Distribution:
Practical Implementation
Statistical Computing in SQL
-- Descriptive statistics
SELECT
COUNT(*) as n,
AVG(revenue) as mean_revenue,
STDDEV(revenue) as std_revenue,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue,
MIN(revenue) as min_revenue,
MAX(revenue) as max_revenue
FROM sales_data
WHERE date_created >= '2024-01-01';
-- Correlation analysis
SELECT
CORR(advertising_spend, revenue) as correlation_coefficient,
REGR_SLOPE(revenue, advertising_spend) as slope,
REGR_INTERCEPT(revenue, advertising_spend) as intercept,
REGR_R2(revenue, advertising_spend) as r_squared
FROM marketing_performance;
-- Moving averages for trend analysis
SELECT
date_created,
daily_sales,
AVG(daily_sales) OVER (
ORDER BY date_created
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as seven_day_moving_avg,
STDDEV(daily_sales) OVER (
ORDER BY date_created
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as thirty_day_volatility
FROM daily_sales_summary
ORDER BY date_created;
Rust Implementation Example
// Standard library imports
use std::collections::HashMap;
use std::f64::consts::PI;
/// Comprehensive descriptive statistics for a dataset.
///
/// Contains all key statistical measures including central tendency, dispersion, and shape.
#[derive(Debug, Clone)]
pub struct DescriptiveStats {
/// Sample size
pub n: usize,
/// Arithmetic mean
pub mean: f64,
/// Median value (50th percentile)
pub median: f64,
/// Sample standard deviation
pub std_dev: f64,
/// Sample variance
pub variance: f64,
/// Skewness (measure of asymmetry)
pub skewness: f64,
/// Excess kurtosis (measure of tail heaviness)
pub kurtosis: f64,
/// Minimum value
pub min: f64,
/// Maximum value
pub max: f64,
/// 25th percentile (first quartile)
pub q25: f64,
/// 75th percentile (third quartile)
pub q75: f64,
}
/// Confidence interval for a population parameter.
///
/// Represents a range of values with associated confidence level.
#[derive(Debug, Clone)]
pub struct ConfidenceInterval {
/// Lower bound of the interval
pub lower_bound: f64,
/// Upper bound of the interval
pub upper_bound: f64,
/// Confidence level (e.g., 0.95 for 95%)
pub confidence_level: f64,
}
/// Results from a statistical hypothesis test.
///
/// Contains test statistic, p-value, and decision criteria.
#[derive(Debug, Clone)]
pub struct HypothesisTestResult {
/// Test statistic value
pub t_statistic: f64,
/// P-value for the test
pub p_value: f64,
/// Whether to reject null hypothesis (p < 0.05)
pub reject_null: bool,
/// Degrees of freedom for the test
pub degrees_of_freedom: usize,
}
#[derive(Debug, Clone)]
pub struct NormalityTestResult {
pub statistic: f64,
pub p_value: f64,
pub is_normal: bool,
}
pub struct AnalyticsFundamentals {
data: Vec<f64>,
}
impl AnalyticsFundamentals {
pub fn new(data: Vec<f64>) -> Self {
Self { data }
}
pub fn descriptive_stats(&self) -> DescriptiveStats {
let n = self.data.len();
if n == 0 {
panic!("Cannot calculate statistics for empty dataset");
}
// Calculate mean
let mean = self.data.iter().sum::<f64>() / n as f64;
// Calculate median
let mut sorted_data = self.data.clone();
sorted_data.sort_by(|a, b| a.partial_cmp(b).unwrap());
let median = if n % 2 == 0 {
(sorted_data[n / 2 - 1] + sorted_data[n / 2]) / 2.0
} else {
sorted_data[n / 2]
};
// Calculate variance (sample variance with Bessel's correction)
let variance = self.data.iter()
.map(|&x| (x - mean).powi(2))
.sum::<f64>() / (n - 1) as f64;
let std_dev = variance.sqrt();
// Calculate skewness
let skewness = if std_dev > 0.0 {
let third_moment = self.data.iter()
.map(|&x| ((x - mean) / std_dev).powi(3))
.sum::<f64>() / n as f64;
third_moment
} else {
0.0
};
// Calculate kurtosis
let kurtosis = if std_dev > 0.0 {
let fourth_moment = self.data.iter()
.map(|&x| ((x - mean) / std_dev).powi(4))
.sum::<f64>() / n as f64;
fourth_moment - 3.0 // Excess kurtosis
} else {
0.0
};
// Calculate quartiles
let q25 = self.percentile(&sorted_data, 0.25);
let q75 = self.percentile(&sorted_data, 0.75);
let min = sorted_data[0];
let max = sorted_data[n - 1];
DescriptiveStats {
n,
mean,
median,
std_dev,
variance,
skewness,
kurtosis,
min,
max,
q25,
q75,
}
}
fn percentile(&self, sorted_data: &[f64], p: f64) -> f64 {
let n = sorted_data.len();
let index = p * (n - 1) as f64;
let lower_index = index.floor() as usize;
let upper_index = index.ceil() as usize;
if lower_index == upper_index {
sorted_data[lower_index]
} else {
let weight = index - lower_index as f64;
sorted_data[lower_index] * (1.0 - weight) + sorted_data[upper_index] * weight
}
}
pub fn confidence_interval(&self, confidence_level: f64) -> ConfidenceInterval {
let n = self.data.len();
if n < 2 {
panic!("Need at least 2 data points for confidence interval");
}
let mean = self.data.iter().sum::<f64>() / n as f64;
let variance = self.data.iter()
.map(|&x| (x - mean).powi(2))
.sum::<f64>() / (n - 1) as f64;
let std_err = (variance / n as f64).sqrt();
let degrees_of_freedom = n - 1;
let alpha = 1.0 - confidence_level;
let t_critical = self.t_distribution_inverse(1.0 - alpha / 2.0, degrees_of_freedom);
let margin_of_error = t_critical * std_err;
ConfidenceInterval {
lower_bound: mean - margin_of_error,
upper_bound: mean + margin_of_error,
confidence_level,
}
}
pub fn hypothesis_test(&self, null_hypothesis: f64) -> HypothesisTestResult {
let n = self.data.len();
if n < 2 {
panic!("Need at least 2 data points for hypothesis test");
}
let mean = self.data.iter().sum::<f64>() / n as f64;
let variance = self.data.iter()
.map(|&x| (x - mean).powi(2))
.sum::<f64>() / (n - 1) as f64;
let std_err = (variance / n as f64).sqrt();
let t_statistic = (mean - null_hypothesis) / std_err;
let degrees_of_freedom = n - 1;
// Two-tailed p-value calculation
let p_value = 2.0 * (1.0 - self.t_distribution_cdf(t_statistic.abs(), degrees_of_freedom));
HypothesisTestResult {
t_statistic,
p_value,
reject_null: p_value < 0.05,
degrees_of_freedom,
}
}
pub fn normality_test(&self) -> NormalityTestResult {
// Simplified Shapiro-Wilk test approximation
let n = self.data.len();
if n < 3 {
return NormalityTestResult {
statistic: 0.0,
p_value: 1.0,
is_normal: false,
};
}
let mut sorted_data = self.data.clone();
sorted_data.sort_by(|a, b| a.partial_cmp(b).unwrap());
let mean = sorted_data.iter().sum::<f64>() / n as f64;
let variance = sorted_data.iter()
.map(|&x| (x - mean).powi(2))
.sum::<f64>() / (n - 1) as f64;
// Calculate W statistic (simplified version)
let mut numerator = 0.0;
let mut denominator = 0.0;
for (i, &value) in sorted_data.iter().enumerate() {
let expected = self.normal_quantile((i as f64 + 0.5) / n as f64);
numerator += expected * value;
denominator += expected * expected;
}
let w_statistic = (numerator * numerator) / (denominator * variance * (n - 1) as f64);
// Approximate p-value (simplified)
let p_value = if w_statistic > 0.95 { 0.1 } else { 0.01 };
NormalityTestResult {
statistic: w_statistic,
p_value,
is_normal: p_value > 0.05,
}
}
// Approximation of normal distribution quantile function
fn normal_quantile(&self, p: f64) -> f64 {
// Beasley-Springer-Moro algorithm approximation
if p <= 0.0 { return f64::NEG_INFINITY; }
if p >= 1.0 { return f64::INFINITY; }
if p == 0.5 { return 0.0; }
let q = p - 0.5;
if q.abs() <= 0.425 {
let r = 0.180625 - q * q;
return q * (((((((2.5090809287301226727e3 * r + 3.3430575583588128105e4) * r +
6.7265770927008700853e4) * r + 4.5921953931549871457e4) * r +
1.3731693765509461125e4) * r + 1.9715909503065514427e3) * r +
1.3314166789178437745e2) * r + 3.3871328727963666080e0) /
(((((((5.2264952788528545610e3 * r + 2.8729085735721942674e4) * r +
3.9307895800092710610e4) * r + 2.1213794301586595867e4) * r +
5.3941960214247511077e3) * r + 6.8718700749205790830e2) * r +
4.2313330701600911252e1) * r + 1.0);
}
// For values further from center, use different approximation
let r = if q < 0.0 { p } else { 1.0 - p };
let s = (-2.0 * r.ln()).sqrt();
let t = s - (2.515517 + 0.802853 * s + 0.010328 * s * s) /
(1.0 + 1.432788 * s + 0.189269 * s * s + 0.001308 * s * s * s);
if q < 0.0 { -t } else { t }
}
// Approximation of t-distribution CDF
fn t_distribution_cdf(&self, x: f64, df: usize) -> f64 {
if df == 1 {
return 0.5 + (x / (PI * (1.0 + x * x))).atan() / PI;
}
// For larger degrees of freedom, approximate with normal distribution
if df >= 30 {
return self.normal_cdf(x);
}
// Simplified approximation for intermediate df
let a = 4.0 * df as f64;
let b = a + x * x - 1.0;
let c = (a / b).sqrt();
let d = x * c;
self.normal_cdf(d)
}
// Approximation of normal CDF
fn normal_cdf(&self, x: f64) -> f64 {
0.5 * (1.0 + self.erf(x / 2.0_f64.sqrt()))
}
// Approximation of error function
fn erf(&self, x: f64) -> f64 {
let a1 = 0.254829592;
let a2 = -0.284496736;
let a3 = 1.421413741;
let a4 = -1.453152027;
let a5 = 1.061405429;
let p = 0.3275911;
let sign = if x < 0.0 { -1.0 } else { 1.0 };
let x = x.abs();
let t = 1.0 / (1.0 + p * x);
let y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * (-x * x).exp();
sign * y
}
// Approximation of t-distribution inverse
fn t_distribution_inverse(&self, p: f64, df: usize) -> f64 {
if df >= 30 {
return self.normal_quantile(p);
}
// Simplified approximation
let z = self.normal_quantile(p);
let correction = (z * z * z + z) / (4.0 * df as f64) +
(5.0 * z.powi(5) + 16.0 * z.powi(3) + 3.0 * z) / (96.0 * (df as f64).powi(2));
z + correction
}
pub fn correlation(&self, other: &[f64]) -> f64 {
if self.data.len() != other.len() || self.data.is_empty() {
return 0.0;
}
let n = self.data.len() as f64;
let mean_x = self.data.iter().sum::<f64>() / n;
let mean_y = other.iter().sum::<f64>() / n;
let numerator: f64 = self.data.iter()
.zip(other.iter())
.map(|(&x, &y)| (x - mean_x) * (y - mean_y))
.sum();
let sum_sq_x: f64 = self.data.iter()
.map(|&x| (x - mean_x).powi(2))
.sum();
let sum_sq_y: f64 = other.iter()
.map(|&y| (y - mean_y).powi(2))
.sum();
let denominator = (sum_sq_x * sum_sq_y).sqrt();
if denominator == 0.0 { 0.0 } else { numerator / denominator }
}
}
// Example usage
fn main() -> Result<(), Box<dyn std::error::Error>> {
let revenue_data = vec![1200.0, 1350.0, 980.0, 1100.0, 1450.0, 1300.0, 1180.0, 1420.0, 1250.0, 1380.0];
let analytics = AnalyticsFundamentals::new(revenue_data);
// Get descriptive statistics
let stats = analytics.descriptive_stats();
println!("Mean Revenue: ${:.2}", stats.mean);
println!("Standard Deviation: ${:.2}", stats.std_dev);
println!("Median: ${:.2}", stats.median);
println!("Skewness: {:.4}", stats.skewness);
println!("Kurtosis: {:.4}", stats.kurtosis);
// Calculate 95% confidence interval
let ci = analytics.confidence_interval(0.95);
println!("95% CI: [${:.2}, ${:.2}]", ci.lower_bound, ci.upper_bound);
// Test hypothesis that true mean is $1200
let test_result = analytics.hypothesis_test(1200.0);
println!("H0: μ = $1200");
println!("t-statistic: {:.4}", test_result.t_statistic);
println!("p-value: {:.4}", test_result.p_value);
println!("Reject null hypothesis: {}", test_result.reject_null);
// Test for normality
let normality = analytics.normality_test();
println!("Normality test statistic: {:.4}", normality.statistic);
println!("Data appears normal: {}", normality.is_normal);
// Example correlation with advertising spend
let advertising_spend = vec![50.0, 75.0, 40.0, 45.0, 85.0, 70.0, 55.0, 80.0, 60.0, 78.0];
let correlation = analytics.correlation(&advertising_spend);
println!("Correlation with advertising spend: {:.4}", correlation);
Ok(())
}
/*
Cargo.toml dependencies:
[dependencies]
chrono = { version = "0.4", features = ["serde"] }
serde = { version = "1.0", features = ["derive"] }
*/
Data Quality Assessment
Completeness Metrics
def assess_data_quality(df):
"""Comprehensive data quality assessment"""
quality_report = {}
for column in df.columns:
quality_report[column] = {
'completeness': (df[column].notna().sum() / len(df)) * 100,
'uniqueness': (df[column].nunique() / len(df)) * 100,
'consistency': check_data_consistency(df[column]),
'validity': validate_data_format(df[column])
}
return quality_report
def detect_outliers(data, method='iqr'):
"""Detect outliers using IQR or Z-score methods"""
if method == 'iqr':
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = (data < lower_bound) | (data > upper_bound)
elif method == 'zscore':
z_scores = np.abs(stats.zscore(data))
outliers = z_scores > 3
return outliers
Advanced Analytics Concepts
Experimental Design
A/B Testing Framework:
Sample Size Calculation:
Time Series Fundamentals
Autocovariance Function:
Autocorrelation Function:
Dimensionality Reduction
Principal Component Analysis:
- Eigenvalue decomposition of covariance matrix
- Variance explained by each component
- Dimensionality reduction while preserving information
Real-World Applications
Customer Analytics
# Customer lifetime value calculation
def calculate_clv(df):
"""Calculate Customer Lifetime Value using analytical approach"""
avg_order_value = df['order_value'].mean()
purchase_frequency = len(df) / df['customer_id'].nunique()
customer_lifespan = df['customer_tenure_days'].mean() / 365
clv = avg_order_value * purchase_frequency * customer_lifespan
return clv
# RFM Analysis
def rfm_analysis(df):
"""Recency, Frequency, Monetary analysis"""
current_date = df['order_date'].max()
rfm = df.groupby('customer_id').agg({
'order_date': lambda x: (current_date - x.max()).days, # Recency
'order_id': 'count', # Frequency
'order_value': 'sum' # Monetary
}).rename(columns={
'order_date': 'recency',
'order_id': 'frequency',
'order_value': 'monetary'
})
# Create RFM scores (1-5 scale)
rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['m_score'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
return rfm
Financial Analytics
# Risk metrics calculation
def calculate_var(returns, confidence_level=0.05):
"""Calculate Value at Risk"""
return np.percentile(returns, confidence_level * 100)
def calculate_sharpe_ratio(returns, risk_free_rate=0.02):
"""Calculate Sharpe Ratio"""
excess_returns = returns - risk_free_rate / 252 # Daily risk-free rate
return np.mean(excess_returns) / np.std(excess_returns) * np.sqrt(252)
# Portfolio optimization
def optimize_portfolio(expected_returns, covariance_matrix):
"""Mean-variance optimization"""
from scipy.optimize import minimize
def portfolio_variance(weights, covariance_matrix):
return np.dot(weights.T, np.dot(covariance_matrix, weights))
def portfolio_return(weights, expected_returns):
return np.dot(weights, expected_returns)
# Constraints and bounds
constraints = {'type': 'eq', 'fun': lambda x: np.sum(x) - 1}
bounds = tuple((0, 1) for _ in range(len(expected_returns)))
# Minimize variance for given return
result = minimize(portfolio_variance,
x0=np.array([1/len(expected_returns)] * len(expected_returns)),
args=(covariance_matrix,),
method='SLSQP',
bounds=bounds,
constraints=constraints)
return result.x
Best Practices
1. Statistical Validation
- Always check assumptions before applying statistical tests
- Use appropriate sample sizes for reliable inference
- Account for multiple testing corrections
- Validate models on out-of-sample data
2. Computational Efficiency
- Use vectorized operations instead of loops
- Implement incremental algorithms for streaming data
- Cache intermediate results for repeated calculations
- Profile code to identify bottlenecks
3. Reproducibility
- Set random seeds for stochastic processes
- Document data preprocessing steps
- Version control analytical code and datasets
- Use container technologies for environment consistency
4. Communication
- Present uncertainty alongside point estimates
- Use appropriate visualizations for data types
- Provide business context for statistical findings
- Document assumptions and limitations
These mathematical and statistical fundamentals provide the rigorous foundation upon which all analytical methods are built. Mastery of these concepts enables data engineers to design systems that support sophisticated analytical workflows while maintaining statistical rigor and computational efficiency.
Related Topics
For deeper exploration of analytics applications:
- Classification: Apply these statistical concepts to build predictive models
- Machine Learning Fundamentals: Extend statistical foundations into ML algorithms
- Data Engineering Pipelines: Implement analytics within scalable data systems
- API Management: Serve analytical results through robust APIs
For practical implementations:
- Rust Programming: High-performance statistical computing implementations
- Data Processing: Scale analytical computations across distributed systems