Analytics
Predictive Analytics
Classification

Classification

Classification is a fundamental supervised learning technique that predicts discrete categorical outcomes. In data engineering contexts, classification models power recommendation systems, fraud detection, customer segmentation, and automated decision-making processes at scale.

Core Philosophy

Classification is fundamentally about learning decision boundaries that separate different classes in feature space. Unlike regression which predicts continuous values, classification outputs discrete class labels with associated probabilities. This makes it ideal for:

1. Business Decision Automation

Classification models enable automated decision-making:

  • Customer credit approval (approve/deny/manual review)
  • Email filtering (spam/not spam/promotional)
  • Quality control (pass/fail/inspect)
  • Risk assessment (high/medium/low risk)

2. Probabilistic Reasoning

Modern classification provides probability estimates:

  • Confidence scores for predictions
  • Uncertainty quantification for critical decisions
  • Risk assessment in high-stakes applications
  • A/B testing and experimentation frameworks

3. Multi-class and Multi-label Support

Handles complex taxonomies:

  • Multi-class: One label per instance (e.g., product category)
  • Multi-label: Multiple labels per instance (e.g., image tags)
  • Hierarchical: Nested class structures
  • Imbalanced classes with different importance weights

Mathematical Foundation

Decision Theory Framework

Classification seeks to find a decision boundary function that maps input features to discrete class labels:

f(x)=argmaxyYP(Y=yX=x)\boxed{f(\mathbf{x}) = \arg\max_{y \in \mathcal{Y}} P(Y = y | \mathbf{X} = \mathbf{x})}

Where:

  • xX\mathbf{x} \in \mathcal{X} is the input feature vector
  • Y={1,2,,K}\mathcal{Y} = \{1, 2, \ldots, K\} is the set of class labels
  • P(Y=yX=x)P(Y = y | \mathbf{X} = \mathbf{x}) is the posterior probability

Bayes' Theorem

The foundation of probabilistic classification:

P(Y=kX=x)=P(X=xY=k)P(Y=k)P(X=x)P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{P(\mathbf{X} = \mathbf{x})}

Where:

  • P(Y=k)P(Y = k) is the prior probability of class kk
  • P(X=xY=k)P(\mathbf{X} = \mathbf{x} | Y = k) is the class-conditional likelihood
  • P(X=x)P(\mathbf{X} = \mathbf{x}) is the marginal likelihood (normalizing constant)

Loss Functions

0-1 Loss (Misclassification Error): L01(y,y^)={0if y=y^1if yy^L_{0-1}(y, \hat{y}) = \begin{cases} 0 & \text{if } y = \hat{y} \\ 1 & \text{if } y \neq \hat{y} \end{cases}

Log-Loss (Cross-Entropy): Llog(y,p^)=k=1Kyklog(p^k)L_{\text{log}}(y, \hat{p}) = -\sum_{k=1}^K y_k \log(\hat{p}_k)

Hinge Loss (SVM): Lhinge(y,f(x))=max(0,1yf(x))L_{\text{hinge}}(y, f(\mathbf{x})) = \max(0, 1 - y \cdot f(\mathbf{x}))

Core Algorithms

Logistic Regression

Sigmoid Function: σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Binary Classification: P(Y=1x)=σ(βTx)=11+eβTxP(Y = 1 | \mathbf{x}) = \sigma(\boldsymbol{\beta}^T \mathbf{x}) = \frac{1}{1 + e^{-\boldsymbol{\beta}^T \mathbf{x}}}

Multi-class (Softmax): P(Y=kx)=eβkTxj=1KeβjTxP(Y = k | \mathbf{x}) = \frac{e^{\boldsymbol{\beta}_k^T \mathbf{x}}}{\sum_{j=1}^K e^{\boldsymbol{\beta}_j^T \mathbf{x}}}

Decision Trees

Information Gain: IG(S,A)=H(S)vValues(A)SvSH(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

Entropy: H(S)=i=1Kpilog2(pi)H(S) = -\sum_{i=1}^{K} p_i \log_2(p_i)

Gini Impurity: Gini(S)=1i=1Kpi2\text{Gini}(S) = 1 - \sum_{i=1}^{K} p_i^2

Splitting Criterion: Choose attribute AA that maximizes information gain or minimizes weighted impurity.

Support Vector Machines (SVM)

Optimization Problem: minw,b,ξ12w2+Ci=1nξi\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i

Subject to: yi(wTϕ(xi)+b)1ξiy_i(\mathbf{w}^T \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i ξi0\xi_i \geq 0

Kernel Trick:

  • Linear: K(xi,xj)=xiTxjK(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j
  • RBF: K(xi,xj)=eγxixj2K(\mathbf{x}_i, \mathbf{x}_j) = e^{-\gamma\|\mathbf{x}_i - \mathbf{x}_j\|^2}
  • Polynomial: K(xi,xj)=(xiTxj+r)dK(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j + r)^d

Naive Bayes

Assumption: Features are conditionally independent given the class.

P(Y=kx)P(Y=k)i=1dP(Xi=xiY=k)P(Y = k | \mathbf{x}) \propto P(Y = k) \prod_{i=1}^d P(X_i = x_i | Y = k)

Gaussian Naive Bayes: P(Xi=xiY=k)=12πσik2e(xiμik)22σik2P(X_i = x_i | Y = k) = \frac{1}{\sqrt{2\pi\sigma_{ik}^2}} e^{-\frac{(x_i - \mu_{ik})^2}{2\sigma_{ik}^2}}

Random Forest

Bootstrap Aggregating (Bagging):

  1. Sample nn training examples with replacement
  2. Train decision tree on bootstrap sample
  3. Repeat BB times
  4. Average predictions: f^(x)=1Bb=1Bf^b(x)\hat{f}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})

Feature Randomness: At each split, consider only m=pm = \sqrt{p} randomly selected features.

Practical Implementation

Rust Implementation

// Standard library imports
use std::collections::HashMap;
 
// External crate imports
use nalgebra::{DMatrix, DVector};
use rand::Rng;
use serde::{Deserialize, Serialize};
 
/// Represents a labeled dataset for classification tasks.
/// 
/// Contains feature matrix, labels, and metadata about the dataset structure.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Dataset {
    /// Feature matrix where rows are samples and columns are features
    features: DMatrix<f64>,
    /// Label vector with class indices for each sample
    labels: DVector<usize>,
    /// Names of features for interpretability
    feature_names: Vec<String>,
    /// Names of classes for each label index
    class_names: Vec<String>,
}
 
/// Performance metrics for classification models.
/// 
/// Provides comprehensive evaluation metrics including accuracy, precision, recall, and confusion matrix.
#[derive(Debug, Clone)]
pub struct ClassificationMetrics {
    /// Overall classification accuracy (0.0 to 1.0)
    accuracy: f64,
    /// Precision score for each class
    precision: Vec<f64>,
    /// Recall score for each class  
    recall: Vec<f64>,
    /// F1-score for each class
    f1_score: Vec<f64>,
    /// Confusion matrix showing true vs predicted labels
    confusion_matrix: DMatrix<usize>,
}
 
/// Common interface for all classification algorithms.
/// 
/// Provides standardized methods for training, prediction, and probability estimation.
pub trait Classifier {
    /// Train the classifier on labeled data.
    /// 
    /// # Arguments
    /// * `X` - Feature matrix (samples × features)
    /// * `y` - Label vector with class indices
    fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>);
    
    /// Predict class labels for new samples.
    /// 
    /// # Arguments
    /// * `X` - Feature matrix for prediction
    /// 
    /// # Returns
    /// * Vector of predicted class indices
    fn predict(&self, X: &DMatrix<f64>) -> DVector<usize>;
    
    /// Predict class probabilities for new samples.
    /// 
    /// # Arguments  
    /// * `X` - Feature matrix for prediction
    /// 
    /// # Returns
    /// * Matrix where each row contains class probabilities for a sample
    fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64>;
}
 
// Logistic Regression Implementation
#[derive(Debug, Clone)]
pub struct LogisticRegression {
    weights: DVector<f64>,
    intercept: f64,
    learning_rate: f64,
    max_iterations: usize,
}
 
impl LogisticRegression {
    pub fn new(learning_rate: f64, max_iterations: usize) -> Self {
        Self {
            weights: DVector::zeros(0),
            intercept: 0.0,
            learning_rate,
            max_iterations,
        }
    }
    
    fn sigmoid(&self, z: f64) -> f64 {
        1.0 / (1.0 + (-z).exp())
    }
}
 
impl Classifier for LogisticRegression {
    fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>) {
        let (n_samples, n_features) = (X.nrows(), X.ncols());
        self.weights = DVector::zeros(n_features);
        self.intercept = 0.0;
        
        for _ in 0..self.max_iterations {
            // Forward pass
            let linear_pred: DVector<f64> = X * &self.weights + DVector::repeat(n_samples, self.intercept);
            let predictions: DVector<f64> = linear_pred.map(|z| self.sigmoid(z));
            
            // Convert labels to f64 for computation
            let y_float: DVector<f64> = y.map(|&label| label as f64);
            
            // Compute gradients
            let errors = &predictions - &y_float;
            let weight_gradient = X.transpose() * &errors / n_samples as f64;
            let intercept_gradient = errors.sum() / n_samples as f64;
            
            // Update parameters
            self.weights -= &weight_gradient * self.learning_rate;
            self.intercept -= intercept_gradient * self.learning_rate;
        }
    }
    
    fn predict(&self, X: &DMatrix<f64>) -> DVector<usize> {
        let probabilities = self.predict_proba(X);
        probabilities.column(1).map(|&prob| if prob > 0.5 { 1 } else { 0 })
    }
    
    fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
        let linear_pred = X * &self.weights + DVector::repeat(X.nrows(), self.intercept);
        let prob_class_1: DVector<f64> = linear_pred.map(|z| self.sigmoid(z));
        let prob_class_0: DVector<f64> = prob_class_1.map(|&p| 1.0 - p);
        
        DMatrix::from_columns(&[prob_class_0, prob_class_1])
    }
}
 
// Decision Tree Node
#[derive(Debug, Clone)]
pub struct TreeNode {
    feature_index: Option<usize>,
    threshold: Option<f64>,
    left: Option<Box<TreeNode>>,
    right: Option<Box<TreeNode>>,
    class_prediction: Option<usize>,
}
 
#[derive(Debug, Clone)]
pub struct DecisionTree {
    root: Option<TreeNode>,
    max_depth: usize,
    min_samples_split: usize,
}
 
impl DecisionTree {
    pub fn new(max_depth: usize, min_samples_split: usize) -> Self {
        Self {
            root: None,
            max_depth,
            min_samples_split,
        }
    }
    
    fn calculate_entropy(&self, y: &[usize]) -> f64 {
        if y.is_empty() {
            return 0.0;
        }
        
        let mut class_counts = HashMap::new();
        for &class in y {
            *class_counts.entry(class).or_insert(0) += 1;
        }
        
        let n = y.len() as f64;
        class_counts.values()
            .map(|&count| {
                let p = count as f64 / n;
                if p > 0.0 { -p * p.log2() } else { 0.0 }
            })
            .sum()
    }
    
    fn find_best_split(&self, X: &DMatrix<f64>, y: &[usize]) -> (usize, f64, f64) {
        let mut best_gain = 0.0;
        let mut best_feature = 0;
        let mut best_threshold = 0.0;
        
        let parent_entropy = self.calculate_entropy(y);
        
        for feature_idx in 0..X.ncols() {
            let feature_values: Vec<f64> = X.column(feature_idx).iter().cloned().collect();
            let mut sorted_values = feature_values.clone();
            sorted_values.sort_by(|a, b| a.partial_cmp(b).unwrap());
            
            for i in 0..sorted_values.len() - 1 {
                let threshold = (sorted_values[i] + sorted_values[i + 1]) / 2.0;
                
                let (left_indices, right_indices): (Vec<usize>, Vec<usize>) = 
                    (0..y.len()).partition(|&idx| feature_values[idx] <= threshold);
                
                if left_indices.is_empty() || right_indices.is_empty() {
                    continue;
                }
                
                let left_y: Vec<usize> = left_indices.iter().map(|&idx| y[idx]).collect();
                let right_y: Vec<usize> = right_indices.iter().map(|&idx| y[idx]).collect();
                
                let left_entropy = self.calculate_entropy(&left_y);
                let right_entropy = self.calculate_entropy(&right_y);
                
                let n_left = left_y.len() as f64;
                let n_right = right_y.len() as f64;
                let n_total = y.len() as f64;
                
                let information_gain = parent_entropy 
                    - (n_left / n_total * left_entropy + n_right / n_total * right_entropy);
                
                if information_gain > best_gain {
                    best_gain = information_gain;
                    best_feature = feature_idx;
                    best_threshold = threshold;
                }
            }
        }
        
        (best_feature, best_threshold, best_gain)
    }
    
    fn build_tree(&self, X: &DMatrix<f64>, y: &[usize], depth: usize) -> TreeNode {
        // Base cases
        if depth >= self.max_depth || y.len() < self.min_samples_split {
            let mut class_counts = HashMap::new();
            for &class in y {
                *class_counts.entry(class).or_insert(0) += 1;
            }
            let majority_class = *class_counts.iter()
                .max_by_key(|(_, &count)| count)
                .map(|(class, _)| class)
                .unwrap_or(&0);
            
            return TreeNode {
                feature_index: None,
                threshold: None,
                left: None,
                right: None,
                class_prediction: Some(majority_class),
            };
        }
        
        let (best_feature, best_threshold, best_gain) = self.find_best_split(X, y);
        
        if best_gain == 0.0 {
            let mut class_counts = HashMap::new();
            for &class in y {
                *class_counts.entry(class).or_insert(0) += 1;
            }
            let majority_class = *class_counts.iter()
                .max_by_key(|(_, &count)| count)
                .map(|(class, _)| class)
                .unwrap_or(&0);
            
            return TreeNode {
                feature_index: None,
                threshold: None,
                left: None,
                right: None,
                class_prediction: Some(majority_class),
            };
        }
        
        // Split data
        let feature_values: Vec<f64> = X.column(best_feature).iter().cloned().collect();
        let (left_indices, right_indices): (Vec<usize>, Vec<usize>) = 
            (0..y.len()).partition(|&idx| feature_values[idx] <= best_threshold);
        
        let left_X = DMatrix::from_iterator(
            left_indices.len(),
            X.ncols(),
            left_indices.iter().flat_map(|&idx| X.row(idx).iter().cloned()),
        );
        let left_y: Vec<usize> = left_indices.iter().map(|&idx| y[idx]).collect();
        
        let right_X = DMatrix::from_iterator(
            right_indices.len(),
            X.ncols(),
            right_indices.iter().flat_map(|&idx| X.row(idx).iter().cloned()),
        );
        let right_y: Vec<usize> = right_indices.iter().map(|&idx| y[idx]).collect();
        
        TreeNode {
            feature_index: Some(best_feature),
            threshold: Some(best_threshold),
            left: Some(Box::new(self.build_tree(&left_X, &left_y, depth + 1))),
            right: Some(Box::new(self.build_tree(&right_X, &right_y, depth + 1))),
            class_prediction: None,
        }
    }
    
    fn predict_single(&self, node: &TreeNode, sample: &DVector<f64>) -> usize {
        if let Some(class) = node.class_prediction {
            return class;
        }
        
        let feature_value = sample[node.feature_index.unwrap()];
        let threshold = node.threshold.unwrap();
        
        if feature_value <= threshold {
            self.predict_single(node.left.as_ref().unwrap(), sample)
        } else {
            self.predict_single(node.right.as_ref().unwrap(), sample)
        }
    }
}
 
impl Classifier for DecisionTree {
    fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>) {
        let y_slice: Vec<usize> = y.iter().cloned().collect();
        self.root = Some(self.build_tree(X, &y_slice, 0));
    }
    
    fn predict(&self, X: &DMatrix<f64>) -> DVector<usize> {
        if let Some(ref root) = self.root {
            let predictions: Vec<usize> = X.row_iter()
                .map(|row| {
                    let sample = DVector::from_vec(row.iter().cloned().collect());
                    self.predict_single(root, &sample)
                })
                .collect();
            DVector::from_vec(predictions)
        } else {
            DVector::zeros(X.nrows())
        }
    }
    
    fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
        // Simplified probability estimation for decision trees
        let predictions = self.predict(X);
        let mut probabilities = DMatrix::zeros(X.nrows(), 2);
        
        for (i, &pred) in predictions.iter().enumerate() {
            if pred == 1 {
                probabilities[(i, 1)] = 1.0;
            } else {
                probabilities[(i, 0)] = 1.0;
            }
        }
        
        probabilities
    }
}
 
// Main Classification Pipeline
pub struct ClassificationPipeline {
    models: HashMap<String, Box<dyn Classifier>>,
    best_model_name: Option<String>,
}
 
impl ClassificationPipeline {
    pub fn new() -> Self {
        let mut models: HashMap<String, Box<dyn Classifier>> = HashMap::new();
        models.insert("Logistic Regression".to_string(), 
                     Box::new(LogisticRegression::new(0.01, 1000)));
        models.insert("Decision Tree".to_string(),
                     Box::new(DecisionTree::new(10, 2)));
        
        Self {
            models,
            best_model_name: None,
        }
    }
    
    pub fn standardize_features(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
        let mean = X.column_mean();
        let std = X.column_variance().map(|v| v.sqrt());
        
        let mut standardized = X.clone();
        for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
            let col_mean = mean[col_idx];
            let col_std = std[col_idx];
            
            if col_std > 1e-8 {
                for value in column.iter_mut() {
                    *value = (*value - col_mean) / col_std;
                }
            }
        }
        
        standardized
    }
    
    pub fn train_and_evaluate(&mut self, dataset: &Dataset) -> ClassificationMetrics {
        let X = self.standardize_features(&dataset.features);
        let y = &dataset.labels;
        
        // Simple train-test split (80-20)
        let n_train = (X.nrows() as f64 * 0.8) as usize;
        let train_X = X.rows(0, n_train).into_owned();
        let test_X = X.rows(n_train, X.nrows() - n_train).into_owned();
        let train_y = y.rows(0, n_train).into_owned();
        let test_y = y.rows(n_train, y.nrows() - n_train).into_owned();
        
        let mut best_accuracy = 0.0;
        let mut best_predictions = DVector::zeros(test_y.nrows());
        
        for (name, model) in self.models.iter_mut() {
            println!("Training {}...", name);
            
            // Train model
            model.fit(&train_X, &train_y);
            
            // Make predictions
            let predictions = model.predict(&test_X);
            
            // Calculate accuracy
            let correct: usize = predictions.iter()
                .zip(test_y.iter())
                .map(|(&pred, &actual)| if pred == actual { 1 } else { 0 })
                .sum();
            let accuracy = correct as f64 / test_y.nrows() as f64;
            
            println!("Accuracy: {:.4}", accuracy);
            
            if accuracy > best_accuracy {
                best_accuracy = accuracy;
                best_predictions = predictions;
                self.best_model_name = Some(name.clone());
            }
        }
        
        // Calculate comprehensive metrics
        self.calculate_metrics(&test_y, &best_predictions)
    }
    
    fn calculate_metrics(&self, y_true: &DVector<usize>, y_pred: &DVector<usize>) -> ClassificationMetrics {
        let n_classes = 2; // Binary classification for simplicity
        let mut confusion_matrix = DMatrix::zeros(n_classes, n_classes);
        
        // Build confusion matrix
        for (&true_label, &pred_label) in y_true.iter().zip(y_pred.iter()) {
            confusion_matrix[(true_label, pred_label)] += 1;
        }
        
        // Calculate metrics per class
        let mut precision = Vec::with_capacity(n_classes);
        let mut recall = Vec::with_capacity(n_classes);
        let mut f1_score = Vec::with_capacity(n_classes);
        
        for class in 0..n_classes {
            let tp = confusion_matrix[(class, class)] as f64;
            let fp: f64 = (0..n_classes).map(|i| confusion_matrix[(i, class)] as f64).sum::<f64>() - tp;
            let fn_val: f64 = (0..n_classes).map(|j| confusion_matrix[(class, j)] as f64).sum::<f64>() - tp;
            
            let prec = if tp + fp > 0.0 { tp / (tp + fp) } else { 0.0 };
            let rec = if tp + fn_val > 0.0 { tp / (tp + fn_val) } else { 0.0 };
            let f1 = if prec + rec > 0.0 { 2.0 * prec * rec / (prec + rec) } else { 0.0 };
            
            precision.push(prec);
            recall.push(rec);
            f1_score.push(f1);
        }
        
        let accuracy = (0..n_classes).map(|i| confusion_matrix[(i, i)]).sum::<usize>() as f64 
                      / y_true.nrows() as f64;
        
        ClassificationMetrics {
            accuracy,
            precision,
            recall,
            f1_score,
            confusion_matrix,
        }
    }
    
    pub fn generate_report(&self, metrics: &ClassificationMetrics) {
        if let Some(ref best_model) = self.best_model_name {
            println!("\nBest Model: {}", best_model);
        }
        
        println!("Classification Report:");
        println!("Accuracy: {:.4}", metrics.accuracy);
        
        for (i, (&prec, (&rec, &f1))) in metrics.precision.iter()
            .zip(metrics.recall.iter().zip(metrics.f1_score.iter()))
            .enumerate() {
            println!("Class {}: Precision: {:.4}, Recall: {:.4}, F1: {:.4}", i, prec, rec, f1);
        }
        
        println!("\nConfusion Matrix:");
        for i in 0..metrics.confusion_matrix.nrows() {
            for j in 0..metrics.confusion_matrix.ncols() {
                print!("{:>4}", metrics.confusion_matrix[(i, j)]);
            }
            println!();
        }
    }
}
 
// Example usage
fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create sample dataset
    let features = DMatrix::from_vec(4, 2, vec![
        1.0, 2.0, 
        2.0, 3.0, 
        3.0, 4.0, 
        4.0, 5.0
    ]);
    let labels = DVector::from_vec(vec![0, 0, 1, 1]);
    let feature_names = vec!["feature1".to_string(), "feature2".to_string()];
    let class_names = vec!["class0".to_string(), "class1".to_string()];
    
    let dataset = Dataset {
        features,
        labels,
        feature_names,
        class_names,
    };
    
    // Initialize and run pipeline
    let mut pipeline = ClassificationPipeline::new();
    let metrics = pipeline.train_and_evaluate(&dataset);
    pipeline.generate_report(&metrics);
    
    Ok(())
}
 
/*
Cargo.toml dependencies:
 
[dependencies]
nalgebra = "0.32"
rand = "0.8" 
serde = { version = "1.0", features = ["derive"] }
*/

SQL-Based Classification

-- Feature engineering for classification
WITH customer_features AS (
    SELECT 
        customer_id,
        -- Behavioral features
        COUNT(DISTINCT order_id) as order_count,
        AVG(order_value) as avg_order_value,
        MAX(order_date) as last_order_date,
        DATEDIFF(CURRENT_DATE, MAX(order_date)) as days_since_last_order,
        
        -- Temporal features
        EXTRACT(DOW FROM order_date) as avg_order_dow,
        EXTRACT(HOUR FROM order_timestamp) as avg_order_hour,
        
        -- Product diversity
        COUNT(DISTINCT product_category) as category_diversity,
        
        -- Customer lifetime metrics
        DATEDIFF(MAX(order_date), MIN(order_date)) as customer_tenure_days,
        
        -- Labels (example: churn prediction)
        CASE 
            WHEN DATEDIFF(CURRENT_DATE, MAX(order_date)) > 90 THEN 1 
            ELSE 0 
        END as is_churned
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '2 years'
    GROUP BY customer_id
),
 
-- Feature scaling and binning
scaled_features AS (
    SELECT 
        customer_id,
        -- Standardized features
        (order_count - AVG(order_count) OVER()) / STDDEV(order_count) OVER() as order_count_scaled,
        (avg_order_value - AVG(avg_order_value) OVER()) / STDDEV(avg_order_value) OVER() as aov_scaled,
        
        -- Categorical features
        CASE 
            WHEN days_since_last_order <= 30 THEN 'active'
            WHEN days_since_last_order <= 90 THEN 'at_risk'
            ELSE 'inactive'
        END as recency_segment,
        
        CASE 
            WHEN order_count >= PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_count) OVER() THEN 'high_frequency'
            WHEN order_count >= PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_count) OVER() THEN 'medium_frequency'
            ELSE 'low_frequency'
        END as frequency_segment,
        
        is_churned
    FROM customer_features
)
 
-- Simple decision tree logic in SQL
SELECT 
    customer_id,
    recency_segment,
    frequency_segment,
    CASE 
        WHEN recency_segment = 'inactive' THEN 'high_churn_risk'
        WHEN recency_segment = 'at_risk' AND frequency_segment = 'low_frequency' THEN 'medium_churn_risk'
        WHEN recency_segment = 'active' AND frequency_segment IN ('high_frequency', 'medium_frequency') THEN 'low_churn_risk'
        ELSE 'medium_churn_risk'
    END as predicted_churn_risk,
    is_churned as actual_churn
FROM scaled_features;

Model Evaluation

Performance Metrics

Accuracy: Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Precision: Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity): Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

F1-Score: F1=2Precision×RecallPrecision+Recall\text{F1} = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Specificity: Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

ROC and AUC

ROC Curve: Plot of True Positive Rate vs False Positive Rate TPR=TPTP+FN,FPR=FPFP+TN\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}

AUC (Area Under Curve): Measures discriminative ability

  • AUC = 0.5: Random classifier
  • AUC = 1.0: Perfect classifier
  • AUC > 0.7: Generally acceptable performance

Cross-Validation Strategies

K-Fold Cross-Validation:

  1. Divide data into kk folds
  2. Train on k1k-1 folds, test on remaining fold
  3. Repeat kk times, average results

Stratified K-Fold: Maintains class distribution across folds

Time Series Split: For temporal data, ensure no data leakage

Real-World Applications

Fraud Detection

// Standard library imports
use std::collections::HashMap;
 
// External crate imports  
use chrono::{DateTime, Utc, Weekday};
use serde::{Deserialize, Serialize};
 
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Transaction {
    pub transaction_id: String,
    pub user_id: String,
    pub amount: f64,
    pub merchant_id: String,
    pub merchant_category: String,
    pub timestamp: DateTime<Utc>,
    pub location: String,
    pub is_fraud: bool,
}
 
#[derive(Debug, Clone)]
pub struct UserStats {
    pub avg_amount: f64,
    pub std_amount: f64,
    pub txn_count: usize,
    pub merchant_diversity: usize,
    pub avg_daily_count: f64,
    pub weekly_spend_avg: f64,
}
 
#[derive(Debug, Clone)]
pub struct FraudFeatures {
    pub transaction_amount: f64,
    pub merchant_category_encoded: f64,
    pub time_since_last_transaction: f64,
    pub daily_transaction_count: f64,
    pub weekly_spend_average: f64,
    pub is_weekend: f64,
    pub is_night: f64,
    pub unusual_location: f64,
    pub new_merchant: f64,
    pub amount_z_score: f64,
    pub is_amount_outlier: f64,
}
 
#[derive(Debug)]
pub enum RiskLevel {
    Low,
    Medium,
    High,
}
 
#[derive(Debug)]
pub enum Recommendation {
    Approve,
    Review,
    Block,
}
 
#[derive(Debug)]
pub struct FraudPrediction {
    pub fraud_probability: f64,
    pub risk_level: RiskLevel,
    pub recommendation: Recommendation,
}
 
pub struct FraudDetectionSystem {
    classifier: Option<Box<dyn Classifier>>,
    user_profiles: HashMap<String, UserStats>,
    merchant_categories: HashMap<String, f64>,
    location_profiles: HashMap<String, Vec<String>>,
}
 
impl FraudDetectionSystem {
    pub fn new() -> Self {
        Self {
            classifier: None,
            user_profiles: HashMap::new(),
            merchant_categories: HashMap::new(),
            location_profiles: HashMap::new(),
        }
    }
    
    pub fn build_user_profiles(&mut self, transactions: &[Transaction]) {
        let mut user_data: HashMap<String, Vec<&Transaction>> = HashMap::new();
        
        // Group transactions by user
        for transaction in transactions {
            user_data.entry(transaction.user_id.clone())
                     .or_insert_with(Vec::new)
                     .push(transaction);
        }
        
        // Calculate user statistics
        for (user_id, user_transactions) in user_data {
            let amounts: Vec<f64> = user_transactions.iter()
                                                   .map(|t| t.amount)
                                                   .collect();
            
            let avg_amount = amounts.iter().sum::<f64>() / amounts.len() as f64;
            let variance = amounts.iter()
                                 .map(|&x| (x - avg_amount).powi(2))
                                 .sum::<f64>() / amounts.len() as f64;
            let std_amount = variance.sqrt();
            
            let merchant_diversity = user_transactions.iter()
                                                    .map(|t| &t.merchant_id)
                                                    .collect::<std::collections::HashSet<_>>()
                                                    .len();
            
            // Calculate daily transaction counts
            let mut daily_counts: HashMap<String, usize> = HashMap::new();
            for transaction in &user_transactions {
                let date = transaction.timestamp.format("%Y-%m-%d").to_string();
                *daily_counts.entry(date).or_insert(0) += 1;
            }
            let avg_daily_count = daily_counts.values().sum::<usize>() as f64 / 
                                daily_counts.len() as f64;
            
            // Calculate weekly spending
            let total_spend: f64 = amounts.iter().sum();
            let weeks = user_transactions.len() as f64 / 7.0;
            let weekly_spend_avg = total_spend / weeks.max(1.0);
            
            let stats = UserStats {
                avg_amount,
                std_amount,
                txn_count: user_transactions.len(),
                merchant_diversity,
                avg_daily_count,
                weekly_spend_avg,
            };
            
            self.user_profiles.insert(user_id, stats);
        }
    }
    
    pub fn engineer_features(&self, transaction: &Transaction) -> FraudFeatures {
        let user_stats = self.user_profiles.get(&transaction.user_id)
                                          .cloned()
                                          .unwrap_or_else(|| UserStats {
                                              avg_amount: 0.0,
                                              std_amount: 1.0,
                                              txn_count: 0,
                                              merchant_diversity: 0,
                                              avg_daily_count: 0.0,
                                              weekly_spend_avg: 0.0,
                                          });
        
        // Temporal features
        let is_weekend = matches!(transaction.timestamp.weekday(), 
                                Weekday::Sat | Weekday::Sun) as u8 as f64;
        let hour = transaction.timestamp.hour();
        let is_night = (hour >= 22 || hour <= 6) as u8 as f64;
        
        // Merchant category encoding (simplified)
        let merchant_category_encoded = self.merchant_categories
                                           .get(&transaction.merchant_category)
                                           .copied()
                                           .unwrap_or(0.0);
        
        // Amount anomaly detection
        let amount_z_score = if user_stats.std_amount > 0.0 {
            ((transaction.amount - user_stats.avg_amount) / user_stats.std_amount).abs()
        } else {
            0.0
        };
        let is_amount_outlier = (amount_z_score > 3.0) as u8 as f64;
        
        // Location anomaly (simplified - check if location is in user's typical locations)
        let user_locations = self.location_profiles.get(&transaction.user_id);
        let unusual_location = match user_locations {
            Some(locations) => (!locations.contains(&transaction.location)) as u8 as f64,
            None => 1.0, // Unknown user, treat as unusual
        };
        
        FraudFeatures {
            transaction_amount: transaction.amount,
            merchant_category_encoded,
            time_since_last_transaction: 0.0, // Would need transaction history
            daily_transaction_count: user_stats.avg_daily_count,
            weekly_spend_average: user_stats.weekly_spend_avg,
            is_weekend,
            is_night,
            unusual_location,
            new_merchant: 0.0, // Would need merchant history
            amount_z_score,
            is_amount_outlier,
        }
    }
    
    pub fn features_to_vector(&self, features: &FraudFeatures) -> DVector<f64> {
        DVector::from_vec(vec![
            features.transaction_amount,
            features.merchant_category_encoded,
            features.time_since_last_transaction,
            features.daily_transaction_count,
            features.weekly_spend_average,
            features.is_weekend,
            features.is_night,
            features.unusual_location,
            features.new_merchant,
            features.amount_z_score,
            features.is_amount_outlier,
        ])
    }
    
    pub fn train_model(&mut self, transactions: &[Transaction]) -> Result<(), Box<dyn std::error::Error>> {
        // Build user profiles first
        self.build_user_profiles(transactions);
        
        // Engineer features for all transactions
        let features: Vec<FraudFeatures> = transactions.iter()
                                                      .map(|t| self.engineer_features(t))
                                                      .collect();
        
        // Convert to matrices
        let feature_vectors: Vec<DVector<f64>> = features.iter()
                                                        .map(|f| self.features_to_vector(f))
                                                        .collect();
        
        let X = DMatrix::from_columns(&feature_vectors.iter()
                                                     .map(|v| v.as_slice())
                                                     .collect::<Vec<_>>());
        
        let y: DVector<usize> = DVector::from_vec(
            transactions.iter().map(|t| t.is_fraud as usize).collect()
        );
        
        // Train logistic regression model
        let mut model = LogisticRegression::new(0.001, 2000);
        model.fit(&X.transpose(), &y); // Transpose because we built X column-wise
        
        self.classifier = Some(Box::new(model));
        
        Ok(())
    }
    
    pub fn predict_fraud(&self, transaction: &Transaction) -> Option<FraudPrediction> {
        if let Some(ref classifier) = self.classifier {
            let features = self.engineer_features(transaction);
            let feature_vector = self.features_to_vector(&features);
            
            // Reshape to matrix form (1 x n_features)
            let X = DMatrix::from_vec(1, feature_vector.len(), feature_vector.data.as_slice().to_vec());
            
            let probabilities = classifier.predict_proba(&X);
            let fraud_probability = probabilities[(0, 1)]; // Probability of fraud (class 1)
            
            let risk_level = if fraud_probability > 0.8 {
                RiskLevel::High
            } else if fraud_probability > 0.5 {
                RiskLevel::Medium
            } else {
                RiskLevel::Low
            };
            
            let recommendation = match risk_level {
                RiskLevel::High => Recommendation::Block,
                RiskLevel::Medium => Recommendation::Review,
                RiskLevel::Low => Recommendation::Approve,
            };
            
            Some(FraudPrediction {
                fraud_probability,
                risk_level,
                recommendation,
            })
        } else {
            None
        }
    }
    
    pub fn update_model_online(&mut self, new_transaction: &Transaction) {
        // Update user profiles with new transaction
        let user_id = &new_transaction.user_id;
        
        // This would implement online learning updates
        // For now, we'll just update the user profile
        if let Some(stats) = self.user_profiles.get_mut(user_id) {
            // Update running averages (simplified approach)
            let n = stats.txn_count as f64;
            let new_avg = (stats.avg_amount * n + new_transaction.amount) / (n + 1.0);
            stats.avg_amount = new_avg;
            stats.txn_count += 1;
        }
    }
}
 
// Example usage
fn example_fraud_detection() -> Result<(), Box<dyn std::error::Error>> {
    use chrono::{TimeZone, Utc};
    
    let mut fraud_detector = FraudDetectionSystem::new();
    
    // Sample training data
    let training_data = vec![
        Transaction {
            transaction_id: "txn1".to_string(),
            user_id: "user1".to_string(),
            amount: 25.50,
            merchant_id: "merchant_a".to_string(),
            merchant_category: "grocery".to_string(),
            timestamp: Utc.with_ymd_and_hms(2024, 1, 15, 10, 30, 0).unwrap(),
            location: "NYC".to_string(),
            is_fraud: false,
        },
        Transaction {
            transaction_id: "txn2".to_string(),
            user_id: "user1".to_string(),
            amount: 5000.00,
            merchant_id: "merchant_x".to_string(),
            merchant_category: "electronics".to_string(),
            timestamp: Utc.with_ymd_and_hms(2024, 1, 15, 23, 45, 0).unwrap(),
            location: "Unknown".to_string(),
            is_fraud: true,
        },
    ];
    
    // Train the model
    fraud_detector.train_model(&training_data)?;
    
    // Make prediction on new transaction
    let new_transaction = Transaction {
        transaction_id: "txn_new".to_string(),
        user_id: "user1".to_string(),
        amount: 150.00,
        merchant_id: "merchant_b".to_string(),
        merchant_category: "restaurant".to_string(),
        timestamp: Utc::now(),
        location: "NYC".to_string(),
        is_fraud: false, // Unknown in practice
    };
    
    if let Some(prediction) = fraud_detector.predict_fraud(&new_transaction) {
        println!("Fraud Probability: {:.4}", prediction.fraud_probability);
        println!("Risk Level: {:?}", prediction.risk_level);
        println!("Recommendation: {:?}", prediction.recommendation);
    }
    
    Ok(())
}

Customer Segmentation

// Standard library imports
use std::collections::HashMap;
 
// External crate imports
use chrono::{DateTime, Duration, Utc};
use rand::prelude::*;
 
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Order {
    pub order_id: String,
    pub customer_id: String,
    pub order_date: DateTime<Utc>,
    pub order_value: f64,
    pub product_category: String,
}
 
#[derive(Debug, Clone)]
pub struct CustomerMetrics {
    pub customer_id: String,
    pub recency: f64,        // Days since last order
    pub frequency: f64,      // Total number of orders
    pub monetary: f64,       // Total spending
    pub tenure: f64,         // Days since first order
    pub diversity: f64,      // Number of unique categories
}
 
#[derive(Debug, Clone)]
pub enum CustomerSegment {
    Champions,
    LoyalCustomers,
    AtRisk,
    NewCustomers,
}
 
impl CustomerSegment {
    pub fn from_id(id: usize) -> Self {
        match id {
            0 => Self::Champions,
            1 => Self::LoyalCustomers,
            2 => Self::AtRisk,
            _ => Self::NewCustomers,
        }
    }
    
    pub fn to_id(&self) -> usize {
        match self {
            Self::Champions => 0,
            Self::LoyalCustomers => 1,
            Self::AtRisk => 2,
            Self::NewCustomers => 3,
        }
    }
}
 
#[derive(Debug, Clone)]
pub struct SegmentProfile {
    pub segment: CustomerSegment,
    pub avg_recency: f64,
    pub avg_frequency: f64,
    pub avg_monetary: f64,
    pub avg_tenure: f64,
    pub avg_diversity: f64,
    pub customer_count: usize,
}
 
pub struct CustomerSegmentation {
    classifier: Option<Box<dyn Classifier>>,
    feature_scaler_params: Option<(DVector<f64>, DVector<f64>)>, // (mean, std)
    cluster_centers: Option<DMatrix<f64>>,
}
 
impl CustomerSegmentation {
    pub fn new() -> Self {
        Self {
            classifier: None,
            feature_scaler_params: None,
            cluster_centers: None,
        }
    }
    
    pub fn create_customer_features(&self, orders: &[Order]) -> Vec<CustomerMetrics> {
        let current_date = Utc::now();
        let mut customer_data: HashMap<String, Vec<&Order>> = HashMap::new();
        
        // Group orders by customer
        for order in orders {
            customer_data.entry(order.customer_id.clone())
                         .or_insert_with(Vec::new)
                         .push(order);
        }
        
        let mut customer_metrics = Vec::new();
        
        for (customer_id, customer_orders) in customer_data {
            // Calculate recency (days since last order)
            let last_order_date = customer_orders.iter()
                                                 .map(|o| o.order_date)
                                                 .max()
                                                 .unwrap();
            let recency = (current_date - last_order_date).num_days() as f64;
            
            // Calculate frequency (total number of orders)
            let frequency = customer_orders.len() as f64;
            
            // Calculate monetary (total spending)
            let monetary: f64 = customer_orders.iter()
                                              .map(|o| o.order_value)
                                              .sum();
            
            // Calculate tenure (days since first order)
            let first_order_date = customer_orders.iter()
                                                  .map(|o| o.order_date)
                                                  .min()
                                                  .unwrap();
            let tenure = (current_date - first_order_date).num_days() as f64;
            
            // Calculate diversity (unique product categories)
            let unique_categories: std::collections::HashSet<_> = customer_orders.iter()
                                                                                .map(|o| &o.product_category)
                                                                                .collect();
            let diversity = unique_categories.len() as f64;
            
            customer_metrics.push(CustomerMetrics {
                customer_id,
                recency,
                frequency,
                monetary,
                tenure,
                diversity,
            });
        }
        
        customer_metrics
    }
    
    fn metrics_to_feature_matrix(&self, metrics: &[CustomerMetrics]) -> DMatrix<f64> {
        let n_samples = metrics.len();
        let features: Vec<f64> = metrics.iter()
            .flat_map(|m| vec![m.recency, m.frequency, m.monetary, m.tenure, m.diversity])
            .collect();
        
        DMatrix::from_vec(n_samples, 5, features)
    }
    
    fn standardize_features(&mut self, X: &DMatrix<f64>) -> DMatrix<f64> {
        let mean = X.column_mean();
        let std = X.column_variance().map(|v| v.sqrt().max(1e-8));
        
        // Store parameters for future use
        self.feature_scaler_params = Some((mean.clone(), std.clone()));
        
        let mut standardized = X.clone();
        for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
            let col_mean = mean[col_idx];
            let col_std = std[col_idx];
            
            for value in column.iter_mut() {
                *value = (*value - col_mean) / col_std;
            }
        }
        
        standardized
    }
    
    fn apply_standardization(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
        if let Some((ref mean, ref std)) = self.feature_scaler_params {
            let mut standardized = X.clone();
            for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
                let col_mean = mean[col_idx];
                let col_std = std[col_idx];
                
                for value in column.iter_mut() {
                    *value = (*value - col_mean) / col_std;
                }
            }
            standardized
        } else {
            X.clone()
        }
    }
    
    pub fn k_means_clustering(&mut self, X: &DMatrix<f64>, k: usize, max_iterations: usize) -> DVector<usize> {
        let mut rng = rand::thread_rng();
        let (n_samples, n_features) = (X.nrows(), X.ncols());
        
        // Initialize cluster centers randomly
        let mut centers = DMatrix::zeros(k, n_features);
        for i in 0..k {
            for j in 0..n_features {
                let min_val = X.column(j).iter().fold(f64::INFINITY, |a, &b| a.min(b));
                let max_val = X.column(j).iter().fold(f64::NEG_INFINITY, |a, &b| a.max(b));
                centers[(i, j)] = rng.gen_range(min_val..max_val);
            }
        }
        
        let mut labels = DVector::zeros(n_samples);
        
        for _iteration in 0..max_iterations {
            let mut changed = false;
            
            // Assign points to clusters
            for i in 0..n_samples {
                let point = X.row(i);
                let mut min_distance = f64::INFINITY;
                let mut best_cluster = 0;
                
                for j in 0..k {
                    let center = centers.row(j);
                    let distance: f64 = point.iter()
                                           .zip(center.iter())
                                           .map(|(&a, &b)| (a - b).powi(2))
                                           .sum::<f64>()
                                           .sqrt();
                    
                    if distance < min_distance {
                        min_distance = distance;
                        best_cluster = j;
                    }
                }
                
                if labels[i] != best_cluster {
                    labels[i] = best_cluster;
                    changed = true;
                }
            }
            
            if !changed {
                break;
            }
            
            // Update cluster centers
            for j in 0..k {
                let cluster_points: Vec<usize> = labels.iter()
                                                      .enumerate()
                                                      .filter(|(_, &label)| label == j)
                                                      .map(|(idx, _)| idx)
                                                      .collect();
                
                if !cluster_points.is_empty() {
                    for dim in 0..n_features {
                        let sum: f64 = cluster_points.iter()
                                                    .map(|&idx| X[(idx, dim)])
                                                    .sum();
                        centers[(j, dim)] = sum / cluster_points.len() as f64;
                    }
                }
            }
        }
        
        self.cluster_centers = Some(centers);
        labels
    }
    
    pub fn segment_customers(&mut self, orders: &[Order], n_segments: usize) -> (Vec<CustomerMetrics>, Vec<SegmentProfile>) {
        // Create customer features
        let customer_metrics = self.create_customer_features(orders);
        
        // Convert to feature matrix
        let feature_matrix = self.metrics_to_feature_matrix(&customer_metrics);
        
        // Standardize features
        let scaled_features = self.standardize_features(&feature_matrix);
        
        // Perform K-means clustering
        let cluster_labels = self.k_means_clustering(&scaled_features, n_segments, 100);
        
        // Train classification model to predict segments
        let mut decision_tree = DecisionTree::new(10, 2);
        decision_tree.fit(&scaled_features, &cluster_labels);
        self.classifier = Some(Box::new(decision_tree));
        
        // Create segment profiles
        let segment_profiles = self.create_segment_profiles(&customer_metrics, &cluster_labels, n_segments);
        
        segment_profiles
    }
    
    fn create_segment_profiles(&self, metrics: &[CustomerMetrics], labels: &DVector<usize>, n_segments: usize) -> (Vec<CustomerMetrics>, Vec<SegmentProfile>) {
        let mut segment_profiles = Vec::new();
        let mut customer_metrics_with_segments = Vec::new();
        
        for segment_id in 0..n_segments {
            let segment_indices: Vec<usize> = labels.iter()
                                                   .enumerate()
                                                   .filter(|(_, &label)| label == segment_id)
                                                   .map(|(idx, _)| idx)
                                                   .collect();
            
            if segment_indices.is_empty() {
                continue;
            }
            
            let segment_metrics: Vec<&CustomerMetrics> = segment_indices.iter()
                                                                       .map(|&idx| &metrics[idx])
                                                                       .collect();
            
            let avg_recency = segment_metrics.iter().map(|m| m.recency).sum::<f64>() / segment_metrics.len() as f64;
            let avg_frequency = segment_metrics.iter().map(|m| m.frequency).sum::<f64>() / segment_metrics.len() as f64;
            let avg_monetary = segment_metrics.iter().map(|m| m.monetary).sum::<f64>() / segment_metrics.len() as f64;
            let avg_tenure = segment_metrics.iter().map(|m| m.tenure).sum::<f64>() / segment_metrics.len() as f64;
            let avg_diversity = segment_metrics.iter().map(|m| m.diversity).sum::<f64>() / segment_metrics.len() as f64;
            
            let profile = SegmentProfile {
                segment: CustomerSegment::from_id(segment_id),
                avg_recency,
                avg_frequency,
                avg_monetary,
                avg_tenure,
                avg_diversity,
                customer_count: segment_metrics.len(),
            };
            
            segment_profiles.push(profile);
        }
        
        // Add segment information to customer metrics
        for (i, metric) in metrics.iter().enumerate() {
            customer_metrics_with_segments.push(metric.clone());
        }
        
        (customer_metrics_with_segments, segment_profiles)
    }
    
    pub fn predict_segment(&self, customer_metrics: &CustomerMetrics) -> Option<CustomerSegment> {
        if let Some(ref classifier) = self.classifier {
            let features = vec![
                customer_metrics.recency,
                customer_metrics.frequency,
                customer_metrics.monetary,
                customer_metrics.tenure,
                customer_metrics.diversity,
            ];
            
            let feature_matrix = DMatrix::from_vec(1, 5, features);
            let scaled_features = self.apply_standardization(&feature_matrix);
            
            let predictions = classifier.predict(&scaled_features);
            Some(CustomerSegment::from_id(predictions[0]))
        } else {
            None
        }
    }
    
    pub fn print_segment_report(&self, profiles: &[SegmentProfile]) {
        println!("\nCustomer Segmentation Report");
        println!("=" * 50);
        
        for profile in profiles {
            println!("\nSegment: {:?}", profile.segment);
            println!("Customer Count: {}", profile.customer_count);
            println!("Average Recency: {:.2} days", profile.avg_recency);
            println!("Average Frequency: {:.2} orders", profile.avg_frequency);
            println!("Average Monetary: ${:.2}", profile.avg_monetary);
            println!("Average Tenure: {:.2} days", profile.avg_tenure);
            println!("Average Diversity: {:.2} categories", profile.avg_diversity);
        }
    }
}
 
// Example usage
fn example_customer_segmentation() -> Result<(), Box<dyn std::error::Error>> {
    use chrono::{TimeZone, Utc};
    
    let mut segmentation = CustomerSegmentation::new();
    
    // Sample order data
    let orders = vec![
        Order {
            order_id: "order1".to_string(),
            customer_id: "customer1".to_string(),
            order_date: Utc.with_ymd_and_hms(2024, 1, 1, 0, 0, 0).unwrap(),
            order_value: 100.0,
            product_category: "electronics".to_string(),
        },
        Order {
            order_id: "order2".to_string(),
            customer_id: "customer1".to_string(),
            order_date: Utc.with_ymd_and_hms(2024, 1, 15, 0, 0, 0).unwrap(),
            order_value: 50.0,
            product_category: "books".to_string(),
        },
        Order {
            order_id: "order3".to_string(),
            customer_id: "customer2".to_string(),
            order_date: Utc.with_ymd_and_hms(2023, 12, 1, 0, 0, 0).unwrap(),
            order_value: 500.0,
            product_category: "electronics".to_string(),
        },
    ];
    
    // Perform segmentation
    let (customer_metrics, segment_profiles) = segmentation.segment_customers(&orders, 4);
    
    // Print report
    segmentation.print_segment_report(&segment_profiles);
    
    // Predict segment for new customer metrics
    let new_customer = CustomerMetrics {
        customer_id: "new_customer".to_string(),
        recency: 30.0,
        frequency: 5.0,
        monetary: 250.0,
        tenure: 180.0,
        diversity: 3.0,
    };
    
    if let Some(predicted_segment) = segmentation.predict_segment(&new_customer) {
        println!("\nPredicted segment for new customer: {:?}", predicted_segment);
    }
    
    Ok(())
}

Advanced Topics

Ensemble Methods

Voting Classifier: y^=argmaxci=1nwi1[y^i=c]\hat{y} = \arg\max_c \sum_{i=1}^n w_i \cdot \mathbb{1}[\hat{y}_i = c]

Stacking:

  1. Train base models on training data
  2. Use base model predictions as features for meta-model
  3. Meta-model learns optimal combination

Handling Imbalanced Classes

Sampling Techniques:

  • SMOTE: Synthetic Minority Oversampling Technique
  • ADASYN: Adaptive Synthetic Sampling
  • Undersampling: Random or informed minority class removal

Cost-Sensitive Learning: J(θ)=i=1nC(yi)L(yi,f(xi;θ))J(\boldsymbol{\theta}) = \sum_{i=1}^n C(y_i) \cdot L(y_i, f(\mathbf{x}_i; \boldsymbol{\theta}))

Where C(yi)C(y_i) is the cost of misclassifying class yiy_i.

Multi-label Classification

Binary Relevance: Train separate binary classifier for each label Classifier Chains: Model label dependencies sequentially Label Powerset: Treat each unique label combination as a class

Best Practices

1. Data Quality

  • Handle missing values appropriately
  • Detect and address outliers
  • Ensure representative sampling
  • Validate data consistency across sources

2. Feature Engineering

  • Domain knowledge integration
  • Feature scaling and normalization
  • Categorical variable encoding
  • Feature selection and dimensionality reduction

3. Model Selection

  • Start with simple baseline models
  • Compare multiple algorithms systematically
  • Use appropriate cross-validation strategies
  • Consider computational constraints

4. Evaluation Strategy

  • Use appropriate metrics for business context
  • Account for class imbalance in evaluation
  • Perform statistical significance testing
  • Monitor model performance over time

5. Production Deployment

  • Model versioning and rollback procedures
  • A/B testing for model updates
  • Real-time monitoring and alerting
  • Automated retraining pipelines

Classification provides data engineering organizations with powerful frameworks for automated decision-making, pattern recognition, and predictive analytics. Mastery of these techniques enables the construction of robust, scalable systems that transform raw data into actionable business insights.

Related Topics

For foundational concepts:

For implementation and deployment:

For advanced techniques:

For practical development: