Classification
Classification is a fundamental supervised learning technique that predicts discrete categorical outcomes. In data engineering contexts, classification models power recommendation systems, fraud detection, customer segmentation, and automated decision-making processes at scale.
Core Philosophy
Classification is fundamentally about learning decision boundaries that separate different classes in feature space. Unlike regression which predicts continuous values, classification outputs discrete class labels with associated probabilities. This makes it ideal for:
1. Business Decision Automation
Classification models enable automated decision-making:
- Customer credit approval (approve/deny/manual review)
- Email filtering (spam/not spam/promotional)
- Quality control (pass/fail/inspect)
- Risk assessment (high/medium/low risk)
2. Probabilistic Reasoning
Modern classification provides probability estimates:
- Confidence scores for predictions
- Uncertainty quantification for critical decisions
- Risk assessment in high-stakes applications
- A/B testing and experimentation frameworks
3. Multi-class and Multi-label Support
Handles complex taxonomies:
- Multi-class: One label per instance (e.g., product category)
- Multi-label: Multiple labels per instance (e.g., image tags)
- Hierarchical: Nested class structures
- Imbalanced classes with different importance weights
Mathematical Foundation
Decision Theory Framework
Classification seeks to find a decision boundary function that maps input features to discrete class labels:
Where:
- is the input feature vector
- is the set of class labels
- is the posterior probability
Bayes' Theorem
The foundation of probabilistic classification:
Where:
- is the prior probability of class
- is the class-conditional likelihood
- is the marginal likelihood (normalizing constant)
Loss Functions
0-1 Loss (Misclassification Error):
Log-Loss (Cross-Entropy):
Hinge Loss (SVM):
Core Algorithms
Logistic Regression
Sigmoid Function:
Binary Classification:
Multi-class (Softmax):
Decision Trees
Information Gain:
Entropy:
Gini Impurity:
Splitting Criterion: Choose attribute that maximizes information gain or minimizes weighted impurity.
Support Vector Machines (SVM)
Optimization Problem:
Subject to:
Kernel Trick:
- Linear:
- RBF:
- Polynomial:
Naive Bayes
Assumption: Features are conditionally independent given the class.
Gaussian Naive Bayes:
Random Forest
Bootstrap Aggregating (Bagging):
- Sample training examples with replacement
- Train decision tree on bootstrap sample
- Repeat times
- Average predictions:
Feature Randomness: At each split, consider only randomly selected features.
Practical Implementation
Rust Implementation
// Standard library imports
use std::collections::HashMap;
// External crate imports
use nalgebra::{DMatrix, DVector};
use rand::Rng;
use serde::{Deserialize, Serialize};
/// Represents a labeled dataset for classification tasks.
///
/// Contains feature matrix, labels, and metadata about the dataset structure.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Dataset {
/// Feature matrix where rows are samples and columns are features
features: DMatrix<f64>,
/// Label vector with class indices for each sample
labels: DVector<usize>,
/// Names of features for interpretability
feature_names: Vec<String>,
/// Names of classes for each label index
class_names: Vec<String>,
}
/// Performance metrics for classification models.
///
/// Provides comprehensive evaluation metrics including accuracy, precision, recall, and confusion matrix.
#[derive(Debug, Clone)]
pub struct ClassificationMetrics {
/// Overall classification accuracy (0.0 to 1.0)
accuracy: f64,
/// Precision score for each class
precision: Vec<f64>,
/// Recall score for each class
recall: Vec<f64>,
/// F1-score for each class
f1_score: Vec<f64>,
/// Confusion matrix showing true vs predicted labels
confusion_matrix: DMatrix<usize>,
}
/// Common interface for all classification algorithms.
///
/// Provides standardized methods for training, prediction, and probability estimation.
pub trait Classifier {
/// Train the classifier on labeled data.
///
/// # Arguments
/// * `X` - Feature matrix (samples × features)
/// * `y` - Label vector with class indices
fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>);
/// Predict class labels for new samples.
///
/// # Arguments
/// * `X` - Feature matrix for prediction
///
/// # Returns
/// * Vector of predicted class indices
fn predict(&self, X: &DMatrix<f64>) -> DVector<usize>;
/// Predict class probabilities for new samples.
///
/// # Arguments
/// * `X` - Feature matrix for prediction
///
/// # Returns
/// * Matrix where each row contains class probabilities for a sample
fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64>;
}
// Logistic Regression Implementation
#[derive(Debug, Clone)]
pub struct LogisticRegression {
weights: DVector<f64>,
intercept: f64,
learning_rate: f64,
max_iterations: usize,
}
impl LogisticRegression {
pub fn new(learning_rate: f64, max_iterations: usize) -> Self {
Self {
weights: DVector::zeros(0),
intercept: 0.0,
learning_rate,
max_iterations,
}
}
fn sigmoid(&self, z: f64) -> f64 {
1.0 / (1.0 + (-z).exp())
}
}
impl Classifier for LogisticRegression {
fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>) {
let (n_samples, n_features) = (X.nrows(), X.ncols());
self.weights = DVector::zeros(n_features);
self.intercept = 0.0;
for _ in 0..self.max_iterations {
// Forward pass
let linear_pred: DVector<f64> = X * &self.weights + DVector::repeat(n_samples, self.intercept);
let predictions: DVector<f64> = linear_pred.map(|z| self.sigmoid(z));
// Convert labels to f64 for computation
let y_float: DVector<f64> = y.map(|&label| label as f64);
// Compute gradients
let errors = &predictions - &y_float;
let weight_gradient = X.transpose() * &errors / n_samples as f64;
let intercept_gradient = errors.sum() / n_samples as f64;
// Update parameters
self.weights -= &weight_gradient * self.learning_rate;
self.intercept -= intercept_gradient * self.learning_rate;
}
}
fn predict(&self, X: &DMatrix<f64>) -> DVector<usize> {
let probabilities = self.predict_proba(X);
probabilities.column(1).map(|&prob| if prob > 0.5 { 1 } else { 0 })
}
fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
let linear_pred = X * &self.weights + DVector::repeat(X.nrows(), self.intercept);
let prob_class_1: DVector<f64> = linear_pred.map(|z| self.sigmoid(z));
let prob_class_0: DVector<f64> = prob_class_1.map(|&p| 1.0 - p);
DMatrix::from_columns(&[prob_class_0, prob_class_1])
}
}
// Decision Tree Node
#[derive(Debug, Clone)]
pub struct TreeNode {
feature_index: Option<usize>,
threshold: Option<f64>,
left: Option<Box<TreeNode>>,
right: Option<Box<TreeNode>>,
class_prediction: Option<usize>,
}
#[derive(Debug, Clone)]
pub struct DecisionTree {
root: Option<TreeNode>,
max_depth: usize,
min_samples_split: usize,
}
impl DecisionTree {
pub fn new(max_depth: usize, min_samples_split: usize) -> Self {
Self {
root: None,
max_depth,
min_samples_split,
}
}
fn calculate_entropy(&self, y: &[usize]) -> f64 {
if y.is_empty() {
return 0.0;
}
let mut class_counts = HashMap::new();
for &class in y {
*class_counts.entry(class).or_insert(0) += 1;
}
let n = y.len() as f64;
class_counts.values()
.map(|&count| {
let p = count as f64 / n;
if p > 0.0 { -p * p.log2() } else { 0.0 }
})
.sum()
}
fn find_best_split(&self, X: &DMatrix<f64>, y: &[usize]) -> (usize, f64, f64) {
let mut best_gain = 0.0;
let mut best_feature = 0;
let mut best_threshold = 0.0;
let parent_entropy = self.calculate_entropy(y);
for feature_idx in 0..X.ncols() {
let feature_values: Vec<f64> = X.column(feature_idx).iter().cloned().collect();
let mut sorted_values = feature_values.clone();
sorted_values.sort_by(|a, b| a.partial_cmp(b).unwrap());
for i in 0..sorted_values.len() - 1 {
let threshold = (sorted_values[i] + sorted_values[i + 1]) / 2.0;
let (left_indices, right_indices): (Vec<usize>, Vec<usize>) =
(0..y.len()).partition(|&idx| feature_values[idx] <= threshold);
if left_indices.is_empty() || right_indices.is_empty() {
continue;
}
let left_y: Vec<usize> = left_indices.iter().map(|&idx| y[idx]).collect();
let right_y: Vec<usize> = right_indices.iter().map(|&idx| y[idx]).collect();
let left_entropy = self.calculate_entropy(&left_y);
let right_entropy = self.calculate_entropy(&right_y);
let n_left = left_y.len() as f64;
let n_right = right_y.len() as f64;
let n_total = y.len() as f64;
let information_gain = parent_entropy
- (n_left / n_total * left_entropy + n_right / n_total * right_entropy);
if information_gain > best_gain {
best_gain = information_gain;
best_feature = feature_idx;
best_threshold = threshold;
}
}
}
(best_feature, best_threshold, best_gain)
}
fn build_tree(&self, X: &DMatrix<f64>, y: &[usize], depth: usize) -> TreeNode {
// Base cases
if depth >= self.max_depth || y.len() < self.min_samples_split {
let mut class_counts = HashMap::new();
for &class in y {
*class_counts.entry(class).or_insert(0) += 1;
}
let majority_class = *class_counts.iter()
.max_by_key(|(_, &count)| count)
.map(|(class, _)| class)
.unwrap_or(&0);
return TreeNode {
feature_index: None,
threshold: None,
left: None,
right: None,
class_prediction: Some(majority_class),
};
}
let (best_feature, best_threshold, best_gain) = self.find_best_split(X, y);
if best_gain == 0.0 {
let mut class_counts = HashMap::new();
for &class in y {
*class_counts.entry(class).or_insert(0) += 1;
}
let majority_class = *class_counts.iter()
.max_by_key(|(_, &count)| count)
.map(|(class, _)| class)
.unwrap_or(&0);
return TreeNode {
feature_index: None,
threshold: None,
left: None,
right: None,
class_prediction: Some(majority_class),
};
}
// Split data
let feature_values: Vec<f64> = X.column(best_feature).iter().cloned().collect();
let (left_indices, right_indices): (Vec<usize>, Vec<usize>) =
(0..y.len()).partition(|&idx| feature_values[idx] <= best_threshold);
let left_X = DMatrix::from_iterator(
left_indices.len(),
X.ncols(),
left_indices.iter().flat_map(|&idx| X.row(idx).iter().cloned()),
);
let left_y: Vec<usize> = left_indices.iter().map(|&idx| y[idx]).collect();
let right_X = DMatrix::from_iterator(
right_indices.len(),
X.ncols(),
right_indices.iter().flat_map(|&idx| X.row(idx).iter().cloned()),
);
let right_y: Vec<usize> = right_indices.iter().map(|&idx| y[idx]).collect();
TreeNode {
feature_index: Some(best_feature),
threshold: Some(best_threshold),
left: Some(Box::new(self.build_tree(&left_X, &left_y, depth + 1))),
right: Some(Box::new(self.build_tree(&right_X, &right_y, depth + 1))),
class_prediction: None,
}
}
fn predict_single(&self, node: &TreeNode, sample: &DVector<f64>) -> usize {
if let Some(class) = node.class_prediction {
return class;
}
let feature_value = sample[node.feature_index.unwrap()];
let threshold = node.threshold.unwrap();
if feature_value <= threshold {
self.predict_single(node.left.as_ref().unwrap(), sample)
} else {
self.predict_single(node.right.as_ref().unwrap(), sample)
}
}
}
impl Classifier for DecisionTree {
fn fit(&mut self, X: &DMatrix<f64>, y: &DVector<usize>) {
let y_slice: Vec<usize> = y.iter().cloned().collect();
self.root = Some(self.build_tree(X, &y_slice, 0));
}
fn predict(&self, X: &DMatrix<f64>) -> DVector<usize> {
if let Some(ref root) = self.root {
let predictions: Vec<usize> = X.row_iter()
.map(|row| {
let sample = DVector::from_vec(row.iter().cloned().collect());
self.predict_single(root, &sample)
})
.collect();
DVector::from_vec(predictions)
} else {
DVector::zeros(X.nrows())
}
}
fn predict_proba(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
// Simplified probability estimation for decision trees
let predictions = self.predict(X);
let mut probabilities = DMatrix::zeros(X.nrows(), 2);
for (i, &pred) in predictions.iter().enumerate() {
if pred == 1 {
probabilities[(i, 1)] = 1.0;
} else {
probabilities[(i, 0)] = 1.0;
}
}
probabilities
}
}
// Main Classification Pipeline
pub struct ClassificationPipeline {
models: HashMap<String, Box<dyn Classifier>>,
best_model_name: Option<String>,
}
impl ClassificationPipeline {
pub fn new() -> Self {
let mut models: HashMap<String, Box<dyn Classifier>> = HashMap::new();
models.insert("Logistic Regression".to_string(),
Box::new(LogisticRegression::new(0.01, 1000)));
models.insert("Decision Tree".to_string(),
Box::new(DecisionTree::new(10, 2)));
Self {
models,
best_model_name: None,
}
}
pub fn standardize_features(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
let mean = X.column_mean();
let std = X.column_variance().map(|v| v.sqrt());
let mut standardized = X.clone();
for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
let col_mean = mean[col_idx];
let col_std = std[col_idx];
if col_std > 1e-8 {
for value in column.iter_mut() {
*value = (*value - col_mean) / col_std;
}
}
}
standardized
}
pub fn train_and_evaluate(&mut self, dataset: &Dataset) -> ClassificationMetrics {
let X = self.standardize_features(&dataset.features);
let y = &dataset.labels;
// Simple train-test split (80-20)
let n_train = (X.nrows() as f64 * 0.8) as usize;
let train_X = X.rows(0, n_train).into_owned();
let test_X = X.rows(n_train, X.nrows() - n_train).into_owned();
let train_y = y.rows(0, n_train).into_owned();
let test_y = y.rows(n_train, y.nrows() - n_train).into_owned();
let mut best_accuracy = 0.0;
let mut best_predictions = DVector::zeros(test_y.nrows());
for (name, model) in self.models.iter_mut() {
println!("Training {}...", name);
// Train model
model.fit(&train_X, &train_y);
// Make predictions
let predictions = model.predict(&test_X);
// Calculate accuracy
let correct: usize = predictions.iter()
.zip(test_y.iter())
.map(|(&pred, &actual)| if pred == actual { 1 } else { 0 })
.sum();
let accuracy = correct as f64 / test_y.nrows() as f64;
println!("Accuracy: {:.4}", accuracy);
if accuracy > best_accuracy {
best_accuracy = accuracy;
best_predictions = predictions;
self.best_model_name = Some(name.clone());
}
}
// Calculate comprehensive metrics
self.calculate_metrics(&test_y, &best_predictions)
}
fn calculate_metrics(&self, y_true: &DVector<usize>, y_pred: &DVector<usize>) -> ClassificationMetrics {
let n_classes = 2; // Binary classification for simplicity
let mut confusion_matrix = DMatrix::zeros(n_classes, n_classes);
// Build confusion matrix
for (&true_label, &pred_label) in y_true.iter().zip(y_pred.iter()) {
confusion_matrix[(true_label, pred_label)] += 1;
}
// Calculate metrics per class
let mut precision = Vec::with_capacity(n_classes);
let mut recall = Vec::with_capacity(n_classes);
let mut f1_score = Vec::with_capacity(n_classes);
for class in 0..n_classes {
let tp = confusion_matrix[(class, class)] as f64;
let fp: f64 = (0..n_classes).map(|i| confusion_matrix[(i, class)] as f64).sum::<f64>() - tp;
let fn_val: f64 = (0..n_classes).map(|j| confusion_matrix[(class, j)] as f64).sum::<f64>() - tp;
let prec = if tp + fp > 0.0 { tp / (tp + fp) } else { 0.0 };
let rec = if tp + fn_val > 0.0 { tp / (tp + fn_val) } else { 0.0 };
let f1 = if prec + rec > 0.0 { 2.0 * prec * rec / (prec + rec) } else { 0.0 };
precision.push(prec);
recall.push(rec);
f1_score.push(f1);
}
let accuracy = (0..n_classes).map(|i| confusion_matrix[(i, i)]).sum::<usize>() as f64
/ y_true.nrows() as f64;
ClassificationMetrics {
accuracy,
precision,
recall,
f1_score,
confusion_matrix,
}
}
pub fn generate_report(&self, metrics: &ClassificationMetrics) {
if let Some(ref best_model) = self.best_model_name {
println!("\nBest Model: {}", best_model);
}
println!("Classification Report:");
println!("Accuracy: {:.4}", metrics.accuracy);
for (i, (&prec, (&rec, &f1))) in metrics.precision.iter()
.zip(metrics.recall.iter().zip(metrics.f1_score.iter()))
.enumerate() {
println!("Class {}: Precision: {:.4}, Recall: {:.4}, F1: {:.4}", i, prec, rec, f1);
}
println!("\nConfusion Matrix:");
for i in 0..metrics.confusion_matrix.nrows() {
for j in 0..metrics.confusion_matrix.ncols() {
print!("{:>4}", metrics.confusion_matrix[(i, j)]);
}
println!();
}
}
}
// Example usage
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create sample dataset
let features = DMatrix::from_vec(4, 2, vec![
1.0, 2.0,
2.0, 3.0,
3.0, 4.0,
4.0, 5.0
]);
let labels = DVector::from_vec(vec![0, 0, 1, 1]);
let feature_names = vec!["feature1".to_string(), "feature2".to_string()];
let class_names = vec!["class0".to_string(), "class1".to_string()];
let dataset = Dataset {
features,
labels,
feature_names,
class_names,
};
// Initialize and run pipeline
let mut pipeline = ClassificationPipeline::new();
let metrics = pipeline.train_and_evaluate(&dataset);
pipeline.generate_report(&metrics);
Ok(())
}
/*
Cargo.toml dependencies:
[dependencies]
nalgebra = "0.32"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
*/
SQL-Based Classification
-- Feature engineering for classification
WITH customer_features AS (
SELECT
customer_id,
-- Behavioral features
COUNT(DISTINCT order_id) as order_count,
AVG(order_value) as avg_order_value,
MAX(order_date) as last_order_date,
DATEDIFF(CURRENT_DATE, MAX(order_date)) as days_since_last_order,
-- Temporal features
EXTRACT(DOW FROM order_date) as avg_order_dow,
EXTRACT(HOUR FROM order_timestamp) as avg_order_hour,
-- Product diversity
COUNT(DISTINCT product_category) as category_diversity,
-- Customer lifetime metrics
DATEDIFF(MAX(order_date), MIN(order_date)) as customer_tenure_days,
-- Labels (example: churn prediction)
CASE
WHEN DATEDIFF(CURRENT_DATE, MAX(order_date)) > 90 THEN 1
ELSE 0
END as is_churned
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY customer_id
),
-- Feature scaling and binning
scaled_features AS (
SELECT
customer_id,
-- Standardized features
(order_count - AVG(order_count) OVER()) / STDDEV(order_count) OVER() as order_count_scaled,
(avg_order_value - AVG(avg_order_value) OVER()) / STDDEV(avg_order_value) OVER() as aov_scaled,
-- Categorical features
CASE
WHEN days_since_last_order <= 30 THEN 'active'
WHEN days_since_last_order <= 90 THEN 'at_risk'
ELSE 'inactive'
END as recency_segment,
CASE
WHEN order_count >= PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_count) OVER() THEN 'high_frequency'
WHEN order_count >= PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_count) OVER() THEN 'medium_frequency'
ELSE 'low_frequency'
END as frequency_segment,
is_churned
FROM customer_features
)
-- Simple decision tree logic in SQL
SELECT
customer_id,
recency_segment,
frequency_segment,
CASE
WHEN recency_segment = 'inactive' THEN 'high_churn_risk'
WHEN recency_segment = 'at_risk' AND frequency_segment = 'low_frequency' THEN 'medium_churn_risk'
WHEN recency_segment = 'active' AND frequency_segment IN ('high_frequency', 'medium_frequency') THEN 'low_churn_risk'
ELSE 'medium_churn_risk'
END as predicted_churn_risk,
is_churned as actual_churn
FROM scaled_features;
Model Evaluation
Performance Metrics
Accuracy:
Precision:
Recall (Sensitivity):
F1-Score:
Specificity:
ROC and AUC
ROC Curve: Plot of True Positive Rate vs False Positive Rate
AUC (Area Under Curve): Measures discriminative ability
- AUC = 0.5: Random classifier
- AUC = 1.0: Perfect classifier
- AUC > 0.7: Generally acceptable performance
Cross-Validation Strategies
K-Fold Cross-Validation:
- Divide data into folds
- Train on folds, test on remaining fold
- Repeat times, average results
Stratified K-Fold: Maintains class distribution across folds
Time Series Split: For temporal data, ensure no data leakage
Real-World Applications
Fraud Detection
// Standard library imports
use std::collections::HashMap;
// External crate imports
use chrono::{DateTime, Utc, Weekday};
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Transaction {
pub transaction_id: String,
pub user_id: String,
pub amount: f64,
pub merchant_id: String,
pub merchant_category: String,
pub timestamp: DateTime<Utc>,
pub location: String,
pub is_fraud: bool,
}
#[derive(Debug, Clone)]
pub struct UserStats {
pub avg_amount: f64,
pub std_amount: f64,
pub txn_count: usize,
pub merchant_diversity: usize,
pub avg_daily_count: f64,
pub weekly_spend_avg: f64,
}
#[derive(Debug, Clone)]
pub struct FraudFeatures {
pub transaction_amount: f64,
pub merchant_category_encoded: f64,
pub time_since_last_transaction: f64,
pub daily_transaction_count: f64,
pub weekly_spend_average: f64,
pub is_weekend: f64,
pub is_night: f64,
pub unusual_location: f64,
pub new_merchant: f64,
pub amount_z_score: f64,
pub is_amount_outlier: f64,
}
#[derive(Debug)]
pub enum RiskLevel {
Low,
Medium,
High,
}
#[derive(Debug)]
pub enum Recommendation {
Approve,
Review,
Block,
}
#[derive(Debug)]
pub struct FraudPrediction {
pub fraud_probability: f64,
pub risk_level: RiskLevel,
pub recommendation: Recommendation,
}
pub struct FraudDetectionSystem {
classifier: Option<Box<dyn Classifier>>,
user_profiles: HashMap<String, UserStats>,
merchant_categories: HashMap<String, f64>,
location_profiles: HashMap<String, Vec<String>>,
}
impl FraudDetectionSystem {
pub fn new() -> Self {
Self {
classifier: None,
user_profiles: HashMap::new(),
merchant_categories: HashMap::new(),
location_profiles: HashMap::new(),
}
}
pub fn build_user_profiles(&mut self, transactions: &[Transaction]) {
let mut user_data: HashMap<String, Vec<&Transaction>> = HashMap::new();
// Group transactions by user
for transaction in transactions {
user_data.entry(transaction.user_id.clone())
.or_insert_with(Vec::new)
.push(transaction);
}
// Calculate user statistics
for (user_id, user_transactions) in user_data {
let amounts: Vec<f64> = user_transactions.iter()
.map(|t| t.amount)
.collect();
let avg_amount = amounts.iter().sum::<f64>() / amounts.len() as f64;
let variance = amounts.iter()
.map(|&x| (x - avg_amount).powi(2))
.sum::<f64>() / amounts.len() as f64;
let std_amount = variance.sqrt();
let merchant_diversity = user_transactions.iter()
.map(|t| &t.merchant_id)
.collect::<std::collections::HashSet<_>>()
.len();
// Calculate daily transaction counts
let mut daily_counts: HashMap<String, usize> = HashMap::new();
for transaction in &user_transactions {
let date = transaction.timestamp.format("%Y-%m-%d").to_string();
*daily_counts.entry(date).or_insert(0) += 1;
}
let avg_daily_count = daily_counts.values().sum::<usize>() as f64 /
daily_counts.len() as f64;
// Calculate weekly spending
let total_spend: f64 = amounts.iter().sum();
let weeks = user_transactions.len() as f64 / 7.0;
let weekly_spend_avg = total_spend / weeks.max(1.0);
let stats = UserStats {
avg_amount,
std_amount,
txn_count: user_transactions.len(),
merchant_diversity,
avg_daily_count,
weekly_spend_avg,
};
self.user_profiles.insert(user_id, stats);
}
}
pub fn engineer_features(&self, transaction: &Transaction) -> FraudFeatures {
let user_stats = self.user_profiles.get(&transaction.user_id)
.cloned()
.unwrap_or_else(|| UserStats {
avg_amount: 0.0,
std_amount: 1.0,
txn_count: 0,
merchant_diversity: 0,
avg_daily_count: 0.0,
weekly_spend_avg: 0.0,
});
// Temporal features
let is_weekend = matches!(transaction.timestamp.weekday(),
Weekday::Sat | Weekday::Sun) as u8 as f64;
let hour = transaction.timestamp.hour();
let is_night = (hour >= 22 || hour <= 6) as u8 as f64;
// Merchant category encoding (simplified)
let merchant_category_encoded = self.merchant_categories
.get(&transaction.merchant_category)
.copied()
.unwrap_or(0.0);
// Amount anomaly detection
let amount_z_score = if user_stats.std_amount > 0.0 {
((transaction.amount - user_stats.avg_amount) / user_stats.std_amount).abs()
} else {
0.0
};
let is_amount_outlier = (amount_z_score > 3.0) as u8 as f64;
// Location anomaly (simplified - check if location is in user's typical locations)
let user_locations = self.location_profiles.get(&transaction.user_id);
let unusual_location = match user_locations {
Some(locations) => (!locations.contains(&transaction.location)) as u8 as f64,
None => 1.0, // Unknown user, treat as unusual
};
FraudFeatures {
transaction_amount: transaction.amount,
merchant_category_encoded,
time_since_last_transaction: 0.0, // Would need transaction history
daily_transaction_count: user_stats.avg_daily_count,
weekly_spend_average: user_stats.weekly_spend_avg,
is_weekend,
is_night,
unusual_location,
new_merchant: 0.0, // Would need merchant history
amount_z_score,
is_amount_outlier,
}
}
pub fn features_to_vector(&self, features: &FraudFeatures) -> DVector<f64> {
DVector::from_vec(vec![
features.transaction_amount,
features.merchant_category_encoded,
features.time_since_last_transaction,
features.daily_transaction_count,
features.weekly_spend_average,
features.is_weekend,
features.is_night,
features.unusual_location,
features.new_merchant,
features.amount_z_score,
features.is_amount_outlier,
])
}
pub fn train_model(&mut self, transactions: &[Transaction]) -> Result<(), Box<dyn std::error::Error>> {
// Build user profiles first
self.build_user_profiles(transactions);
// Engineer features for all transactions
let features: Vec<FraudFeatures> = transactions.iter()
.map(|t| self.engineer_features(t))
.collect();
// Convert to matrices
let feature_vectors: Vec<DVector<f64>> = features.iter()
.map(|f| self.features_to_vector(f))
.collect();
let X = DMatrix::from_columns(&feature_vectors.iter()
.map(|v| v.as_slice())
.collect::<Vec<_>>());
let y: DVector<usize> = DVector::from_vec(
transactions.iter().map(|t| t.is_fraud as usize).collect()
);
// Train logistic regression model
let mut model = LogisticRegression::new(0.001, 2000);
model.fit(&X.transpose(), &y); // Transpose because we built X column-wise
self.classifier = Some(Box::new(model));
Ok(())
}
pub fn predict_fraud(&self, transaction: &Transaction) -> Option<FraudPrediction> {
if let Some(ref classifier) = self.classifier {
let features = self.engineer_features(transaction);
let feature_vector = self.features_to_vector(&features);
// Reshape to matrix form (1 x n_features)
let X = DMatrix::from_vec(1, feature_vector.len(), feature_vector.data.as_slice().to_vec());
let probabilities = classifier.predict_proba(&X);
let fraud_probability = probabilities[(0, 1)]; // Probability of fraud (class 1)
let risk_level = if fraud_probability > 0.8 {
RiskLevel::High
} else if fraud_probability > 0.5 {
RiskLevel::Medium
} else {
RiskLevel::Low
};
let recommendation = match risk_level {
RiskLevel::High => Recommendation::Block,
RiskLevel::Medium => Recommendation::Review,
RiskLevel::Low => Recommendation::Approve,
};
Some(FraudPrediction {
fraud_probability,
risk_level,
recommendation,
})
} else {
None
}
}
pub fn update_model_online(&mut self, new_transaction: &Transaction) {
// Update user profiles with new transaction
let user_id = &new_transaction.user_id;
// This would implement online learning updates
// For now, we'll just update the user profile
if let Some(stats) = self.user_profiles.get_mut(user_id) {
// Update running averages (simplified approach)
let n = stats.txn_count as f64;
let new_avg = (stats.avg_amount * n + new_transaction.amount) / (n + 1.0);
stats.avg_amount = new_avg;
stats.txn_count += 1;
}
}
}
// Example usage
fn example_fraud_detection() -> Result<(), Box<dyn std::error::Error>> {
use chrono::{TimeZone, Utc};
let mut fraud_detector = FraudDetectionSystem::new();
// Sample training data
let training_data = vec![
Transaction {
transaction_id: "txn1".to_string(),
user_id: "user1".to_string(),
amount: 25.50,
merchant_id: "merchant_a".to_string(),
merchant_category: "grocery".to_string(),
timestamp: Utc.with_ymd_and_hms(2024, 1, 15, 10, 30, 0).unwrap(),
location: "NYC".to_string(),
is_fraud: false,
},
Transaction {
transaction_id: "txn2".to_string(),
user_id: "user1".to_string(),
amount: 5000.00,
merchant_id: "merchant_x".to_string(),
merchant_category: "electronics".to_string(),
timestamp: Utc.with_ymd_and_hms(2024, 1, 15, 23, 45, 0).unwrap(),
location: "Unknown".to_string(),
is_fraud: true,
},
];
// Train the model
fraud_detector.train_model(&training_data)?;
// Make prediction on new transaction
let new_transaction = Transaction {
transaction_id: "txn_new".to_string(),
user_id: "user1".to_string(),
amount: 150.00,
merchant_id: "merchant_b".to_string(),
merchant_category: "restaurant".to_string(),
timestamp: Utc::now(),
location: "NYC".to_string(),
is_fraud: false, // Unknown in practice
};
if let Some(prediction) = fraud_detector.predict_fraud(&new_transaction) {
println!("Fraud Probability: {:.4}", prediction.fraud_probability);
println!("Risk Level: {:?}", prediction.risk_level);
println!("Recommendation: {:?}", prediction.recommendation);
}
Ok(())
}
Customer Segmentation
// Standard library imports
use std::collections::HashMap;
// External crate imports
use chrono::{DateTime, Duration, Utc};
use rand::prelude::*;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Order {
pub order_id: String,
pub customer_id: String,
pub order_date: DateTime<Utc>,
pub order_value: f64,
pub product_category: String,
}
#[derive(Debug, Clone)]
pub struct CustomerMetrics {
pub customer_id: String,
pub recency: f64, // Days since last order
pub frequency: f64, // Total number of orders
pub monetary: f64, // Total spending
pub tenure: f64, // Days since first order
pub diversity: f64, // Number of unique categories
}
#[derive(Debug, Clone)]
pub enum CustomerSegment {
Champions,
LoyalCustomers,
AtRisk,
NewCustomers,
}
impl CustomerSegment {
pub fn from_id(id: usize) -> Self {
match id {
0 => Self::Champions,
1 => Self::LoyalCustomers,
2 => Self::AtRisk,
_ => Self::NewCustomers,
}
}
pub fn to_id(&self) -> usize {
match self {
Self::Champions => 0,
Self::LoyalCustomers => 1,
Self::AtRisk => 2,
Self::NewCustomers => 3,
}
}
}
#[derive(Debug, Clone)]
pub struct SegmentProfile {
pub segment: CustomerSegment,
pub avg_recency: f64,
pub avg_frequency: f64,
pub avg_monetary: f64,
pub avg_tenure: f64,
pub avg_diversity: f64,
pub customer_count: usize,
}
pub struct CustomerSegmentation {
classifier: Option<Box<dyn Classifier>>,
feature_scaler_params: Option<(DVector<f64>, DVector<f64>)>, // (mean, std)
cluster_centers: Option<DMatrix<f64>>,
}
impl CustomerSegmentation {
pub fn new() -> Self {
Self {
classifier: None,
feature_scaler_params: None,
cluster_centers: None,
}
}
pub fn create_customer_features(&self, orders: &[Order]) -> Vec<CustomerMetrics> {
let current_date = Utc::now();
let mut customer_data: HashMap<String, Vec<&Order>> = HashMap::new();
// Group orders by customer
for order in orders {
customer_data.entry(order.customer_id.clone())
.or_insert_with(Vec::new)
.push(order);
}
let mut customer_metrics = Vec::new();
for (customer_id, customer_orders) in customer_data {
// Calculate recency (days since last order)
let last_order_date = customer_orders.iter()
.map(|o| o.order_date)
.max()
.unwrap();
let recency = (current_date - last_order_date).num_days() as f64;
// Calculate frequency (total number of orders)
let frequency = customer_orders.len() as f64;
// Calculate monetary (total spending)
let monetary: f64 = customer_orders.iter()
.map(|o| o.order_value)
.sum();
// Calculate tenure (days since first order)
let first_order_date = customer_orders.iter()
.map(|o| o.order_date)
.min()
.unwrap();
let tenure = (current_date - first_order_date).num_days() as f64;
// Calculate diversity (unique product categories)
let unique_categories: std::collections::HashSet<_> = customer_orders.iter()
.map(|o| &o.product_category)
.collect();
let diversity = unique_categories.len() as f64;
customer_metrics.push(CustomerMetrics {
customer_id,
recency,
frequency,
monetary,
tenure,
diversity,
});
}
customer_metrics
}
fn metrics_to_feature_matrix(&self, metrics: &[CustomerMetrics]) -> DMatrix<f64> {
let n_samples = metrics.len();
let features: Vec<f64> = metrics.iter()
.flat_map(|m| vec![m.recency, m.frequency, m.monetary, m.tenure, m.diversity])
.collect();
DMatrix::from_vec(n_samples, 5, features)
}
fn standardize_features(&mut self, X: &DMatrix<f64>) -> DMatrix<f64> {
let mean = X.column_mean();
let std = X.column_variance().map(|v| v.sqrt().max(1e-8));
// Store parameters for future use
self.feature_scaler_params = Some((mean.clone(), std.clone()));
let mut standardized = X.clone();
for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
let col_mean = mean[col_idx];
let col_std = std[col_idx];
for value in column.iter_mut() {
*value = (*value - col_mean) / col_std;
}
}
standardized
}
fn apply_standardization(&self, X: &DMatrix<f64>) -> DMatrix<f64> {
if let Some((ref mean, ref std)) = self.feature_scaler_params {
let mut standardized = X.clone();
for (col_idx, mut column) in standardized.column_iter_mut().enumerate() {
let col_mean = mean[col_idx];
let col_std = std[col_idx];
for value in column.iter_mut() {
*value = (*value - col_mean) / col_std;
}
}
standardized
} else {
X.clone()
}
}
pub fn k_means_clustering(&mut self, X: &DMatrix<f64>, k: usize, max_iterations: usize) -> DVector<usize> {
let mut rng = rand::thread_rng();
let (n_samples, n_features) = (X.nrows(), X.ncols());
// Initialize cluster centers randomly
let mut centers = DMatrix::zeros(k, n_features);
for i in 0..k {
for j in 0..n_features {
let min_val = X.column(j).iter().fold(f64::INFINITY, |a, &b| a.min(b));
let max_val = X.column(j).iter().fold(f64::NEG_INFINITY, |a, &b| a.max(b));
centers[(i, j)] = rng.gen_range(min_val..max_val);
}
}
let mut labels = DVector::zeros(n_samples);
for _iteration in 0..max_iterations {
let mut changed = false;
// Assign points to clusters
for i in 0..n_samples {
let point = X.row(i);
let mut min_distance = f64::INFINITY;
let mut best_cluster = 0;
for j in 0..k {
let center = centers.row(j);
let distance: f64 = point.iter()
.zip(center.iter())
.map(|(&a, &b)| (a - b).powi(2))
.sum::<f64>()
.sqrt();
if distance < min_distance {
min_distance = distance;
best_cluster = j;
}
}
if labels[i] != best_cluster {
labels[i] = best_cluster;
changed = true;
}
}
if !changed {
break;
}
// Update cluster centers
for j in 0..k {
let cluster_points: Vec<usize> = labels.iter()
.enumerate()
.filter(|(_, &label)| label == j)
.map(|(idx, _)| idx)
.collect();
if !cluster_points.is_empty() {
for dim in 0..n_features {
let sum: f64 = cluster_points.iter()
.map(|&idx| X[(idx, dim)])
.sum();
centers[(j, dim)] = sum / cluster_points.len() as f64;
}
}
}
}
self.cluster_centers = Some(centers);
labels
}
pub fn segment_customers(&mut self, orders: &[Order], n_segments: usize) -> (Vec<CustomerMetrics>, Vec<SegmentProfile>) {
// Create customer features
let customer_metrics = self.create_customer_features(orders);
// Convert to feature matrix
let feature_matrix = self.metrics_to_feature_matrix(&customer_metrics);
// Standardize features
let scaled_features = self.standardize_features(&feature_matrix);
// Perform K-means clustering
let cluster_labels = self.k_means_clustering(&scaled_features, n_segments, 100);
// Train classification model to predict segments
let mut decision_tree = DecisionTree::new(10, 2);
decision_tree.fit(&scaled_features, &cluster_labels);
self.classifier = Some(Box::new(decision_tree));
// Create segment profiles
let segment_profiles = self.create_segment_profiles(&customer_metrics, &cluster_labels, n_segments);
segment_profiles
}
fn create_segment_profiles(&self, metrics: &[CustomerMetrics], labels: &DVector<usize>, n_segments: usize) -> (Vec<CustomerMetrics>, Vec<SegmentProfile>) {
let mut segment_profiles = Vec::new();
let mut customer_metrics_with_segments = Vec::new();
for segment_id in 0..n_segments {
let segment_indices: Vec<usize> = labels.iter()
.enumerate()
.filter(|(_, &label)| label == segment_id)
.map(|(idx, _)| idx)
.collect();
if segment_indices.is_empty() {
continue;
}
let segment_metrics: Vec<&CustomerMetrics> = segment_indices.iter()
.map(|&idx| &metrics[idx])
.collect();
let avg_recency = segment_metrics.iter().map(|m| m.recency).sum::<f64>() / segment_metrics.len() as f64;
let avg_frequency = segment_metrics.iter().map(|m| m.frequency).sum::<f64>() / segment_metrics.len() as f64;
let avg_monetary = segment_metrics.iter().map(|m| m.monetary).sum::<f64>() / segment_metrics.len() as f64;
let avg_tenure = segment_metrics.iter().map(|m| m.tenure).sum::<f64>() / segment_metrics.len() as f64;
let avg_diversity = segment_metrics.iter().map(|m| m.diversity).sum::<f64>() / segment_metrics.len() as f64;
let profile = SegmentProfile {
segment: CustomerSegment::from_id(segment_id),
avg_recency,
avg_frequency,
avg_monetary,
avg_tenure,
avg_diversity,
customer_count: segment_metrics.len(),
};
segment_profiles.push(profile);
}
// Add segment information to customer metrics
for (i, metric) in metrics.iter().enumerate() {
customer_metrics_with_segments.push(metric.clone());
}
(customer_metrics_with_segments, segment_profiles)
}
pub fn predict_segment(&self, customer_metrics: &CustomerMetrics) -> Option<CustomerSegment> {
if let Some(ref classifier) = self.classifier {
let features = vec![
customer_metrics.recency,
customer_metrics.frequency,
customer_metrics.monetary,
customer_metrics.tenure,
customer_metrics.diversity,
];
let feature_matrix = DMatrix::from_vec(1, 5, features);
let scaled_features = self.apply_standardization(&feature_matrix);
let predictions = classifier.predict(&scaled_features);
Some(CustomerSegment::from_id(predictions[0]))
} else {
None
}
}
pub fn print_segment_report(&self, profiles: &[SegmentProfile]) {
println!("\nCustomer Segmentation Report");
println!("=" * 50);
for profile in profiles {
println!("\nSegment: {:?}", profile.segment);
println!("Customer Count: {}", profile.customer_count);
println!("Average Recency: {:.2} days", profile.avg_recency);
println!("Average Frequency: {:.2} orders", profile.avg_frequency);
println!("Average Monetary: ${:.2}", profile.avg_monetary);
println!("Average Tenure: {:.2} days", profile.avg_tenure);
println!("Average Diversity: {:.2} categories", profile.avg_diversity);
}
}
}
// Example usage
fn example_customer_segmentation() -> Result<(), Box<dyn std::error::Error>> {
use chrono::{TimeZone, Utc};
let mut segmentation = CustomerSegmentation::new();
// Sample order data
let orders = vec![
Order {
order_id: "order1".to_string(),
customer_id: "customer1".to_string(),
order_date: Utc.with_ymd_and_hms(2024, 1, 1, 0, 0, 0).unwrap(),
order_value: 100.0,
product_category: "electronics".to_string(),
},
Order {
order_id: "order2".to_string(),
customer_id: "customer1".to_string(),
order_date: Utc.with_ymd_and_hms(2024, 1, 15, 0, 0, 0).unwrap(),
order_value: 50.0,
product_category: "books".to_string(),
},
Order {
order_id: "order3".to_string(),
customer_id: "customer2".to_string(),
order_date: Utc.with_ymd_and_hms(2023, 12, 1, 0, 0, 0).unwrap(),
order_value: 500.0,
product_category: "electronics".to_string(),
},
];
// Perform segmentation
let (customer_metrics, segment_profiles) = segmentation.segment_customers(&orders, 4);
// Print report
segmentation.print_segment_report(&segment_profiles);
// Predict segment for new customer metrics
let new_customer = CustomerMetrics {
customer_id: "new_customer".to_string(),
recency: 30.0,
frequency: 5.0,
monetary: 250.0,
tenure: 180.0,
diversity: 3.0,
};
if let Some(predicted_segment) = segmentation.predict_segment(&new_customer) {
println!("\nPredicted segment for new customer: {:?}", predicted_segment);
}
Ok(())
}
Advanced Topics
Ensemble Methods
Voting Classifier:
Stacking:
- Train base models on training data
- Use base model predictions as features for meta-model
- Meta-model learns optimal combination
Handling Imbalanced Classes
Sampling Techniques:
- SMOTE: Synthetic Minority Oversampling Technique
- ADASYN: Adaptive Synthetic Sampling
- Undersampling: Random or informed minority class removal
Cost-Sensitive Learning:
Where is the cost of misclassifying class .
Multi-label Classification
Binary Relevance: Train separate binary classifier for each label Classifier Chains: Model label dependencies sequentially Label Powerset: Treat each unique label combination as a class
Best Practices
1. Data Quality
- Handle missing values appropriately
- Detect and address outliers
- Ensure representative sampling
- Validate data consistency across sources
2. Feature Engineering
- Domain knowledge integration
- Feature scaling and normalization
- Categorical variable encoding
- Feature selection and dimensionality reduction
3. Model Selection
- Start with simple baseline models
- Compare multiple algorithms systematically
- Use appropriate cross-validation strategies
- Consider computational constraints
4. Evaluation Strategy
- Use appropriate metrics for business context
- Account for class imbalance in evaluation
- Perform statistical significance testing
- Monitor model performance over time
5. Production Deployment
- Model versioning and rollback procedures
- A/B testing for model updates
- Real-time monitoring and alerting
- Automated retraining pipelines
Classification provides data engineering organizations with powerful frameworks for automated decision-making, pattern recognition, and predictive analytics. Mastery of these techniques enables the construction of robust, scalable systems that transform raw data into actionable business insights.
Related Topics
For foundational concepts:
- Analytics Fundamentals: Statistical foundations underlying classification algorithms
- Machine Learning Fundamentals: Broader ML context and supervised learning principles
For implementation and deployment:
- Data Engineering Pipelines: Deploy classification models within scalable data systems
- API Management: Serve classification predictions through production APIs
- Data Processing: Scale feature engineering and model training across distributed systems
For advanced techniques:
- Unsupervised Learning: Complement classification with clustering and dimensionality reduction
- Deep Learning: Neural network approaches to classification
- Feature Engineering: Optimize input features for classification performance
For practical development:
- Rust Programming: High-performance classification implementations
- Data Technologies: Storage and processing systems for classification workflows