Model Evaluation
Model evaluation provides systematic methods to assess supervised learning performance, ensuring models generalize well to unseen data. In financial services, proper evaluation prevents costly deployment failures in credit scoring and fraud detection. In retail, it ensures reliable demand forecasting and customer analytics.
Evaluation Methodology
Train-Validation-Test Split
Proper data partitioning prevents overfitting:
Typical Split Ratios:
- Training: 60-70% (model learning)
- Validation: 15-20% (hyperparameter tuning)
- Test: 15-20% (final evaluation)
Symbol Definitions:
- = Complete dataset
- = Training set for parameter learning
- = Validation set for model selection
- = Test set for unbiased performance estimation
Cross-Validation
More robust performance estimation:
K-Fold Cross-Validation:
Symbol Definitions:
- = Number of folds (typically 5 or 10)
- = Model trained on all folds except
- = k-th fold used for validation
Stratified K-Fold: Maintains class distribution in each fold:
Symbol Definitions:
- = Number of samples of class in fold
- = Total samples in fold
- = Total samples of class
- = Total samples
Classification Metrics
Confusion Matrix
Foundation for binary classification metrics:
Symbol Definitions:
- = True Positives (correctly predicted positive)
- = True Negatives (correctly predicted negative)
- = False Positives (Type I error)
- = False Negatives (Type II error)
Core Classification Metrics
Accuracy:
Precision (Positive Predictive Value):
Recall (Sensitivity/True Positive Rate):
Specificity (True Negative Rate):
F1-Score (Harmonic Mean):
Advanced Classification Metrics
Matthews Correlation Coefficient (MCC):
Balanced Accuracy:
Financial Services Example: Credit Scorecard Validation
Business Context: Bank validates credit scoring model for regulatory compliance, requiring comprehensive evaluation across multiple performance dimensions and fairness metrics.
Model Details:
- Algorithm: Logistic Regression
- Target: Binary classification (default vs. non-default)
- Dataset: 100,000 loan applications over 3 years
- Features: 25 financial and demographic variables
Evaluation Setup:
Temporal Split (Realistic for Credit Models):
- Training: 2020-2021 data (60,000 applications)
- Validation: First half 2022 (20,000 applications)
- Test: Second half 2022 (20,000 applications)
Class Distribution:
- Non-default: 94% (94,000 applications)
- Default: 6% (6,000 applications)
Performance Metrics:
Overall Classification Results:
Calculated Metrics:
- Accuracy: 96.6% (excellent overall performance)
- Precision: 77.8% (778 of 900 predicted defaults are actual defaults)
- Recall: 59.3% (593 of 1,180 actual defaults detected)
- Specificity: 98.9% (989 of non-defaults correctly identified)
- F1-Score: 67.3% (balanced precision-recall performance)
Business-Critical Metrics:
Default Capture Rate (Recall):
False Positive Cost:
False Negative Cost:
Economic Value:
Regulatory Compliance:
- Adverse Action Rate: 4.5% (within acceptable range)
- Disparate Impact Ratio: 0.89 (meets 80% rule)
- Model Stability: Population Stability Index = 0.08 (stable)
ROC Curve and AUC
Receiver Operating Characteristic (ROC) Curve
Plots sensitivity vs. (1 - specificity) at various thresholds:
Area Under Curve (AUC):
Symbol Definitions:
- = Classification threshold
- = True Positive Rate (Sensitivity)
- = False Positive Rate (1 - Specificity)
AUC Interpretation:
- 0.5: Random classifier
- 0.5-0.7: Poor discrimination
- 0.7-0.8: Acceptable discrimination
- 0.8-0.9: Excellent discrimination
- 0.9-1.0: Outstanding discrimination
Precision-Recall Curve
More informative for imbalanced datasets:
Average Precision (AP):
Regression Metrics
Error-Based Metrics
Mean Squared Error (MSE):
Root Mean Squared Error (RMSE):
Mean Absolute Error (MAE):
Mean Absolute Percentage Error (MAPE):
Goodness-of-Fit Metrics
R-squared (Coefficient of Determination):
Adjusted R-squared:
Symbol Definitions:
- = Actual value for sample
- = Predicted value for sample
- = Mean of actual values
- = Number of samples
- = Number of predictors
Retail Example: Demand Forecasting Evaluation
Business Context: Fashion retailer evaluates demand forecasting models for inventory optimization across 10,000 SKUs and seasonal patterns, requiring accurate error quantification and business impact assessment.
Problem Setup:
- Target: Weekly demand (units sold)
- Horizon: 12-week ahead forecasts
- Models Compared: Linear Regression, Random Forest, LSTM, Ensemble
- Evaluation Period: 52 weeks of out-of-sample data
Time Series Cross-Validation:
Symbol Definitions:
- = Training window start
- = Training window end
- = Forecast horizon (12 weeks)
- = Number of validation folds
Evaluation Metrics by Model:
| Model | RMSE | MAE | MAPE | R² | Business Impact |
|---|---|---|---|---|---|
| Linear Regression | 147.3 | 89.2 | 23.4% | 0.72 | Baseline |
| Random Forest | 132.1 | 78.9 | 19.8% | 0.78 | +2.1M |
| LSTM | 128.6 | 76.1 | 18.5% | 0.81 | +2.8M |
| Ensemble | 121.4 | 71.3 | 17.2% | 0.84 | +3.4M |
Seasonal Performance Analysis:
Peak Season (Q4) Metrics:
Off-Season Performance:
SKU-Level Analysis:
High-Volume SKUs (Top 20%):
- MAPE: 12.8% (better predictability)
- R²: 0.91 (strong correlation)
Low-Volume SKUs (Bottom 20%):
- MAPE: 34.7% (higher uncertainty)
- R²: 0.52 (weaker correlation)
Business Impact Calculation:
Inventory Holding Cost Reduction:
Stockout Cost Reduction:
Symbol Definitions:
- = Number of SKUs
- = Holding cost rate (15% annually)
- = Profit margin for SKU
Total Business Value:
- Inventory Reduction: 1.8M (18% lower safety stock)
- Stockout Prevention: 1.6M (12% fewer lost sales)
- Total Annual Benefit: 3.4M
Statistical Significance Testing
Paired t-Test
Compare two models on same dataset:
Symbol Definitions:
- = Mean difference in performance
- = Standard deviation of differences
- = Number of test samples
Null Hypothesis: No difference between models Alternative: One model significantly better
McNemar's Test
For comparing classifiers:
Symbol Definitions:
- = Cases where model 1 wrong, model 2 correct
- = Cases where model 1 correct, model 2 wrong
Wilcoxon Signed-Rank Test
Non-parametric alternative to paired t-test.
Model Calibration
Calibration Assessment
Evaluate if predicted probabilities match observed frequencies:
Reliability Diagram: Plot predicted probabilities vs. actual frequencies in bins.
Brier Score:
Symbol Definitions:
- = Predicted probability for sample
- = Actual binary outcome
Hosmer-Lemeshow Test:
Symbol Definitions:
- = Number of bins (typically 10)
- = Observed events in bin for outcome
- = Expected events in bin for outcome
Calibration Methods
Platt Scaling:
Isotonic Regression: Non-parametric calibration preserving ranking.
Business-Oriented Evaluation
Cost-Sensitive Metrics
Incorporate business costs into evaluation:
Expected Cost:
Cost-Sensitive Accuracy:
Profit-Based Evaluation
Direct business value assessment:
Profit Curve:
Lift and Gain Charts: Measure improvement over random selection.
Return on Investment:
Model evaluation provides the foundation for confident deployment of supervised learning systems, ensuring reliable performance measurement, statistical validation, and business value quantification across financial services and retail applications.