Support Vector Machines
Support Vector Machines (SVMs) find optimal decision boundaries by maximizing margins between classes. In financial services, they excel at binary classification tasks like fraud detection and credit approval. In retail, they power customer segmentation and demand classification.
Mathematical Foundation
Linear SVM
Find hyperplane that maximizes margin between classes:
Symbol Definitions:
- = Weight vector (perpendicular to decision boundary)
- = Input feature vector
- = Bias term (offset from origin)
Classification Rule:
Distance from Point to Hyperplane:
Symbol Definitions:
- = Absolute value
- = L2 norm of weight vector
- = Sign function (-1 or +1)
Margin Maximization
Maximize minimum distance to closest points (support vectors):
Margin:
Optimization Problem:
Subject to:
Symbol Definitions:
- = True class label for sample
- = Number of training samples
Support Vectors
Points that lie on the margin boundary:
Only support vectors determine the decision boundary - all other points can be removed without changing the solution.
Soft Margin SVM
Handling Non-Separable Data
Introduce slack variables to allow misclassification:
Subject to:
Symbol Definitions:
- = Slack variable for sample (penalty for violation)
- = Regularization parameter (trade-off between margin and violations)
- = Total penalty for margin violations
Interpretation of C:
- Large C: Small margin, fewer misclassifications (high variance)
- Small C: Large margin, more misclassifications (high bias)
Dual Formulation
Lagrangian Dual Problem
Transform to dual form using Lagrange multipliers:
Subject to:
Symbol Definitions:
- = Lagrange multiplier for sample
- = Dot product between samples and
Solution:
Prediction:
Support Vectors: Samples with
Financial Services Example: Credit Card Fraud Detection
Business Context: Credit card company uses SVM to detect fraudulent transactions in real-time, minimizing false positives while maintaining high fraud detection rates.
Feature Engineering:
- = Transaction amount (log-scaled)
- = Time since last transaction (normalized)
- = Merchant category risk score
- = Geographic distance from home (miles)
- = Velocity of transactions (frequency score)
- = Card-present/not-present indicator
- = Amount deviation from user's spending pattern
- = Hour of day (encoded cyclically)
Class Imbalance:
- Legitimate transactions: 99.85%
- Fraudulent transactions: 0.15%
SVM Configuration:
Class Weights:
Weighted Loss Function:
Symbol Definitions:
- = Class weight for sample
- = Number of fraud/legitimate samples
Decision Function:
Threshold Optimization:
Optimal Threshold (Maximizing F1-Score):
Business Performance:
- Precision: 0.94 (fraud predictions that are correct)
- Recall: 0.87 (actual frauds detected)
- F1-Score: 0.90 (harmonic mean of precision and recall)
- False Positive Rate: 0.8% (legitimate transactions flagged)
- Processing Speed: 2ms per transaction
- Annual Fraud Prevention: 28M in blocked fraudulent transactions
Support Vector Analysis:
- Number of Support Vectors: 12,847 out of 100,000 samples (12.8%)
- Fraud Support Vectors: 89% of fraud cases become support vectors
- Legitimate Support Vectors: 11.2% of legitimate cases
Kernel Methods
Non-Linear SVM
Use kernel trick to handle non-linear decision boundaries:
Kernel Function:
Symbol Definitions:
- = Kernel function (similarity measure)
- = Feature mapping to higher-dimensional space
Decision Function with Kernels:
Common Kernel Functions
Polynomial Kernel:
Symbol Definitions:
- = Constant term (typically 1)
- = Polynomial degree
Radial Basis Function (RBF/Gaussian) Kernel:
Symbol Definitions:
- = Kernel parameter (controls locality)
- = Squared Euclidean distance
Sigmoid Kernel:
Retail Example: Customer Segmentation for Marketing
Business Context: Fashion retailer uses SVM to segment customers into distinct groups for targeted marketing campaigns based on purchase behavior and demographics.
Multi-Class Classification: Segment customers into 4 categories:
- High-Value Loyalists
- Frequent Bargain Hunters
- Occasional Buyers
- Seasonal Shoppers
Customer Features:
- = Average order value (, log-scaled)
- = Purchase frequency (orders per year)
- = Brand loyalty score (0-1)
- = Price sensitivity index
- = Seasonal purchasing pattern
- = Product category diversity
- = Return rate (%)
- = Email engagement score
- = Social media activity level
- = Customer tenure (months)
One-vs-Rest (OvR) Multi-Class Strategy: Train separate binary SVM for each class:
Final Prediction:
RBF Kernel Configuration:
Hyperparameter Tuning:
Grid Search over:
Cross-Validation Performance:
Segment Characteristics:
High-Value Loyalists (25% of customers):
- Average Order Value: 185+
- Purchase Frequency: 8+ times/year
- Brand Loyalty Score: 0.8+
- Marketing Strategy: VIP treatment, early access
Frequent Bargain Hunters (35% of customers):
- High frequency, low average order value
- High price sensitivity
- Marketing Strategy: Sale notifications, discount codes
Business Applications:
Personalized Marketing Budget:
Campaign Targeting:
Business Results:
- Segmentation Accuracy: 84.7% correct classification
- Campaign Response Rates:
- High-Value: 42% (vs. 18% mass marketing)
- Bargain Hunters: 28% (vs. 12% mass marketing)
- ROI Improvement: 235% vs. untargeted campaigns
- Customer Satisfaction: 4.3/5 rating for personalized offers
- Revenue Impact: 1.8M quarterly increase from targeted campaigns
Model Selection and Validation
Cross-Validation for SVM
Select optimal hyperparameters:
K-Fold Cross-Validation:
Performance Metrics
For Binary Classification:
Precision:
Recall (Sensitivity):
F1-Score:
Symbol Definitions:
- = True Positives
- = False Positives
- = False Negatives
Computational Considerations
Sequential Minimal Optimization (SMO)
Efficient algorithm for solving SVM dual problem:
Working Set Selection: Select two variables to optimize at each iteration
Coordinate Ascent:
Symbol Definitions:
- = Prediction error for sample
- = Second derivative of objective function
Scaling and Memory
Time Complexity: O(n²) to O(n³) depending on algorithm Space Complexity: O(n²) for kernel matrix storage
For Large Datasets:
- Use linear kernels when possible
- Implement stochastic/online SVM variants
- Use approximation methods (Nyström, random features)
Advantages and Limitations
Advantages:
- Optimal Margin: Mathematically principled approach
- Kernel Trick: Handles non-linear relationships elegantly
- Sparse Solution: Only support vectors matter
- Regularization: Built-in overfitting prevention
- Global Optimum: Convex optimization problem
Limitations:
- Computational Cost: O(n³) training complexity
- Memory Requirements: Kernel matrix storage
- Parameter Selection: Requires careful hyperparameter tuning
- Probabilistic Output: Doesn't naturally provide class probabilities
- Multiclass Extension: Not naturally multiclass
Support Vector Machines provide powerful, theoretically grounded solutions for classification problems, excelling in scenarios requiring optimal decision boundaries and robust performance with limited training data, particularly valuable in financial risk assessment and retail customer analytics.