Machine Learning
Supervised Learning
Support Vector Machines

Support Vector Machines

Support Vector Machines (SVMs) find optimal decision boundaries by maximizing margins between classes. In financial services, they excel at binary classification tasks like fraud detection and credit approval. In retail, they power customer segmentation and demand classification.

Mathematical Foundation

Linear SVM

Find hyperplane that maximizes margin between classes:

Symbol Definitions:

  • = Weight vector (perpendicular to decision boundary)
  • = Input feature vector
  • = Bias term (offset from origin)

Classification Rule:

Distance from Point to Hyperplane:

Symbol Definitions:

  • = Absolute value
  • = L2 norm of weight vector
  • = Sign function (-1 or +1)

Margin Maximization

Maximize minimum distance to closest points (support vectors):

Margin:

Optimization Problem:

Subject to:

Symbol Definitions:

  • = True class label for sample
  • = Number of training samples

Support Vectors

Points that lie on the margin boundary:

Only support vectors determine the decision boundary - all other points can be removed without changing the solution.

Soft Margin SVM

Handling Non-Separable Data

Introduce slack variables to allow misclassification:

Subject to:

Symbol Definitions:

  • = Slack variable for sample (penalty for violation)
  • = Regularization parameter (trade-off between margin and violations)
  • = Total penalty for margin violations

Interpretation of C:

  • Large C: Small margin, fewer misclassifications (high variance)
  • Small C: Large margin, more misclassifications (high bias)

Dual Formulation

Lagrangian Dual Problem

Transform to dual form using Lagrange multipliers:

Subject to:

Symbol Definitions:

  • = Lagrange multiplier for sample
  • = Dot product between samples and

Solution:

Prediction:

Support Vectors: Samples with

Financial Services Example: Credit Card Fraud Detection

Business Context: Credit card company uses SVM to detect fraudulent transactions in real-time, minimizing false positives while maintaining high fraud detection rates.

Feature Engineering:

  • = Transaction amount (log-scaled)
  • = Time since last transaction (normalized)
  • = Merchant category risk score
  • = Geographic distance from home (miles)
  • = Velocity of transactions (frequency score)
  • = Card-present/not-present indicator
  • = Amount deviation from user's spending pattern
  • = Hour of day (encoded cyclically)

Class Imbalance:

  • Legitimate transactions: 99.85%
  • Fraudulent transactions: 0.15%

SVM Configuration:

Class Weights:

Weighted Loss Function:

Symbol Definitions:

  • = Class weight for sample
  • = Number of fraud/legitimate samples

Decision Function:

Threshold Optimization:

Optimal Threshold (Maximizing F1-Score):

Business Performance:

  • Precision: 0.94 (fraud predictions that are correct)
  • Recall: 0.87 (actual frauds detected)
  • F1-Score: 0.90 (harmonic mean of precision and recall)
  • False Positive Rate: 0.8% (legitimate transactions flagged)
  • Processing Speed: 2ms per transaction
  • Annual Fraud Prevention: 28M in blocked fraudulent transactions

Support Vector Analysis:

  • Number of Support Vectors: 12,847 out of 100,000 samples (12.8%)
  • Fraud Support Vectors: 89% of fraud cases become support vectors
  • Legitimate Support Vectors: 11.2% of legitimate cases

Kernel Methods

Non-Linear SVM

Use kernel trick to handle non-linear decision boundaries:

Kernel Function:

Symbol Definitions:

  • = Kernel function (similarity measure)
  • = Feature mapping to higher-dimensional space

Decision Function with Kernels:

Common Kernel Functions

Polynomial Kernel:

Symbol Definitions:

  • = Constant term (typically 1)
  • = Polynomial degree

Radial Basis Function (RBF/Gaussian) Kernel:

Symbol Definitions:

  • = Kernel parameter (controls locality)
  • = Squared Euclidean distance

Sigmoid Kernel:

Retail Example: Customer Segmentation for Marketing

Business Context: Fashion retailer uses SVM to segment customers into distinct groups for targeted marketing campaigns based on purchase behavior and demographics.

Multi-Class Classification: Segment customers into 4 categories:

  1. High-Value Loyalists
  2. Frequent Bargain Hunters
  3. Occasional Buyers
  4. Seasonal Shoppers

Customer Features:

  • = Average order value (, log-scaled)
  • = Purchase frequency (orders per year)
  • = Brand loyalty score (0-1)
  • = Price sensitivity index
  • = Seasonal purchasing pattern
  • = Product category diversity
  • = Return rate (%)
  • = Email engagement score
  • = Social media activity level
  • = Customer tenure (months)

One-vs-Rest (OvR) Multi-Class Strategy: Train separate binary SVM for each class:

Final Prediction:

RBF Kernel Configuration:

Hyperparameter Tuning:

Grid Search over:

Cross-Validation Performance:

Segment Characteristics:

High-Value Loyalists (25% of customers):

  • Average Order Value: 185+
  • Purchase Frequency: 8+ times/year
  • Brand Loyalty Score: 0.8+
  • Marketing Strategy: VIP treatment, early access

Frequent Bargain Hunters (35% of customers):

  • High frequency, low average order value
  • High price sensitivity
  • Marketing Strategy: Sale notifications, discount codes

Business Applications:

Personalized Marketing Budget:

Campaign Targeting:

Business Results:

  • Segmentation Accuracy: 84.7% correct classification
  • Campaign Response Rates:
    • High-Value: 42% (vs. 18% mass marketing)
    • Bargain Hunters: 28% (vs. 12% mass marketing)
  • ROI Improvement: 235% vs. untargeted campaigns
  • Customer Satisfaction: 4.3/5 rating for personalized offers
  • Revenue Impact: 1.8M quarterly increase from targeted campaigns

Model Selection and Validation

Cross-Validation for SVM

Select optimal hyperparameters:

K-Fold Cross-Validation:

Performance Metrics

For Binary Classification:

Precision:

Recall (Sensitivity):

F1-Score:

Symbol Definitions:

  • = True Positives
  • = False Positives
  • = False Negatives

Computational Considerations

Sequential Minimal Optimization (SMO)

Efficient algorithm for solving SVM dual problem:

Working Set Selection: Select two variables to optimize at each iteration

Coordinate Ascent:

Symbol Definitions:

  • = Prediction error for sample
  • = Second derivative of objective function

Scaling and Memory

Time Complexity: O(n²) to O(n³) depending on algorithm Space Complexity: O(n²) for kernel matrix storage

For Large Datasets:

  • Use linear kernels when possible
  • Implement stochastic/online SVM variants
  • Use approximation methods (Nyström, random features)

Advantages and Limitations

Advantages:

  • Optimal Margin: Mathematically principled approach
  • Kernel Trick: Handles non-linear relationships elegantly
  • Sparse Solution: Only support vectors matter
  • Regularization: Built-in overfitting prevention
  • Global Optimum: Convex optimization problem

Limitations:

  • Computational Cost: O(n³) training complexity
  • Memory Requirements: Kernel matrix storage
  • Parameter Selection: Requires careful hyperparameter tuning
  • Probabilistic Output: Doesn't naturally provide class probabilities
  • Multiclass Extension: Not naturally multiclass

Support Vector Machines provide powerful, theoretically grounded solutions for classification problems, excelling in scenarios requiring optimal decision boundaries and robust performance with limited training data, particularly valuable in financial risk assessment and retail customer analytics.


© 2025 Praba Siva. Personal Documentation Site.