Machine Learning
Supervised Learning
Neural Networks

Neural Networks in Supervised Learning

Neural networks in supervised learning focus on traditional feedforward architectures for classification and regression tasks. Unlike deep learning which covers complex architectures, this section emphasizes foundational neural network concepts, training algorithms, and practical applications in financial services and retail.

Perceptron Foundation

Single Perceptron

Basic linear classifier with threshold activation:

Symbol Definitions:

  • [mathematical expression] = Binary output (0 or 1)
  • [mathematical expression] = Weight for input feature [mathematical expression]
  • [mathematical expression] = Input feature [mathematical expression]
  • [mathematical expression] = Bias term (threshold)
  • [mathematical expression] = Number of input features

Perceptron Learning Rule:

Symbol Definitions:

  • [mathematical expression] = Weight [mathematical expression] at iteration [mathematical expression]
  • [mathematical expression] = Learning rate
  • [mathematical expression] = True label
  • [mathematical expression] = Predicted label

Limitation: Can only solve linearly separable problems

Multi-Layer Perceptron (MLP)

Overcomes linear limitation through hidden layers:

Symbol Definitions:

  • [mathematical expression] = Hidden layer activations
  • [mathematical expression] = Activation function
  • [mathematical expression] = Weight matrices for layers 1 and 2
  • [mathematical expression] = Bias vectors for layers 1 and 2

Supervised Learning Applications

Classification Networks

Output layer configuration for multi-class problems:

Softmax Output:

Symbol Definitions:

  • [mathematical expression] = Probability of class [mathematical expression] given input [mathematical expression]
  • [mathematical expression] = Raw output (logit) for class [mathematical expression]
  • [mathematical expression] = Number of classes

Cross-Entropy Loss:

Symbol Definitions:

  • [mathematical expression] = True label (1 if sample [mathematical expression] belongs to class [mathematical expression], 0 otherwise)
  • [mathematical expression] = Predicted probability for sample [mathematical expression], class [mathematical expression]

Regression Networks

Linear output for continuous predictions:

Mean Squared Error Loss:

Financial Services Example: Credit Risk Assessment

Business Context: Regional bank uses multi-layer perceptron to assess credit risk for small business loans, requiring fast decisions with interpretable risk scores.

Network Architecture:

  • Input Layer: 25 financial and business features
  • Hidden Layer 1: 50 neurons with ReLU activation
  • Hidden Layer 2: 30 neurons with ReLU activation
  • Output Layer: 1 neuron with sigmoid activation (default probability)

Input Features:

  • [mathematical expression] = Annual revenue (000s, normalized)
  • [mathematical expression] = Business age (years)
  • [mathematical expression] = Cash flow ratio
  • [mathematical expression] = Debt service coverage ratio
  • [mathematical expression] = Industry risk score (1-10)
  • [mathematical expression] = Owner credit score
  • [mathematical expression] = Collateral value ratio
  • ... (18 additional features)

Network Equations:

Hidden Layer 1:

Hidden Layer 2:

Output (Default Probability):

Risk Score Calibration:

Business Rules Integration:

Training Details:

  • Loss Function: Binary cross-entropy with class weighting
  • Optimizer: Adam with learning rate scheduling
  • Regularization: L2 weight decay (λ = 0.001) + Dropout (0.3)
  • Training Data: 50,000 historical loan applications
  • Validation: 5-fold cross-validation

Performance Metrics:

  • AUC-ROC: 0.87 (excellent discrimination)
  • Precision: 0.91 (approved loans that don't default)
  • Recall: 0.83 (actual defaults caught)
  • Calibration: Brier Score = 0.089 (well-calibrated probabilities)
  • Processing Speed: 500 applications/second
  • Business Impact: 8.2M annual reduction in default losses

Backpropagation Algorithm

Forward Pass

Compute activations layer by layer:

Symbol Definitions:

  • [mathematical expression] = Pre-activation at layer [mathematical expression]
  • [mathematical expression] = Post-activation at layer [mathematical expression]
  • [mathematical expression] = Input features

Backward Pass

Compute gradients using chain rule:

Output Layer Error:

Hidden Layer Error:

Weight Gradients:

Bias Gradients:

Symbol Definitions:

  • [mathematical expression] = Error signal at layer [mathematical expression]
  • [mathematical expression] = Gradient of loss w.r.t. output activations
  • [mathematical expression] = Element-wise multiplication
  • [mathematical expression] = Derivative of activation function

Retail Example: Customer Lifetime Value Prediction

Business Context: E-commerce retailer uses neural network to predict customer lifetime value (CLV) for personalized marketing budget allocation and retention strategies.

Problem Formulation: Regression task to predict 24-month CLV based on early customer behavior patterns.

Network Architecture:

  • Input Layer: 35 customer behavior features
  • Hidden Layer 1: 64 neurons with ReLU
  • Hidden Layer 2: 32 neurons with ReLU
  • Hidden Layer 3: 16 neurons with ReLU
  • Output Layer: 1 neuron with linear activation (CLV in dollars)

Feature Categories:

Transactional Features (12 features):

  • Average order value, purchase frequency, total spent, etc.

Behavioral Features (15 features):

  • Website engagement, email open rates, product views, etc.

Demographic Features (8 features):

  • Age group, location, acquisition channel, tenure, etc.

Network Implementation:

Input Standardization:

Hidden Layer Computations:

CLV Prediction:

Loss Function with Regularization:

Symbol Definitions:

  • [mathematical expression] = L2 regularization coefficient
  • [mathematical expression] = Frobenius norm squared (sum of squared weights)

Training Configuration:

  • Optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8)
  • Learning Rate: 0.001 with exponential decay
  • Batch Size: 256 customers
  • Epochs: 200 with early stopping
  • Validation: 20% holdout set

Model Performance:

  • RMSE: [mathematical expression]189 linear regression baseline)
  • MAE: [mathematical expression]134 baseline)
  • R²: 0.78 (vs. 0.52 baseline)
  • Business Validation: 85% accuracy in predicting high-value customers

Business Applications:

Marketing Budget Allocation:

Customer Segmentation:

Business Impact:

  • Marketing ROI: 42% improvement through targeted spending
  • Customer Retention: 28% increase in high-value segment retention
  • Revenue Growth: 3.2M additional quarterly revenue
  • Cost Efficiency: 35% reduction in marketing waste

Activation Functions in Supervised Learning

Sigmoid

Good for binary classification output:

Properties:

  • Range: (0, 1) - interpretable as probabilities
  • Smooth and differentiable
  • Suffers from vanishing gradient for large |z|

Tanh

Alternative to sigmoid with zero-centered output:

Properties:

  • Range: (-1, 1) - zero-centered
  • Stronger gradients than sigmoid
  • Still suffers from vanishing gradient

ReLU (Rectified Linear Unit)

Most popular activation for hidden layers:

Advantages:

  • Computationally efficient
  • Helps mitigate vanishing gradient
  • Sparse activation (biological plausibility)

Disadvantage:

  • Dead neurons (neurons that never activate)

Regularization Techniques

L1 and L2 Regularization

Add penalty terms to prevent overfitting:

L2 (Ridge) Regularization:

L1 (Lasso) Regularization:

Dropout

Randomly set neurons to zero during training:

Symbol Definitions:

  • [mathematical expression] = Random binary mask
  • [mathematical expression] = Dropout probability (typically 0.2-0.5)

Early Stopping

Monitor validation error and stop when it starts increasing:

Model Selection and Hyperparameter Tuning

Grid Search

Exhaustive search over hyperparameter combinations:

Common Hyperparameters:

  • Number of hidden layers and neurons
  • Learning rate and decay schedule
  • Regularization coefficients
  • Batch size and training epochs

Random Search

Often more efficient than grid search:

Bayesian Optimization

Use Gaussian processes to model hyperparameter performance:

Symbol Definitions:

  • [mathematical expression] = Expected improvement acquisition function
  • [mathematical expression] = Previous hyperparameter evaluations

Neural networks in supervised learning provide powerful function approximators for classification and regression tasks, offering flexibility to model complex non-linear relationships while maintaining interpretability through careful architecture design and regularization strategies.