Overfitting and Regularization

Regularization Overview

Regularization prevents overfitting. It reduces model complexity. It improves generalization. It balances bias and variance. Different techniques work for different models.

Overfitting occurs when models memorize training data. They perform well on training data. They perform poorly on new data. Regularization forces models to learn general patterns.

Figure: Overfitting Problem

The diagram shows overfitting. Training accuracy is high. Test accuracy is low. The model memorizes training patterns. It fails on new patterns.

Bias-Variance Tradeoff

Bias measures model simplicity. High bias means underfitting. Models miss important patterns. Variance measures model sensitivity. High variance means overfitting. Models learn noise.

The tradeoff balances model complexity. Simple models have high bias and low variance. Complex models have low bias and high variance. Optimal models balance both.

# Bias-Variance Tradeoff
import numpy as np
from sklearn.linear_model import LinearRegression, PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Generate data
X = np.linspace(0, 10, 20).reshape(-1, 1)
y = 2 * X.flatten() + np.random.randn(20) * 2

# Simple model (high bias, low variance)
model_simple = LinearRegression()
model_simple.fit(X, y)
y_pred_simple = model_simple.predict(X)
mse_simple = mean_squared_error(y, y_pred_simple)

# Complex model (low bias, high variance)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
model_complex = LinearRegression()
model_complex.fit(X_poly, y)
y_pred_complex = model_complex.predict(X_poly)
mse_complex = mean_squared_error(y, y_pred_complex)

print("Simple model MSE: " + str(mse_simple))
print("Complex model MSE: " + str(mse_complex))
# Simple model has higher bias but generalizes better

Understanding bias-variance helps choose regularization. High bias needs less regularization. High variance needs more regularization.

Figure: Bias-Variance Tradeoff

The diagram shows the tradeoff. Simple models have high bias. Complex models have high variance. Optimal models balance both.

L1 and L2 Regularization

L1 regularization adds absolute weight penalties. It encourages sparsity. It performs feature selection. L2 regularization adds squared weight penalties. It shrinks all weights. It prevents large weights.

L1 cost is J = loss + λΣ|w|. Lambda controls strength. Larger lambda increases sparsity. L2 cost is J = loss + λΣw². It shrinks weights toward zero. It doesn't eliminate features.

# L1 and L2 Regularization
from sklearn.linear_model import Lasso, Ridge
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([6, 15, 24])

# L1 regularization (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso weights: " + str(lasso.coef_))
# Some weights become zero

# L2 regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)
print("Ridge weights: " + str(ridge.coef_))
# All weights shrink but remain non-zero

Choose regularization based on goals. Use L1 for feature selection. Use L2 for weight shrinkage. Use both (Elastic Net) for combined benefits.

Figure: L1 vs L2

The diagram compares L1 and L2. L1 creates sparse solutions. L2 creates smooth solutions.

Dropout

Dropout randomly disables neurons during training. It prevents co-adaptation. It forces redundant representations. It reduces overfitting in neural networks.

Dropout rate controls disable probability. Common rates are 0.2 to 0.5. Higher rates increase regularization. Lower rates reduce regularization. During inference, all neurons are active. Outputs are scaled by dropout rate.

Detailed Dropout Mechanisms

Dropout prevents co-adaptation of neurons. During training, random neurons are disabled. This forces network to learn redundant representations. No single neuron becomes critical. Network becomes more robust.

Mathematical formulation: during training, output is y = f(Wx + b) ⊙ m where m ~ Bernoulli(1-p). During inference, output is y = f(Wx + b) × (1-p). Scaling by (1-p) maintains expected output magnitude.

Spatial dropout drops entire feature maps in CNNs. It works better than standard dropout for convolutional layers. It drops contiguous regions. It prevents spatial co-adaptation.

# Detailed Dropout Implementation
import numpy as np
import torch
import torch.nn as nn

class DropoutDetailed:
    def __init__(self, p=0.5):
        self.p = p
        self.mask = None
        self.training = True
    
    def forward(self, x):
        if self.training:
            # Generate dropout mask
            self.mask = (np.random.rand(*x.shape) > self.p) / (1 - self.p)
            return x * self.mask
        else:
            # During inference, no dropout
            return x
    
    def backward(self, grad_output):
        if self.training:
            return grad_output * self.mask
        return grad_output

# PyTorch dropout comparison
class DropoutComparison:
    def __init__(self):
        self.dropout1 = nn.Dropout(p=0.5)  # Standard dropout
        self.dropout2 = nn.Dropout2d(p=0.5)  # Spatial dropout for 2D
        self.dropout3 = nn.AlphaDropout(p=0.5)  # Alpha dropout for SELU
    
    def compare(self, x):
        # Standard dropout
        out1 = self.dropout1(x)
        
        # Spatial dropout (requires 4D input: batch, channels, height, width)
        if len(x.shape) == 4:
            out2 = self.dropout2(x)
        else:
            out2 = None
        
        # Alpha dropout (maintains self-normalizing properties)
        out3 = self.dropout3(x)
        
        return out1, out2, out3

# Example
x = torch.randn(32, 100)  # batch_size=32, features=100
comparison = DropoutComparison()
out1, out2, out3 = comparison.compare(x)
print("Standard dropout output shape: " + str(out1.shape))
print("Alpha dropout output shape: " + str(out3.shape))

Advanced Regularization Techniques

Batch normalization normalizes layer inputs. It reduces internal covariate shift. It enables higher learning rates. It acts as regularization. It improves training stability.

Layer normalization normalizes across features. It works for variable-length sequences. It doesn't depend on batch statistics. It works well for RNNs and transformers.

Weight decay adds L2 penalty to weights. It shrinks weights during training. It prevents large weight values. It is equivalent to L2 regularization in SGD.

# Advanced Regularization Techniques
import torch
import torch.nn as nn

class RegularizedNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        # Layers with batch normalization
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.bn1 = nn.BatchNorm1d(hidden_size)
        self.dropout1 = nn.Dropout(0.5)
        
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.bn2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(0.5)
        
        self.fc3 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = torch.relu(x)
        x = self.dropout1(x)
        
        x = self.fc2(x)
        x = self.bn2(x)
        x = torch.relu(x)
        x = self.dropout2(x)
        
        x = self.fc3(x)
        return x

# Training with weight decay
model = RegularizedNetwork(10, 20, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

# Weight decay adds L2 penalty
# Loss = original_loss + weight_decay * sum(weights^2)

Regularization Hyperparameter Tuning

Tune regularization strength using validation data. Start with default values. Increase if overfitting. Decrease if underfitting. Use grid search or random search.

For L1/L2 regularization, test alpha values from 0.0001 to 10. Use logarithmic scale. For dropout, test rates from 0.1 to 0.7. Higher rates for larger networks. Lower rates for smaller networks.

Cross-validation helps find optimal values. Split data into folds. Test different values on each fold. Average results. Choose values with best validation performance.

# Regularization Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.neural_network import MLPClassifier
import numpy as np

# Generate data
X = np.random.randn(1000, 20)
y = np.random.randint(0, 2, 1000)

# Tune Ridge regression
ridge_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X, y)
print("Best Ridge alpha: " + str(ridge_grid.best_params_['alpha']))

# Tune Lasso regression
lasso_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
lasso_grid = GridSearchCV(Lasso(max_iter=10000), lasso_params, cv=5, scoring='neg_mean_squared_error')
lasso_grid.fit(X, y)
print("Best Lasso alpha: " + str(lasso_grid.best_params_['alpha']))

# Tune MLP with dropout
mlp_params = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50)],
    'alpha': [0.0001, 0.001, 0.01],  # L2 regularization
    'learning_rate_init': [0.001, 0.01]
}
mlp_grid = GridSearchCV(MLPClassifier(max_iter=1000), mlp_params, cv=3, scoring='accuracy')
mlp_grid.fit(X, y)
print("Best MLP params: " + str(mlp_grid.best_params_))

# Dropout Implementation
import numpy as np

class Dropout:
    def __init__(self, rate=0.5):
        self.rate = rate
        self.mask = None
    
    def forward(self, x, training=True):
        if training:
            self.mask = np.random.binomial(1, 1 - self.rate, x.shape) / (1 - self.rate)
            return x * self.mask
        return x
    
    def backward(self, grad):
        return grad * self.mask

# Example
dropout = Dropout(rate=0.5)
x = np.array([[1.0, 2.0, 3.0, 4.0]])

# Training mode
x_dropped = dropout.forward(x, training=True)
print("With dropout: " + str(x_dropped))

# Inference mode
x_inference = dropout.forward(x, training=False)
print("Without dropout: " + str(x_inference))

Dropout is effective for neural networks. It works well with other techniques. It is simple to implement. It significantly reduces overfitting.

Figure: Dropout

The diagram shows dropout. Some neurons are disabled during training. All neurons are active during inference.

Cross-Validation

Cross-validation evaluates model performance robustly. It splits data into folds. It trains on k-1 folds. It tests on remaining fold. It repeats for all folds. It averages results.

K-fold cross-validation uses k folds. Typical k is 5 or 10. Stratified cross-validation maintains class distribution. It works well for imbalanced data.

# Cross-Validation
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.random.randn(100, 10)
y = np.random.randn(100)

model = LinearRegression()
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
print("Cross-validation scores: " + str(-scores))
print("Mean score: " + str(-scores.mean()))
print("Std score: " + str(scores.std()))

Cross-validation provides robust performance estimates. It uses all data for training and testing. It reduces variance in estimates. It helps detect overfitting.

Figure: Cross-Validation

The diagram shows 5-fold cross-validation. Data splits into 5 folds. Each fold serves as test set once. Results average across folds.

Early Stopping

Early stopping monitors validation performance. It stops training when performance degrades. It prevents overfitting automatically. It saves computation time.

The process tracks validation loss. It stops when loss stops improving. It saves the best model. Patience parameter controls wait time. Larger patience waits longer.

# Early Stopping
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None
    
    def check(self, val_loss, model_weights):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.best_weights = model_weights
            self.counter = 0
            return False
        else:
            self.counter += 1
            return self.counter >= self.patience

# Example usage
early_stopping = EarlyStopping(patience=5)

for epoch in range(100):
    train_loss = 0.5 * (1 - epoch / 100)  # Decreasing
    val_loss = 0.5 * (1 - epoch / 50) if epoch < 50 else 0.5 + (epoch - 50) / 100  # Increases after 50
    
    if early_stopping.check(val_loss, None):
        print(f"Early stopping at epoch {epoch}")
        break

Early stopping is simple and effective. It prevents overfitting automatically. It saves the best model. It reduces training time.

Figure: Early Stopping

The diagram shows early stopping. Training loss decreases. Validation loss decreases then increases. Training stops when validation degrades.

Data Augmentation

Data augmentation creates more training examples. It applies transformations to existing data. It increases dataset diversity. It acts as regularization. It improves generalization.

Common augmentations include rotation, flipping, scaling, and noise. For images, use geometric and color transformations. For text, use paraphrasing and synonym replacement. For audio, use time stretching and pitch shifting.

# Data Augmentation
import numpy as np
from scipy.ndimage import rotate, zoom

def augment_image(image, angle_range=15, zoom_range=0.1):
    # Random rotation
    angle = np.random.uniform(-angle_range, angle_range)
    rotated = rotate(image, angle, reshape=False)
    
    # Random zoom
    zoom_factor = 1 + np.random.uniform(-zoom_range, zoom_range)
    zoomed = zoom(rotated, zoom_factor)
    
    # Random noise
    noise = np.random.normal(0, 0.01, zoomed.shape)
    noisy = zoomed + noise
    
    return np.clip(noisy, 0, 1)

# Example
image = np.random.rand(32, 32, 3)
augmented = augment_image(image)
print("Original shape: " + str(image.shape))
print("Augmented shape: " + str(augmented.shape))

Data augmentation is powerful regularization. It increases effective dataset size. It improves model robustness. It is essential for limited data.

Summary

Regularization prevents overfitting. Bias-variance tradeoff guides regularization strength. L1 regularization encourages sparsity. L2 regularization shrinks weights. Dropout disables neurons randomly. Cross-validation evaluates performance robustly. Early stopping prevents overfitting automatically. Data augmentation increases dataset diversity. Combined techniques provide strong regularization.