Regularization Overview
Regularization prevents overfitting. It reduces model complexity. It improves generalization. It balances bias and variance. Different techniques work for different models.
Overfitting occurs when models memorize training data. They perform well on training data. They perform poorly on new data. Regularization forces models to learn general patterns.
The diagram shows overfitting. Training accuracy is high. Test accuracy is low. The model memorizes training patterns. It fails on new patterns.
Bias-Variance Tradeoff
Bias measures model simplicity. High bias means underfitting. Models miss important patterns. Variance measures model sensitivity. High variance means overfitting. Models learn noise.
The tradeoff balances model complexity. Simple models have high bias and low variance. Complex models have low bias and high variance. Optimal models balance both.
# Bias-Variance Tradeoffimport numpy as npfrom sklearn.linear_model import LinearRegression, PolynomialFeaturesfrom sklearn.metrics import mean_squared_error# Generate dataX = np.linspace(0, 10, 20).reshape(-1, 1)y = 2 * X.flatten() + np.random.randn(20) * 2# Simple model (high bias, low variance)model_simple = LinearRegression()model_simple.fit(X, y)y_pred_simple = model_simple.predict(X)mse_simple = mean_squared_error(y, y_pred_simple)# Complex model (low bias, high variance)poly = PolynomialFeatures(degree=15)X_poly = poly.fit_transform(X)model_complex = LinearRegression()model_complex.fit(X_poly, y)y_pred_complex = model_complex.predict(X_poly)mse_complex = mean_squared_error(y, y_pred_complex)print("Simple model MSE: " + str(mse_simple))print("Complex model MSE: " + str(mse_complex))# Simple model has higher bias but generalizes better
Understanding bias-variance helps choose regularization. High bias needs less regularization. High variance needs more regularization.
The diagram shows the tradeoff. Simple models have high bias. Complex models have high variance. Optimal models balance both.
L1 and L2 Regularization
L1 regularization adds absolute weight penalties. It encourages sparsity. It performs feature selection. L2 regularization adds squared weight penalties. It shrinks all weights. It prevents large weights.
L1 cost is J = loss + λΣ|w|. Lambda controls strength. Larger lambda increases sparsity. L2 cost is J = loss + λΣw². It shrinks weights toward zero. It doesn't eliminate features.
# L1 and L2 Regularizationfrom sklearn.linear_model import Lasso, Ridgeimport numpy as npX = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])y = np.array([6, 15, 24])# L1 regularization (Lasso)lasso = Lasso(alpha=0.1)lasso.fit(X, y)print("Lasso weights: " + str(lasso.coef_))# Some weights become zero# L2 regularization (Ridge)ridge = Ridge(alpha=0.1)ridge.fit(X, y)print("Ridge weights: " + str(ridge.coef_))# All weights shrink but remain non-zero
Choose regularization based on goals. Use L1 for feature selection. Use L2 for weight shrinkage. Use both (Elastic Net) for combined benefits.
The diagram compares L1 and L2. L1 creates sparse solutions. L2 creates smooth solutions.
Dropout
Dropout randomly disables neurons during training. It prevents co-adaptation. It forces redundant representations. It reduces overfitting in neural networks.
Dropout rate controls disable probability. Common rates are 0.2 to 0.5. Higher rates increase regularization. Lower rates reduce regularization. During inference, all neurons are active. Outputs are scaled by dropout rate.
Detailed Dropout Mechanisms
Dropout prevents co-adaptation of neurons. During training, random neurons are disabled. This forces network to learn redundant representations. No single neuron becomes critical. Network becomes more robust.
Mathematical formulation: during training, output is y = f(Wx + b) ⊙ m where m ~ Bernoulli(1-p). During inference, output is y = f(Wx + b) × (1-p). Scaling by (1-p) maintains expected output magnitude.
Spatial dropout drops entire feature maps in CNNs. It works better than standard dropout for convolutional layers. It drops contiguous regions. It prevents spatial co-adaptation.
# Detailed Dropout Implementationimport numpy as npimport torchimport torch.nn as nnclass DropoutDetailed:def __init__(self, p=0.5):self.p = pself.mask = Noneself.training = Truedef forward(self, x):if self.training:# Generate dropout maskself.mask = (np.random.rand(*x.shape) > self.p) / (1 - self.p)return x * self.maskelse:# During inference, no dropoutreturn xdef backward(self, grad_output):if self.training:return grad_output * self.maskreturn grad_output# PyTorch dropout comparisonclass DropoutComparison:def __init__(self):self.dropout1 = nn.Dropout(p=0.5) # Standard dropoutself.dropout2 = nn.Dropout2d(p=0.5) # Spatial dropout for 2Dself.dropout3 = nn.AlphaDropout(p=0.5) # Alpha dropout for SELUdef compare(self, x):# Standard dropoutout1 = self.dropout1(x)# Spatial dropout (requires 4D input: batch, channels, height, width)if len(x.shape) == 4:out2 = self.dropout2(x)else:out2 = None# Alpha dropout (maintains self-normalizing properties)out3 = self.dropout3(x)return out1, out2, out3# Examplex = torch.randn(32, 100) # batch_size=32, features=100comparison = DropoutComparison()out1, out2, out3 = comparison.compare(x)print("Standard dropout output shape: " + str(out1.shape))print("Alpha dropout output shape: " + str(out3.shape))
Advanced Regularization Techniques
Batch normalization normalizes layer inputs. It reduces internal covariate shift. It enables higher learning rates. It acts as regularization. It improves training stability.
Layer normalization normalizes across features. It works for variable-length sequences. It doesn't depend on batch statistics. It works well for RNNs and transformers.
Weight decay adds L2 penalty to weights. It shrinks weights during training. It prevents large weight values. It is equivalent to L2 regularization in SGD.
# Advanced Regularization Techniquesimport torchimport torch.nn as nnclass RegularizedNetwork(nn.Module):def __init__(self, input_size, hidden_size, output_size):super().__init__()# Layers with batch normalizationself.fc1 = nn.Linear(input_size, hidden_size)self.bn1 = nn.BatchNorm1d(hidden_size)self.dropout1 = nn.Dropout(0.5)self.fc2 = nn.Linear(hidden_size, hidden_size)self.bn2 = nn.BatchNorm1d(hidden_size)self.dropout2 = nn.Dropout(0.5)self.fc3 = nn.Linear(hidden_size, output_size)def forward(self, x):x = self.fc1(x)x = self.bn1(x)x = torch.relu(x)x = self.dropout1(x)x = self.fc2(x)x = self.bn2(x)x = torch.relu(x)x = self.dropout2(x)x = self.fc3(x)return x# Training with weight decaymodel = RegularizedNetwork(10, 20, 1)optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)# Weight decay adds L2 penalty# Loss = original_loss + weight_decay * sum(weights^2)
Regularization Hyperparameter Tuning
Tune regularization strength using validation data. Start with default values. Increase if overfitting. Decrease if underfitting. Use grid search or random search.
For L1/L2 regularization, test alpha values from 0.0001 to 10. Use logarithmic scale. For dropout, test rates from 0.1 to 0.7. Higher rates for larger networks. Lower rates for smaller networks.
Cross-validation helps find optimal values. Split data into folds. Test different values on each fold. Average results. Choose values with best validation performance.
# Regularization Hyperparameter Tuningfrom sklearn.model_selection import GridSearchCVfrom sklearn.linear_model import Ridge, Lasso, ElasticNetfrom sklearn.neural_network import MLPClassifierimport numpy as np# Generate dataX = np.random.randn(1000, 20)y = np.random.randint(0, 2, 1000)# Tune Ridge regressionridge_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')ridge_grid.fit(X, y)print("Best Ridge alpha: " + str(ridge_grid.best_params_['alpha']))# Tune Lasso regressionlasso_params = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}lasso_grid = GridSearchCV(Lasso(max_iter=10000), lasso_params, cv=5, scoring='neg_mean_squared_error')lasso_grid.fit(X, y)print("Best Lasso alpha: " + str(lasso_grid.best_params_['alpha']))# Tune MLP with dropoutmlp_params = {'hidden_layer_sizes': [(50,), (100,), (50, 50)],'alpha': [0.0001, 0.001, 0.01], # L2 regularization'learning_rate_init': [0.001, 0.01]}mlp_grid = GridSearchCV(MLPClassifier(max_iter=1000), mlp_params, cv=3, scoring='accuracy')mlp_grid.fit(X, y)print("Best MLP params: " + str(mlp_grid.best_params_))
# Dropout Implementationimport numpy as npclass Dropout:def __init__(self, rate=0.5):self.rate = rateself.mask = Nonedef forward(self, x, training=True):if training:self.mask = np.random.binomial(1, 1 - self.rate, x.shape) / (1 - self.rate)return x * self.maskreturn xdef backward(self, grad):return grad * self.mask# Exampledropout = Dropout(rate=0.5)x = np.array([[1.0, 2.0, 3.0, 4.0]])# Training modex_dropped = dropout.forward(x, training=True)print("With dropout: " + str(x_dropped))# Inference modex_inference = dropout.forward(x, training=False)print("Without dropout: " + str(x_inference))
Dropout is effective for neural networks. It works well with other techniques. It is simple to implement. It significantly reduces overfitting.
The diagram shows dropout. Some neurons are disabled during training. All neurons are active during inference.
Cross-Validation
Cross-validation evaluates model performance robustly. It splits data into folds. It trains on k-1 folds. It tests on remaining fold. It repeats for all folds. It averages results.
K-fold cross-validation uses k folds. Typical k is 5 or 10. Stratified cross-validation maintains class distribution. It works well for imbalanced data.
# Cross-Validationfrom sklearn.model_selection import KFold, cross_val_scorefrom sklearn.linear_model import LinearRegressionimport numpy as npX = np.random.randn(100, 10)y = np.random.randn(100)model = LinearRegression()kfold = KFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')print("Cross-validation scores: " + str(-scores))print("Mean score: " + str(-scores.mean()))print("Std score: " + str(scores.std()))
Cross-validation provides robust performance estimates. It uses all data for training and testing. It reduces variance in estimates. It helps detect overfitting.
The diagram shows 5-fold cross-validation. Data splits into 5 folds. Each fold serves as test set once. Results average across folds.
Early Stopping
Early stopping monitors validation performance. It stops training when performance degrades. It prevents overfitting automatically. It saves computation time.
The process tracks validation loss. It stops when loss stops improving. It saves the best model. Patience parameter controls wait time. Larger patience waits longer.
# Early Stoppingclass EarlyStopping:def __init__(self, patience=5, min_delta=0.001):self.patience = patienceself.min_delta = min_deltaself.best_loss = float('inf')self.counter = 0self.best_weights = Nonedef check(self, val_loss, model_weights):if val_loss < self.best_loss - self.min_delta:self.best_loss = val_lossself.best_weights = model_weightsself.counter = 0return Falseelse:self.counter += 1return self.counter >= self.patience# Example usageearly_stopping = EarlyStopping(patience=5)for epoch in range(100):train_loss = 0.5 * (1 - epoch / 100) # Decreasingval_loss = 0.5 * (1 - epoch / 50) if epoch < 50 else 0.5 + (epoch - 50) / 100 # Increases after 50if early_stopping.check(val_loss, None):print(f"Early stopping at epoch {epoch}")break
Early stopping is simple and effective. It prevents overfitting automatically. It saves the best model. It reduces training time.
The diagram shows early stopping. Training loss decreases. Validation loss decreases then increases. Training stops when validation degrades.
Data Augmentation
Data augmentation creates more training examples. It applies transformations to existing data. It increases dataset diversity. It acts as regularization. It improves generalization.
Common augmentations include rotation, flipping, scaling, and noise. For images, use geometric and color transformations. For text, use paraphrasing and synonym replacement. For audio, use time stretching and pitch shifting.
# Data Augmentationimport numpy as npfrom scipy.ndimage import rotate, zoomdef augment_image(image, angle_range=15, zoom_range=0.1):# Random rotationangle = np.random.uniform(-angle_range, angle_range)rotated = rotate(image, angle, reshape=False)# Random zoomzoom_factor = 1 + np.random.uniform(-zoom_range, zoom_range)zoomed = zoom(rotated, zoom_factor)# Random noisenoise = np.random.normal(0, 0.01, zoomed.shape)noisy = zoomed + noisereturn np.clip(noisy, 0, 1)# Exampleimage = np.random.rand(32, 32, 3)augmented = augment_image(image)print("Original shape: " + str(image.shape))print("Augmented shape: " + str(augmented.shape))
Data augmentation is powerful regularization. It increases effective dataset size. It improves model robustness. It is essential for limited data.
Summary
Regularization prevents overfitting. Bias-variance tradeoff guides regularization strength. L1 regularization encourages sparsity. L2 regularization shrinks weights. Dropout disables neurons randomly. Cross-validation evaluates performance robustly. Early stopping prevents overfitting automatically. Data augmentation increases dataset diversity. Combined techniques provide strong regularization.