Back to Tutorials
BeginnerTutorial 5

Training Neural Networks

NeuronDB Team
2/24/2025
30 min read

Training Overview

Training adjusts network weights to minimize errors. It uses loss functions to measure errors. It uses backpropagation to compute gradients. It uses optimizers to update weights. Training continues until convergence or maximum iterations.

The training loop processes data in batches. Each batch updates weights once. Multiple epochs process all data multiple times. Early stopping prevents overfitting. Learning rate schedules adjust step sizes.

Training Process
Figure: Training Process

The diagram shows training workflow. Data flows through forward pass. Loss computes errors. Backpropagation computes gradients. Optimizer updates weights. Process repeats until convergence.

Loss Functions

Loss functions measure prediction errors. They guide weight updates. Different problems use different losses. Regression uses MSE or MAE. Classification uses cross-entropy.

Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. It works well for regression. Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. It is robust to outliers.

Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability differences. It works well for classification. It penalizes confident wrong predictions.

# Loss Functions
import numpy as np
def mse_loss(y_pred, y_true):
return np.mean((y_pred - y_true)**2)
def mae_loss(y_pred, y_true):
return np.mean(np.abs(y_pred - y_true))
def cross_entropy_loss(y_pred, y_true):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Example
y_pred_reg = np.array([100, 200, 300])
y_true_reg = np.array([110, 190, 310])
print("MSE: " + str(mse_loss(y_pred_reg, y_true_reg)))
print("MAE: " + str(mae_loss(y_pred_reg, y_true_reg)))
y_pred_clf = np.array([0.1, 0.9, 0.8])
y_true_clf = np.array([0, 1, 1])
print("Cross-entropy: " + str(cross_entropy_loss(y_pred_clf, y_true_clf)))
# Result:
# MSE: 100.0
# MAE: 10.0
# Cross-entropy: 0.105

Choose loss functions matching your problem. Regression problems use MSE or MAE. Classification problems use cross-entropy. Custom losses can encode domain knowledge.

Loss Functions
Figure: Loss Functions

The diagram compares loss functions. MSE is quadratic. MAE is linear. Cross-entropy is logarithmic.

Backpropagation

Backpropagation computes gradients efficiently. It uses chain rule to propagate errors backward. It computes gradients for all weights in one pass. It enables training deep networks.

The process starts at output layer. It computes output error. It propagates error backward through layers. Each layer computes its gradient. Gradients accumulate using chain rule.

# Backpropagation
import numpy as np
def backward_propagation(y_pred, y_true, activations, weights, activation_derivative):
m = y_true.shape[0]
gradients_w = []
gradients_b = []
# Output layer error
error = y_pred - y_true
gradients_w.append(np.dot(activations[-2].T, error) / m)
gradients_b.append(np.sum(error, axis=0, keepdims=True) / m)
# Backpropagate through hidden layers
for i in range(len(weights) - 2, -1, -1):
error = np.dot(error, weights[i+1].T) * activation_derivative(activations[i+1])
gradients_w.insert(0, np.dot(activations[i].T, error) / m)
gradients_b.insert(0, np.sum(error, axis=0, keepdims=True) / m)
return gradients_w, gradients_b
# Example usage
y_pred = np.array([[0.9], [0.1], [0.8]])
y_true = np.array([[1.0], [0.0], [1.0]])
activations = [np.array([[1, 2]]), np.array([[0.5, 0.7]]), np.array([[0.9]])]
weights = [np.array([[0.5, 0.3], [0.2, 0.4]]), np.array([[0.1], [0.6]])]
def relu_derivative(x):
return (x > 0).astype(float)
grad_w, grad_b = backward_propagation(y_pred, y_true, activations, weights, relu_derivative)
print("Gradient shapes: " + str([g.shape for g in grad_w]))

Backpropagation is the core of neural network training. It enables efficient gradient computation. It makes deep learning practical.

Detailed Backpropagation Mathematics

Backpropagation uses chain rule from calculus. For output layer, error is δᴸ = ∇ₐC ⊙ σ'(zᴸ). C is cost function. σ' is activation derivative. ⊙ is element-wise multiplication.

For hidden layer l, error propagates backward. δˡ = ((wˡ⁺¹)ᵀ δˡ⁺¹) ⊙ σ'(zˡ). Weights from next layer transpose. Error from next layer. Activation derivative of current layer.

Gradient for weight wˡᵢⱼ is ∂C/∂wˡᵢⱼ = aˡ⁻¹ⱼ × δˡᵢ. Activation from previous layer. Error from current layer. Gradient for bias bˡᵢ is ∂C/∂bˡᵢ = δˡᵢ. Bias gradient equals error.

# Detailed Backpropagation Implementation
import numpy as np
class NeuralNetworkDetailed:
def __init__(self, layers, learning_rate=0.01):
self.layers = layers
self.lr = learning_rate
self.weights = []
self.biases = []
self.activations = []
self.z_values = []
# Initialize weights and biases
for i in range(len(layers) - 1):
w = np.random.randn(layers[i+1], layers[i]) * 0.1
b = np.zeros((layers[i+1], 1))
self.weights.append(w)
self.biases.append(b)
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, z):
s = self.sigmoid(z)
return s * (1 - s)
def forward(self, X):
self.activations = [X.T]
self.z_values = []
for i in range(len(self.weights)):
z = self.weights[i] @ self.activations[-1] + self.biases[i]
self.z_values.append(z)
a = self.sigmoid(z)
self.activations.append(a)
return self.activations[-1]
def backward(self, y):
m = y.shape[0]
y = y.reshape(-1, 1).T
# Output layer error
delta = (self.activations[-1] - y) * self.sigmoid_derivative(self.z_values[-1])
gradients_w = []
gradients_b = []
# Backpropagate through layers
for i in range(len(self.weights) - 1, -1, -1):
# Gradient for weights
grad_w = (1/m) * delta @ self.activations[i].T
gradients_w.insert(0, grad_w)
# Gradient for biases
grad_b = (1/m) * np.sum(delta, axis=1, keepdims=True)
gradients_b.insert(0, grad_b)
# Propagate error backward
if i > 0:
delta = (self.weights[i].T @ delta) * self.sigmoid_derivative(self.z_values[i-1])
return gradients_w, gradients_b
def update(self, gradients_w, gradients_b):
for i in range(len(self.weights)):
self.weights[i] -= self.lr * gradients_w[i]
self.biases[i] -= self.lr * gradients_b[i]
def train(self, X, y, epochs=1000):
for epoch in range(epochs):
# Forward pass
output = self.forward(X)
# Backward pass
grad_w, grad_b = self.backward(y)
# Update weights
self.update(grad_w, grad_b)
if epoch % 100 == 0:
cost = np.mean((output - y.reshape(-1, 1).T)**2) / 2
print(f"Epoch {epoch}, Cost: {cost:.4f}")
# Example
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0]) # XOR problem
nn = NeuralNetworkDetailed([2, 4, 1], learning_rate=0.5)
nn.train(X, y, epochs=1000)
predictions = nn.forward(X)
print("Predictions: " + str(predictions.T.flatten()))

Gradient Vanishing and Exploding

Gradient vanishing occurs when gradients become very small. It happens in deep networks. It prevents early layers from learning. It occurs with sigmoid and tanh activations. Their derivatives are bounded and small.

Gradient exploding occurs when gradients become very large. It causes unstable training. It happens with large weights. It causes NaN values. It occurs in recurrent networks.

Solutions include proper weight initialization. Xavier initialization uses variance 1/n. He initialization uses variance 2/n. Batch normalization normalizes activations. Residual connections provide gradient highways. Gradient clipping limits gradient magnitude.

# Gradient Vanishing Demonstration
import numpy as np
import matplotlib.pyplot as plt
def demonstrate_gradient_vanishing():
# Simulate deep network
depth = 10
weights = [np.random.randn(10, 10) * 0.5 for _ in range(depth)]
# Initial gradient
initial_grad = np.ones((10, 1))
# Propagate through layers
grad = initial_grad
grad_magnitudes = [np.linalg.norm(grad)]
for i in range(depth):
# Simulate sigmoid derivative (small values)
sigmoid_deriv = 0.25 # Maximum value
grad = weights[i].T @ grad * sigmoid_deriv
grad_magnitudes.append(np.linalg.norm(grad))
plt.plot(range(len(grad_magnitudes)), grad_magnitudes)
plt.xlabel('Layer')
plt.ylabel('Gradient Magnitude')
plt.title('Gradient Vanishing in Deep Network')
plt.yscale('log')
plt.show()
print("Initial gradient magnitude: " + str(grad_magnitudes[0]))
print("Final gradient magnitude: " + str(grad_magnitudes[-1]))
print("Reduction factor: " + str(grad_magnitudes[0] / grad_magnitudes[-1]))
demonstrate_gradient_vanishing()
Backpropagation
Figure: Backpropagation

The diagram shows gradient flow. Errors propagate backward. Each layer computes its contribution. Gradients accumulate through chain rule.

Optimizers

Optimizers update weights using gradients. Different optimizers have different update rules. SGD uses simple gradient descent. Momentum adds velocity term. Adam combines momentum and adaptive learning rates.

Stochastic gradient descent is w = w - α × ∇w. Learning rate α controls step size. It is simple but can be slow. Momentum is v = βv + ∇w, w = w - αv. Velocity β accumulates gradients. It accelerates convergence.

Adam adapts learning rates per parameter. It maintains momentum and variance estimates. It works well for most problems. It is the default choice for many applications.

# Optimizers
import numpy as np
class SGD:
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
def update(self, weights, gradients):
return [w - self.learning_rate * g for w, g in zip(weights, gradients)]
class Adam:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.m = None
self.v = None
self.t = 0
def update(self, weights, gradients):
if self.m is None:
self.m = [np.zeros_like(g) for g in gradients]
self.v = [np.zeros_like(g) for g in gradients]
self.t += 1
updated_weights = []
for w, g, m, v in zip(weights, gradients, self.m, self.v):
m = self.beta1 * m + (1 - self.beta1) * g
v = self.beta2 * v + (1 - self.beta2) * g**2
m_hat = m / (1 - self.beta1**self.t)
v_hat = v / (1 - self.beta2**self.t)
w_new = w - self.learning_rate * m_hat / (np.sqrt(v_hat) + 1e-8)
updated_weights.append(w_new)
self.m = [self.beta1 * m + (1 - self.beta1) * g for m, g in zip(self.m, gradients)]
self.v = [self.beta2 * v + (1 - self.beta2) * g**2 for v, g in zip(self.v, gradients)]
return updated_weights
# Example
weights = [np.array([[0.5], [0.3]])]
gradients = [np.array([[0.1], [0.2]])]
sgd = SGD(learning_rate=0.01)
adam = Adam(learning_rate=0.001)
w_sgd = sgd.update(weights, gradients)
w_adam = adam.update(weights, gradients)
print("SGD update: " + str(w_sgd[0].flatten()))
print("Adam update: " + str(w_adam[0].flatten()))

Choose optimizers based on problem characteristics. SGD works for simple problems. Adam works for most problems. Experiment to find the best optimizer.

Optimizers
Figure: Optimizers

The diagram compares optimizer paths. SGD follows gradients directly. Momentum follows smoothed gradients. Adam adapts step sizes per parameter.

Learning Rate Schedules

Learning rate schedules adjust step sizes during training. Fixed rates can be too large or too small. Adaptive rates improve convergence. Common schedules include step decay, exponential decay, and cosine annealing.

Step decay reduces rate at fixed intervals. Exponential decay reduces rate continuously. Cosine annealing follows cosine curve. Warmup starts with small rates. It prevents early instability.

# Learning Rate Schedules
import numpy as np
def step_decay(epoch, initial_lr=0.01, drop=0.5, epochs_drop=10):
return initial_lr * (drop ** (epoch // epochs_drop))
def exponential_decay(epoch, initial_lr=0.01, decay_rate=0.96):
return initial_lr * (decay_rate ** epoch)
def cosine_annealing(epoch, initial_lr=0.01, max_epochs=100):
return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / max_epochs))
# Example
for epoch in range(0, 50, 10):
lr_step = step_decay(epoch)
lr_exp = exponential_decay(epoch)
lr_cos = cosine_annealing(epoch, max_epochs=50)
print(f"Epoch {epoch}: Step={lr_step:.4f}, Exp={lr_exp:.4f}, Cos={lr_cos:.4f}")

Learning rate schedules improve training. They enable faster initial learning. They enable fine-tuning later. They prevent overshooting minima.

Learning Rate Schedules
Figure: Learning Rate Schedules

The diagram shows different learning rate schedules. Fixed rate stays constant. Step decay reduces at intervals. Exponential decay reduces continuously. Each schedule has different convergence characteristics.

Detailed Learning Rate Schedule Strategies

Fixed learning rate uses constant value throughout training. It is simple to implement. It requires careful tuning. Too high causes instability. Too low causes slow convergence. It works for simple problems with stable loss landscapes.

Step decay reduces learning rate at fixed intervals. lr(epoch) = initial_lr × drop_factor^(epoch // drop_interval). Drop interval is typically 10-30 epochs. Drop factor is typically 0.1-0.5. It provides controlled reduction. It works well for many problems.

Exponential decay reduces learning rate continuously. lr(epoch) = initial_lr × decay_rate^epoch. Decay rate is typically 0.9-0.99. It provides smooth reduction. It requires tuning decay rate carefully. It works for problems needing gradual reduction.

Cosine annealing follows cosine curve. lr(epoch) = initial_lr × 0.5 × (1 + cos(π × epoch / max_epochs)). It starts high and ends low. It provides smooth transition. It works well for long training runs. It often improves final performance.

Warmup starts with small learning rate. It gradually increases to target rate. It prevents early instability. It helps with large batch training. Typical warmup is 5-10% of total epochs.

# Detailed Learning Rate Schedules
import numpy as np
import matplotlib.pyplot as plt
class LearningRateSchedules:
def fixed(self, epoch, initial_lr=0.01):
return initial_lr
def step_decay(self, epoch, initial_lr=0.01, drop_factor=0.5, drop_interval=10):
return initial_lr * (drop_factor ** (epoch // drop_interval))
def exponential_decay(self, epoch, initial_lr=0.01, decay_rate=0.96):
return initial_lr * (decay_rate ** epoch)
def cosine_annealing(self, epoch, initial_lr=0.01, max_epochs=100):
return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / max_epochs))
def warmup_cosine(self, epoch, initial_lr=0.01, warmup_epochs=10, max_epochs=100):
if epoch < warmup_epochs:
return initial_lr * (epoch / warmup_epochs)
else:
cosine_epoch = epoch - warmup_epochs
cosine_max = max_epochs - warmup_epochs
return initial_lr * 0.5 * (1 + np.cos(np.pi * cosine_epoch / cosine_max))
def polynomial_decay(self, epoch, initial_lr=0.01, max_epochs=100, power=0.9):
return initial_lr * ((1 - epoch / max_epochs) ** power)
schedules = LearningRateSchedules()
epochs = np.arange(0, 100)
# Compare schedules
lr_fixed = [schedules.fixed(e) for e in epochs]
lr_step = [schedules.step_decay(e) for e in epochs]
lr_exp = [schedules.exponential_decay(e) for e in epochs]
lr_cosine = [schedules.cosine_annealing(e, max_epochs=100) for e in epochs]
lr_warmup = [schedules.warmup_cosine(e, max_epochs=100) for e in epochs]
plt.figure(figsize=(12, 6))
plt.plot(epochs, lr_fixed, label='Fixed')
plt.plot(epochs, lr_step, label='Step Decay')
plt.plot(epochs, lr_exp, label='Exponential')
plt.plot(epochs, lr_cosine, label='Cosine Annealing')
plt.plot(epochs, lr_warmup, label='Warmup + Cosine')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend()
plt.grid(True)
plt.show()
# Performance comparison
print("Final learning rates:")
print("Fixed: " + str(lr_fixed[-1]))
print("Step: " + str(lr_step[-1]))
print("Exponential: " + str(lr_exp[-1]))
print("Cosine: " + str(lr_cosine[-1]))
print("Warmup+Cosine: " + str(lr_warmup[-1]))

Learning Rate Finder

Learning rate finder identifies optimal learning rate range. It trains with exponentially increasing rates. It plots loss versus learning rate. Optimal range is where loss decreases fastest. It helps choose initial learning rate.

Process starts with very small learning rate. It increases exponentially each batch. It records loss for each rate. It stops when loss diverges. Plot shows loss curve. Steepest descent indicates good range.

# Learning Rate Finder
import numpy as np
import matplotlib.pyplot as plt
def find_learning_rate(model, X_train, y_train, start_lr=1e-7, end_lr=1, num_iterations=100):
learning_rates = []
losses = []
# Save initial weights
initial_weights = [w.copy() for w in model.weights]
# Exponential range
lr_mult = (end_lr / start_lr) ** (1 / num_iterations)
current_lr = start_lr
for i in range(num_iterations):
# Set learning rate
model.lr = current_lr
# Forward and backward pass
output = model.forward(X_train)
grad_w, grad_b = model.backward(y_train)
model.update(grad_w, grad_b)
# Record
loss = np.mean((output - y_train.reshape(-1, 1).T)**2) / 2
learning_rates.append(current_lr)
losses.append(loss)
# Increase learning rate
current_lr *= lr_mult
# Stop if loss explodes
if loss > 10 * losses[0] or np.isnan(loss):
break
# Restore initial weights
model.weights = initial_weights
# Plot
plt.figure(figsize=(10, 6))
plt.semilogx(learning_rates, losses)
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.grid(True)
plt.show()
# Find steepest descent region
loss_diff = np.diff(losses)
steepest_idx = np.argmin(loss_diff)
optimal_lr = learning_rates[steepest_idx]
print("Suggested learning rate: " + str(optimal_lr))
return optimal_lr, learning_rates, losses
# Example usage would be:
# optimal_lr, lrs, losses = find_learning_rate(nn, X, y)

Batch Processing

Batch processing groups examples for efficiency. Large batches provide stable gradients. Small batches provide frequent updates. Mini-batches balance stability and speed.

Batch size affects training dynamics. Large batches converge smoothly but slowly. Small batches converge quickly but noisily. Typical batch sizes are 32, 64, or 128.

# Batch Processing
import numpy as np
def create_batches(X, y, batch_size=32):
n_samples = X.shape[0]
batches = []
for i in range(0, n_samples, batch_size):
end_idx = min(i + batch_size, n_samples)
batches.append((X[i:end_idx], y[i:end_idx]))
return batches
# Example
X = np.random.randn(100, 10)
y = np.random.randn(100, 1)
batches = create_batches(X, y, batch_size=32)
print("Number of batches: " + str(len(batches)))
print("Batch sizes: " + str([b[0].shape[0] for b in batches]))
# Result:
# Number of batches: 4
# Batch sizes: [32, 32, 32, 4]

Batch processing enables efficient training. It uses parallel computation. It provides stable gradient estimates. It is essential for large datasets.

Gradient Clipping

Gradient clipping prevents exploding gradients. It limits gradient magnitudes. It stabilizes training. It is essential for recurrent networks.

Clipping methods include norm clipping and value clipping. Norm clipping scales gradients to maximum norm. Value clipping clamps gradient values. Both prevent extreme updates.

# Gradient Clipping
import numpy as np
def clip_gradients_norm(gradients, max_norm=1.0):
total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
return [g * clip_coef for g in gradients]
return gradients
def clip_gradients_value(gradients, min_val=-1.0, max_val=1.0):
return [np.clip(g, min_val, max_val) for g in gradients]
# Example
gradients = [np.array([[10.0], [5.0]]), np.array([[-8.0]])]
clipped_norm = clip_gradients_norm(gradients, max_norm=1.0)
clipped_val = clip_gradients_value(gradients, min_val=-1.0, max_val=1.0)
print("Original norm: " + str(np.sqrt(sum(np.sum(g**2) for g in gradients))))
print("Clipped norm: " + str(np.sqrt(sum(np.sum(g**2) for g in clipped_norm))))

Gradient clipping stabilizes training. It prevents weight explosions. It enables training deeper networks. It is especially important for RNNs.

Summary

Training adjusts weights to minimize errors. Loss functions measure prediction errors. Backpropagation computes gradients efficiently. Optimizers update weights using gradients. Learning rate schedules adjust step sizes. Batch processing enables efficient training. Gradient clipping prevents exploding gradients. Proper training setup enables learning complex patterns.

References

Related Tutorials