Neural Networks: From Perceptrons to Deep Learning

Neural Networks Overview

Neural networks learn complex patterns from data. They consist of layers of connected nodes. Each connection has a weight. Training adjusts weights to minimize errors. Neural networks can learn non-linear relationships. They work well for images, text, and complex data.

A neural network has input, hidden, and output layers. Input layer receives features. Hidden layers process information. Output layer produces predictions. More layers enable learning more complex patterns.

Figure: Neural Network Architecture

The diagram shows network structure. Input features flow through hidden layers. Each layer applies weights and activation functions. Output layer produces final predictions.

Perceptron Model

A perceptron is the simplest neural network. It has one layer with weighted inputs. It sums inputs multiplied by weights. It applies an activation function. It produces binary output.

The perceptron equation is y = f(Σ wᵢxᵢ + b). Weights w multiply inputs x. Bias b shifts the decision boundary. Activation function f produces output. Step function gives 0 or 1. Sigmoid gives probability.

# Perceptron Example
import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
    
    def fit(self, X, y):
        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        
        for _ in range(self.n_iterations):
            for i in range(X.shape[0]):
                linear_output = np.dot(X[i], self.weights) + self.bias
                prediction = self.activation(linear_output)
                
                # Update weights
                update = self.learning_rate * (y[i] - prediction)
                self.weights += update * X[i]
                self.bias += update
    
    def activation(self, x):
        return 1 if x >= 0 else 0
    
    def predict(self, X):
        return [self.activation(np.dot(x, self.weights) + self.bias) for x in X]

# Example: AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])

perceptron = Perceptron()
perceptron.fit(X, y)
predictions = perceptron.predict(X)
print("Predictions: " + str(predictions))
# Result: Predictions: [0, 0, 0, 1]

Perceptrons can learn linearly separable patterns. They fail on non-linearly separable patterns like XOR. Multi-layer networks solve this limitation.

Figure: Perceptron

The diagram shows perceptron structure. Inputs connect to a single node. The node sums weighted inputs. It applies activation function. It produces output.

Multi-Layer Networks

Multi-layer networks have multiple hidden layers. Each layer processes information from the previous layer. Deeper networks learn more complex patterns. They can approximate any function given enough neurons.

Forward propagation computes predictions. Input flows through layers. Each layer applies weights and activations. Output layer produces final predictions. Backward propagation computes gradients. Gradients flow backward through layers. They update weights to reduce errors.

# Multi-Layer Neural Network
import numpy as np

class NeuralNetwork:
    def __init__(self, layers, learning_rate=0.01):
        self.layers = layers
        self.learning_rate = learning_rate
        self.weights = []
        self.biases = []
        
        # Initialize weights
        for i in range(len(layers) - 1):
            w = np.random.randn(layers[i], layers[i+1]) * 0.1
            b = np.zeros((1, layers[i+1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
    
    def forward(self, X):
        self.activations = [X]
        for i in range(len(self.weights)):
            z = np.dot(self.activations[-1], self.weights[i]) + self.biases[i]
            a = self.sigmoid(z)
            self.activations.append(a)
        return self.activations[-1]
    
    def backward(self, X, y, output):
        m = X.shape[0]
        dW = []
        dB = []
        
        # Output layer error
        error = output - y
        dW.append(np.dot(self.activations[-2].T, error) / m)
        dB.append(np.sum(error, axis=0, keepdims=True) / m)
        
        # Backpropagate through hidden layers
        for i in range(len(self.weights) - 2, -1, -1):
            error = np.dot(error, self.weights[i+1].T) * self.activations[i+1] * (1 - self.activations[i+1])
            dW.insert(0, np.dot(self.activations[i].T, error) / m)
            dB.insert(0, np.sum(error, axis=0, keepdims=True) / m)
        
        return dW, dB
    
    def fit(self, X, y, epochs=1000):
        for epoch in range(epochs):
            output = self.forward(X)
            dW, dB = self.backward(X, y, output)
            
            # Update weights
            for i in range(len(self.weights)):
                self.weights[i] -= self.learning_rate * dW[i]
                self.biases[i] -= self.learning_rate * dB[i]
    
    def predict(self, X):
        return (self.forward(X) > 0.5).astype(int)

# Example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork([2, 4, 1], learning_rate=0.5)
nn.fit(X, y, epochs=10000)
predictions = nn.predict(X)
print("Predictions: " + str(predictions.flatten()))
# Result: Predictions: [0 1 1 0]

Multi-layer networks solve non-linearly separable problems. They learn hierarchical features. Early layers detect simple patterns. Later layers combine simple patterns into complex patterns.

Figure: Multi-Layer Network

The diagram shows multi-layer structure. Input flows through hidden layers. Each layer transforms information. Output layer produces predictions.

Activation Functions

Activation functions introduce non-linearity. Without them, networks are just linear transformations. Non-linearity enables learning complex patterns. Different functions suit different problems.

Sigmoid maps inputs to 0-1 range. It works well for binary classification. It suffers from vanishing gradients in deep networks. Tanh maps inputs to -1 to 1 range. It centers outputs around zero. It also suffers from vanishing gradients.

ReLU is f(x) = max(0, x). It outputs zero for negative inputs. It outputs input for positive inputs. It solves vanishing gradient problems. It enables training deep networks. It is the default choice for hidden layers.

# Activation Functions
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -250, 250)))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

x = np.linspace(-5, 5, 100)
y_sigmoid = sigmoid(x)
y_tanh = tanh(x)
y_relu = relu(x)

print("Sigmoid(0): " + str(sigmoid(0)))
print("Tanh(0): " + str(tanh(0)))
print("ReLU(0): " + str(relu(0)))
# Result:
# Sigmoid(0): 0.5
# Tanh(0): 0.0
# ReLU(0): 0.0

Choose activation functions based on problem type. Use sigmoid for binary classification output. Use softmax for multi-class classification output. Use ReLU for hidden layers. Use linear for regression output.

Figure: Activation Functions

The diagram compares activation functions. Sigmoid is S-shaped. Tanh is centered S-shaped. ReLU is linear for positive inputs.

Forward Propagation

Forward propagation computes predictions. It processes inputs through all layers. Each layer applies weights and activations. It produces final output.

The process starts with input features. First hidden layer computes weighted sum. It applies activation function. Result becomes input to next layer. Process repeats for all layers. Final layer produces predictions.

# Forward Propagation
import numpy as np

def forward_propagation(X, weights, biases, activation):
    activations = [X]
    
    for i in range(len(weights)):
        z = np.dot(activations[-1], weights[i]) + biases[i]
        a = activation(z)
        activations.append(a)
    
    return activations

# Example network
X = np.array([[1, 2]])
weights = [
    np.array([[0.5, 0.3], [0.2, 0.4]]),
    np.array([[0.1, 0.6]])
]
biases = [
    np.array([[0.1, 0.2]]),
    np.array([[0.3]])
]

def relu(x):
    return np.maximum(0, x)

activations = forward_propagation(X, weights, biases, relu)
print("Output: " + str(activations[-1]))
# Result: Output: [[0.9]]

Forward propagation is efficient. It requires one pass through the network. It computes all layer outputs. It stores activations for backpropagation.

Network Architecture Design

Architecture design affects performance. More layers enable learning complex patterns. More neurons per layer increase capacity. Too many parameters cause overfitting. Too few parameters cause underfitting.

Input layer size matches feature count. Output layer size matches target count. Hidden layer sizes are hyperparameters. Common patterns include decreasing sizes or constant sizes. Start simple and increase complexity as needed.

# Architecture Design Example
def create_network(input_size, hidden_sizes, output_size):
    layers = [input_size] + hidden_sizes + [output_size]
    weights = []
    biases = []
    
    for i in range(len(layers) - 1):
        w = np.random.randn(layers[i], layers[i+1]) * 0.1
        b = np.zeros((1, layers[i+1]))
        weights.append(w)
        biases.append(b)
    
    return weights, biases

# Different architectures
arch1 = create_network(10, [5], 1)  # Simple
arch2 = create_network(10, [20, 10], 1)  # Medium
arch3 = create_network(10, [50, 30, 20], 1)  # Complex

print("Architecture 1 layers: 10 -> 5 -> 1")
print("Architecture 2 layers: 10 -> 20 -> 10 -> 1")
print("Architecture 3 layers: 10 -> 50 -> 30 -> 20 -> 1")

Choose architecture based on data complexity. Simple problems need simple networks. Complex problems need deeper networks. Use validation data to guide architecture selection.

Figure: Network Architecture

The diagram shows different architectures. Simple network has few layers. Complex network has many layers. Each suits different problem complexities.

Weight Initialization

Weight initialization affects training. Poor initialization causes slow convergence or failure. Good initialization enables faster training. Common methods include random, Xavier, and He initialization.

Random initialization uses small random values. It breaks symmetry between neurons. Too small values cause vanishing gradients. Too large values cause exploding gradients. Xavier initialization scales by layer size. It works well for sigmoid and tanh. He initialization scales by layer size. It works well for ReLU.

# Weight Initialization
import numpy as np

def xavier_init(fan_in, fan_out):
    limit = np.sqrt(6.0 / (fan_in + fan_out))
    return np.random.uniform(-limit, limit, (fan_in, fan_out))

def he_init(fan_in, fan_out):
    std = np.sqrt(2.0 / fan_in)
    return np.random.normal(0, std, (fan_in, fan_out))

# Example
w_xavier = xavier_init(10, 20)
w_he = he_init(10, 20)

print("Xavier std: " + str(np.std(w_xavier)))
print("He std: " + str(np.std(w_he)))
# Result:
# Xavier std: 0.158
# He std: 0.447

Proper initialization is critical. It sets training on the right path. Poor initialization can prevent convergence. Good initialization accelerates convergence.

Summary

Neural networks learn complex patterns from data. Perceptrons are single-layer networks. Multi-layer networks solve non-linear problems. Activation functions introduce non-linearity. ReLU works well for hidden layers. Forward propagation computes predictions. Architecture design balances capacity and overfitting. Weight initialization affects training success. Proper setup enables learning complex patterns.