Transformers: Architecture and Self-Attention

Transformers Overview

Transformers process sequences using attention mechanisms. They replaced RNNs for many NLP tasks. They enable parallel processing. They capture long-range dependencies effectively. They form the basis of modern LLMs.

Transformers use self-attention to relate all positions. They process sequences in parallel. They learn complex relationships. They scale to large models and datasets.

Figure: Transformer Architecture

The diagram shows transformer structure. Inputs flow through encoder and decoder stacks. Attention mechanisms relate positions. Feed-forward networks process information.

Self-Attention Mechanism

Self-attention computes relationships between all positions. It creates query, key, and value vectors. It computes attention scores. It weights values by attention. It captures dependencies regardless of distance.

Attention formula is Attention(Q, K, V) = softmax(QK^T / √d_k) V. Queries Q match against keys K. Values V provide information. Scaling by √d_k stabilizes gradients.

# Self-Attention Implementation
import numpy as np
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_model)
        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights

# Example
attention = SelfAttention(d_model=512)
x = torch.randn(1, 10, 512)  # batch, seq_len, d_model
output, weights = attention(x)
print("Output shape: " + str(output.shape))
print("Attention weights shape: " + str(weights.shape))

Self-attention captures all pairwise relationships. It enables parallel processing. It scales to long sequences.

Figure: Self-Attention

The diagram shows attention computation. Each position attends to all positions. Attention weights show relationships.

Multi-Head Attention

Multi-head attention uses multiple attention heads. Each head learns different relationships. Heads are concatenated and projected. This increases model capacity.

Multi-head formula is MultiHead = Concat(head₁, ..., headₕ)W^O where headᵢ = Attention(QWᵢ^Q, KWᵢ^K, VWᵢ^V). Each head has separate weight matrices. They learn complementary patterns.

# Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        batch_size = x.size(0)
        
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_k)
        attention_weights = torch.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)
        
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        return self.W_o(attention_output)

# Example
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(1, 10, 512)
output = mha(x)
print("Multi-head output shape: " + str(output.shape))

Multi-head attention increases model capacity. Different heads learn different patterns. They capture diverse relationships.

Figure: Multi-Head Attention

The diagram shows multi-head structure. Multiple attention heads process in parallel. They are concatenated for final output.

Encoder-Decoder Structure

Encoder-decoder transformers process input-output pairs. Encoder processes input sequence. Decoder generates output sequence. Attention connects encoder and decoder.

Encoder stacks self-attention and feed-forward layers. It processes input independently. Decoder stacks masked self-attention, encoder-decoder attention, and feed-forward layers. Masked attention prevents looking ahead.

# Encoder-Decoder Transformer
class TransformerEncoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

class TransformerDecoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
    
    def forward(self, x, encoder_output):
        for layer in self.layers:
            x = layer(x, encoder_output)
        return x

Encoder-decoder structure enables sequence-to-sequence tasks. It works for translation, summarization, and generation.

Figure: Encoder-Decoder

The diagram shows encoder-decoder flow. Encoder processes input. Decoder generates output using encoder information.

Positional Encoding

Positional encoding adds position information. Transformers have no inherent sequence order. Positional encodings inject order information. They use sinusoidal functions or learned embeddings.

Sinusoidal encoding uses sin and cos functions. It encodes relative positions. It generalizes to unseen lengths. Learned embeddings learn position representations. They work well for fixed maximum lengths.

# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            -(np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Example
pos_encoding = PositionalEncoding(d_model=512)
x = torch.randn(1, 10, 512)
x_pos = pos_encoding(x)
print("Positional encoded shape: " + str(x_pos.shape))

Positional encoding enables sequence processing. It provides order information. It works with attention mechanisms.

Figure: Positional Encoding

The diagram shows positional encoding. Each position gets unique encoding. Encodings are added to embeddings.

Layer Normalization

Layer normalization stabilizes training. It normalizes inputs within layers. It reduces internal covariate shift. It enables deeper networks.

Normalization formula is LN(x) = γ * (x - μ) / (σ + ε) + β. Mean μ and std σ computed per layer. Scale γ and shift β are learnable. Epsilon ε prevents division by zero.

# Layer Normalization
class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

# Example
layer_norm = LayerNorm(d_model=512)
x = torch.randn(1, 10, 512)
x_norm = layer_norm(x)
print("Normalized shape: " + str(x_norm.shape))

Layer normalization stabilizes training. It enables deeper networks. It works well with residual connections.

Feed-Forward Networks

Feed-forward networks process attention outputs. They apply two linear transformations with ReLU. They increase model capacity. They enable non-linear transformations.

FFN formula is FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. First linear layer expands dimensions. ReLU introduces non-linearity. Second linear layer projects back.

# Feed-Forward Network
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

# Example
ffn = FeedForward(d_model=512, d_ff=2048)
x = torch.randn(1, 10, 512)
x_ffn = ffn(x)
print("FFN output shape: " + str(x_ffn.shape))

Feed-forward networks increase model capacity. They enable complex transformations. They work with attention mechanisms.

Summary

Transformers use attention mechanisms to process sequences. Self-attention relates all positions. Multi-head attention increases capacity. Encoder-decoder structure enables sequence-to-sequence tasks. Positional encoding provides order information. Layer normalization stabilizes training. Feed-forward networks increase capacity. Transformers form the basis of modern language models.