Transformers Overview
Transformers process sequences using attention mechanisms. They replaced RNNs for many NLP tasks. They enable parallel processing. They capture long-range dependencies effectively. They form the basis of modern LLMs.
Transformers use self-attention to relate all positions. They process sequences in parallel. They learn complex relationships. They scale to large models and datasets.
The diagram shows transformer structure. Inputs flow through encoder and decoder stacks. Attention mechanisms relate positions. Feed-forward networks process information.
Self-Attention Mechanism
Self-attention computes relationships between all positions. It creates query, key, and value vectors. It computes attention scores. It weights values by attention. It captures dependencies regardless of distance.
Attention formula is Attention(Q, K, V) = softmax(QK^T / √d_k) V. Queries Q match against keys K. Values V provide information. Scaling by √d_k stabilizes gradients.
Self-attention captures all pairwise relationships. It enables parallel processing. It scales to long sequences.
The diagram shows attention computation. Each position attends to all positions. Attention weights show relationships.
Multi-Head Attention
Multi-head attention uses multiple attention heads. Each head learns different relationships. Heads are concatenated and projected. This increases model capacity.
Multi-head formula is MultiHead = Concat(head₁, ..., headₕ)W^O where headᵢ = Attention(QWᵢ^Q, KWᵢ^K, VWᵢ^V). Each head has separate weight matrices. They learn complementary patterns.
Multi-head attention increases model capacity. Different heads learn different patterns. They capture diverse relationships.
The diagram shows multi-head structure. Multiple attention heads process in parallel. They are concatenated for final output.
Encoder-Decoder Structure
Encoder-decoder transformers process input-output pairs. Encoder processes input sequence. Decoder generates output sequence. Attention connects encoder and decoder.
Encoder stacks self-attention and feed-forward layers. It processes input independently. Decoder stacks masked self-attention, encoder-decoder attention, and feed-forward layers. Masked attention prevents looking ahead.
Encoder-decoder structure enables sequence-to-sequence tasks. It works for translation, summarization, and generation.
The diagram shows encoder-decoder flow. Encoder processes input. Decoder generates output using encoder information.
Positional Encoding
Positional encoding adds position information. Transformers have no inherent sequence order. Positional encodings inject order information. They use sinusoidal functions or learned embeddings.
Sinusoidal encoding uses sin and cos functions. It encodes relative positions. It generalizes to unseen lengths. Learned embeddings learn position representations. They work well for fixed maximum lengths.
Positional encoding enables sequence processing. It provides order information. It works with attention mechanisms.
The diagram shows positional encoding. Each position gets unique encoding. Encodings are added to embeddings.
Layer Normalization
Layer normalization stabilizes training. It normalizes inputs within layers. It reduces internal covariate shift. It enables deeper networks.
Normalization formula is LN(x) = γ * (x - μ) / (σ + ε) + β. Mean μ and std σ computed per layer. Scale γ and shift β are learnable. Epsilon ε prevents division by zero.
Layer normalization stabilizes training. It enables deeper networks. It works well with residual connections.
Feed-Forward Networks
Feed-forward networks process attention outputs. They apply two linear transformations with ReLU. They increase model capacity. They enable non-linear transformations.
FFN formula is FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. First linear layer expands dimensions. ReLU introduces non-linearity. Second linear layer projects back.
Feed-forward networks increase model capacity. They enable complex transformations. They work with attention mechanisms.
Summary
Transformers use attention mechanisms to process sequences. Self-attention relates all positions. Multi-head attention increases capacity. Encoder-decoder structure enables sequence-to-sequence tasks. Positional encoding provides order information. Layer normalization stabilizes training. Feed-forward networks increase capacity. Transformers form the basis of modern language models.
References
- NeuronDB Documentation
- Attention Is All You Need
- The Illustrated Transformer
- Transformers from Scratch