Transformer Architecture: The Engine of LLMs
"The Transformer is the first sequence transduction model relying entirely on attention." — Vaswani et al. (2017)
To pass an LLM interview, simply knowing "it uses attention" is not enough. You must understand why specific design choices were made (Pre-Norm vs Post-Norm, SwiGLU vs ReLU, GQA vs MHA, MoE vs Dense) and the mathematical operations inside the block.
1. The High-Level View
A modern Decoder-Only Transformer (like GPT-4 or Llama 3) consists of a stack of identical blocks. Each block has two main sub-layers:
- Multi-Head Self-Attention (MHA): Mixing information between tokens.
- Feed-Forward Network (FFN): Processing information within each token independently.
Critically, these are wrapped in Residual Connections and Layer Normalization.
2025 Evolution:
- FFN → MoE: Many models now use Mixture-of-Experts for the feed-forward layer
- MHA → GQA: Grouped-Query Attention reduces KV cache memory
- Standard → Hybrid: Some models mix Transformer with State Space Models (Mamba)
2. Self-Attention: The "Routing" Layer
Attention allows tokens to "talk" to each other. It asks specific questions to build context.
The Query, Key, Value Intuition
Every token produces three vectors:
- Query (Q): "What am I looking for?" (e.g., a noun looking for its adjective).
- Key (K): "What do I contain?" (e.g., I am an adjective).
- Value (V): "If you attend to me, here is my information."
The Engineering Perspective
Self-attention computes relationships between tokens through these steps:
- Similarity: Computes scores for every pair of tokens (QK^T)
- Scaling: Prevents dot products from exploding, which would cause vanishing gradients
- Normalization: Converts scores to probabilities (softmax)
- Aggregation: Weighted sum of value vectors
Why Multi-Head?
One head might focus on syntax (noun-verb agreement). Another might focus on semantics (synonyms). Another might look at position (previous word).
| Model | Heads | Head Dimension | Total Dimension |
|---|---|---|---|
| Llama 3 8B | 32 | 128 | 4,096 |
| Llama 3 70B | 64 | 128 | 8,192 |
| GPT-4 | 96+ (est.) | 128 | 12,288 |
3. Grouped Query Attention (GQA) - The 2025 Standard
As context windows grew (8k → 128k → 1M+), the KV Cache became a memory bottleneck. Storing Key and Value matrices for every head is expensive.
The Spectrum: MHA → GQA → MQA
| Mechanism | Query Heads | KV Heads | KV Cache Size | Quality | Speed |
|---|---|---|---|---|---|
| MHA (Multi-Head) | H | H | 100% | Best | Slowest |
| GQA (Grouped-Query) | H | G (where G < H) | ~1/G | Near-best | Faster |
| MQA (Multi-Query) | H | 1 | 1/H | Lower | Fastest |
How GQA Works
Instead of each head having its own K/V projections, groups of query heads share K/V:
# MHA: 32 heads, 32 KV pairs
q_heads = 32
kv_heads = 32
# GQA: 32 query heads, 8 KV pairs (groups of 4)
q_heads = 32
kv_heads = 8 # Each KV pair serves 4 query heads
Benefits:
- Memory reduction: 8x less KV cache for GQA-8
- Bandwidth reduction: Less memory transfer during inference
- Quality retention: GQA-8 achieves ~98-99% of MHA quality
Adoption:
- Llama 3 70B: Uses GQA for efficient inference
- T5-XXL: GQA-8 for production deployment
- Gemini 2.5: Uses GQA variants for long context
2025: Weighted GQA (WGQA)
Innovation: Learnable parameters for each K/V head enable weighted averaging during fine-tuning.
Benefits:
- 0.53% average improvement over standard GQA
- Converges to MHA quality with no inference overhead
- Model learns optimal grouping during training
4. Mixture-of-Experts (MoE) - The Scaling Revolution
Instead of one monolithic feed-forward network, MoE uses multiple specialized "expert" networks. Each token is routed to the most relevant experts.
Architecture
Key Components
- Router: Gating network that selects top-k experts for each token
- Experts: Specialized FFN networks (typically 8-64 per layer)
- Load Balancing: Auxiliary loss ensures all experts are utilized
2025 MoE Models
| Model | Total Params | Active Params | Experts | Top-K | Notes |
|---|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 13B | 8 | 2 | Open-source, matches Llama 2 70B |
| Llama 4 | TBD | TBD | TBD | TBD | MoE variant rumored |
| DeepSeek-V3 | 671B | 37B | 256 | 8 | Shared experts, diverse routing |
| GPT-4 | ~1.7T (est.) | ~220B (est.) | ~128 (est.) | TBD | MoE widely suspected |
| Switch Transformer | 1.6T | TBD | 2048 | 1 | Research milestone |
| GLaM | 1.2T | TBD | 64 | 2 | Google's trillion-parameter model |
Why MoE Matters
Training Efficiency:
- Same quality as dense model with 1/3 the compute (GLaM result)
- Allows scaling to trillions of parameters
- Carbon footprint: Up to 10x reduction vs dense models
Inference Efficiency:
- Only activates relevant experts per token
- 70B parameters with 13B active = 8B model speed with 70B quality
- Enables massive models on consumer hardware (with quantization)
Training Stability (2025 Advances):
- Router Z-loss: Penalizes large router logits, stabilizing training
- Shared experts: Reduces redundancy, increases diversity
- Sigmoid gating: More stable than softmax for expert selection
MoE vs Dense FFN
| Aspect | Dense FFN | MoE |
|---|---|---|
| Parameters | Fixed per layer | Scales with experts |
| Compute | Always active | Sparse activation |
| Quality | Baseline | Same or better |
| Inference | Predictable | Variable (depends on routing) |
| Training | Stable | Requires tricks (Z-loss, aux loss) |
5. Pre-Norm vs. Post-Norm
Post-Norm (Original Transformer, BERT)
LayerNorm is applied after the residual connection.
- Issue: Gradients can explode near the output layers during initialization, requiring a "warm-up" stage.
Pre-Norm (GPT-2, Llama, PaLM)
LayerNorm is applied before the sublayer.
- Benefit: Gradients flow through the "residual highway" (the addition path) untouched. Training is much more stable at scale.
- Trade-off: Potentially slightly less expressive (theoretical debate), but stability wins for LLMs.
2025 Consensus: Pre-Norm is universal for decoder-only LLMs. Post-Norm still used in some encoder-decoder models (T5).
6. Feed-Forward Networks (FFN) & MoE: The "Knowledge" Layer
If Attention is "routing" information, the FFN (or MoE) is "processing" it. Some researchers posit that FFNs act as Key-Value Memories storing factual knowledge.
Evolution of Activations
- ReLU (Original): Rectified Linear Unit. Problem: "Dead neurons" (zero gradient).
- GELU (GPT-2/3): Gaussian Error Linear Unit. Smoother, probabilistic.
- SwiGLU (PaLM, Llama): Swish-Gated Linear Unit.
What is SwiGLU?
It adds a "gate" to the FFN. Instead of just passing data through, we compute two paths and multiply them. This requires 3 matrix multiplications instead of 2, but consistently yields better performance for the same compute budget.
MoE as FFN Replacement
Standard Transformer Block:
Attention → Dense FFN → Output
MoE Transformer Block:
Attention → Router → Selected Experts → Combined Output
Each expert is a specialized FFN:
- Expert 1: Specializes in coding patterns
- Expert 2: Specializes in mathematical reasoning
- Expert 3: Specializes in factual knowledge
- Expert 4-8: Other specializations
7. Linear Attention & Hybrid Architectures
Linear Attention (2020+)
Problem: Standard attention has O(N²) complexity due to the QK^T matrix computation.
Solution: Use kernel functions to approximate attention without explicit N × N matrix.
Benefits:
- O(N) complexity instead of O(N²)
- Enables truly massive context windows (1M+ tokens)
- Trade-off: Slight quality degradation
Adoption:
- RWKV: Recurrent architecture with linear attention
- Mamba/State Space Models: Linear complexity by design
- Hybrid models: Mix Transformer and linear attention layers
2025: Higher-Order Attention (Nexus)
Innovation: Query and Key vectors are outputs of nested self-attention loops.
Benefits:
- Captures multi-hop relationships in single layer
- More expressive than standard first-order attention
- Enables complex reasoning without deep stacks
Status: Research stage, not yet production in major LLMs.
8. Positional Encodings Revisited
RoPE (Rotary Positional Embeddings) - Gold Standard
Used by Llama 2/3/4, PaLM, Mistral, GPT-NeoX.
- Intuition: Encode position by rotating the vector in space.
- Mechanism:
- Tokens at position are rotated by angle .
- The dot product (similarity) between two tokens depends only on their relative distance ().
- Why it wins:
- Decay: Attention naturally decays as tokens get further apart (long-term dependency management).
- Extrapolation: It handles context lengths longer than training data better than absolute embeddings.
2025: PaTH Attention
Innovation: Treats in-between words as a path of data-dependent transformations (Householder reflections).
Benefits:
- Positional memory: Tracks state changes across sequences
- Better sequential reasoning: Improved code execution tracking
- Selective forgetting: Combined with Forgetting Transformers (FoX) to down-weight old info
Status: Cutting-edge research, not yet in production models.
9. Interview FAQ
Q: What is the computational complexity of Self-Attention?
A: where is the sequence length.
- Computing results in an matrix.
- This is why long context (100k+) is hard; doubling context quadruples compute.
- 2025 Solutions:
- FlashAttention-2: Optimizes IO but still mathematically
- Linear Attention: complexity, slight quality trade-off
- Ring Attention: Distributed across GPUs, enables 1M+ context
- Sliding Window: Only attend to nearby tokens + global cache
Q: Why do we need Layer Normalization?
A: To stabilize the distribution of activations across deep networks and ensuring that no single feature dominates magnitude-wise. Without it, gradients would explode or vanish in a network with 100+ layers.
2025 Update: RMSNorm (Root Mean Square Normalization) is replacing LayerNorm in many models (Llama, Gemma) because it's simpler and faster:
- Normalizes by root mean square instead of mean and variance
- More computationally efficient than LayerNorm
- Better stability for very deep networks
Q: How does a Decoder-only model prevent "cheating" during training?
A: Through Causal Masking. In the self-attention step, we set the attention scores for all future tokens (positions ) to . When passed through softmax, these become 0, ensuring token can only attend to .
Implementation:
# Create causal mask
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Apply to attention scores
scores = scores.masked_fill(mask.bool(), float('-inf'))
Q: What is the purpose of the Residual (Skip) Connection?
A: It mitigates the vanishing gradient problem. By allowing gradients to flow directly through the network via addition (), errors can backpropagate from the last layer to the first without being diminished by multiple multiplication steps.
2025 Insight: Residual connections also enable gradient checkpointing, trading compute for memory during training.
Q: When should I use GQA vs MHA vs MQA?
A:
Use MHA when:
- Quality is paramount (research, benchmarks)
- Context window is short (< 8k tokens)
- Memory is not a constraint
Use GQA when:
- Default choice for production LLMs in 2025
- Long context (32k-128k tokens)
- Memory-constrained deployment
- Want near-MHA quality with faster inference
Use MQA when:
- Maximal throughput is required
- Can accept 5-10% quality degradation
- Very large batch inference (e.g., API serving)
2025 Verdict: GQA-8 or GQA-4 is the sweet spot for most applications.
Q: What causes training instability in MoE models?
A: Three main issues:
-
Router collapse: All tokens route to the same expert, leaving others unused
- Fix: Auxiliary load-balancing loss, expert capacity factor
-
Expert overflow: Expert receives more tokens than its capacity factor allows
- Fix: Drop tokens or route to next layer
-
Gradient imbalance: Some experts receive much larger gradients than others
- Fix: Router Z-loss, normalized expert losses
2025 Solutions:
- Shared experts: Reduces redundancy, improves load balancing
- Sigmoid gating: More stable than softmax for expert selection
- Stable MoE training: Warm-up periods, gradual expert activation
Q: How does RoPE differ from absolute positional embeddings?
A: Absolute embeddings add a fixed vector to each token based on its position. Position is encoded as a fixed property of the token.
RoPE rotates the query and key vectors based on position using rotation matrices. The dot product between queries and keys depends only on their relative distance, not absolute positions.
Benefits:
- Better extrapolation to longer sequences
- Natural decay of attention with distance
- No learned positional parameters
2025 Dominance: RoPE is used in almost all decoder-only LLMs (Llama, GPT-4, PaLM, Mistral).
Spring AI Model Configuration
Spring AI provides unified configuration for different LLM providers with consistent parameter tuning options.
Basic Model Configuration
// Spring AI Model Configuration
@Configuration
public class LLMConfiguration {
@Bean
public ChatModel chatModel(OpenAiApi openAiApi) {
return OpenAiChatModel.builder()
.openAiApi(openAiApi)
.options(OpenAiChatOptions.builder()
.model("gpt-4")
.temperature(0.7)
.maxTokens(2000)
// Understanding these parameters:
// - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
// - maxTokens: Limits response length
// - topP: Nucleus sampling (0.9 = keep 90% probability mass)
// - presencePenalty: Reduces repetition
.build())
.build();
}
// For models requiring specific attention settings
@Bean
public ChatModel longContextModel() {
return OpenAiChatModel.builder()
.options(OpenAiChatOptions.builder()
.model("gpt-4-turbo") // 128K context
// When to use long context:
// - Document analysis > 50 pages
// - Codebase reviews
// - Multi-document synthesis
.build())
.build();
}
}
Parameter Tuning Guide
Different tasks require different parameter settings for optimal results:
// Effect of sampling parameters
@Service
public class ParameterTuningService {
private final ChatClient chatClient;
// Code generation: Low temperature for consistency
public String generateCode(String description) {
return chatClient.prompt()
.user("Write code to: " + description)
.options(OpenAiChatOptions.builder()
.temperature(0.2) // Low = more deterministic
.maxTokens(1500)
.topP(0.95)
.build())
.call()
.content();
}
// Creative writing: Higher temperature
public String generateStory(String prompt) {
return chatClient.prompt()
.user("Write a story about: " + prompt)
.options(OpenAiChatOptions.builder()
.temperature(0.9) // High = more creative
.maxTokens(2000)
.topP(0.9)
.build())
.call()
.content();
}
// Technical documentation: Balanced settings
public String generateDocs(String code) {
return chatClient.prompt()
.user("Generate documentation for:\n" + code)
.options(OpenAiChatOptions.builder()
.temperature(0.5) // Balanced
.maxTokens(1000)
.presencePenalty(0.3) // Reduce repetition
.build())
.call()
.content();
}
}
Choosing the Right Model
// Service for model selection based on task
@Service
public class ModelSelectionService {
public String chooseModel(String task) {
return switch (task.toLowerCase()) {
case "code", "debug", "refactor" -> "gpt-4", // Best coding performance
case "chat", "general", "qa" -> "gpt-3.5-turbo", // Cost-effective
case "analysis", "document", "long" -> "gpt-4-turbo", // 128K context
case "creative", "story", "poem" -> "gpt-4", // Better creativity
case "simple", "classification" -> "gpt-3.5-turbo", // Faster, cheaper
default -> "gpt-3.5-turbo" // Default to cost-effective
};
}
public ChatOptions getOptionsForTask(String task) {
return switch (task.toLowerCase()) {
case "code" -> OpenAiChatOptions.builder()
.temperature(0.2)
.maxTokens(2000)
.build();
case "creative" -> OpenAiChatOptions.builder()
.temperature(0.9)
.maxTokens(1500)
.presencePenalty(0.5)
.build();
case "analysis" -> OpenAiChatOptions.builder()
.temperature(0.3)
.maxTokens(3000)
.topP(0.95)
.build();
default -> OpenAiChatOptions.builder()
.temperature(0.7)
.maxTokens(1000)
.build();
};
}
}
Architecture Selection Guide
| Use Case | Recommended Architecture | Why |
|---|---|---|
| Chatbots | Decoder-only (GPT, Llama) | Generative, conversational |
| Classification | Encoder-only (BERT) | Better understanding, bidirectional context |
| Translation | Encoder-Decoder (T5) | Sequence-to-sequence transformation |
| Code Generation | Decoder-only with MoE | Specialized experts for coding patterns |
| Long Documents | Hybrid (Transformer + SSM) | Efficient long-context modeling |
| Cost-Sensitive | Dense small models | Predictable inference cost |
| Quality-First | Large MoE models | Best performance with sparse activation |
Summary for Interviews
- Transformer blocks consist of Multi-Head Attention + FFN, wrapped in residuals and normalization.
- Self-attention computes similarity between all token pairs via QK^T, scaled by the square root of the key dimension.
- Multi-head attention allows different heads to focus on different aspects (syntax, semantics, position).
- Pre-norm (LayerNorm before sublayer) is standard for decoder-only LLMs; more stable than post-norm.
- GQA (Grouped-Query Attention) is the 2025 standard: reduces KV cache by 4-8x with minimal quality loss.
- MoE (Mixture-of-Experts) enables scaling to trillions of parameters by activating only relevant experts per token.
- RoPE (Rotary Positional Embeddings) dominates for position encoding; enables better extrapolation to long contexts.
- SwiGLU activation outperforms ReLU/GELU for LLMs; adds gating mechanism to FFN.
- Linear attention variants enable O(N) complexity for 1M+ token contexts; used in hybrid models.
- 2025 architecture trends: MoE for scaling, GQA for efficiency, RoPE for positioning, hybrid (Transformer + SSM) for long context.
For hands-on practice:
1. Study GQA implementations:
2. Explore MoE models:
3. Build intuition with attention viz:
- BertViz - Attention visualization
- Transformer Explainer - Interactive attention math
4. Experiment with RoPE: