Skip to main content

Transformer Architecture: The Engine of LLMs

"The Transformer is the first sequence transduction model relying entirely on attention." — Vaswani et al. (2017)

To pass an LLM interview, simply knowing "it uses attention" is not enough. You must understand why specific design choices were made (Pre-Norm vs Post-Norm, SwiGLU vs ReLU, GQA vs MHA, MoE vs Dense) and the mathematical operations inside the block.


1. The High-Level View

A modern Decoder-Only Transformer (like GPT-4 or Llama 3) consists of a stack of identical blocks. Each block has two main sub-layers:

  1. Multi-Head Self-Attention (MHA): Mixing information between tokens.
  2. Feed-Forward Network (FFN): Processing information within each token independently.

Critically, these are wrapped in Residual Connections and Layer Normalization.

2025 Evolution:

  • FFN → MoE: Many models now use Mixture-of-Experts for the feed-forward layer
  • MHA → GQA: Grouped-Query Attention reduces KV cache memory
  • Standard → Hybrid: Some models mix Transformer with State Space Models (Mamba)

2. Self-Attention: The "Routing" Layer

Attention allows tokens to "talk" to each other. It asks specific questions to build context.

The Query, Key, Value Intuition

Every token produces three vectors:

  • Query (Q): "What am I looking for?" (e.g., a noun looking for its adjective).
  • Key (K): "What do I contain?" (e.g., I am an adjective).
  • Value (V): "If you attend to me, here is my information."

The Engineering Perspective

Self-attention computes relationships between tokens through these steps:

  1. Similarity: Computes scores for every pair of tokens (QK^T)
  2. Scaling: Prevents dot products from exploding, which would cause vanishing gradients
  3. Normalization: Converts scores to probabilities (softmax)
  4. Aggregation: Weighted sum of value vectors

Why Multi-Head?

One head might focus on syntax (noun-verb agreement). Another might focus on semantics (synonyms). Another might look at position (previous word).

ModelHeadsHead DimensionTotal Dimension
Llama 3 8B321284,096
Llama 3 70B641288,192
GPT-496+ (est.)12812,288

3. Grouped Query Attention (GQA) - The 2025 Standard

As context windows grew (8k → 128k → 1M+), the KV Cache became a memory bottleneck. Storing Key and Value matrices for every head is expensive.

The Spectrum: MHA → GQA → MQA

MechanismQuery HeadsKV HeadsKV Cache SizeQualitySpeed
MHA (Multi-Head)HH100%BestSlowest
GQA (Grouped-Query)HG (where G < H)~1/GNear-bestFaster
MQA (Multi-Query)H11/HLowerFastest

How GQA Works

Instead of each head having its own K/V projections, groups of query heads share K/V:

# MHA: 32 heads, 32 KV pairs
q_heads = 32
kv_heads = 32

# GQA: 32 query heads, 8 KV pairs (groups of 4)
q_heads = 32
kv_heads = 8 # Each KV pair serves 4 query heads

Benefits:

  • Memory reduction: 8x less KV cache for GQA-8
  • Bandwidth reduction: Less memory transfer during inference
  • Quality retention: GQA-8 achieves ~98-99% of MHA quality

Adoption:

  • Llama 3 70B: Uses GQA for efficient inference
  • T5-XXL: GQA-8 for production deployment
  • Gemini 2.5: Uses GQA variants for long context

2025: Weighted GQA (WGQA)

Innovation: Learnable parameters for each K/V head enable weighted averaging during fine-tuning.

Benefits:

  • 0.53% average improvement over standard GQA
  • Converges to MHA quality with no inference overhead
  • Model learns optimal grouping during training

4. Mixture-of-Experts (MoE) - The Scaling Revolution

Instead of one monolithic feed-forward network, MoE uses multiple specialized "expert" networks. Each token is routed to the most relevant experts.

Architecture

Key Components

  1. Router: Gating network that selects top-k experts for each token
  2. Experts: Specialized FFN networks (typically 8-64 per layer)
  3. Load Balancing: Auxiliary loss ensures all experts are utilized

2025 MoE Models

ModelTotal ParamsActive ParamsExpertsTop-KNotes
Mixtral 8x7B46.7B13B82Open-source, matches Llama 2 70B
Llama 4TBDTBDTBDTBDMoE variant rumored
DeepSeek-V3671B37B2568Shared experts, diverse routing
GPT-4~1.7T (est.)~220B (est.)~128 (est.)TBDMoE widely suspected
Switch Transformer1.6TTBD20481Research milestone
GLaM1.2TTBD642Google's trillion-parameter model

Why MoE Matters

Training Efficiency:

  • Same quality as dense model with 1/3 the compute (GLaM result)
  • Allows scaling to trillions of parameters
  • Carbon footprint: Up to 10x reduction vs dense models

Inference Efficiency:

  • Only activates relevant experts per token
  • 70B parameters with 13B active = 8B model speed with 70B quality
  • Enables massive models on consumer hardware (with quantization)

Training Stability (2025 Advances):

  • Router Z-loss: Penalizes large router logits, stabilizing training
  • Shared experts: Reduces redundancy, increases diversity
  • Sigmoid gating: More stable than softmax for expert selection

MoE vs Dense FFN

AspectDense FFNMoE
ParametersFixed per layerScales with experts
ComputeAlways activeSparse activation
QualityBaselineSame or better
InferencePredictableVariable (depends on routing)
TrainingStableRequires tricks (Z-loss, aux loss)

5. Pre-Norm vs. Post-Norm

Post-Norm (Original Transformer, BERT)

LayerNorm is applied after the residual connection.

  • Issue: Gradients can explode near the output layers during initialization, requiring a "warm-up" stage.

Pre-Norm (GPT-2, Llama, PaLM)

LayerNorm is applied before the sublayer.

  • Benefit: Gradients flow through the "residual highway" (the addition path) untouched. Training is much more stable at scale.
  • Trade-off: Potentially slightly less expressive (theoretical debate), but stability wins for LLMs.

2025 Consensus: Pre-Norm is universal for decoder-only LLMs. Post-Norm still used in some encoder-decoder models (T5).


6. Feed-Forward Networks (FFN) & MoE: The "Knowledge" Layer

If Attention is "routing" information, the FFN (or MoE) is "processing" it. Some researchers posit that FFNs act as Key-Value Memories storing factual knowledge.

Evolution of Activations

  1. ReLU (Original): Rectified Linear Unit. Problem: "Dead neurons" (zero gradient).
  2. GELU (GPT-2/3): Gaussian Error Linear Unit. Smoother, probabilistic.
  3. SwiGLU (PaLM, Llama): Swish-Gated Linear Unit.

What is SwiGLU?

It adds a "gate" to the FFN. Instead of just passing data through, we compute two paths and multiply them. This requires 3 matrix multiplications instead of 2, but consistently yields better performance for the same compute budget.

MoE as FFN Replacement

Standard Transformer Block:

Attention → Dense FFN → Output

MoE Transformer Block:

Attention → Router → Selected Experts → Combined Output

Each expert is a specialized FFN:

  • Expert 1: Specializes in coding patterns
  • Expert 2: Specializes in mathematical reasoning
  • Expert 3: Specializes in factual knowledge
  • Expert 4-8: Other specializations

7. Linear Attention & Hybrid Architectures

Linear Attention (2020+)

Problem: Standard attention has O(N²) complexity due to the QK^T matrix computation.

Solution: Use kernel functions to approximate attention without explicit N × N matrix.

Benefits:

  • O(N) complexity instead of O(N²)
  • Enables truly massive context windows (1M+ tokens)
  • Trade-off: Slight quality degradation

Adoption:

  • RWKV: Recurrent architecture with linear attention
  • Mamba/State Space Models: Linear complexity by design
  • Hybrid models: Mix Transformer and linear attention layers

2025: Higher-Order Attention (Nexus)

Innovation: Query and Key vectors are outputs of nested self-attention loops.

Benefits:

  • Captures multi-hop relationships in single layer
  • More expressive than standard first-order attention
  • Enables complex reasoning without deep stacks

Status: Research stage, not yet production in major LLMs.


8. Positional Encodings Revisited

RoPE (Rotary Positional Embeddings) - Gold Standard

Used by Llama 2/3/4, PaLM, Mistral, GPT-NeoX.

  • Intuition: Encode position by rotating the vector in space.
  • Mechanism:
    • Tokens at position mm are rotated by angle mθm\theta.
    • The dot product (similarity) between two tokens depends only on their relative distance (mnm-n).
  • Why it wins:
    • Decay: Attention naturally decays as tokens get further apart (long-term dependency management).
    • Extrapolation: It handles context lengths longer than training data better than absolute embeddings.

2025: PaTH Attention

Innovation: Treats in-between words as a path of data-dependent transformations (Householder reflections).

Benefits:

  • Positional memory: Tracks state changes across sequences
  • Better sequential reasoning: Improved code execution tracking
  • Selective forgetting: Combined with Forgetting Transformers (FoX) to down-weight old info

Status: Cutting-edge research, not yet in production models.


9. Interview FAQ

Q: What is the computational complexity of Self-Attention?

A: O(N2)O(N^2) where NN is the sequence length.

  • Computing QKTQK^T results in an N×NN \times N matrix.
  • This is why long context (100k+) is hard; doubling context quadruples compute.
  • 2025 Solutions:
    • FlashAttention-2: Optimizes IO but still O(N2)O(N^2) mathematically
    • Linear Attention: O(N)O(N) complexity, slight quality trade-off
    • Ring Attention: Distributed across GPUs, enables 1M+ context
    • Sliding Window: Only attend to nearby tokens + global cache
Q: Why do we need Layer Normalization?

A: To stabilize the distribution of activations across deep networks and ensuring that no single feature dominates magnitude-wise. Without it, gradients would explode or vanish in a network with 100+ layers.

2025 Update: RMSNorm (Root Mean Square Normalization) is replacing LayerNorm in many models (Llama, Gemma) because it's simpler and faster:

  • Normalizes by root mean square instead of mean and variance
  • More computationally efficient than LayerNorm
  • Better stability for very deep networks
Q: How does a Decoder-only model prevent "cheating" during training?

A: Through Causal Masking. In the self-attention step, we set the attention scores for all future tokens (positions j>ij > i) to -\infty. When passed through softmax, these become 0, ensuring token ii can only attend to 0...i0...i.

Implementation:

# Create causal mask
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Apply to attention scores
scores = scores.masked_fill(mask.bool(), float('-inf'))
Q: What is the purpose of the Residual (Skip) Connection?

A: It mitigates the vanishing gradient problem. By allowing gradients to flow directly through the network via addition (x+f(x)x + f(x)), errors can backpropagate from the last layer to the first without being diminished by multiple multiplication steps.

2025 Insight: Residual connections also enable gradient checkpointing, trading compute for memory during training.

Q: When should I use GQA vs MHA vs MQA?

A:

Use MHA when:

  • Quality is paramount (research, benchmarks)
  • Context window is short (< 8k tokens)
  • Memory is not a constraint

Use GQA when:

  • Default choice for production LLMs in 2025
  • Long context (32k-128k tokens)
  • Memory-constrained deployment
  • Want near-MHA quality with faster inference

Use MQA when:

  • Maximal throughput is required
  • Can accept 5-10% quality degradation
  • Very large batch inference (e.g., API serving)

2025 Verdict: GQA-8 or GQA-4 is the sweet spot for most applications.

Q: What causes training instability in MoE models?

A: Three main issues:

  1. Router collapse: All tokens route to the same expert, leaving others unused

    • Fix: Auxiliary load-balancing loss, expert capacity factor
  2. Expert overflow: Expert receives more tokens than its capacity factor allows

    • Fix: Drop tokens or route to next layer
  3. Gradient imbalance: Some experts receive much larger gradients than others

    • Fix: Router Z-loss, normalized expert losses

2025 Solutions:

  • Shared experts: Reduces redundancy, improves load balancing
  • Sigmoid gating: More stable than softmax for expert selection
  • Stable MoE training: Warm-up periods, gradual expert activation
Q: How does RoPE differ from absolute positional embeddings?

A: Absolute embeddings add a fixed vector to each token based on its position. Position is encoded as a fixed property of the token.

RoPE rotates the query and key vectors based on position using rotation matrices. The dot product between queries and keys depends only on their relative distance, not absolute positions.

Benefits:

  • Better extrapolation to longer sequences
  • Natural decay of attention with distance
  • No learned positional parameters

2025 Dominance: RoPE is used in almost all decoder-only LLMs (Llama, GPT-4, PaLM, Mistral).


Spring AI Model Configuration

Spring AI provides unified configuration for different LLM providers with consistent parameter tuning options.

Basic Model Configuration

// Spring AI Model Configuration
@Configuration
public class LLMConfiguration {

@Bean
public ChatModel chatModel(OpenAiApi openAiApi) {
return OpenAiChatModel.builder()
.openAiApi(openAiApi)
.options(OpenAiChatOptions.builder()
.model("gpt-4")
.temperature(0.7)
.maxTokens(2000)
// Understanding these parameters:
// - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
// - maxTokens: Limits response length
// - topP: Nucleus sampling (0.9 = keep 90% probability mass)
// - presencePenalty: Reduces repetition
.build())
.build();
}

// For models requiring specific attention settings
@Bean
public ChatModel longContextModel() {
return OpenAiChatModel.builder()
.options(OpenAiChatOptions.builder()
.model("gpt-4-turbo") // 128K context
// When to use long context:
// - Document analysis > 50 pages
// - Codebase reviews
// - Multi-document synthesis
.build())
.build();
}
}

Parameter Tuning Guide

Different tasks require different parameter settings for optimal results:

// Effect of sampling parameters
@Service
public class ParameterTuningService {
private final ChatClient chatClient;

// Code generation: Low temperature for consistency
public String generateCode(String description) {
return chatClient.prompt()
.user("Write code to: " + description)
.options(OpenAiChatOptions.builder()
.temperature(0.2) // Low = more deterministic
.maxTokens(1500)
.topP(0.95)
.build())
.call()
.content();
}

// Creative writing: Higher temperature
public String generateStory(String prompt) {
return chatClient.prompt()
.user("Write a story about: " + prompt)
.options(OpenAiChatOptions.builder()
.temperature(0.9) // High = more creative
.maxTokens(2000)
.topP(0.9)
.build())
.call()
.content();
}

// Technical documentation: Balanced settings
public String generateDocs(String code) {
return chatClient.prompt()
.user("Generate documentation for:\n" + code)
.options(OpenAiChatOptions.builder()
.temperature(0.5) // Balanced
.maxTokens(1000)
.presencePenalty(0.3) // Reduce repetition
.build())
.call()
.content();
}
}

Choosing the Right Model

// Service for model selection based on task
@Service
public class ModelSelectionService {

public String chooseModel(String task) {
return switch (task.toLowerCase()) {
case "code", "debug", "refactor" -> "gpt-4", // Best coding performance
case "chat", "general", "qa" -> "gpt-3.5-turbo", // Cost-effective
case "analysis", "document", "long" -> "gpt-4-turbo", // 128K context
case "creative", "story", "poem" -> "gpt-4", // Better creativity
case "simple", "classification" -> "gpt-3.5-turbo", // Faster, cheaper
default -> "gpt-3.5-turbo" // Default to cost-effective
};
}

public ChatOptions getOptionsForTask(String task) {
return switch (task.toLowerCase()) {
case "code" -> OpenAiChatOptions.builder()
.temperature(0.2)
.maxTokens(2000)
.build();
case "creative" -> OpenAiChatOptions.builder()
.temperature(0.9)
.maxTokens(1500)
.presencePenalty(0.5)
.build();
case "analysis" -> OpenAiChatOptions.builder()
.temperature(0.3)
.maxTokens(3000)
.topP(0.95)
.build();
default -> OpenAiChatOptions.builder()
.temperature(0.7)
.maxTokens(1000)
.build();
};
}
}

Architecture Selection Guide

Use CaseRecommended ArchitectureWhy
ChatbotsDecoder-only (GPT, Llama)Generative, conversational
ClassificationEncoder-only (BERT)Better understanding, bidirectional context
TranslationEncoder-Decoder (T5)Sequence-to-sequence transformation
Code GenerationDecoder-only with MoESpecialized experts for coding patterns
Long DocumentsHybrid (Transformer + SSM)Efficient long-context modeling
Cost-SensitiveDense small modelsPredictable inference cost
Quality-FirstLarge MoE modelsBest performance with sparse activation

Summary for Interviews

  1. Transformer blocks consist of Multi-Head Attention + FFN, wrapped in residuals and normalization.
  2. Self-attention computes similarity between all token pairs via QK^T, scaled by the square root of the key dimension.
  3. Multi-head attention allows different heads to focus on different aspects (syntax, semantics, position).
  4. Pre-norm (LayerNorm before sublayer) is standard for decoder-only LLMs; more stable than post-norm.
  5. GQA (Grouped-Query Attention) is the 2025 standard: reduces KV cache by 4-8x with minimal quality loss.
  6. MoE (Mixture-of-Experts) enables scaling to trillions of parameters by activating only relevant experts per token.
  7. RoPE (Rotary Positional Embeddings) dominates for position encoding; enables better extrapolation to long contexts.
  8. SwiGLU activation outperforms ReLU/GELU for LLMs; adds gating mechanism to FFN.
  9. Linear attention variants enable O(N) complexity for 1M+ token contexts; used in hybrid models.
  10. 2025 architecture trends: MoE for scaling, GQA for efficiency, RoPE for positioning, hybrid (Transformer + SSM) for long context.
Implementation Resources

For hands-on practice:

1. Study GQA implementations:

2. Explore MoE models:

3. Build intuition with attention viz:

4. Experiment with RoPE: