Transformer Architecture: The Engine of LLMs

"The Transformer is the first sequence transduction model relying entirely on attention." — Vaswani et al. (2017)

To pass an LLM interview, simply knowing "it uses attention" is not enough. You must understand why specific design choices were made (Pre-Norm vs Post-Norm, SwiGLU vs ReLU, GQA vs MHA, MoE vs Dense) and the mathematical operations inside the block.

1. The High-Level View

A modern Decoder-Only Transformer (like GPT-4 or Llama 3) consists of a stack of identical blocks. Each block has two main sub-layers:

Multi-Head Self-Attention (MHA): Mixing information between tokens.
Feed-Forward Network (FFN): Processing information within each token independently.

Critically, these are wrapped in Residual Connections and Layer Normalization.

2025 Evolution:

FFN → MoE: Many models now use Mixture-of-Experts for the feed-forward layer
MHA → GQA: Grouped-Query Attention reduces KV cache memory
Standard → Hybrid: Some models mix Transformer with State Space Models (Mamba)

2. Self-Attention: The "Routing" Layer

Attention allows tokens to "talk" to each other. It asks specific questions to build context.

The Query, Key, Value Intuition

Every token produces three vectors:

Query (Q): "What am I looking for?" (e.g., a noun looking for its adjective).
Key (K): "What do I contain?" (e.g., I am an adjective).
Value (V): "If you attend to me, here is my information."

The Engineering Perspective

Self-attention computes relationships between tokens through these steps:

Similarity: Computes scores for every pair of tokens (QK^T)
Scaling: Prevents dot products from exploding, which would cause vanishing gradients
Normalization: Converts scores to probabilities (softmax)
Aggregation: Weighted sum of value vectors

Why Multi-Head?

One head might focus on syntax (noun-verb agreement). Another might focus on semantics (synonyms). Another might look at position (previous word).

Model	Heads	Head Dimension	Total Dimension
Llama 3 8B	32	128	4,096
Llama 3 70B	64	128	8,192
GPT-4	96+ (est.)	128	12,288

3. Grouped Query Attention (GQA) - The 2025 Standard

As context windows grew (8k → 128k → 1M+), the KV Cache became a memory bottleneck. Storing Key and Value matrices for every head is expensive.

The Spectrum: MHA → GQA → MQA

Mechanism	Query Heads	KV Heads	KV Cache Size	Quality	Speed
MHA (Multi-Head)	H	H	100%	Best	Slowest
GQA (Grouped-Query)	H	G (where G < H)	~1/G	Near-best	Faster
MQA (Multi-Query)	H	1	1/H	Lower	Fastest

How GQA Works

Instead of each head having its own K/V projections, groups of query heads share K/V:

# MHA: 32 heads, 32 KV pairs
q_heads = 32
kv_heads = 32

# GQA: 32 query heads, 8 KV pairs (groups of 4)
q_heads = 32
kv_heads = 8  # Each KV pair serves 4 query heads

Benefits:

Memory reduction: 8x less KV cache for GQA-8
Bandwidth reduction: Less memory transfer during inference
Quality retention: GQA-8 achieves ~98-99% of MHA quality

Adoption:

Llama 3 70B: Uses GQA for efficient inference
T5-XXL: GQA-8 for production deployment
Gemini 2.5: Uses GQA variants for long context

2025: Weighted GQA (WGQA)

Innovation: Learnable parameters for each K/V head enable weighted averaging during fine-tuning.

Benefits:

0.53% average improvement over standard GQA
Converges to MHA quality with no inference overhead
Model learns optimal grouping during training

4. Mixture-of-Experts (MoE) - The Scaling Revolution

Instead of one monolithic feed-forward network, MoE uses multiple specialized "expert" networks. Each token is routed to the most relevant experts.

Architecture

Key Components

Router: Gating network that selects top-k experts for each token
Experts: Specialized FFN networks (typically 8-64 per layer)
Load Balancing: Auxiliary loss ensures all experts are utilized

2025 MoE Models

Model	Total Params	Active Params	Experts	Top-K	Notes
Mixtral 8x7B	46.7B	13B	8	2	Open-source, matches Llama 2 70B
Llama 4	TBD	TBD	TBD	TBD	MoE variant rumored
DeepSeek-V3	671B	37B	256	8	Shared experts, diverse routing
GPT-4	~1.7T (est.)	~220B (est.)	~128 (est.)	TBD	MoE widely suspected
Switch Transformer	1.6T	TBD	2048	1	Research milestone
GLaM	1.2T	TBD	64	2	Google's trillion-parameter model

Why MoE Matters

Training Efficiency:

Same quality as dense model with 1/3 the compute (GLaM result)
Allows scaling to trillions of parameters
Carbon footprint: Up to 10x reduction vs dense models

Inference Efficiency:

Only activates relevant experts per token
70B parameters with 13B active = 8B model speed with 70B quality
Enables massive models on consumer hardware (with quantization)

Training Stability (2025 Advances):

Router Z-loss: Penalizes large router logits, stabilizing training
Shared experts: Reduces redundancy, increases diversity
Sigmoid gating: More stable than softmax for expert selection

MoE vs Dense FFN

Aspect	Dense FFN	MoE
Parameters	Fixed per layer	Scales with experts
Compute	Always active	Sparse activation
Quality	Baseline	Same or better
Inference	Predictable	Variable (depends on routing)
Training	Stable	Requires tricks (Z-loss, aux loss)

5. Pre-Norm vs. Post-Norm

Post-Norm (Original Transformer, BERT)

LayerNorm is applied after the residual connection.

Issue: Gradients can explode near the output layers during initialization, requiring a "warm-up" stage.

Pre-Norm (GPT-2, Llama, PaLM)

LayerNorm is applied before the sublayer.

Benefit: Gradients flow through the "residual highway" (the addition path) untouched. Training is much more stable at scale.
Trade-off: Potentially slightly less expressive (theoretical debate), but stability wins for LLMs.

2025 Consensus: Pre-Norm is universal for decoder-only LLMs. Post-Norm still used in some encoder-decoder models (T5).

6. Feed-Forward Networks (FFN) & MoE: The "Knowledge" Layer

If Attention is "routing" information, the FFN (or MoE) is "processing" it. Some researchers posit that FFNs act as Key-Value Memories storing factual knowledge.

Evolution of Activations

ReLU (Original): Rectified Linear Unit. Problem: "Dead neurons" (zero gradient).
GELU (GPT-2/3): Gaussian Error Linear Unit. Smoother, probabilistic.
SwiGLU (PaLM, Llama): Swish-Gated Linear Unit.

What is SwiGLU?

It adds a "gate" to the FFN. Instead of just passing data through, we compute two paths and multiply them. This requires 3 matrix multiplications instead of 2, but consistently yields better performance for the same compute budget.

MoE as FFN Replacement

Standard Transformer Block:

Attention → Dense FFN → Output

MoE Transformer Block:

Attention → Router → Selected Experts → Combined Output

Each expert is a specialized FFN:

Expert 1: Specializes in coding patterns
Expert 2: Specializes in mathematical reasoning
Expert 3: Specializes in factual knowledge
Expert 4-8: Other specializations

7. Linear Attention & Hybrid Architectures

Linear Attention (2020+)

Problem: Standard attention has O(N²) complexity due to the QK^T matrix computation.

Solution: Use kernel functions to approximate attention without explicit N × N matrix.

Benefits:

O(N) complexity instead of O(N²)
Enables truly massive context windows (1M+ tokens)
Trade-off: Slight quality degradation

Adoption:

RWKV: Recurrent architecture with linear attention
Mamba/State Space Models: Linear complexity by design
Hybrid models: Mix Transformer and linear attention layers

2025: Higher-Order Attention (Nexus)

Innovation: Query and Key vectors are outputs of nested self-attention loops.

Benefits:

Captures multi-hop relationships in single layer
More expressive than standard first-order attention
Enables complex reasoning without deep stacks

Status: Research stage, not yet production in major LLMs.

8. Positional Encodings Revisited

RoPE (Rotary Positional Embeddings) - Gold Standard

Used by Llama 2/3/4, PaLM, Mistral, GPT-NeoX.

Intuition: Encode position by rotating the vector in space.
Mechanism:
- Tokens at position $m$ are rotated by angle $m\theta$ .
- The dot product (similarity) between two tokens depends only on their relative distance ( $m-n$ ).
Why it wins:
- Decay: Attention naturally decays as tokens get further apart (long-term dependency management).
- Extrapolation: It handles context lengths longer than training data better than absolute embeddings.

2025: PaTH Attention

Innovation: Treats in-between words as a path of data-dependent transformations (Householder reflections).

Benefits:

Positional memory: Tracks state changes across sequences
Better sequential reasoning: Improved code execution tracking
Selective forgetting: Combined with Forgetting Transformers (FoX) to down-weight old info

Status: Cutting-edge research, not yet in production models.

9. Interview FAQ

Q: What is the computational complexity of Self-Attention?

A: $O(N^2)$ where $N$ is the sequence length.

Computing $QK^T$ results in an $N \times N$ matrix.
This is why long context (100k+) is hard; doubling context quadruples compute.
2025 Solutions:
- FlashAttention-2: Optimizes IO but still $O(N^2)$ mathematically
- Linear Attention: $O(N)$ complexity, slight quality trade-off
- Ring Attention: Distributed across GPUs, enables 1M+ context
- Sliding Window: Only attend to nearby tokens + global cache

Q: Why do we need Layer Normalization?

A: To stabilize the distribution of activations across deep networks and ensuring that no single feature dominates magnitude-wise. Without it, gradients would explode or vanish in a network with 100+ layers.

2025 Update: RMSNorm (Root Mean Square Normalization) is replacing LayerNorm in many models (Llama, Gemma) because it's simpler and faster:

Normalizes by root mean square instead of mean and variance
More computationally efficient than LayerNorm
Better stability for very deep networks

Q: How does a Decoder-only model prevent "cheating" during training?

A: Through Causal Masking. In the self-attention step, we set the attention scores for all future tokens (positions $j > i$ ) to $-\infty$ . When passed through softmax, these become 0, ensuring token $i$ can only attend to $0...i$ .

Implementation:

# Create causal mask
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
# Apply to attention scores
scores = scores.masked_fill(mask.bool(), float('-inf'))

Q: What is the purpose of the Residual (Skip) Connection?

A: It mitigates the vanishing gradient problem. By allowing gradients to flow directly through the network via addition ( $x + f(x)$ ), errors can backpropagate from the last layer to the first without being diminished by multiple multiplication steps.

2025 Insight: Residual connections also enable gradient checkpointing, trading compute for memory during training.

Q: When should I use GQA vs MHA vs MQA?

Use MHA when:

Quality is paramount (research, benchmarks)
Context window is short (< 8k tokens)
Memory is not a constraint

Use GQA when:

Default choice for production LLMs in 2025
Long context (32k-128k tokens)
Memory-constrained deployment
Want near-MHA quality with faster inference

Use MQA when:

Maximal throughput is required
Can accept 5-10% quality degradation
Very large batch inference (e.g., API serving)

2025 Verdict: GQA-8 or GQA-4 is the sweet spot for most applications.

Q: What causes training instability in MoE models?

A: Three main issues:

Router collapse: All tokens route to the same expert, leaving others unused
- Fix: Auxiliary load-balancing loss, expert capacity factor
Expert overflow: Expert receives more tokens than its capacity factor allows
- Fix: Drop tokens or route to next layer
Gradient imbalance: Some experts receive much larger gradients than others
- Fix: Router Z-loss, normalized expert losses

2025 Solutions:

Shared experts: Reduces redundancy, improves load balancing
Sigmoid gating: More stable than softmax for expert selection
Stable MoE training: Warm-up periods, gradual expert activation

Q: How does RoPE differ from absolute positional embeddings?

A: Absolute embeddings add a fixed vector to each token based on its position. Position is encoded as a fixed property of the token.

RoPE rotates the query and key vectors based on position using rotation matrices. The dot product between queries and keys depends only on their relative distance, not absolute positions.

Benefits:

Better extrapolation to longer sequences
Natural decay of attention with distance
No learned positional parameters

2025 Dominance: RoPE is used in almost all decoder-only LLMs (Llama, GPT-4, PaLM, Mistral).

Spring AI Model Configuration

Spring AI provides unified configuration for different LLM providers with consistent parameter tuning options.

Basic Model Configuration

// Spring AI Model Configuration
@Configuration
public class LLMConfiguration {

    @Bean
    public ChatModel chatModel(OpenAiApi openAiApi) {
        return OpenAiChatModel.builder()
            .openAiApi(openAiApi)
            .options(OpenAiChatOptions.builder()
                .model("gpt-4")
                .temperature(0.7)
                .maxTokens(2000)
                // Understanding these parameters:
                // - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
                // - maxTokens: Limits response length
                // - topP: Nucleus sampling (0.9 = keep 90% probability mass)
                // - presencePenalty: Reduces repetition
                .build())
            .build();
    }

    // For models requiring specific attention settings
    @Bean
    public ChatModel longContextModel() {
        return OpenAiChatModel.builder()
            .options(OpenAiChatOptions.builder()
                .model("gpt-4-turbo")  // 128K context
                // When to use long context:
                // - Document analysis > 50 pages
                // - Codebase reviews
                // - Multi-document synthesis
                .build())
            .build();
    }
}

Parameter Tuning Guide

Different tasks require different parameter settings for optimal results:

// Effect of sampling parameters
@Service
public class ParameterTuningService {
    private final ChatClient chatClient;

    // Code generation: Low temperature for consistency
    public String generateCode(String description) {
        return chatClient.prompt()
            .user("Write code to: " + description)
            .options(OpenAiChatOptions.builder()
                .temperature(0.2)  // Low = more deterministic
                .maxTokens(1500)
                .topP(0.95)
                .build())
            .call()
            .content();
    }

    // Creative writing: Higher temperature
    public String generateStory(String prompt) {
        return chatClient.prompt()
            .user("Write a story about: " + prompt)
            .options(OpenAiChatOptions.builder()
                .temperature(0.9)  // High = more creative
                .maxTokens(2000)
                .topP(0.9)
                .build())
            .call()
            .content();
    }

    // Technical documentation: Balanced settings
    public String generateDocs(String code) {
        return chatClient.prompt()
            .user("Generate documentation for:\n" + code)
            .options(OpenAiChatOptions.builder()
                .temperature(0.5)  // Balanced
                .maxTokens(1000)
                .presencePenalty(0.3)  // Reduce repetition
                .build())
            .call()
            .content();
    }
}

Choosing the Right Model

// Service for model selection based on task
@Service
public class ModelSelectionService {

    public String chooseModel(String task) {
        return switch (task.toLowerCase()) {
            case "code", "debug", "refactor" -> "gpt-4",  // Best coding performance
            case "chat", "general", "qa" -> "gpt-3.5-turbo",  // Cost-effective
            case "analysis", "document", "long" -> "gpt-4-turbo",  // 128K context
            case "creative", "story", "poem" -> "gpt-4",  // Better creativity
            case "simple", "classification" -> "gpt-3.5-turbo",  // Faster, cheaper
            default -> "gpt-3.5-turbo"  // Default to cost-effective
        };
    }

    public ChatOptions getOptionsForTask(String task) {
        return switch (task.toLowerCase()) {
            case "code" -> OpenAiChatOptions.builder()
                .temperature(0.2)
                .maxTokens(2000)
                .build();
            case "creative" -> OpenAiChatOptions.builder()
                .temperature(0.9)
                .maxTokens(1500)
                .presencePenalty(0.5)
                .build();
            case "analysis" -> OpenAiChatOptions.builder()
                .temperature(0.3)
                .maxTokens(3000)
                .topP(0.95)
                .build();
            default -> OpenAiChatOptions.builder()
                .temperature(0.7)
                .maxTokens(1000)
                .build();
        };
    }
}

Architecture Selection Guide

Use Case	Recommended Architecture	Why
Chatbots	Decoder-only (GPT, Llama)	Generative, conversational
Classification	Encoder-only (BERT)	Better understanding, bidirectional context
Translation	Encoder-Decoder (T5)	Sequence-to-sequence transformation
Code Generation	Decoder-only with MoE	Specialized experts for coding patterns
Long Documents	Hybrid (Transformer + SSM)	Efficient long-context modeling
Cost-Sensitive	Dense small models	Predictable inference cost
Quality-First	Large MoE models	Best performance with sparse activation

Summary for Interviews

Transformer blocks consist of Multi-Head Attention + FFN, wrapped in residuals and normalization.
Self-attention computes similarity between all token pairs via QK^T, scaled by the square root of the key dimension.
Multi-head attention allows different heads to focus on different aspects (syntax, semantics, position).
Pre-norm (LayerNorm before sublayer) is standard for decoder-only LLMs; more stable than post-norm.
GQA (Grouped-Query Attention) is the 2025 standard: reduces KV cache by 4-8x with minimal quality loss.
MoE (Mixture-of-Experts) enables scaling to trillions of parameters by activating only relevant experts per token.
RoPE (Rotary Positional Embeddings) dominates for position encoding; enables better extrapolation to long contexts.
SwiGLU activation outperforms ReLU/GELU for LLMs; adds gating mechanism to FFN.
Linear attention variants enable O(N) complexity for 1M+ token contexts; used in hybrid models.
2025 architecture trends: MoE for scaling, GQA for efficiency, RoPE for positioning, hybrid (Transformer + SSM) for long context.

Implementation Resources

For hands-on practice:

1. Study GQA implementations:

2. Explore MoE models:

3. Build intuition with attention viz:

BertViz - Attention visualization
Transformer Explainer - Interactive attention math

4. Experiment with RoPE:

RoPE implementation in PyTorch

1. The High-Level View​

2. Self-Attention: The "Routing" Layer​

The Query, Key, Value Intuition​

The Engineering Perspective​

Why Multi-Head?​

3. Grouped Query Attention (GQA) - The 2025 Standard​

The Spectrum: MHA → GQA → MQA​

How GQA Works​

2025: Weighted GQA (WGQA)​

4. Mixture-of-Experts (MoE) - The Scaling Revolution​

Architecture​

Key Components​

2025 MoE Models​

Why MoE Matters​

MoE vs Dense FFN​

5. Pre-Norm vs. Post-Norm​

Post-Norm (Original Transformer, BERT)​

Pre-Norm (GPT-2, Llama, PaLM)​

6. Feed-Forward Networks (FFN) & MoE: The "Knowledge" Layer​

Evolution of Activations​

What is SwiGLU?​

MoE as FFN Replacement​

7. Linear Attention & Hybrid Architectures​

Linear Attention (2020+)​

2025: Higher-Order Attention (Nexus)​

8. Positional Encodings Revisited​

RoPE (Rotary Positional Embeddings) - Gold Standard​

2025: PaTH Attention​

9. Interview FAQ​

Spring AI Model Configuration​

Basic Model Configuration​

Parameter Tuning Guide​

Choosing the Right Model​

Architecture Selection Guide​

Summary for Interviews​

1. The High-Level View

2. Self-Attention: The "Routing" Layer

The Query, Key, Value Intuition

The Engineering Perspective

Why Multi-Head?

3. Grouped Query Attention (GQA) - The 2025 Standard

The Spectrum: MHA → GQA → MQA

How GQA Works

2025: Weighted GQA (WGQA)

4. Mixture-of-Experts (MoE) - The Scaling Revolution

Architecture

Key Components

2025 MoE Models

Why MoE Matters

MoE vs Dense FFN

5. Pre-Norm vs. Post-Norm

Post-Norm (Original Transformer, BERT)

Pre-Norm (GPT-2, Llama, PaLM)

6. Feed-Forward Networks (FFN) & MoE: The "Knowledge" Layer

Evolution of Activations

What is SwiGLU?

MoE as FFN Replacement

7. Linear Attention & Hybrid Architectures

Linear Attention (2020+)

2025: Higher-Order Attention (Nexus)

8. Positional Encodings Revisited

RoPE (Rotary Positional Embeddings) - Gold Standard

2025: PaTH Attention

9. Interview FAQ

Spring AI Model Configuration

Basic Model Configuration

Parameter Tuning Guide

Choosing the Right Model

Architecture Selection Guide

Summary for Interviews