Introduction to Large Language Models
"LLMs are not just text predictors; they are compressed representations of the world's knowledge, accessible through natural language."
Large Language Models (LLMs) represent a paradigm shift in Artificial Intelligence, moving from task-specific models to general-purpose reasoning engines. For software engineers and AI practitioners, understanding LLMs requires looking beyond the hype and grasping the underlying statistical and architectural principles that drive them.
What is an LLM?
At its core, an LLM is a probabilistic engine that predicts the next token based on previous context. While the mathematical formulation involves conditional probabilities, for engineers it's more useful to understand what LLMs can do rather than the underlying math.
Modern LLM Capabilities
LLMs have evolved from simple text completion to sophisticated reasoning engines:
- Code Generation: Write, debug, and explain code across multiple languages
- Document Analysis: Extract insights from technical documentation, research papers, and contracts
- Conversation Systems: Maintain context across multi-turn dialogue with memory
- Tool Use: Interact with APIs, databases, and external systems
- Multi-step Reasoning: Break down complex problems into intermediate steps
The Engineering Perspective
For production systems, think of LLMs as text-to-text transformations:
// Conceptual view: LLM as text processor
Input: "Summarize this document: [content]"
Processing: Model traverses layers of attention and feed-forward networks
Output: "[summary]"
The key insight: LLMs learn patterns from training data and apply them during inference. They don't "know" facts in the human sense—they've seen statistical correlations that they can reproduce.
Spring AI Integration Setup
Spring AI provides a unified abstraction layer for working with LLMs in Spring Boot applications. This simplifies switching between models and providers while maintaining consistent APIs.
Basic ChatClient Configuration
// application.properties (using Doppler for environment variables)
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
// Or use the recommended Doppler injection pattern:
// spring.ai.openai.api-key=${doppler.OPENAI_API_KEY}
// Service layer for LLM interactions
@Service
public class LLMChatService {
private final ChatClient chatClient;
public LLMChatService(ChatModel chatModel) {
this.chatClient = ChatClient.builder(chatModel).build();
}
public String chat(String userMessage) {
return chatClient.prompt()
.user(userMessage)
.call()
.content();
}
// Streaming responses for real-time applications
public Flux<String> chatStream(String userMessage) {
return chatClient.prompt()
.user(userMessage)
.stream()
.content();
}
}
Model Selection Guide
Choosing the right model depends on your use case, budget, and performance requirements:
| Use Case | Recommended Model | Why |
|---|---|---|
| Code Generation & Debugging | Claude Opus 4.7 or GLM-5.1 | 87.6% SWE-bench Verified (Opus 4.7), GLM-5.1 is MIT-licensed |
| General-Purpose Chat | GPT-5.4 or Gemini 3.1 Pro | Strong reasoning, native multimodal |
| Long Document Analysis | Claude Opus 4.7 or Gemini 3.1 Pro | 1M token context (Opus 4.7), 1M–2M tokens (Gemini) |
| Cost-Sensitive Applications | Gemma 4 26B MoE (self-hosted) | Apache 2.0, outperforms models 20x its size |
| Enterprise Self-Hosted | Granite 4.1 8B/30B (IBM) | Apache 2.0, 8B matches 32B MoE, 512K context |
| On-Premise Deployment | GLM-5.1 (MIT) or Gemma 4 (Apache 2.0) | Fully open-source, frontier-grade |
| Multilingual Applications | Qwen 3.6-Plus or Gemma 4 (140+ languages) | Strong non-English, agentic capabilities |
| Complex Reasoning | OpenAI o3/o4 or Claude Opus 4.7 | Explicit reasoning chains, math/science tasks |
| Edge / Mobile | Gemma 4 E2B/E4B | 128K context on mobile/IoT devices |
| Life Sciences / Drug Discovery | GPT-Rosalind | First LLM purpose-built for biology and genomics research |
Configuration Example
@Configuration
public class LLMConfiguration {
@Bean
public ChatModel chatModel(OpenAiApi openAiApi) {
return OpenAiChatModel.builder()
.openAiApi(openAiApi)
.options(OpenAiChatOptions.builder()
.model("gpt-4")
.temperature(0.7)
.maxTokens(2000)
// Understanding these parameters:
// - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
// - maxTokens: Limits response length
// - topP: Nucleus sampling (0.9 = keep 90% probability mass)
// - presencePenalty: Reduces repetition
.build())
.build();
}
// For long-context use cases
@Bean
public ChatModel longContextModel() {
return OpenAiChatModel.builder()
.options(OpenAiChatOptions.builder()
.model("gpt-4o") // 128K context
.maxTokens(4000)
.build())
.build();
}
}
Model Architectures: The "Big Three" + Modern Evolutions
In 2017, the "Attention Is All You Need" paper introduced the Transformer. Since then, the architecture has branched into three distinct families and hybrid variants. You must know the difference between these for interviews.
1. Encoder-Only (Auto-Encoding)
- Mechanism: Corrupts input (masks words) and tries to reconstruct it using bidirectional context (looking at both left and right context).
- Core Ability: "Understanding" and classification. These models create rich vector representations of text.
- Use Cases: Sentiment analysis, Named Entity Recognition (NER), Search/Embeddings.
- Examples: BERT, RoBERTa, DistilBERT.
2. Decoder-Only (Auto-Regressive)
- Mechanism: Predicts the next token based only on previous tokens (causal masking). It cannot "see" the future.
- Core Ability: Generative tasks.
- Use Cases: Chatbots, Code Generation, Storytelling.
- Examples: GPT-3/4, Llama 3/4, Claude, Gemini.
- Note: This is the dominant architecture for modern "Generative AI" with the emergence of Mixture-of-Experts (MoE) variants.
3. Encoder-Decoder (Seq2Seq)
- Mechanism: An Encoder processes the input into a context vector, and a Decoder generates output.
- Core Ability: Transforming one sequence into another.
- Use Cases: Translation (English → French), Summarization (Long Article → Abstract).
- Examples: T5, BART.
4. Hybrid Architectures (2024+)
The Latest Frontier: Combining Transformer blocks with State Space Models (SSM) like Mamba.
- Mechanism: Interleaves Transformer attention layers with linear-complexity SSM layers.
- Advantages:
- O(n) complexity instead of O(n²) for attention
- Better long-context modeling without memory blowout
- Maintains strong performance on benchmarks
- Examples:
- Jamba (AI21 Labs): Transformer + Mamba hybrid
- RecurrentGemma (Google): Griffin architecture mixing attention and linear recurrence
- Qwen3-Next: Uses Gated DeltaNets for linear attention
- Nemotron 3 (NVIDIA): Incorporates Mamba-2 layers
- Performance: Research shows these hybrids often outperform pure Transformers or pure SSM models.
State-of-the-Art Models (2025–2026)
As of April 2026, the LLM landscape has undergone dramatic shifts: open-source models now rival or surpass closed-source frontier models on key benchmarks, agentic capabilities have become a core differentiator, and the industry faces a philosophical split between open access and gated deployment.
Closed-Source / Gated Models
| Model | Parameters | Context Window | Key Strengths | Best For |
|---|---|---|---|---|
| GPT-5.5 | Trillions (MoE) | 1M tokens | Terminal-Bench 2.0 SOTA (82.7%), SWE-Bench Pro 58.6%, agentic coding, computer use, scientific research | Agentic coding, computer use, knowledge work, research |
| GPT-5.4 | Trillions (MoE) | 128K tokens | Advanced reasoning, complex analytics, multimodal | General-purpose, enterprise analytics, complex reasoning |
| Claude Opus 4.7 | ~175B+ | 1M tokens | 87.6% SWE-bench, 94.2% GPQA, 3.3x higher-res vision, xhigh effort level | Software development, agentic coding, vision tasks |
| Claude Mythos | Unknown | Large | Most capable Anthropic model (93.9% SWE-bench), cybersecurity scanning | Gated access only (Project Glasswing, ~50 partners) |
| Gemini 3.1 Pro / Ultra | ~500B+ (est.) | 1M–2M tokens | Native multimodal (text/image/audio/video), real-time voice (90+ langs) | Long-document analysis, multimodal workflows, enterprise |
| OpenAI o3/o4 series | Unknown | Moderate | Explicit reasoning chains, advanced math/problem-solving | Scientific reasoning, complex math, research tasks |
Open-Source / Open-Weight Models
| Model | Parameters | Context Window | Key Strengths | Best For |
|---|---|---|---|---|
| GLM-5.1 (Zhipu AI) | 744B (MoE, 40B active) | 200K tokens | Beat GPT-5.4 & Claude Opus 4.6 on SWE-Bench Pro, MIT license | Software engineering, self-hosted frontier |
| Gemma 4 31B (Google) | 31B dense | 256K tokens | #3 open model on Arena AI, Apache 2.0, native vision+audio+agentic | On-premise deployment, edge-to-cloud |
| Gemma 4 26B MoE (Google) | 26B (3.8B active) | 256K tokens | #6 open model, outperforms models 20x its size, Apache 2.0 | Latency-sensitive applications, local inference |
| Llama 4 Scout / Maverick (Meta) | MoE | 128K tokens | Open-weight, competitive with frontier models | Open-source alternatives to closed models |
| Qwen 3.6-Plus (Alibaba) | MoE | 1M tokens | Strong agentic capabilities, multilingual, agentic coding | Asian languages, agentic workflows |
| DeepSeek V4 Pro (DeepSeek) | 1.6T MoE (49B active) | 1M tokens | Hybrid Attention (CSA+HCA), KV cache 降至 2%, Agent 优化 | Agentic 工作负载,长上下文推理 |
| DeepSeek V4 Flash (DeepSeek) | 284B MoE (13B active) | 1M tokens | 轻量推理,per-token FLOPs 降至 V3 的 10-27% | 高吞吐推理,成本敏感场景 |
| Kimi K2.6 (Moonshot AI) | ~1T MoE (32B active) | 262K tokens | 开源 SOTA 编码能力,原生多模态(视觉+文本),Agent 优化 | Agentic 编码,多模态理解 |
| Granite 4.1 (IBM) | 3B / 8B / 30B dense | 512K tokens | 5 阶段渐进式预训练,8B dense 匹配 Granite 4.0 32B MoE,Apache 2.0 | 企业级部署,自托管推理 |
Key Insights for 2026
-
The Open vs. Gated Split: The defining story of 2026 is the philosophical fracture. Anthropic locked Claude Mythos behind a 50-company firewall (125 per M tokens), while Zhipu AI open-sourced GLM-5.1 (744B MoE, MIT license) — and it beat both GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro. Meanwhile, Mozilla launched Thunderbolt, an open-source AI client for self-hosted AI infrastructure, further empowering the sovereign AI movement.
-
Cognitive Density over Raw Scale: The industry is pivoting from "biggest model wins" to cognitive density — packing more reasoning into smaller, efficient models. Gemma 4 26B MoE (3.8B active params) outperforms models 20x its size. This is driven by cost, speed, and practicality.
-
MoE is Ubiquitous: Nearly all frontier models now use Mixture-of-Experts. GLM-5.1 (744B total, 40B active), Gemma 4 26B (3.8B active), Llama 4, and GPT-5.4 are all MoE architectures. This allows massive total capacity with efficient inference.
-
Context Window Evolution:
- Standard: 128K tokens
- Long-context: 200K–256K tokens (Claude, Gemma 4)
- Massive: 1M tokens (DeepSeek V4 Pro/Flash, Gemini 3.1 Pro)
- The real innovation is making massive context usable for agentic workloads — DeepSeek V4's Hybrid Attention reduces KV cache to ~2% of standard GQA
- GPT-6 "Spud" is rumored to support 1M–2M tokens
-
Agentic Capabilities Are Table Stakes: Models now ship with native function-calling, structured JSON output, and system instructions for autonomous agents. GPT-5.5 achieved 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, making agentic coding its defining capability. Gemma 4, Qwen 3.6, and Claude all emphasize agentic workflows as a core feature.
-
Frontier Model Pricing War: GPT-5.5 delivers frontier intelligence at half the cost of competitive coding models (30 per M tokens), while DeepSeek V4 Flash offers 1M context at rock-bottom prices. The race is no longer just about capability — it's about capability-per-dollar.
-
The US-China AI Divide: DeepSeek V4 Pro (1.6T MoE, 1M context) runs on Huawei Ascend chips, completely bypassing Nvidia/CUDA. Its Hybrid Attention architecture (Compressed Sparse Attention + Hierarchical Context Attention) specifically targets agentic failure modes like context blowup and KV cache exhaustion. OpenAI, Anthropic, and Google are collaborating to combat model distillation by Chinese competitors. This geopolitical dimension is reshaping the entire supply chain.
Key Terminology
Parameters
The weights and biases of the neural network.
- 7B Parameters: Capable of running on consumer hardware (MacBook M3, gaming PC with GPU).
- 13B-70B Parameters: Requires decent GPU (A40/A100) for production use.
- 100B+ Parameters: Requires enterprise GPUs (H100 cluster) or efficient MoE architecture.
- Trillions: Frontier models (presumed GPT-4, Gemini Ultra) use MoE to effectively achieve this scale.
Context Window
The amount of text (in tokens) the model can "keep in mind" at once.
- Standard: 8k - 32k tokens (~30-120 pages).
- Long Context: 128k (GPT-4o, Llama 3.1), 200k (Claude).
- Massive Context: 1M-2M (Gemini 2.0 Pro) - equivalent to multiple books or entire codebases.
- Trade-off: Longer context traditionally required O(n²) compute during attention, but techniques like Ring Attention, Linear Attention, and Forgetting Transformers reduce this to O(n).
Mixture-of-Experts (MoE)
A technique to scale model capacity without proportional compute increase.
- How it works: Each token is routed to a subset of "expert" sub-networks (e.g., 8 out of 224 experts).
- Benefits: Model can have huge total parameters (405B+) but only activate a small fraction per token (e.g., 21B active).
- Examples: Mixtral 8x22B, Llama 4, GPT-4 (rumored).
Training Stages
-
Pre-training: The expensive part. Learning language patterns from internet-scale data (trillions of tokens).
- Result: Base Model (can complete text but doesn't follow instructions)
- Cost: Millions of dollars, thousands of GPUs, weeks of training
-
Supervised Fine-Tuning (SFT): Teaching the model to follow instructions using high-quality Q&A datasets.
- Result: Chat/Instruct Model (understands conversational intent)
- Data: Millions of instruction-response pairs, often curated by humans
-
Alignment (RLHF/DPO/GRPO): Refining behavior to be helpful, harmless, and honest.
- RLHF: Reinforcement Learning from Human Feedback (GPT-style)
- DPO: Direct Preference Optimization (simpler, more stable)
- GRPO: Group Relative Policy Optimization (newer, more efficient; from DeepSeek R1)
- Result: Aligned Model that refuses harmful requests and follows user intent
Interview FAQ
Q: Why did Transformers replace RNNs/LSTMs?
A: Two main reasons:
- Parallelization: RNNs process word-by-word sequentially (), making training on GPUs inefficient. Transformers process the whole sequence at once using matrix operations.
- Long-term Dependencies: RNNs "forget" information over long sequences due to the vanishing gradient problem. The Attention mechanism connects every token to every other token directly, making the "distance" between any two words effectively 1.
2025 Update: However, Transformers have O(n²) complexity. New hybrid models (Transformer + Mamba/SSM) combine the best of both: parallel training and efficient O(n) inference for long contexts.
Q: What is the difference between a Base Model and an Instruct Model?
A: A Base Model (e.g., Llama-3-Base) is trained only to predict the next token. If you ask it "What is the capital of France?", it might reply "And what is the capital of Germany?" because it thinks it's completing a list of quiz questions.
An Instruct Model (e.g., Llama-3-Instruct) has undergone SFT (Supervised Fine-Tuning) on instruction-response pairs. It understands the intent of a query and knows how to act as an assistant.
Key Insight: Always use Instruct/Chat models for user-facing applications. Base models are only useful for continued pre-training or research.
Q: Can an LLM learn new knowledge at inference time?
A: No, the model's weights are frozen after training. It can learn temporarily through In-Context Learning (putting the info in the prompt), but once that context window is closed, the knowledge is gone.
To "teach" an LLM new knowledge, you have three options:
- Fine-tuning: Update the model weights on new data (expensive, requires expertise)
- RAG (Retrieval Augmented Generation): Retrieve relevant documents and include them in the prompt (most common)
- Prompt Engineering: Provide the knowledge directly in the system prompt or user message (for small, static knowledge)
Q: What is Mixture-of-Experts (MoE) and why is it important?
A: MoE is an architectural innovation that decouples model size from computational cost. Instead of activating all parameters for every token (as in dense models), MoE models route each token to a small subset of specialized "expert" sub-networks.
Example: Mixtral 8x22B has 141B total parameters but only activates ~39B per token (8 experts × ~5B each).
Benefits:
- Scale: Can build massive models (400B+) without proportional inference costs
- Specialization: Different experts can specialize in different domains (coding, math, creative writing)
- Efficiency: Faster inference and lower memory usage than equivalent dense models
Trade-offs:
- Training complexity: Requires careful load balancing to ensure all experts are utilized
- Implementation complexity: Need to implement routing logic and expert selection
2025 State: Most frontier models (GPT-4, Llama 4, Gemini) are believed to use MoE to achieve their scale.
Q: How do long-context models (1M+ tokens) work without running out of memory?
A: Traditional attention has O(n²) complexity, meaning a 1M-token context would require ~1 trillion operations per attention layer. Modern models use several techniques:
-
Ring Attention: Distribute the sequence across multiple GPUs, each computing attention for a subset. Pass "boundary" information between devices like a ring.
-
Linear Attention: Replace the quadratic softmax attention with linear-complexity alternatives (e.g., Mamba, Gated DeltaNets). These achieve O(n) complexity.
-
Sliding Window / Local Attention: Only attend to nearby tokens, using a global "cache" for distant important information.
-
Forgetting Transformers (FoX): Selectively "forget" less relevant information, maintaining a bounded memory state.
Trade-off: Some of these methods sacrifice theoretical modeling power for practical efficiency. However, hybrid models (Transformer + SSM) often achieve 95%+ of Transformer quality at a fraction of the cost.
Q: What's the difference between RLHF, DPO, and GRPO?
A: These are three methods for aligning LLMs with human preferences:
RLHF (Reinforcement Learning from Human Feedback):
- Process: Train a reward model on human preference data → Use PPO (Proximal Policy Optimization) to optimize the LLM
- Pros: Well-established, strong results
- Cons: Complex, requires training a separate reward model, unstable
DPO (Direct Preference Optimization):
- Process: Directly optimize the policy using preference pairs without a reward model
- Pros: Simpler, more stable, easier to implement
- Cons: Can be less sample-efficient than RLHF
GRPO (Group Relative Policy Optimization):
- Process: Newer method (from DeepSeek R1) that optimizes groups of outputs relative to each other
- Pros: More efficient than RLHF, better for reasoning tasks, includes improvements like active sampling and token-level loss
- Cons: Newer, less battle-tested
2025 State: GRPO and DPO are becoming preferred over traditional RLHF due to simplicity and stability. Many state-of-the-art models (DeepSeek, Llama 4) use these newer methods.
Summary for Interviews
- LLMs are probabilistic next-token predictors that exhibit emergent reasoning capabilities at scale.
- Transformer architecture (2017) enabled parallel training and long-range dependencies, but hybrid models (2024+) are improving efficiency.
- Three main architectures: Encoder-only (BERT), Decoder-only (GPT, Llama), Encoder-Decoder (T5). Decoder-only dominates generative AI.
- 2026 state-of-the-art: GPT-5.4 (reasoning), Claude Opus 4.7 (coding, 87.6% SWE-bench), Gemini 3.5 Flash (agentic, 4x faster than frontier), Gemini 3.1 Pro (multimodal, 1M+ context), GLM-5.1 (open-source frontier), Gemma 4 (edge-to-cloud), GPT-Rosalind (life sciences).
- Mixture-of-Experts (MoE) is ubiquitous — nearly all frontier models use MoE to scale efficiently.
- Training pipeline: Pre-training → SFT → Alignment (RLHF/DPO/GRPO).
- In-context learning ≠ learning: Weights are frozen; use RAG for external knowledge.
- Long context is mainstream: 128K–2M tokens via Ring Attention, linear attention, and hybrid architectures.
- The open vs. gated split: Open-source models (GLM-5.1, Gemma 4) now rival or surpass closed models, while some frontier models (Claude Mythos) are deliberately locked behind gated access.
- The State of LLMs 2025 - Comprehensive analysis of 2024-2025 advances
- Hybrid Architectures for Language Models - Systematic analysis of Transformer + SSM hybrids
- Attention Is All You Need (2017) - The original Transformer paper