Skip to main content

Introduction to Large Language Models

"LLMs are not just text predictors; they are compressed representations of the world's knowledge, accessible through natural language."

Large Language Models (LLMs) represent a paradigm shift in Artificial Intelligence, moving from task-specific models to general-purpose reasoning engines. For software engineers and AI practitioners, understanding LLMs requires looking beyond the hype and grasping the underlying statistical and architectural principles that drive them.


What is an LLM?

At its core, an LLM is a probabilistic engine that predicts the next token based on previous context. While the mathematical formulation involves conditional probabilities, for engineers it's more useful to understand what LLMs can do rather than the underlying math.

Modern LLM Capabilities

LLMs have evolved from simple text completion to sophisticated reasoning engines:

  • Code Generation: Write, debug, and explain code across multiple languages
  • Document Analysis: Extract insights from technical documentation, research papers, and contracts
  • Conversation Systems: Maintain context across multi-turn dialogue with memory
  • Tool Use: Interact with APIs, databases, and external systems
  • Multi-step Reasoning: Break down complex problems into intermediate steps

The Engineering Perspective

For production systems, think of LLMs as text-to-text transformations:

// Conceptual view: LLM as text processor
Input: "Summarize this document: [content]"
Processing: Model traverses layers of attention and feed-forward networks
Output: "[summary]"

The key insight: LLMs learn patterns from training data and apply them during inference. They don't "know" facts in the human sense—they've seen statistical correlations that they can reproduce.


Spring AI Integration Setup

Spring AI provides a unified abstraction layer for working with LLMs in Spring Boot applications. This simplifies switching between models and providers while maintaining consistent APIs.

Basic ChatClient Configuration

// application.properties (using Doppler for environment variables)
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}

// Or use the recommended Doppler injection pattern:
// spring.ai.openai.api-key=${doppler.OPENAI_API_KEY}
// Service layer for LLM interactions
@Service
public class LLMChatService {
private final ChatClient chatClient;

public LLMChatService(ChatModel chatModel) {
this.chatClient = ChatClient.builder(chatModel).build();
}

public String chat(String userMessage) {
return chatClient.prompt()
.user(userMessage)
.call()
.content();
}

// Streaming responses for real-time applications
public Flux<String> chatStream(String userMessage) {
return chatClient.prompt()
.user(userMessage)
.stream()
.content();
}
}

Model Selection Guide

Choosing the right model depends on your use case, budget, and performance requirements:

Use CaseRecommended ModelWhy
Code Generation & DebuggingClaude Opus 4.7 or GLM-5.187.6% SWE-bench Verified (Opus 4.7), GLM-5.1 is MIT-licensed
General-Purpose ChatGPT-5.4 or Gemini 3.1 ProStrong reasoning, native multimodal
Long Document AnalysisClaude Opus 4.7 or Gemini 3.1 Pro1M token context (Opus 4.7), 1M–2M tokens (Gemini)
Cost-Sensitive ApplicationsGemma 4 26B MoE (self-hosted)Apache 2.0, outperforms models 20x its size
Enterprise Self-HostedGranite 4.1 8B/30B (IBM)Apache 2.0, 8B matches 32B MoE, 512K context
On-Premise DeploymentGLM-5.1 (MIT) or Gemma 4 (Apache 2.0)Fully open-source, frontier-grade
Multilingual ApplicationsQwen 3.6-Plus or Gemma 4 (140+ languages)Strong non-English, agentic capabilities
Complex ReasoningOpenAI o3/o4 or Claude Opus 4.7Explicit reasoning chains, math/science tasks
Edge / MobileGemma 4 E2B/E4B128K context on mobile/IoT devices
Life Sciences / Drug DiscoveryGPT-RosalindFirst LLM purpose-built for biology and genomics research

Configuration Example

@Configuration
public class LLMConfiguration {

@Bean
public ChatModel chatModel(OpenAiApi openAiApi) {
return OpenAiChatModel.builder()
.openAiApi(openAiApi)
.options(OpenAiChatOptions.builder()
.model("gpt-4")
.temperature(0.7)
.maxTokens(2000)
// Understanding these parameters:
// - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
// - maxTokens: Limits response length
// - topP: Nucleus sampling (0.9 = keep 90% probability mass)
// - presencePenalty: Reduces repetition
.build())
.build();
}

// For long-context use cases
@Bean
public ChatModel longContextModel() {
return OpenAiChatModel.builder()
.options(OpenAiChatOptions.builder()
.model("gpt-4o") // 128K context
.maxTokens(4000)
.build())
.build();
}
}

Model Architectures: The "Big Three" + Modern Evolutions

In 2017, the "Attention Is All You Need" paper introduced the Transformer. Since then, the architecture has branched into three distinct families and hybrid variants. You must know the difference between these for interviews.

1. Encoder-Only (Auto-Encoding)

  • Mechanism: Corrupts input (masks words) and tries to reconstruct it using bidirectional context (looking at both left and right context).
  • Core Ability: "Understanding" and classification. These models create rich vector representations of text.
  • Use Cases: Sentiment analysis, Named Entity Recognition (NER), Search/Embeddings.
  • Examples: BERT, RoBERTa, DistilBERT.

2. Decoder-Only (Auto-Regressive)

  • Mechanism: Predicts the next token based only on previous tokens (causal masking). It cannot "see" the future.
  • Core Ability: Generative tasks.
  • Use Cases: Chatbots, Code Generation, Storytelling.
  • Examples: GPT-3/4, Llama 3/4, Claude, Gemini.
  • Note: This is the dominant architecture for modern "Generative AI" with the emergence of Mixture-of-Experts (MoE) variants.

3. Encoder-Decoder (Seq2Seq)

  • Mechanism: An Encoder processes the input into a context vector, and a Decoder generates output.
  • Core Ability: Transforming one sequence into another.
  • Use Cases: Translation (English → French), Summarization (Long Article → Abstract).
  • Examples: T5, BART.

4. Hybrid Architectures (2024+)

The Latest Frontier: Combining Transformer blocks with State Space Models (SSM) like Mamba.

  • Mechanism: Interleaves Transformer attention layers with linear-complexity SSM layers.
  • Advantages:
    • O(n) complexity instead of O(n²) for attention
    • Better long-context modeling without memory blowout
    • Maintains strong performance on benchmarks
  • Examples:
    • Jamba (AI21 Labs): Transformer + Mamba hybrid
    • RecurrentGemma (Google): Griffin architecture mixing attention and linear recurrence
    • Qwen3-Next: Uses Gated DeltaNets for linear attention
    • Nemotron 3 (NVIDIA): Incorporates Mamba-2 layers
  • Performance: Research shows these hybrids often outperform pure Transformers or pure SSM models.

State-of-the-Art Models (2025–2026)

As of April 2026, the LLM landscape has undergone dramatic shifts: open-source models now rival or surpass closed-source frontier models on key benchmarks, agentic capabilities have become a core differentiator, and the industry faces a philosophical split between open access and gated deployment.

Closed-Source / Gated Models

ModelParametersContext WindowKey StrengthsBest For
GPT-5.5Trillions (MoE)1M tokensTerminal-Bench 2.0 SOTA (82.7%), SWE-Bench Pro 58.6%, agentic coding, computer use, scientific researchAgentic coding, computer use, knowledge work, research
GPT-5.4Trillions (MoE)128K tokensAdvanced reasoning, complex analytics, multimodalGeneral-purpose, enterprise analytics, complex reasoning
Claude Opus 4.7~175B+1M tokens87.6% SWE-bench, 94.2% GPQA, 3.3x higher-res vision, xhigh effort levelSoftware development, agentic coding, vision tasks
Claude MythosUnknownLargeMost capable Anthropic model (93.9% SWE-bench), cybersecurity scanningGated access only (Project Glasswing, ~50 partners)
Gemini 3.1 Pro / Ultra~500B+ (est.)1M–2M tokensNative multimodal (text/image/audio/video), real-time voice (90+ langs)Long-document analysis, multimodal workflows, enterprise
OpenAI o3/o4 seriesUnknownModerateExplicit reasoning chains, advanced math/problem-solvingScientific reasoning, complex math, research tasks

Open-Source / Open-Weight Models

ModelParametersContext WindowKey StrengthsBest For
GLM-5.1 (Zhipu AI)744B (MoE, 40B active)200K tokensBeat GPT-5.4 & Claude Opus 4.6 on SWE-Bench Pro, MIT licenseSoftware engineering, self-hosted frontier
Gemma 4 31B (Google)31B dense256K tokens#3 open model on Arena AI, Apache 2.0, native vision+audio+agenticOn-premise deployment, edge-to-cloud
Gemma 4 26B MoE (Google)26B (3.8B active)256K tokens#6 open model, outperforms models 20x its size, Apache 2.0Latency-sensitive applications, local inference
Llama 4 Scout / Maverick (Meta)MoE128K tokensOpen-weight, competitive with frontier modelsOpen-source alternatives to closed models
Qwen 3.6-Plus (Alibaba)MoE1M tokensStrong agentic capabilities, multilingual, agentic codingAsian languages, agentic workflows
DeepSeek V4 Pro (DeepSeek)1.6T MoE (49B active)1M tokensHybrid Attention (CSA+HCA), KV cache 降至 2%, Agent 优化Agentic 工作负载,长上下文推理
DeepSeek V4 Flash (DeepSeek)284B MoE (13B active)1M tokens轻量推理,per-token FLOPs 降至 V3 的 10-27%高吞吐推理,成本敏感场景
Kimi K2.6 (Moonshot AI)~1T MoE (32B active)262K tokens开源 SOTA 编码能力,原生多模态(视觉+文本),Agent 优化Agentic 编码,多模态理解
Granite 4.1 (IBM)3B / 8B / 30B dense512K tokens5 阶段渐进式预训练,8B dense 匹配 Granite 4.0 32B MoE,Apache 2.0企业级部署,自托管推理

Key Insights for 2026

  1. The Open vs. Gated Split: The defining story of 2026 is the philosophical fracture. Anthropic locked Claude Mythos behind a 50-company firewall (25/25/125 per M tokens), while Zhipu AI open-sourced GLM-5.1 (744B MoE, MIT license) — and it beat both GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro. Meanwhile, Mozilla launched Thunderbolt, an open-source AI client for self-hosted AI infrastructure, further empowering the sovereign AI movement.

  2. Cognitive Density over Raw Scale: The industry is pivoting from "biggest model wins" to cognitive density — packing more reasoning into smaller, efficient models. Gemma 4 26B MoE (3.8B active params) outperforms models 20x its size. This is driven by cost, speed, and practicality.

  3. MoE is Ubiquitous: Nearly all frontier models now use Mixture-of-Experts. GLM-5.1 (744B total, 40B active), Gemma 4 26B (3.8B active), Llama 4, and GPT-5.4 are all MoE architectures. This allows massive total capacity with efficient inference.

  4. Context Window Evolution:

    • Standard: 128K tokens
    • Long-context: 200K–256K tokens (Claude, Gemma 4)
    • Massive: 1M tokens (DeepSeek V4 Pro/Flash, Gemini 3.1 Pro)
    • The real innovation is making massive context usable for agentic workloads — DeepSeek V4's Hybrid Attention reduces KV cache to ~2% of standard GQA
    • GPT-6 "Spud" is rumored to support 1M–2M tokens
  5. Agentic Capabilities Are Table Stakes: Models now ship with native function-calling, structured JSON output, and system instructions for autonomous agents. GPT-5.5 achieved 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro, making agentic coding its defining capability. Gemma 4, Qwen 3.6, and Claude all emphasize agentic workflows as a core feature.

  6. Frontier Model Pricing War: GPT-5.5 delivers frontier intelligence at half the cost of competitive coding models (5/5/30 per M tokens), while DeepSeek V4 Flash offers 1M context at rock-bottom prices. The race is no longer just about capability — it's about capability-per-dollar.

  7. The US-China AI Divide: DeepSeek V4 Pro (1.6T MoE, 1M context) runs on Huawei Ascend chips, completely bypassing Nvidia/CUDA. Its Hybrid Attention architecture (Compressed Sparse Attention + Hierarchical Context Attention) specifically targets agentic failure modes like context blowup and KV cache exhaustion. OpenAI, Anthropic, and Google are collaborating to combat model distillation by Chinese competitors. This geopolitical dimension is reshaping the entire supply chain.


Key Terminology

Parameters

The weights and biases of the neural network.

  • 7B Parameters: Capable of running on consumer hardware (MacBook M3, gaming PC with GPU).
  • 13B-70B Parameters: Requires decent GPU (A40/A100) for production use.
  • 100B+ Parameters: Requires enterprise GPUs (H100 cluster) or efficient MoE architecture.
  • Trillions: Frontier models (presumed GPT-4, Gemini Ultra) use MoE to effectively achieve this scale.

Context Window

The amount of text (in tokens) the model can "keep in mind" at once.

  • Standard: 8k - 32k tokens (~30-120 pages).
  • Long Context: 128k (GPT-4o, Llama 3.1), 200k (Claude).
  • Massive Context: 1M-2M (Gemini 2.0 Pro) - equivalent to multiple books or entire codebases.
  • Trade-off: Longer context traditionally required O(n²) compute during attention, but techniques like Ring Attention, Linear Attention, and Forgetting Transformers reduce this to O(n).

Mixture-of-Experts (MoE)

A technique to scale model capacity without proportional compute increase.

  • How it works: Each token is routed to a subset of "expert" sub-networks (e.g., 8 out of 224 experts).
  • Benefits: Model can have huge total parameters (405B+) but only activate a small fraction per token (e.g., 21B active).
  • Examples: Mixtral 8x22B, Llama 4, GPT-4 (rumored).

Training Stages

  1. Pre-training: The expensive part. Learning language patterns from internet-scale data (trillions of tokens).

    • Result: Base Model (can complete text but doesn't follow instructions)
    • Cost: Millions of dollars, thousands of GPUs, weeks of training
  2. Supervised Fine-Tuning (SFT): Teaching the model to follow instructions using high-quality Q&A datasets.

    • Result: Chat/Instruct Model (understands conversational intent)
    • Data: Millions of instruction-response pairs, often curated by humans
  3. Alignment (RLHF/DPO/GRPO): Refining behavior to be helpful, harmless, and honest.

    • RLHF: Reinforcement Learning from Human Feedback (GPT-style)
    • DPO: Direct Preference Optimization (simpler, more stable)
    • GRPO: Group Relative Policy Optimization (newer, more efficient; from DeepSeek R1)
    • Result: Aligned Model that refuses harmful requests and follows user intent

Interview FAQ

Q: Why did Transformers replace RNNs/LSTMs?

A: Two main reasons:

  1. Parallelization: RNNs process word-by-word sequentially (t1,t2,...t_1, t_2, ...), making training on GPUs inefficient. Transformers process the whole sequence at once using matrix operations.
  2. Long-term Dependencies: RNNs "forget" information over long sequences due to the vanishing gradient problem. The Attention mechanism connects every token to every other token directly, making the "distance" between any two words effectively 1.

2025 Update: However, Transformers have O(n²) complexity. New hybrid models (Transformer + Mamba/SSM) combine the best of both: parallel training and efficient O(n) inference for long contexts.

Q: What is the difference between a Base Model and an Instruct Model?

A: A Base Model (e.g., Llama-3-Base) is trained only to predict the next token. If you ask it "What is the capital of France?", it might reply "And what is the capital of Germany?" because it thinks it's completing a list of quiz questions.

An Instruct Model (e.g., Llama-3-Instruct) has undergone SFT (Supervised Fine-Tuning) on instruction-response pairs. It understands the intent of a query and knows how to act as an assistant.

Key Insight: Always use Instruct/Chat models for user-facing applications. Base models are only useful for continued pre-training or research.

Q: Can an LLM learn new knowledge at inference time?

A: No, the model's weights are frozen after training. It can learn temporarily through In-Context Learning (putting the info in the prompt), but once that context window is closed, the knowledge is gone.

To "teach" an LLM new knowledge, you have three options:

  1. Fine-tuning: Update the model weights on new data (expensive, requires expertise)
  2. RAG (Retrieval Augmented Generation): Retrieve relevant documents and include them in the prompt (most common)
  3. Prompt Engineering: Provide the knowledge directly in the system prompt or user message (for small, static knowledge)
Q: What is Mixture-of-Experts (MoE) and why is it important?

A: MoE is an architectural innovation that decouples model size from computational cost. Instead of activating all parameters for every token (as in dense models), MoE models route each token to a small subset of specialized "expert" sub-networks.

Example: Mixtral 8x22B has 141B total parameters but only activates ~39B per token (8 experts × ~5B each).

Benefits:

  • Scale: Can build massive models (400B+) without proportional inference costs
  • Specialization: Different experts can specialize in different domains (coding, math, creative writing)
  • Efficiency: Faster inference and lower memory usage than equivalent dense models

Trade-offs:

  • Training complexity: Requires careful load balancing to ensure all experts are utilized
  • Implementation complexity: Need to implement routing logic and expert selection

2025 State: Most frontier models (GPT-4, Llama 4, Gemini) are believed to use MoE to achieve their scale.

Q: How do long-context models (1M+ tokens) work without running out of memory?

A: Traditional attention has O(n²) complexity, meaning a 1M-token context would require ~1 trillion operations per attention layer. Modern models use several techniques:

  1. Ring Attention: Distribute the sequence across multiple GPUs, each computing attention for a subset. Pass "boundary" information between devices like a ring.

  2. Linear Attention: Replace the quadratic softmax attention with linear-complexity alternatives (e.g., Mamba, Gated DeltaNets). These achieve O(n) complexity.

  3. Sliding Window / Local Attention: Only attend to nearby tokens, using a global "cache" for distant important information.

  4. Forgetting Transformers (FoX): Selectively "forget" less relevant information, maintaining a bounded memory state.

Trade-off: Some of these methods sacrifice theoretical modeling power for practical efficiency. However, hybrid models (Transformer + SSM) often achieve 95%+ of Transformer quality at a fraction of the cost.

Q: What's the difference between RLHF, DPO, and GRPO?

A: These are three methods for aligning LLMs with human preferences:

RLHF (Reinforcement Learning from Human Feedback):

  • Process: Train a reward model on human preference data → Use PPO (Proximal Policy Optimization) to optimize the LLM
  • Pros: Well-established, strong results
  • Cons: Complex, requires training a separate reward model, unstable

DPO (Direct Preference Optimization):

  • Process: Directly optimize the policy using preference pairs without a reward model
  • Pros: Simpler, more stable, easier to implement
  • Cons: Can be less sample-efficient than RLHF

GRPO (Group Relative Policy Optimization):

  • Process: Newer method (from DeepSeek R1) that optimizes groups of outputs relative to each other
  • Pros: More efficient than RLHF, better for reasoning tasks, includes improvements like active sampling and token-level loss
  • Cons: Newer, less battle-tested

2025 State: GRPO and DPO are becoming preferred over traditional RLHF due to simplicity and stability. Many state-of-the-art models (DeepSeek, Llama 4) use these newer methods.


Summary for Interviews

  1. LLMs are probabilistic next-token predictors that exhibit emergent reasoning capabilities at scale.
  2. Transformer architecture (2017) enabled parallel training and long-range dependencies, but hybrid models (2024+) are improving efficiency.
  3. Three main architectures: Encoder-only (BERT), Decoder-only (GPT, Llama), Encoder-Decoder (T5). Decoder-only dominates generative AI.
  4. 2026 state-of-the-art: GPT-5.4 (reasoning), Claude Opus 4.7 (coding, 87.6% SWE-bench), Gemini 3.5 Flash (agentic, 4x faster than frontier), Gemini 3.1 Pro (multimodal, 1M+ context), GLM-5.1 (open-source frontier), Gemma 4 (edge-to-cloud), GPT-Rosalind (life sciences).
  5. Mixture-of-Experts (MoE) is ubiquitous — nearly all frontier models use MoE to scale efficiently.
  6. Training pipeline: Pre-training → SFT → Alignment (RLHF/DPO/GRPO).
  7. In-context learning ≠ learning: Weights are frozen; use RAG for external knowledge.
  8. Long context is mainstream: 128K–2M tokens via Ring Attention, linear attention, and hybrid architectures.
  9. The open vs. gated split: Open-source models (GLM-5.1, Gemma 4) now rival or surpass closed models, while some frontier models (Claude Mythos) are deliberately locked behind gated access.
Further Reading