Threads & Concurrency: From Hardware Primitives to Lock-Free Design
Concurrency is the heart of modern high-performance computing. As Moore's Law shifts from increasing clock speeds to increasing core counts, software must leverage threads and synchronization to utilize modern hardware effectively.
This chapter dives into the mechanics of threading, the reality of hardware-level atomicity, and the complex world of memory consistency models.
1. The Essence of a Thread
While a process is a unit of resource ownership, a thread is the unit of execution.
1.1 Thread Anatomy
All threads within a single process share:
- Address Space: Code, Global Data, and Heap.
- Resources: Open file descriptors, signals, and working directory.
Each thread has its own private:
- Program Counter (PC): Where the thread is in the code.
- Stack: Local variables and function call history.
- Registers: Current computational state.
- Thread Local Storage (TLS): Small variables private to the thread.
1.2 Linux Threading: The NPTL Model
In Linux, there is no fundamental difference between a process and a thread at the kernel level. Both are represented by task_struct.
- NPTL (Native POSIX Thread Library): The modern implementation where each
pthread_create()call creates a "Lightweight Process" (LWP) via theclone()system call. - Sharing: Threads are simply LWPs that share the same
mm_struct(memory descriptor) andfiles_struct.
2. Synchronization Internals: How Locks Actually Work
At the high level, we use mutex.lock(). But what happens inside?
2.1 The Spinlock (Kernel-Space Primitive)
A spinlock is a "busy-wait" lock. A thread continuously checks a memory location in a loop until it becomes available.
- Pro: No context switch overhead. Extremely fast if the lock is held for a short time.
- Con: Wastes CPU cycles. Can cause "Priority Inversion" if not managed carefully.
- Implementation: Uses a
test-and-setorexchangeatomic instruction.
2.2 The Futex (Fast Userspace Mutex)
Modern Linux mutexes are built on Futexes.
- The Optimization: If there is no contention (only one thread wants the lock), the lock is acquired entirely in user space via a single atomic instruction. No system call is made.
- Contention: If the lock is already held, the thread makes a
futex()system call to ask the kernel to put it to sleep and wake it up when the lock is free. - Benefit: Combines the speed of user-space atomics with the efficiency of kernel-space sleeping.
3. Hardware Support: Atomicity and Barriers
Software synchronization is only possible because hardware provides specific guarantees.
3.1 Atomic Instructions
- Compare-and-Swap (CAS):
atomic_compare_exchange(ptr, expected, new).- Read the value at
ptr. - If it equals
expected, writenew. - Return whether it succeeded.
- Read the value at
- Fetch-and-Add: Atomically increments a value and returns the old one.
3.2 Memory Barriers (Fences)
Modern CPUs are "Out-of-Order." They might execute a Write to memory after a subsequent Read if there is no data dependency.
- Memory Barrier: A special instruction (e.g.,
mfenceon x86) that forces the CPU to complete all pending memory operations before proceeding. - Why it matters: Without barriers, Thread A might set
ready = truebefore the data it prepared is actually visible to Thread B.
3.3 Memory Consistency Models
- TSO (Total Store Ordering): Used by x86. Most operations are naturally ordered, making multi-threading easier but hardware slightly slower.
- Weak Ordering: Used by ARM and RISC-V. The CPU can reorder almost anything. Software must use explicit barriers everywhere.
4. Classic Concurrency Problems and Patterns
4.1 The Producer-Consumer Problem
Using Semaphores to coordinate access to a bounded buffer.
- Semaphore
full: Tracks occupied slots (Init: 0). - Semaphore
empty: Tracks free slots (Init: Buffer Size). - Mutex: Protects the buffer's internal pointers.
4.2 Reader-Writer Locks
Optimizing for scenarios where data is read often but modified rarely.
- Reader Preference: Allows infinite readers, but can starve writers.
- Writer Preference: New readers must wait if a writer is in queue.
5. The Deadlock Menace
5.1 The Four Necessary Conditions (Coffman)
- Mutual Exclusion: Only one thread can use a resource at a time.
- Hold and Wait: A thread holds a resource while waiting for another.
- No Preemption: Resources cannot be forcibly taken away.
- Circular Wait: Thread A waits for B, who waits for A.
5.2 Deadlock Avoidance: The Banker's Algorithm
The OS simulates the allocation of resources to see if it leads to a "Safe State" (a sequence where everyone can eventually finish). If a request leads to an unsafe state, the request is denied or delayed.
6. Advanced Topic: Lock-Free Programming
Lock-free algorithms use CAS instead of mutexes to ensure that at least one thread in the system always makes progress.
6.1 The ABA Problem
- Thread A reads value
A. - Thread B changes it to
B, then back toA. - Thread A performs a CAS, thinks nothing changed, and potentially corrupts data.
- Solution: Tagged pointers (adding a version number to the pointer).
6.2 Data Structures
- Lock-free Stack: Use CAS to update the
Headpointer. - RCU (Read-Copy-Update): Used extensively in the Linux kernel. Readers never lock; writers create a copy, update it, and swap the pointer. The old copy is deleted only after all current readers are done.
7. Performance Pitfalls in Multi-threading
7.1 False Sharing
When two threads on different cores update different variables that happen to reside on the same Cache Line (usually 64 bytes).
- Result: The hardware must constantly bounce the cache line between cores, destroying performance.
- Fix: Pad variables so they reside on different cache lines.
7.2 Context Switch Overhead
Switching between threads requires:
- Saving registers.
- Switching stacks.
- Flushing parts of the CPU pipeline.
- Rule of thumb: Context switches are expensive (~1-5 microseconds). Use Thread Pools to avoid frequent creation/destruction.
8. Summary Checklist
- Can you explain why
futexis faster than a traditional kernel mutex? - What is the difference between a
spinlockand amutex? - How does
Compare-and-Swapenable lock-free programming? - What is "Priority Inversion" and how does "Priority Inheritance" solve it?
- Explain the ABA problem in the context of lock-free stacks.
End of Chapter 03. Continue to Chapter 04: Memory Management.