Threads & Concurrency: From Hardware Primitives to Lock-Free Design

Concurrency is the heart of modern high-performance computing. As Moore's Law shifts from increasing clock speeds to increasing core counts, software must leverage threads and synchronization to utilize modern hardware effectively.

This chapter dives into the mechanics of threading, the reality of hardware-level atomicity, and the complex world of memory consistency models.

1. The Essence of a Thread

While a process is a unit of resource ownership, a thread is the unit of execution.

1.1 Thread Anatomy

All threads within a single process share:

Address Space: Code, Global Data, and Heap.
Resources: Open file descriptors, signals, and working directory.

Each thread has its own private:

Program Counter (PC): Where the thread is in the code.
Stack: Local variables and function call history.
Registers: Current computational state.
Thread Local Storage (TLS): Small variables private to the thread.

1.2 Linux Threading: The NPTL Model

In Linux, there is no fundamental difference between a process and a thread at the kernel level. Both are represented by task_struct.

NPTL (Native POSIX Thread Library): The modern implementation where each pthread_create() call creates a "Lightweight Process" (LWP) via the clone() system call.
Sharing: Threads are simply LWPs that share the same mm_struct (memory descriptor) and files_struct.

2. Synchronization Internals: How Locks Actually Work

At the high level, we use mutex.lock(). But what happens inside?

2.1 The Spinlock (Kernel-Space Primitive)

A spinlock is a "busy-wait" lock. A thread continuously checks a memory location in a loop until it becomes available.

Pro: No context switch overhead. Extremely fast if the lock is held for a short time.
Con: Wastes CPU cycles. Can cause "Priority Inversion" if not managed carefully.
Implementation: Uses a test-and-set or exchange atomic instruction.

2.2 The Futex (Fast Userspace Mutex)

Modern Linux mutexes are built on Futexes.

The Optimization: If there is no contention (only one thread wants the lock), the lock is acquired entirely in user space via a single atomic instruction. No system call is made.
Contention: If the lock is already held, the thread makes a futex() system call to ask the kernel to put it to sleep and wake it up when the lock is free.
Benefit: Combines the speed of user-space atomics with the efficiency of kernel-space sleeping.

3. Hardware Support: Atomicity and Barriers

Software synchronization is only possible because hardware provides specific guarantees.

3.1 Atomic Instructions

Compare-and-Swap (CAS): atomic_compare_exchange(ptr, expected, new).
1. Read the value at ptr.
2. If it equals expected, write new.
3. Return whether it succeeded.
Fetch-and-Add: Atomically increments a value and returns the old one.

3.2 Memory Barriers (Fences)

Modern CPUs are "Out-of-Order." They might execute a Write to memory after a subsequent Read if there is no data dependency.

Memory Barrier: A special instruction (e.g., mfence on x86) that forces the CPU to complete all pending memory operations before proceeding.
Why it matters: Without barriers, Thread A might set ready = true before the data it prepared is actually visible to Thread B.

3.3 Memory Consistency Models

TSO (Total Store Ordering): Used by x86. Most operations are naturally ordered, making multi-threading easier but hardware slightly slower.
Weak Ordering: Used by ARM and RISC-V. The CPU can reorder almost anything. Software must use explicit barriers everywhere.

4. Classic Concurrency Problems and Patterns

4.1 The Producer-Consumer Problem

Using Semaphores to coordinate access to a bounded buffer.

Semaphore full: Tracks occupied slots (Init: 0).
Semaphore empty: Tracks free slots (Init: Buffer Size).
Mutex: Protects the buffer's internal pointers.

4.2 Reader-Writer Locks

Optimizing for scenarios where data is read often but modified rarely.

Reader Preference: Allows infinite readers, but can starve writers.
Writer Preference: New readers must wait if a writer is in queue.

5. The Deadlock Menace

5.1 The Four Necessary Conditions (Coffman)

Mutual Exclusion: Only one thread can use a resource at a time.
Hold and Wait: A thread holds a resource while waiting for another.
No Preemption: Resources cannot be forcibly taken away.
Circular Wait: Thread A waits for B, who waits for A.

5.2 Deadlock Avoidance: The Banker's Algorithm

The OS simulates the allocation of resources to see if it leads to a "Safe State" (a sequence where everyone can eventually finish). If a request leads to an unsafe state, the request is denied or delayed.

6. Advanced Topic: Lock-Free Programming

Lock-free algorithms use CAS instead of mutexes to ensure that at least one thread in the system always makes progress.

6.1 The ABA Problem

Thread A reads value A.
Thread B changes it to B, then back to A.
Thread A performs a CAS, thinks nothing changed, and potentially corrupts data.

Solution: Tagged pointers (adding a version number to the pointer).

6.2 Data Structures

Lock-free Stack: Use CAS to update the Head pointer.
RCU (Read-Copy-Update): Used extensively in the Linux kernel. Readers never lock; writers create a copy, update it, and swap the pointer. The old copy is deleted only after all current readers are done.

7. Performance Pitfalls in Multi-threading

When two threads on different cores update different variables that happen to reside on the same Cache Line (usually 64 bytes).

Result: The hardware must constantly bounce the cache line between cores, destroying performance.
Fix: Pad variables so they reside on different cache lines.

7.2 Context Switch Overhead

Switching between threads requires:

Saving registers.
Switching stacks.
Flushing parts of the CPU pipeline.

Rule of thumb: Context switches are expensive (~1-5 microseconds). Use Thread Pools to avoid frequent creation/destruction.

8. Summary Checklist

Can you explain why futex is faster than a traditional kernel mutex?
What is the difference between a spinlock and a mutex?
How does Compare-and-Swap enable lock-free programming?
What is "Priority Inversion" and how does "Priority Inheritance" solve it?
Explain the ABA problem in the context of lock-free stacks.

End of Chapter 03. Continue to Chapter 04: Memory Management.

1. The Essence of a Thread​

1.1 Thread Anatomy​

1.2 Linux Threading: The NPTL Model​

2. Synchronization Internals: How Locks Actually Work​

2.1 The Spinlock (Kernel-Space Primitive)​

2.2 The Futex (Fast Userspace Mutex)​

3. Hardware Support: Atomicity and Barriers​

3.1 Atomic Instructions​

3.2 Memory Barriers (Fences)​

3.3 Memory Consistency Models​

4. Classic Concurrency Problems and Patterns​

4.1 The Producer-Consumer Problem​

4.2 Reader-Writer Locks​

5. The Deadlock Menace​

5.1 The Four Necessary Conditions (Coffman)​

5.2 Deadlock Avoidance: The Banker's Algorithm​

6. Advanced Topic: Lock-Free Programming​

6.1 The ABA Problem​

6.2 Data Structures​

7. Performance Pitfalls in Multi-threading​

7.1 False Sharing​

7.2 Context Switch Overhead​

8. Summary Checklist​