## Overview ### Paradigms ^ Paradigm ^ Memory model ^ API ^ Typical scale ^ | Shared-memory | Common address space | [[openmp|OpenMP]], pthreads | Single node | | Distributed-memory | Private per process | [[mpi|MPI]] | Multi-node cluster | | GPU | Host + device | CUDA, HIP, OpenCL | Single GPU | | Hybrid | Mixed | MPI + OpenMP, MPI + CUDA | Multi-node + GPU | ### Laws and models ^ Name ^ Formula ^ What it predicts ^ | [[amdahls-law|Amdahl's law]] | $S = \dfrac{1}{s + (1-s)/p}$ | Maximum speedup given serial fraction $s$ and $p$ processors | | [[gustafsons-law|Gustafson's law]] | $S = p - s(p-1)$ | Speedup when problem size scales with $p$ | | [[roofline-model|Roofline model]] | $\min(\pi,\; I \cdot \beta)$ | Whether a kernel is compute-bound or memory-bandwidth-bound | ### Common failure modes ^ Issue ^ Cause ^ Detection ^ | Race condition | Concurrent unsynchronized read/write | Helgrind, ThreadSanitizer | | Deadlock | Circular lock dependency | Missing progress; `gdb` `info threads` | | False sharing | Two variables in the same cache line | `perf c2c`, cache miss counters | | Load imbalance | Unequal work distribution | Per-thread time profile | | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers | ## Three paradigms Parallel computing splits into three broad paradigms based on where the parallelism lives. **Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead. **Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads. **GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory. ## Performance and correctness Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash. On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth. ## Practice The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP: ```c // compile: gcc -O2 -fopenmp -o sum sum.c // run: OMP_NUM_THREADS=4 ./sum // description: parallel reduction; compare timing against a serial baseline #include #include int main(void) { long n = 1000000000L, sum = 0; double t = omp_get_wtime(); #pragma omp parallel for reduction(+:sum) for (long i = 0; i < n; i++) sum += i; printf("sum=%ld time=%.2fs\n", sum, omp_get_wtime() - t); return 0; } ``` Build and run both versions: ```bash $ gcc -O2 -o sum_serial sum.c $ gcc -O2 -fopenmp -o sum_parallel sum.c $ ./sum_serial sum=499999999500000000 time=2.14s $ OMP_NUM_THREADS=4 ./sum_parallel sum=499999999500000000 time=0.57s ``` The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+:sum)` clause handles synchronization. Try setting `OMP_NUM_THREADS` from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action. ## Tree ``` Parallel Computing ├── Models │ ├── Shared Memory │ │ ├── Threads │ │ ├── Locks / Mutexes │ │ ├── Atomics │ │ ├── Barriers │ │ ├── Thread Pools │ │ └── NUMA │ │ │ ├── Distributed Memory │ │ ├── Message Passing │ │ ├── Collectives │ │ ├── Point-to-point │ │ ├── Topologies │ │ └── RDMA │ │ │ ├── Accelerator Computing │ │ ├── GPUs │ │ ├── FPGAs │ │ ├── TPUs │ │ └── Quantum Accelerators │ │ │ └── Hybrid │ ├── MPI + OpenMP │ ├── MPI + CUDA │ └── MPI + OpenMP + CUDA │ ├── Granularity │ ├── Bit-level │ ├── Instruction-level (ILP) │ ├── Data-level (SIMD) │ ├── Thread-level (TLP) │ ├── Task-level │ └── Process-level │ ├── Execution Models │ ├── SIMD │ ├── MISD │ ├── MIMD │ ├── SPMD │ ├── SIMT │ └── BSP │ ├── Synchronization │ ├── Mutex │ ├── Semaphore │ ├── Condition Variable │ ├── Barrier │ ├── Atomic Operations │ ├── Fences │ └── Lock-free │ ├── Communication │ ├── Shared Variables │ ├── Message Passing │ ├── Channels │ ├── Collectives │ ├── Pipelines │ └── Reduction Trees │ ├── Memory Hierarchy │ ├── Registers │ ├── L1/L2/L3 Cache │ ├── RAM │ ├── NUMA Domains │ ├── GPU Global Memory │ ├── Shared Memory (GPU) │ ├── Constant Memory │ └── Distributed Memory │ ├── Scheduling │ ├── Static │ ├── Dynamic │ ├── Guided │ ├── Work Stealing │ ├── Gang Scheduling │ └── Batch Scheduling │ ├── Decomposition │ ├── Data Parallelism │ ├── Task Parallelism │ ├── Pipeline Parallelism │ ├── Domain Decomposition │ ├── Functional Decomposition │ └── Recursive Decomposition │ ├── Performance Theory │ ├── :contentReference[oaicite:0]{index=0} │ ├── :contentReference[oaicite:1]{index=1} │ ├── Strong Scaling │ ├── Weak Scaling │ ├── Speedup │ ├── Efficiency │ ├── Latency │ ├── Bandwidth │ ├── Throughput │ └── Roofline Model │ ├── Hardware │ ├── Multicore CPUs │ ├── Manycore CPUs │ ├── GPUs │ ├── Clusters │ ├── Supercomputers │ ├── Interconnects │ │ ├── Ethernet │ │ ├── :contentReference[oaicite:2]{index=2} │ │ └── NVLink │ └── Storage │ ├── Parallel FS │ ├── Lustre │ └── GPFS │ ├── Programming Models │ ├── :contentReference[oaicite:3]{index=3} │ ├── :contentReference[oaicite:4]{index=4} │ ├── :contentReference[oaicite:5]{index=5} │ ├── :contentReference[oaicite:6]{index=6} │ ├── :contentReference[oaicite:7]{index=7} │ ├── :contentReference[oaicite:8]{index=8} │ ├── :contentReference[oaicite:9]{index=9} │ └── CSP / Actors │ └── Debugging & Profiling ├── Tracing ├── Sampling ├── Race Detection ├── Deadlock Detection ├── Memory Profiling ├── Cache Profiling ├── GPU Profiling └── MPI Profiling ```