## Overview

### Paradigms

^ Paradigm ^ Memory model ^ API ^ Typical scale ^
| Shared-memory | Common address space | [[openmp|OpenMP]], pthreads | Single node |
| Distributed-memory | Private per process | [[mpi|MPI]] | Multi-node cluster |
| GPU | Host + device | CUDA, HIP, OpenCL | Single GPU |
| Hybrid | Mixed | MPI + OpenMP, MPI + CUDA | Multi-node + GPU |

### Laws and models

^ Name ^ Formula ^ What it predicts ^
| [[amdahls-law|Amdahl's law]] | $S = \dfrac{1}{s + (1-s)/p}$ | Maximum speedup given serial fraction $s$ and $p$ processors |
| [[gustafsons-law|Gustafson's law]] | $S = p - s(p-1)$ | Speedup when problem size scales with $p$ |
| [[roofline-model|Roofline model]] | $\min(\pi,\; I \cdot \beta)$ | Whether a kernel is compute-bound or memory-bandwidth-bound |

### Common failure modes

^ Issue ^ Cause ^ Detection ^
| Race condition | Concurrent unsynchronized read/write | Helgrind, ThreadSanitizer |
| Deadlock | Circular lock dependency | Missing progress; `gdb` `info threads` |
| False sharing | Two variables in the same cache line | `perf c2c`, cache miss counters |
| Load imbalance | Unequal work distribution | Per-thread time profile |
| Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers |


## Three paradigms

Parallel computing splits into three broad paradigms based on where the parallelism lives.

**Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead.

**Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads.

**GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory.

## Performance and correctness

Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash.

On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth.

## Practice

The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP:

```c
// compile: gcc -O2 -fopenmp -o sum sum.c
// run: OMP_NUM_THREADS=4 ./sum
// description: parallel reduction; compare timing against a serial baseline

#include <omp.h>
#include <stdio.h>

int main(void) {
    long n = 1000000000L, sum = 0;
    double t = omp_get_wtime();
    #pragma omp parallel for reduction(+:sum)
    for (long i = 0; i < n; i++)
        sum += i;
    printf("sum=%ld  time=%.2fs\n", sum, omp_get_wtime() - t);
    return 0;
}
```

Build and run both versions:

```bash
$ gcc -O2 -o sum_serial sum.c
$ gcc -O2 -fopenmp -o sum_parallel sum.c
$ ./sum_serial
sum=499999999500000000  time=2.14s
$ OMP_NUM_THREADS=4 ./sum_parallel
sum=499999999500000000  time=0.57s
```

The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+:sum)` clause handles synchronization. Try setting `OMP_NUM_THREADS` from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action.

## Tree

```
Parallel Computing
├── Models
│   ├── Shared Memory
│   │   ├── Threads
│   │   ├── Locks / Mutexes
│   │   ├── Atomics
│   │   ├── Barriers
│   │   ├── Thread Pools
│   │   └── NUMA
│   │
│   ├── Distributed Memory
│   │   ├── Message Passing
│   │   ├── Collectives
│   │   ├── Point-to-point
│   │   ├── Topologies
│   │   └── RDMA
│   │
│   ├── Accelerator Computing
│   │   ├── GPUs
│   │   ├── FPGAs
│   │   ├── TPUs
│   │   └── Quantum Accelerators
│   │
│   └── Hybrid
│       ├── MPI + OpenMP
│       ├── MPI + CUDA
│       └── MPI + OpenMP + CUDA
│
├── Granularity
│   ├── Bit-level
│   ├── Instruction-level (ILP)
│   ├── Data-level (SIMD)
│   ├── Thread-level (TLP)
│   ├── Task-level
│   └── Process-level
│
├── Execution Models
│   ├── SIMD
│   ├── MISD
│   ├── MIMD
│   ├── SPMD
│   ├── SIMT
│   └── BSP
│
├── Synchronization
│   ├── Mutex
│   ├── Semaphore
│   ├── Condition Variable
│   ├── Barrier
│   ├── Atomic Operations
│   ├── Fences
│   └── Lock-free
│
├── Communication
│   ├── Shared Variables
│   ├── Message Passing
│   ├── Channels
│   ├── Collectives
│   ├── Pipelines
│   └── Reduction Trees
│
├── Memory Hierarchy
│   ├── Registers
│   ├── L1/L2/L3 Cache
│   ├── RAM
│   ├── NUMA Domains
│   ├── GPU Global Memory
│   ├── Shared Memory (GPU)
│   ├── Constant Memory
│   └── Distributed Memory
│
├── Scheduling
│   ├── Static
│   ├── Dynamic
│   ├── Guided
│   ├── Work Stealing
│   ├── Gang Scheduling
│   └── Batch Scheduling
│
├── Decomposition
│   ├── Data Parallelism
│   ├── Task Parallelism
│   ├── Pipeline Parallelism
│   ├── Domain Decomposition
│   ├── Functional Decomposition
│   └── Recursive Decomposition
│
├── Performance Theory
│   ├── :contentReference[oaicite:0]{index=0}
│   ├── :contentReference[oaicite:1]{index=1}
│   ├── Strong Scaling
│   ├── Weak Scaling
│   ├── Speedup
│   ├── Efficiency
│   ├── Latency
│   ├── Bandwidth
│   ├── Throughput
│   └── Roofline Model
│
├── Hardware
│   ├── Multicore CPUs
│   ├── Manycore CPUs
│   ├── GPUs
│   ├── Clusters
│   ├── Supercomputers
│   ├── Interconnects
│   │   ├── Ethernet
│   │   ├── :contentReference[oaicite:2]{index=2}
│   │   └── NVLink
│   └── Storage
│       ├── Parallel FS
│       ├── Lustre
│       └── GPFS
│
├── Programming Models
│   ├── :contentReference[oaicite:3]{index=3}
│   ├── :contentReference[oaicite:4]{index=4}
│   ├── :contentReference[oaicite:5]{index=5}
│   ├── :contentReference[oaicite:6]{index=6}
│   ├── :contentReference[oaicite:7]{index=7}
│   ├── :contentReference[oaicite:8]{index=8}
│   ├── :contentReference[oaicite:9]{index=9}
│   └── CSP / Actors
│
└── Debugging & Profiling
    ├── Tracing
    ├── Sampling
    ├── Race Detection
    ├── Deadlock Detection
    ├── Memory Profiling
    ├── Cache Profiling
    ├── GPU Profiling
    └── MPI Profiling
```