Ivan's wiki

This is an old revision of the document!

Paradigm	Memory model	API	Typical scale
Shared-memory	Common address space	OpenMP, pthreads	Single node
Distributed-memory	Private per process	MPI	Multi-node cluster
GPU	Host + device	CUDA, HIP, OpenCL	Single GPU
Hybrid	Mixed	MPI + OpenMP, MPI + CUDA	Multi-node + GPU

Name	Formula	What it predicts
Amdahl's law	$S = \dfrac{1}{s + (1-s)/p}$	Maximum speedup given serial fraction $s$ and $p$ processors
Gustafson's law	$S = p - s(p-1)$	Speedup when problem size scales with $p$
Roofline model	$\min(\pi,\; I \cdot \beta)$	Whether a kernel is compute-bound or memory-bandwidth-bound

Issue	Cause	Detection
Race condition	Concurrent unsynchronized read/write	Helgrind, ThreadSanitizer
Deadlock	Circular lock dependency	Missing progress; `gdb` `info threads`
False sharing	Two variables in the same cache line	`perf c2c`, cache miss counters
Load imbalance	Unequal work distribution	Per-thread time profile
Memory ordering	Relaxed atomics with wrong assumptions	Litmus tests; address and memory sanitizers