Differences

This shows you the differences between two versions of the page.

--- parallel-computing-overview [June 15, 2026 at 09:32] – created Ivan Janevski
+++ parallel-computing-overview [June 15, 2026 at 09:34] (current) – Ivan Janevski
@@ Line 25: / Line 25: @@
 | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers |
+## Three paradigms
+Parallel computing splits into three broad paradigms based on where the parallelism lives.
+**Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead.
+**Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads.
+**GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory.
+## Performance and correctness
+Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash.
+On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth.
+## Practice
+The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP:
+```c
+// compile: gcc -O2 -fopenmp -o sum sum.c
+// run: OMP_NUM_THREADS=4 ./sum
+// description: parallel reduction; compare timing against a serial baseline
+#include <omp.h>
+#include <stdio.h>
+int main(void) {
+    long n = 1000000000L, sum = 0;
+    double t = omp_get_wtime();
+    #pragma omp parallel for reduction(+:sum)
+    for (long i = 0; i < n; i++)
+        sum += i;
+    printf("sum=%ld  time=%.2fs\n", sum, omp_get_wtime() - t);
+    return 0;
+}
+```
+Build and run both versions:
+```bash
+$ gcc -O2 -o sum_serial sum.c
+$ gcc -O2 -fopenmp -o sum_parallel sum.c
+$ ./sum_serial
+sum=499999999500000000  time=2.14s
+$ OMP_NUM_THREADS=4 ./sum_parallel
+sum=499999999500000000  time=0.57s
+```
+The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+:sum)` clause handles synchronization. Try setting `OMP_NUM_THREADS` from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action.
+## Tree
+```
+Parallel Computing
+├── Models
+│   ├── Shared Memory
+│   │   ├── Threads
+│   │   ├── Locks / Mutexes
+│   │   ├── Atomics
+│   │   ├── Barriers
+│   │   ├── Thread Pools
+│   │   └── NUMA
+│   │
+│   ├── Distributed Memory
+│   │   ├── Message Passing
+│   │   ├── Collectives
+│   │   ├── Point-to-point
+│   │   ├── Topologies
+│   │   └── RDMA
+│   │
+│   ├── Accelerator Computing
+│   │   ├── GPUs
+│   │   ├── FPGAs
+│   │   ├── TPUs
+│   │   └── Quantum Accelerators
+│   │
+│   └── Hybrid
+│       ├── MPI + OpenMP
+│       ├── MPI + CUDA
+│       └── MPI + OpenMP + CUDA
+│
+├── Granularity
+│   ├── Bit-level
+│   ├── Instruction-level (ILP)
+│   ├── Data-level (SIMD)
+│   ├── Thread-level (TLP)
+│   ├── Task-level
+│   └── Process-level
+│
+├── Execution Models
+│   ├── SIMD
+│   ├── MISD
+│   ├── MIMD
+│   ├── SPMD
+│   ├── SIMT
+│   └── BSP
+│
+├── Synchronization
+│   ├── Mutex
+│   ├── Semaphore
+│   ├── Condition Variable
+│   ├── Barrier
+│   ├── Atomic Operations
+│   ├── Fences
+│   └── Lock-free
+│
+├── Communication
+│   ├── Shared Variables
+│   ├── Message Passing
+│   ├── Channels
+│   ├── Collectives
+│   ├── Pipelines
+│   └── Reduction Trees
+│
+├── Memory Hierarchy
+│   ├── Registers
+│   ├── L1/L2/L3 Cache
+│   ├── RAM
+│   ├── NUMA Domains
+│   ├── GPU Global Memory
+│   ├── Shared Memory (GPU)
+│   ├── Constant Memory
+│   └── Distributed Memory
+│
+├── Scheduling
+│   ├── Static
+│   ├── Dynamic
+│   ├── Guided
+│   ├── Work Stealing
+│   ├── Gang Scheduling
+│   └── Batch Scheduling
+│
+├── Decomposition
+│   ├── Data Parallelism
+│   ├── Task Parallelism
+│   ├── Pipeline Parallelism
+│   ├── Domain Decomposition
+│   ├── Functional Decomposition
+│   └── Recursive Decomposition
+│
+├── Performance Theory
+│   ├── :contentReference[oaicite:0]{index=0}
+│   ├── :contentReference[oaicite:1]{index=1}
+│   ├── Strong Scaling
+│   ├── Weak Scaling
+│   ├── Speedup
+│   ├── Efficiency
+│   ├── Latency
+│   ├── Bandwidth
+│   ├── Throughput
+│   └── Roofline Model
+│
+├── Hardware
+│   ├── Multicore CPUs
+│   ├── Manycore CPUs
+│   ├── GPUs
+│   ├── Clusters
+│   ├── Supercomputers
+│   ├── Interconnects
+│   │   ├── Ethernet
+│   │   ├── :contentReference[oaicite:2]{index=2}
+│   │   └── NVLink
+│   └── Storage
+│       ├── Parallel FS
+│       ├── Lustre
+│       └── GPFS
+│
+├── Programming Models
+│   ├── :contentReference[oaicite:3]{index=3}
+│   ├── :contentReference[oaicite:4]{index=4}
+│   ├── :contentReference[oaicite:5]{index=5}
+│   ├── :contentReference[oaicite:6]{index=6}
+│   ├── :contentReference[oaicite:7]{index=7}
+│   ├── :contentReference[oaicite:8]{index=8}
+│   ├── :contentReference[oaicite:9]{index=9}
+│   └── CSP / Actors
+│
+└── Debugging & Profiling
+    ├── Tracing
+    ├── Sampling
+    ├── Race Detection
+    ├── Deadlock Detection
+    ├── Memory Profiling
+    ├── Cache Profiling
+    ├── GPU Profiling
+    └── MPI Profiling
+```