parallel-computing-overview
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| parallel-computing-overview [June 15, 2026 at 09:32] – created Ivan Janevski | parallel-computing-overview [June 15, 2026 at 09:34] (current) – Ivan Janevski | ||
|---|---|---|---|
| Line 25: | Line 25: | ||
| | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers | | | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers | | ||
| + | |||
| + | |||
| + | ## Three paradigms | ||
| + | |||
| + | Parallel computing splits into three broad paradigms based on where the parallelism lives. | ||
| + | |||
| + | **Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead. | ||
| + | |||
| + | **Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads. | ||
| + | |||
| + | **GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA' | ||
| + | |||
| + | ## Performance and correctness | ||
| + | |||
| + | Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash. | ||
| + | |||
| + | On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, | ||
| + | |||
| + | ## Practice | ||
| + | |||
| + | The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP: | ||
| + | |||
| + | ```c | ||
| + | // compile: gcc -O2 -fopenmp -o sum sum.c | ||
| + | // run: OMP_NUM_THREADS=4 ./sum | ||
| + | // description: | ||
| + | |||
| + | #include < | ||
| + | #include < | ||
| + | |||
| + | int main(void) { | ||
| + | long n = 1000000000L, | ||
| + | double t = omp_get_wtime(); | ||
| + | #pragma omp parallel for reduction(+: | ||
| + | for (long i = 0; i < n; i++) | ||
| + | sum += i; | ||
| + | printf(" | ||
| + | return 0; | ||
| + | } | ||
| + | ``` | ||
| + | |||
| + | Build and run both versions: | ||
| + | |||
| + | ```bash | ||
| + | $ gcc -O2 -o sum_serial sum.c | ||
| + | $ gcc -O2 -fopenmp -o sum_parallel sum.c | ||
| + | $ ./ | ||
| + | sum=499999999500000000 | ||
| + | $ OMP_NUM_THREADS=4 ./ | ||
| + | sum=499999999500000000 | ||
| + | ``` | ||
| + | |||
| + | The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+: | ||
| + | |||
| + | ## Tree | ||
| + | |||
| + | ``` | ||
| + | Parallel Computing | ||
| + | ├── Models | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Granularity | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Execution Models | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Synchronization | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Communication | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Memory Hierarchy | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Scheduling | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Decomposition | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Performance Theory | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Hardware | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | ├── Programming Models | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | │ | ||
| + | └── Debugging & Profiling | ||
| + | ├── Tracing | ||
| + | ├── Sampling | ||
| + | ├── Race Detection | ||
| + | ├── Deadlock Detection | ||
| + | ├── Memory Profiling | ||
| + | ├── Cache Profiling | ||
| + | ├── GPU Profiling | ||
| + | └── MPI Profiling | ||
| + | ``` | ||
parallel-computing-overview.1781515952.txt.gz · Last modified: by Ivan Janevski
