Differences

This shows you the differences between two versions of the page.

--- parallel-computing [June 10, 2026 at 21:07] – Ivan Janevski
+++ parallel-computing [June 10, 2026 at 22:29] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 # Parallel computing
-**Parallel computing** is a type of software engineering where you use parallelism (more threads) to increase the amount of computation power to tackle a problem.
+**Parallel computing** is a style of programming where a computation is broken into parts that run simultaneously across multiple processors, cores, or machines. The motivation is straightforward: a single core has a clock speed ceiling, and modern CPUs gain performance by adding more cores rather than running each core faster. To take advantage of that, programs have to be written with parallelism in mind.
-There are essentially three spheres of parallel computing: 1. CPU parallelism ([[openmp|OpenMP]]), 2. distributed parallelism ([[mpi|MPI]]), and 3. GPU parallelism ([[cuda|CUDA]]). I would carve out a fourth category, which is 4.
+Not every program benefits equally. [[amdahls-law|Amdahl's law]] shows that the sequential fraction of a program — the part that cannot be parallelized — sets a hard ceiling on speedup regardless of how many cores you add. [[gustafsons-law|Gustafson's law]] is the more optimistic counterpart: if you scale the problem size alongside the hardware, speedup grows linearly. In practice, HPC workloads follow Gustafson's regime — you buy more nodes to solve a bigger problem, not just to solve the same one faster.
-Parallel computing is related to performance engineering. This makes sense, because usually increasing parallelism increases performance, but this is not always so. Not every piece of code can be parallelized and sometimes the overhead of creating threads can overweight the potential speedup.
+## Three paradigms
-Generally speaking, the potential speed up you'd get by parallelizing increasing $N$ cores is governed by Amdahl's law
+Parallel computing splits into three broad paradigms based on where the parallelism lives.
-$$S(\text{N-cores}) = \frac{1}{(1 - P) + \frac{P}{N}}$$
+**Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead.
+**Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads.
+**GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory.
+## Performance and correctness
+Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash.
+On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth.
+## List of concepts
+ - [[list-of-parallel-computing-concepts]]