Site Tools


parallel-computing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
parallel-computing [June 10, 2026 at 20:44] Ivan Janevskiparallel-computing [June 10, 2026 at 22:29] (current) – external edit 127.0.0.1
Line 1: Line 1:
 # Parallel computing # Parallel computing
-**Parallel computing** is a type of software engineering where you use parallelism (e.g. more threads, more processes, more nodes, etc.) to increase the amount of computation power to tackle a problem.+**Parallel computing** is a style of programming where a computation is broken into parts that run simultaneously across multiple processors, cores, or machinesThe motivation is straightforward: a single core has a clock speed ceilingand modern CPUs gain performance by adding more cores rather than running each core fasterTo take advantage of that, programs have to be written with parallelism in mind.
  
-There are essentially three spheres of parallel computing: 1CPU parallelism ([[openmp|OpenMP]]), 2distributed parallelism ([[mpi|MPI]])and 3GPU parallelism ([[cuda|CUDA]]).+Not every program benefits equally. [[amdahls-law|Amdahl's law]] shows that the sequential fraction of a program — the part that cannot be parallelized — sets a hard ceiling on speedup regardless of how many cores you add. [[gustafsons-law|Gustafson's law]] is the more optimistic counterpart: if you scale the problem size alongside the hardwarespeedup grows linearlyIn practice, HPC workloads follow Gustafson's regime — you buy more nodes to solve a bigger problem, not just to solve the same one faster.
  
-Parallel computing is related to performance engineering. This makes sense, because having say 12 CPU cores working on a problem.  +## Three paradigms
-$$S(\text{N-cores}) = \frac{1}{(1 - P) + \frac{P}{N}}$$+
  
- - $S(N)$ - Speed up  +Parallel computing splits into three broad paradigms based on where the parallelism lives.
- - $P$ - Length +
  
-```+**Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma ompdirectives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead. 
 + 
 +**Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads. 
 + 
 +**GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory. 
 + 
 +## Performance and correctness 
 + 
 +Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash. 
 + 
 +On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth. 
 + 
 +## List of concepts 
 + 
 + - [[list-of-parallel-computing-concepts]]
  
-``` 
parallel-computing.1781124280.txt.gz · Last modified: by Ivan Janevski