Site Tools


parallel-computing-overview

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
parallel-computing-overview [June 15, 2026 at 09:32] – created Ivan Janevskiparallel-computing-overview [June 15, 2026 at 09:34] (current) Ivan Janevski
Line 25: Line 25:
 | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers | | Memory ordering | Relaxed atomics with wrong assumptions | Litmus tests; address and memory sanitizers |
  
 +
 +
 +## Three paradigms
 +
 +Parallel computing splits into three broad paradigms based on where the parallelism lives.
 +
 +**Shared-memory parallelism** runs multiple threads on a single machine with a common address space. [[openmp|OpenMP]] is the standard approach in C, C++, and Fortran: a few `#pragma omp` directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead.
 +
 +**Distributed-memory parallelism** runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. [[mpi|MPI]] is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads.
 +
 +**GPU parallelism** offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory.
 +
 +## Performance and correctness
 +
 +Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash.
 +
 +On the performance side, the [[roofline-model|roofline model]] is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, [[saxpy]] — a simple vector operation — is a standard starting point for measuring memory bandwidth.
 +
 +## Practice
 +
 +The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP:
 +
 +```c
 +// compile: gcc -O2 -fopenmp -o sum sum.c
 +// run: OMP_NUM_THREADS=4 ./sum
 +// description: parallel reduction; compare timing against a serial baseline
 +
 +#include <omp.h>
 +#include <stdio.h>
 +
 +int main(void) {
 +    long n = 1000000000L, sum = 0;
 +    double t = omp_get_wtime();
 +    #pragma omp parallel for reduction(+:sum)
 +    for (long i = 0; i < n; i++)
 +        sum += i;
 +    printf("sum=%ld  time=%.2fs\n", sum, omp_get_wtime() - t);
 +    return 0;
 +}
 +```
 +
 +Build and run both versions:
 +
 +```bash
 +$ gcc -O2 -o sum_serial sum.c
 +$ gcc -O2 -fopenmp -o sum_parallel sum.c
 +$ ./sum_serial
 +sum=499999999500000000  time=2.14s
 +$ OMP_NUM_THREADS=4 ./sum_parallel
 +sum=499999999500000000  time=0.57s
 +```
 +
 +The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+:sum)` clause handles synchronization. Try setting `OMP_NUM_THREADS` from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action.
 +
 +## Tree
 +
 +```
 +Parallel Computing
 +├── Models
 +│   ├── Shared Memory
 +│   │   ├── Threads
 +│   │   ├── Locks / Mutexes
 +│   │   ├── Atomics
 +│   │   ├── Barriers
 +│   │   ├── Thread Pools
 +│   │   └── NUMA
 +│   │
 +│   ├── Distributed Memory
 +│   │   ├── Message Passing
 +│   │   ├── Collectives
 +│   │   ├── Point-to-point
 +│   │   ├── Topologies
 +│   │   └── RDMA
 +│   │
 +│   ├── Accelerator Computing
 +│   │   ├── GPUs
 +│   │   ├── FPGAs
 +│   │   ├── TPUs
 +│   │   └── Quantum Accelerators
 +│   │
 +│   └── Hybrid
 +│       ├── MPI + OpenMP
 +│       ├── MPI + CUDA
 +│       └── MPI + OpenMP + CUDA
 +
 +├── Granularity
 +│   ├── Bit-level
 +│   ├── Instruction-level (ILP)
 +│   ├── Data-level (SIMD)
 +│   ├── Thread-level (TLP)
 +│   ├── Task-level
 +│   └── Process-level
 +
 +├── Execution Models
 +│   ├── SIMD
 +│   ├── MISD
 +│   ├── MIMD
 +│   ├── SPMD
 +│   ├── SIMT
 +│   └── BSP
 +
 +├── Synchronization
 +│   ├── Mutex
 +│   ├── Semaphore
 +│   ├── Condition Variable
 +│   ├── Barrier
 +│   ├── Atomic Operations
 +│   ├── Fences
 +│   └── Lock-free
 +
 +├── Communication
 +│   ├── Shared Variables
 +│   ├── Message Passing
 +│   ├── Channels
 +│   ├── Collectives
 +│   ├── Pipelines
 +│   └── Reduction Trees
 +
 +├── Memory Hierarchy
 +│   ├── Registers
 +│   ├── L1/L2/L3 Cache
 +│   ├── RAM
 +│   ├── NUMA Domains
 +│   ├── GPU Global Memory
 +│   ├── Shared Memory (GPU)
 +│   ├── Constant Memory
 +│   └── Distributed Memory
 +
 +├── Scheduling
 +│   ├── Static
 +│   ├── Dynamic
 +│   ├── Guided
 +│   ├── Work Stealing
 +│   ├── Gang Scheduling
 +│   └── Batch Scheduling
 +
 +├── Decomposition
 +│   ├── Data Parallelism
 +│   ├── Task Parallelism
 +│   ├── Pipeline Parallelism
 +│   ├── Domain Decomposition
 +│   ├── Functional Decomposition
 +│   └── Recursive Decomposition
 +
 +├── Performance Theory
 +│   ├── :contentReference[oaicite:0]{index=0}
 +│   ├── :contentReference[oaicite:1]{index=1}
 +│   ├── Strong Scaling
 +│   ├── Weak Scaling
 +│   ├── Speedup
 +│   ├── Efficiency
 +│   ├── Latency
 +│   ├── Bandwidth
 +│   ├── Throughput
 +│   └── Roofline Model
 +
 +├── Hardware
 +│   ├── Multicore CPUs
 +│   ├── Manycore CPUs
 +│   ├── GPUs
 +│   ├── Clusters
 +│   ├── Supercomputers
 +│   ├── Interconnects
 +│   │   ├── Ethernet
 +│   │   ├── :contentReference[oaicite:2]{index=2}
 +│   │   └── NVLink
 +│   └── Storage
 +│       ├── Parallel FS
 +│       ├── Lustre
 +│       └── GPFS
 +
 +├── Programming Models
 +│   ├── :contentReference[oaicite:3]{index=3}
 +│   ├── :contentReference[oaicite:4]{index=4}
 +│   ├── :contentReference[oaicite:5]{index=5}
 +│   ├── :contentReference[oaicite:6]{index=6}
 +│   ├── :contentReference[oaicite:7]{index=7}
 +│   ├── :contentReference[oaicite:8]{index=8}
 +│   ├── :contentReference[oaicite:9]{index=9}
 +│   └── CSP / Actors
 +
 +└── Debugging & Profiling
 +    ├── Tracing
 +    ├── Sampling
 +    ├── Race Detection
 +    ├── Deadlock Detection
 +    ├── Memory Profiling
 +    ├── Cache Profiling
 +    ├── GPU Profiling
 +    └── MPI Profiling
 +```
parallel-computing-overview.1781515952.txt.gz · Last modified: by Ivan Janevski