parallel-computing-overview

Overview
Three paradigms
Performance and correctness
Practice
Tree

Overview

Paradigms

Paradigm	Memory model	API	Typical scale
Shared-memory	Common address space	OpenMP, pthreads	Single node
Distributed-memory	Private per process	MPI	Multi-node cluster
GPU	Host + device	CUDA, HIP, OpenCL	Single GPU
Hybrid	Mixed	MPI + OpenMP, MPI + CUDA	Multi-node + GPU

Laws and models

Name	Formula	What it predicts
Amdahl's law	$S = \dfrac{1}{s + (1-s)/p}$	Maximum speedup given serial fraction $s$ and $p$ processors
Gustafson's law	$S = p - s(p-1)$	Speedup when problem size scales with $p$
Roofline model	$\min(\pi,\; I \cdot \beta)$	Whether a kernel is compute-bound or memory-bandwidth-bound

Common failure modes

Issue	Cause	Detection
Race condition	Concurrent unsynchronized read/write	Helgrind, ThreadSanitizer
Deadlock	Circular lock dependency	Missing progress; `gdb` `info threads`
False sharing	Two variables in the same cache line	`perf c2c`, cache miss counters
Load imbalance	Unequal work distribution	Per-thread time profile
Memory ordering	Relaxed atomics with wrong assumptions	Litmus tests; address and memory sanitizers

Three paradigms

Parallel computing splits into three broad paradigms based on where the parallelism lives.

Shared-memory parallelism runs multiple threads on a single machine with a common address space. OpenMP is the standard approach in C, C++, and Fortran: a few #pragma omp directives turn a serial loop into a parallel one. Threads communicate by reading and writing shared variables, which makes synchronization — mutexes, barriers, atomics — the main source of bugs and overhead.

Distributed-memory parallelism runs processes across separate machines (or separate address spaces on one machine), each with its own private memory. MPI is the dominant standard. Processes communicate explicitly by sending and receiving messages. There is no shared state to race on, but the programmer is responsible for every byte that crosses a process boundary. MPI is the backbone of large cluster workloads.

GPU parallelism offloads computation to a GPU, which can run thousands of lightweight threads simultaneously. CUDA is NVIDIA's programming model for this. GPU parallelism is best suited for problems where the same operation is applied to a large array of data — matrix multiplication, FFTs, stencil operations. The bottleneck is usually memory bandwidth and the cost of transferring data between host (CPU) memory and device (GPU) memory.

Performance and correctness

Parallel programs introduce failure modes that serial programs don't have: race conditions, deadlocks, false sharing, memory ordering issues. A race condition occurs when two threads read and write shared data without synchronization and the outcome depends on the order of execution. A deadlock occurs when two threads are each waiting for a lock the other holds. False sharing is a subtler hardware-level issue: two threads write to different variables that happen to sit in the same cache line, causing the cache coherence protocol to thrash.

On the performance side, the roofline model is a useful frame for understanding whether a kernel is compute-bound or memory-bandwidth-bound, which determines where to focus optimization effort. For quick empirical benchmarks, SAXPY — a simple vector operation — is a standard starting point for measuring memory bandwidth.

Practice

The fastest way to see parallelism pay off is to compile the same program twice — once without threading, once with — and time both. Here is a parallel sum using OpenMP:

// compile: gcc -O2 -fopenmp -o sum sum.c
// run: OMP_NUM_THREADS=4 ./sum
// description: parallel reduction; compare timing against a serial baseline
 
#include <omp.h>
#include <stdio.h>
 
int main(void) {
    long n = 1000000000L, sum = 0;
    double t = omp_get_wtime();
    #pragma omp parallel for reduction(+:sum)
    for (long i = 0; i < n; i++)
        sum += i;
    printf("sum=%ld  time=%.2fs\n", sum, omp_get_wtime() - t);
    return 0;
}

Build and run both versions:

$ gcc -O2 -o sum_serial sum.c
$ gcc -O2 -fopenmp -o sum_parallel sum.c
$ ./sum_serial
sum=499999999500000000  time=2.14s
$ OMP_NUM_THREADS=4 ./sum_parallel
sum=499999999500000000  time=0.57s

The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the reduction(+:sum) clause handles synchronization. Try setting OMP_NUM_THREADS from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action.

Tree

Parallel Computing
├── Models
│   ├── Shared Memory
│   │   ├── Threads
│   │   ├── Locks / Mutexes
│   │   ├── Atomics
│   │   ├── Barriers
│   │   ├── Thread Pools
│   │   └── NUMA
│   │
│   ├── Distributed Memory
│   │   ├── Message Passing
│   │   ├── Collectives
│   │   ├── Point-to-point
│   │   ├── Topologies
│   │   └── RDMA
│   │
│   ├── Accelerator Computing
│   │   ├── GPUs
│   │   ├── FPGAs
│   │   ├── TPUs
│   │   └── Quantum Accelerators
│   │
│   └── Hybrid
│       ├── MPI + OpenMP
│       ├── MPI + CUDA
│       └── MPI + OpenMP + CUDA
│
├── Granularity
│   ├── Bit-level
│   ├── Instruction-level (ILP)
│   ├── Data-level (SIMD)
│   ├── Thread-level (TLP)
│   ├── Task-level
│   └── Process-level
│
├── Execution Models
│   ├── SIMD
│   ├── MISD
│   ├── MIMD
│   ├── SPMD
│   ├── SIMT
│   └── BSP
│
├── Synchronization
│   ├── Mutex
│   ├── Semaphore
│   ├── Condition Variable
│   ├── Barrier
│   ├── Atomic Operations
│   ├── Fences
│   └── Lock-free
│
├── Communication
│   ├── Shared Variables
│   ├── Message Passing
│   ├── Channels
│   ├── Collectives
│   ├── Pipelines
│   └── Reduction Trees
│
├── Memory Hierarchy
│   ├── Registers
│   ├── L1/L2/L3 Cache
│   ├── RAM
│   ├── NUMA Domains
│   ├── GPU Global Memory
│   ├── Shared Memory (GPU)
│   ├── Constant Memory
│   └── Distributed Memory
│
├── Scheduling
│   ├── Static
│   ├── Dynamic
│   ├── Guided
│   ├── Work Stealing
│   ├── Gang Scheduling
│   └── Batch Scheduling
│
├── Decomposition
│   ├── Data Parallelism
│   ├── Task Parallelism
│   ├── Pipeline Parallelism
│   ├── Domain Decomposition
│   ├── Functional Decomposition
│   └── Recursive Decomposition
│
├── Performance Theory
│   ├── :contentReference[oaicite:0]{index=0}
│   ├── :contentReference[oaicite:1]{index=1}
│   ├── Strong Scaling
│   ├── Weak Scaling
│   ├── Speedup
│   ├── Efficiency
│   ├── Latency
│   ├── Bandwidth
│   ├── Throughput
│   └── Roofline Model
│
├── Hardware
│   ├── Multicore CPUs
│   ├── Manycore CPUs
│   ├── GPUs
│   ├── Clusters
│   ├── Supercomputers
│   ├── Interconnects
│   │   ├── Ethernet
│   │   ├── :contentReference[oaicite:2]{index=2}
│   │   └── NVLink
│   └── Storage
│       ├── Parallel FS
│       ├── Lustre
│       └── GPFS
│
├── Programming Models
│   ├── :contentReference[oaicite:3]{index=3}
│   ├── :contentReference[oaicite:4]{index=4}
│   ├── :contentReference[oaicite:5]{index=5}
│   ├── :contentReference[oaicite:6]{index=6}
│   ├── :contentReference[oaicite:7]{index=7}
│   ├── :contentReference[oaicite:8]{index=8}
│   ├── :contentReference[oaicite:9]{index=9}
│   └── CSP / Actors
│
└── Debugging & Profiling
    ├── Tracing
    ├── Sampling
    ├── Race Detection
    ├── Deadlock Detection
    ├── Memory Profiling
    ├── Cache Profiling
    ├── GPU Profiling
    └── MPI Profiling

Table of Contents