Differences

This shows you the differences between two versions of the page.

--- parallel-computing-overview [June 15, 2026 at 09:32] – Ivan Janevski
+++ parallel-computing-overview [June 15, 2026 at 09:34] (current) – Ivan Janevski
@@ Line 79: / Line 79: @@
 The parallel version runs roughly 4× faster on 4 cores. The answer is identical — the `reduction(+:sum)` clause handles synchronization. Try setting `OMP_NUM_THREADS` from 1 up to your core count and plot the speedup. It will flatten before reaching the theoretical maximum; that is Amdahl's law in action.
+## Tree
+```
+Parallel Computing
+├── Models
+│   ├── Shared Memory
+│   │   ├── Threads
+│   │   ├── Locks / Mutexes
+│   │   ├── Atomics
+│   │   ├── Barriers
+│   │   ├── Thread Pools
+│   │   └── NUMA
+│   │
+│   ├── Distributed Memory
+│   │   ├── Message Passing
+│   │   ├── Collectives
+│   │   ├── Point-to-point
+│   │   ├── Topologies
+│   │   └── RDMA
+│   │
+│   ├── Accelerator Computing
+│   │   ├── GPUs
+│   │   ├── FPGAs
+│   │   ├── TPUs
+│   │   └── Quantum Accelerators
+│   │
+│   └── Hybrid
+│       ├── MPI + OpenMP
+│       ├── MPI + CUDA
+│       └── MPI + OpenMP + CUDA
+│
+├── Granularity
+│   ├── Bit-level
+│   ├── Instruction-level (ILP)
+│   ├── Data-level (SIMD)
+│   ├── Thread-level (TLP)
+│   ├── Task-level
+│   └── Process-level
+│
+├── Execution Models
+│   ├── SIMD
+│   ├── MISD
+│   ├── MIMD
+│   ├── SPMD
+│   ├── SIMT
+│   └── BSP
+│
+├── Synchronization
+│   ├── Mutex
+│   ├── Semaphore
+│   ├── Condition Variable
+│   ├── Barrier
+│   ├── Atomic Operations
+│   ├── Fences
+│   └── Lock-free
+│
+├── Communication
+│   ├── Shared Variables
+│   ├── Message Passing
+│   ├── Channels
+│   ├── Collectives
+│   ├── Pipelines
+│   └── Reduction Trees
+│
+├── Memory Hierarchy
+│   ├── Registers
+│   ├── L1/L2/L3 Cache
+│   ├── RAM
+│   ├── NUMA Domains
+│   ├── GPU Global Memory
+│   ├── Shared Memory (GPU)
+│   ├── Constant Memory
+│   └── Distributed Memory
+│
+├── Scheduling
+│   ├── Static
+│   ├── Dynamic
+│   ├── Guided
+│   ├── Work Stealing
+│   ├── Gang Scheduling
+│   └── Batch Scheduling
+│
+├── Decomposition
+│   ├── Data Parallelism
+│   ├── Task Parallelism
+│   ├── Pipeline Parallelism
+│   ├── Domain Decomposition
+│   ├── Functional Decomposition
+│   └── Recursive Decomposition
+│
+├── Performance Theory
+│   ├── :contentReference[oaicite:0]{index=0}
+│   ├── :contentReference[oaicite:1]{index=1}
+│   ├── Strong Scaling
+│   ├── Weak Scaling
+│   ├── Speedup
+│   ├── Efficiency
+│   ├── Latency
+│   ├── Bandwidth
+│   ├── Throughput
+│   └── Roofline Model
+│
+├── Hardware
+│   ├── Multicore CPUs
+│   ├── Manycore CPUs
+│   ├── GPUs
+│   ├── Clusters
+│   ├── Supercomputers
+│   ├── Interconnects
+│   │   ├── Ethernet
+│   │   ├── :contentReference[oaicite:2]{index=2}
+│   │   └── NVLink
+│   └── Storage
+│       ├── Parallel FS
+│       ├── Lustre
+│       └── GPFS
+│
+├── Programming Models
+│   ├── :contentReference[oaicite:3]{index=3}
+│   ├── :contentReference[oaicite:4]{index=4}
+│   ├── :contentReference[oaicite:5]{index=5}
+│   ├── :contentReference[oaicite:6]{index=6}
+│   ├── :contentReference[oaicite:7]{index=7}
+│   ├── :contentReference[oaicite:8]{index=8}
+│   ├── :contentReference[oaicite:9]{index=9}
+│   └── CSP / Actors
+│
+└── Debugging & Profiling
+    ├── Tracing
+    ├── Sampling
+    ├── Race Detection
+    ├── Deadlock Detection
+    ├── Memory Profiling
+    ├── Cache Profiling
+    ├── GPU Profiling
+    └── MPI Profiling
+```