# Writing guide on parallel computing **Writing guide on parallel computing** (this article) covers conventions specific to parallel computing articles in this wiki. Read [[general-writing-guide|the general writing guide]] first; this document supplements it. ## The central challenge Parallel computing articles fail in one consistent way: they explain what a primitive does before establishing why a C programmer would care. The reader lands on `reduction-mpi.txt` and is told that "`MPI_Reduce` applies a commutative associative operator across the send buffers of all processes" before ever connecting that to the problem they came to solve. The result is dense correct text that is hard to follow. The fix is to lead with something the reader already knows, then show how the parallel concept extends or replaces it. This does not mean adding a generic intro paragraph — it means opening with a specific, concrete hook. ## Types of hooks ### Sequential code first Use this when the concept has a direct serial equivalent. Show the sequential version first, then introduce the parallel version as a natural extension or replacement. The sequential snippet should be short — a loop, an array operation, an index calculation. Examples: - Reduction: show `for (int i = 0; i < N; i++) sum += arr[i]` before `MPI_Allreduce` or `reduction(+:sum)` - Scatter/gather: show a loop of `MPI_Send` calls before `MPI_Scatter` - Prefix reductions: show a sequential scan loop before `MPI_Exscan` - Parallel loops: show a serial `for` loop before `#pragma omp parallel for` - Derived datatypes: show the manual pack-into-buffer loop before `MPI_Type_vector` - Virtual topologies: show the manual `(row-1+P)%P * Q + col` neighbour arithmetic before `MPI_Cart_shift` ### Broken naive code first Use this when the concept exists to fix a correctness problem that the naive parallel approach has. Show the broken version — the race condition, the deadlock, the out-of-order read — then explain what goes wrong, then introduce the fix. Examples: - OpenMP reduction: show `#pragma omp parallel for` on `sum += a[i]` as a broken race before `reduction(+:sum)` - OpenMP atomic: show `count++` as a load/increment/store race before `#pragma omp atomic` - MPI deadlock: show two processes each calling `MPI_Send` before explaining cyclic send dependencies ### Syscall or POSIX API analogy Use this when the concept maps cleanly to something from single-machine systems programming that the reader likely already knows. Name the analogy explicitly; don't expect the reader to notice the parallel themselves. | Parallel concept | Analogy | |---|---| | MPI point-to-point | BSD sockets (`send`/`recv`), with rank replacing file descriptor | | MPI blocking/non-blocking | `read()` vs `aio_read()` + `aio_suspend()` | | MPI probing | `MSG_PEEK` / `ioctl(FIONREAD)` | | MPI parallel I/O | `pwrite()` with explicit offset | | MPI persistent communication | HTTP keep-alive vs per-request reconnect | | OpenMP critical section | `pthread_mutex_lock` / `pthread_mutex_unlock` | | OpenMP flush | `volatile`: prevent register caching, force memory visibility | ### Hardware or memory model analogy Use this for concepts that are grounded in hardware behaviour the reader may not have encountered in single-threaded work. Examples: - SIMD: "a CPU can add four floats in a single instruction rather than one" before defining SSE/AVX - False sharing: explain cache line width (typically 64 bytes) and coherence invalidation before showing the performance collapse - MPI one-sided: `mmap(MAP_SHARED)` within a machine, then extend the idea across nodes - Shared memory windows: the messaging layer adds copy overhead even between processes on the same node that share physical memory - Thread affinity / NUMA: laptop (uniform memory) vs. multi-socket server (each socket has its own RAM bank) before introducing `OMP_PLACES` ### Scope or namespace analogy Use this for concepts about isolation, scoping, and context. Examples: - MPI communicators: TCP port numbers scope traffic between programs; communicators scope traffic between MPI contexts - MPI communicator duplication: a library inheriting your file descriptors can corrupt `stdin`; passing `MPI_COMM_WORLD` to a library lets it intercept your messages — `MPI_Comm_dup` is the `dup()` equivalent ## What level of familiarity to assume **For basic concepts** (point-to-point, parallel loops, reduction, data sharing, communicators): assume only C knowledge. The reader knows pointers, structs, loops, and POSIX. They do not know what a rank is, what a communicator is, or what fork-join means. Define these at point of first use. **For intermediate concepts** (collectives, non-blocking, scheduling, tasks, send modes): assume the reader has read the parent article (`mpi.txt` or `openmp.txt`) and the basic concepts that precede this one in the numbered list. Light cross-links are preferred over re-explaining. **For advanced concepts** (one-sided communication, virtual topologies, communicator duplication, process groups, hybrid MPI+OpenMP, NUMA affinity): assume MPI or OpenMP familiarity. The hook can be shorter. Cross-link to prerequisite articles rather than restating them. ## Code snippets in parallel computing articles Two short snippets in the same section are fine when contrasting a before and after (sequential vs. parallel, broken vs. correct). The sequential or broken snippet does not need a compile/run/description header. The parallel or corrected snippet is still a concept illustration, not a full MVE, so it also does not need a header unless the article is specifically a how-to. Keep the sequential "before" snippet to a few lines. Its only job is to give the reader a footing before the parallel version. If it grows beyond a loop body, it is probably carrying too much weight and should move to its own section or article. ## Correctness vs. performance articles Parallel computing articles tend to fall into two categories. Treat them differently. **Correctness concepts** (data sharing, race conditions, deadlock, atomic, critical sections, flush, message ordering) should lead with the failure mode — what goes wrong without the construct — before explaining the fix. The broken code or the pathological scenario is the hook. **Performance concepts** (scheduling, false sharing, SIMD, thread affinity, performance model, persistent communication, non-blocking communication) should lead with the observable symptom — slower than expected, threads idle, bandwidth wasted — before explaining the underlying cause and remedy. ## Structure of overview articles `openmp.txt` and `mpi.txt` are overview articles, not concept articles. They follow the same four-part pattern as `docker.txt` and `perf.txt`: 1. **Intro** — conversational and second-person. Explain what the tool is, what problem it solves, and the core execution model (fork-join, SPMD). The hello-world code block goes here. Aim for the tone of docker.txt: "if you have X problem, this is the shortest path." Analogies comparing to something the reader already knows are welcome here too. 2. **Practice** — "try this yourself" section. Start with compiling and running the hello-world from the intro, explain the non-deterministic output, then show the simplest example that demonstrates actual parallel communication or coordination (not just printing ranks). Write in the second person throughout. The practice section should end with a pointer to where to go next in the Concepts list. 3. **Concepts** — numbered list of links to dedicated concept articles. No prose here; the links speak for themselves. 4. **Overview** — quick-reference function lists, tables of constants and operators, environment variable tables. Pure reference material. Do not add prose or tutorials here. The intro and Practice sections should feel personal and approachable: acknowledge when something is non-obvious ("the mental model takes some adjustment"), acknowledge when something is surprisingly easy ("it genuinely feels like cheating"), and explain the rough edges (non-deterministic output, blocking semantics, `-fopenmp` being silently ignored when absent). ## Cross-links between parallel computing articles Internal links between related parallel concepts are encouraged. The most useful cross-links are: - From an advanced concept to the prerequisite basic concept it builds on - From a correctness fix to the failure mode it addresses (e.g. `atomic` links to `critical-sections`, `nowait` links to `barrier`) - Between OpenMP and MPI articles that cover equivalent concepts (e.g. `reduction-openmp.txt` and `reduction-mpi.txt`) Use `[[page-id|display text]]` with a descriptive display name rather than the raw page ID.