Table of Contents

Performance model (MPI)

Every write() syscall has a fixed overhead regardless of how many bytes you write: there is a kernel crossing, buffer management, and bookkeeping before the first byte moves. For a local file the overhead is small relative to I/O cost, but over a network the per-message overhead can be much larger than the cost of moving the data itself. The performance model for MPI communication captures this as $\alpha + \beta n$, where $\alpha$ is the per-message latency (a fixed overhead paid regardless of size) and $\beta$ is the per-byte transfer time. Latency dominates for small messages; bandwidth dominates for large ones. The practical implication is that fewer, larger messages are almost always better: one message of 100 elements pays $\alpha$ once, while 100 separate sends pay it 100 times.

// bad: N messages, each paying alpha
for (int i = 0; i < N; i++)
    MPI_Send(&vals[i], 1, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD);
 
// good: one message, alpha paid once
MPI_Send(vals, N, MPI_DOUBLE, dest, 0, MPI_COMM_WORLD);

This model also explains two MPI-specific behaviors. Small messages (typically below a few kilobytes, though the threshold is implementation-defined) use the eager protocol: the sender copies the data into a pre-allocated buffer and returns immediately; the receiver picks it up later. Large messages use the rendezvous protocol: sender and receiver handshake first, then data moves directly between their buffers with no intermediate copy. The handshake is why MPI_Send can block on large messages even when no explicit synchronisation is requested. It is also why buffering-based deadlocks often appear only at scale: small test data fits in the eager buffer and the code runs fine; production-size data triggers rendezvous and the code hangs.