# MPI **MPI** (Message Passing Interface) is a standard for distributed-memory parallel programming in C, C++, and Fortran. Unlike [[openmp|OpenMP]] where you add pragmas and the threads share your program's memory, with MPI you launch N independent copies of your program, each with its own private address space, and they coordinate by explicitly sending and receiving messages. Because the communication model makes no assumption that processes share any hardware, MPI programs scale from a laptop to a cluster with thousands of nodes without code changes. The mental model takes some adjustment if you are coming from single-process C or even OpenMP. You are not writing one program that spawns workers. You are writing a program that will be instantiated N times simultaneously, and each copy plays a different role based on which number — the **rank** — it receives at launch. Rank 0 might distribute data, ranks 1 through N-1 might compute, rank 0 might collect results. Same source file, same binary, different runtime behaviour. The execution model is **SPMD** (single program, multiple data): all processes launch together via `mpirun` or `mpiexec` and run until they all call `MPI_Finalize`. Every MPI program must call `MPI_Init` before any other MPI function and `MPI_Finalize` at the end. ```c #include #include int main(int argc, char **argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("process %d of %d\n", rank, size); MPI_Finalize(); return 0; } ``` Launch with `mpirun -n 4 ./program` to start 4 processes. Every process executes the full `main`, so `printf` is called by all four. Output order is non-deterministic. Compile with `mpicc` (C) or `mpicxx` (C++), which wrap the system compiler with the right include paths and link flags — you do not call `gcc` directly. ## Practice Compile and run the hello-world above: ```bash $ mpicc -o hello hello.c $ mpirun -n 4 ./hello process 2 of 4 process 0 of 4 process 3 of 4 process 1 of 4 ``` Output order is non-deterministic. Run it a few times and the order changes. Try `mpirun -n 1` and `mpirun -n 8`. The binary does not change; the number of processes does. This is SPMD in action: one executable, different runtime identity per instantiation. The hello-world does not communicate, so it does not show what MPI is actually for. Here is the simplest program that does: rank 0 sends an integer to rank 1, which receives and prints it. ```c // compile: mpicc -o ping ping.c // run: mpirun -n 2 ./ping // description: rank 0 sends a value to rank 1 #include #include int main(int argc, char **argv) { MPI_Init(&argc, &argv); int rank, value = 0; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { value = 42; MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); printf("rank 0: sent %d\n", value); } else { MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("rank 1: received %d\n", value); } MPI_Finalize(); return 0; } ``` The `if (rank == 0)` branch and the `else` branch are in the same source file but execute on different processes, potentially on different machines. Rank 0 blocks on `MPI_Send` until rank 1 reaches `MPI_Recv`. That blocking behaviour is one of the most important things to internalise early — MPI communication is not fire-and-forget. More on this in [[blocking-mpi|blocking and non-blocking communication]]. From here, [[point-to-point-mpi|point-to-point communication]] is the natural next concept, then [[collectives-mpi|collectives]] once that feels comfortable. ## Concepts 1. [[communicators-mpi|Communicators]] 2. [[point-to-point-mpi|Point-to-point communication]] 3. [[blocking-mpi|Blocking and non-blocking]] 4. [[send-modes-mpi|Send modes]] 5. [[message-ordering-mpi|Message ordering]] 6. [[probing-mpi|Probing for messages]] 7. [[deadlock-mpi|Deadlock]] 8. [[performance-model-mpi|Performance model]] 9. [[collectives-mpi|Collectives]] 10. [[reduction-mpi|Reduction]] 11. [[scatter-and-gather-mpi|Scatter and gather]] 12. [[prefix-reductions-mpi|Prefix reductions]] 13. [[nonblocking-collectives-mpi|Non-blocking collectives]] 14. [[persistent-communication-mpi|Persistent communication]] 15. [[derived-datatypes-mpi|Derived datatypes]] 16. [[virtual-topologies-mpi|Virtual topologies]] 17. [[process-groups-mpi|Process groups]] 18. [[communicator-duplication-mpi|Communicator duplication]] 19. [[one-sided-mpi|One-sided communication]] 20. [[shared-memory-windows-mpi|Shared memory windows]] 21. [[parallel-io-mpi|Parallel I/O]] 22. [[time-measurement-mpi|Time measurement]] 23. [[hybrid-openmp-mpi|Hybrid MPI+OpenMP]] ## Overview ### Functions ```c // Initialisation MPI_Init(&argc, &argv) // initialise MPI; must be first MPI_Finalize() // shut down MPI; must be last MPI_Abort(comm, errorcode) // terminate all processes in comm // Communicator queries MPI_Comm_rank(comm, &rank) // rank of calling process in comm MPI_Comm_size(comm, &size) // number of processes in comm MPI_Comm_split(comm, color, key, &newcomm) // partition into sub-communicators MPI_Comm_free(&comm) // release a communicator // Point-to-point MPI_Send(buf, count, type, dest, tag, comm) MPI_Recv(buf, count, type, src, tag, comm, &status) MPI_Sendrecv(sbuf, sc, st, dest, stag, rbuf, rc, rt, src, rtag, comm, &status) MPI_Isend(buf, count, type, dest, tag, comm, &req) MPI_Irecv(buf, count, type, src, tag, comm, &req) MPI_Wait(&req, &status) MPI_Test(&req, &flag, &status) MPI_Waitall(count, reqs, statuses) // Collectives MPI_Barrier(comm) MPI_Bcast(buf, count, type, root, comm) MPI_Scatter(sbuf, sc, st, rbuf, rc, rt, root, comm) MPI_Gather(sbuf, sc, st, rbuf, rc, rt, root, comm) MPI_Scatterv(sbuf, scounts, displs, st, rbuf, rc, rt, root, comm) MPI_Gatherv(sbuf, sc, st, rbuf, rcounts, displs, rt, root, comm) MPI_Allgather(sbuf, sc, st, rbuf, rc, rt, comm) MPI_Alltoall(sbuf, sc, st, rbuf, rc, rt, comm) MPI_Reduce(sbuf, rbuf, count, type, op, root, comm) MPI_Allreduce(sbuf, rbuf, count, type, op, comm) MPI_Scan(sbuf, rbuf, count, type, op, comm) // inclusive prefix reduction // Derived datatypes MPI_Type_contiguous(count, oldtype, &newtype) MPI_Type_vector(count, blocklength, stride, oldtype, &newtype) MPI_Type_create_struct(count, blocklengths, displs, types, &newtype) MPI_Type_commit(&type) MPI_Type_free(&type) // One-sided MPI_Win_create(base, size, disp_unit, info, comm, &win) MPI_Put(obuf, oc, ot, target_rank, disp, tc, tt, win) MPI_Get(obuf, oc, ot, target_rank, disp, tc, tt, win) MPI_Accumulate(obuf, oc, ot, target_rank, disp, tc, tt, op, win) MPI_Win_fence(assert, win) MPI_Win_free(&win) // Timing MPI_Wtime() // wall-clock time in seconds MPI_Wtick() // resolution of MPI_Wtime ``` ### Built-in datatypes ^ C type ^ MPI type ^ | `int` | `MPI_INT` | | `long` | `MPI_LONG` | | `float` | `MPI_FLOAT` | | `double` | `MPI_DOUBLE` | | `char` | `MPI_CHAR` | | `unsigned char` | `MPI_UNSIGNED_CHAR` | | `long long` | `MPI_LONG_LONG` | ### Built-in reduction operators ^ Operator ^ Meaning ^ | `MPI_SUM` | sum | | `MPI_PROD` | product | | `MPI_MAX` | maximum | | `MPI_MIN` | minimum | | `MPI_LAND` | logical and | | `MPI_LOR` | logical or | | `MPI_BAND` | bitwise and | | `MPI_BOR` | bitwise or | | `MPI_MAXLOC` | maximum value and the rank that holds it | | `MPI_MINLOC` | minimum value and the rank that holds it |