A naive parallel program launches one MPI rank per core. That works, but it means every rank keeps its own copy of the data (there is no shared address space between MPI processes), and the number of messages scales with total core count. A better model on a multi-core node is one MPI rank per node (or per socket), with OpenMP threads filling the cores within it. Inter-node communication goes through MPI; intra-node parallelism uses shared memory through OpenMP. Hybrid MPI+OpenMP is the dominant model on modern HPC clusters.
When threads are involved, MPI must be initialised with MPI_Init_thread instead of MPI_Init to declare the required level of thread safety:
int provided; MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); if (provided < MPI_THREAD_FUNNELED) { fprintf(stderr, "insufficient MPI thread support\n"); MPI_Abort(MPI_COMM_WORLD, 1); }
The four thread-safety levels are:
MPI_THREAD_SINGLE — only one thread will execute; equivalent to MPI_InitMPI_THREAD_FUNNELED — multiple threads exist but only the main thread makes MPI calls; the most common level for MPI+OpenMPMPI_THREAD_SERIALIZED — multiple threads make MPI calls but not concurrently; the application serialises themMPI_THREAD_MULTIPLE — multiple threads call MPI concurrently; requires a thread-safe MPI build and has higher overhead
With MPI_THREAD_FUNNELED, all MPI calls must happen on the master thread, either outside parallel regions or inside one guarded with #pragma omp master. The typical structure is to post non-blocking communication on the master thread, enter a parallel region to compute the interior while halos travel, then wait for communication before computing the boundary.
while (!converged) { MPI_Startall(nreqs, reqs); // post halo exchange on master thread #pragma omp parallel for for (int i = interior_lo; i < interior_hi; i++) update(i); // interior computation overlaps with comms MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE); #pragma omp parallel for for (int i = 0; i < halo_size; i++) update_halo(i); }