Table of Contents

Hybrid MPI+OpenMP (MPI)

A naive parallel program launches one MPI rank per core. That works, but it means every rank keeps its own copy of the data (there is no shared address space between MPI processes), and the number of messages scales with total core count. A better model on a multi-core node is one MPI rank per node (or per socket), with OpenMP threads filling the cores within it. Inter-node communication goes through MPI; intra-node parallelism uses shared memory through OpenMP. Hybrid MPI+OpenMP is the dominant model on modern HPC clusters.

When threads are involved, MPI must be initialised with MPI_Init_thread instead of MPI_Init to declare the required level of thread safety:

int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided < MPI_THREAD_FUNNELED) {
    fprintf(stderr, "insufficient MPI thread support\n");
    MPI_Abort(MPI_COMM_WORLD, 1);
}

The four thread-safety levels are:

With MPI_THREAD_FUNNELED, all MPI calls must happen on the master thread, either outside parallel regions or inside one guarded with #pragma omp master. The typical structure is to post non-blocking communication on the master thread, enter a parallel region to compute the interior while halos travel, then wait for communication before computing the boundary.

while (!converged) {
    MPI_Startall(nreqs, reqs);           // post halo exchange on master thread
    #pragma omp parallel for
    for (int i = interior_lo; i < interior_hi; i++)
        update(i);                        // interior computation overlaps with comms
    MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
    #pragma omp parallel for
    for (int i = 0; i < halo_size; i++)
        update_halo(i);
}