This is an old revision of the document!

OpenMP

OpenMP is a shared-memory parallelism API for C, C++, and Fortran.

If you have a loop that takes too long and you want it to use all the cores on your machine instead of just one, OpenMP is usually the shortest path there. It works via compiler directives (#pragma omp in C/C++), a small runtime library (libomp), and a set of environment variables. In contrast to distributed-memory models like MPI, where each process has its own memory, OpenMP threads all share the same address space — you get parallelism without touching your data layout.

The execution model is fork-join: the program starts as a single thread. When it hits a #pragma omp parallel block, it forks into a team of worker threads that all execute the block concurrently, then join back into one thread at the closing brace. You do not write any thread creation, management, or teardown code — the compiler inserts all of that. If the compiler does not support OpenMP, it silently ignores all #pragma omp directives and the program runs serially, which is useful for debugging: drop -fopenmp from the compile command and you get a clean serial build without changing a line of source.

#pragma omp parallel
{
    int tid = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    printf("thread %d of %d\n", tid, nthreads);
}

The thread count defaults to the number of logical cores. You can override it at runtime with OMP_NUM_THREADS=N ./prog without recompiling, or with omp_set_num_threads(n) inside the program, or with a num_threads(N) clause directly on the pragma. Output order from a parallel region is non-deterministic — threads are scheduled by the OS, not by rank. Compile with -fopenmp and include <omp.h> for the omp_* runtime functions.

Adding more threads does not always mean proportionally faster code. Amdahl's law says that if a fraction $s$ of the program is inherently serial, the maximum speedup is $1/s$ no matter how many threads you add. A loop that accounts for 80% of runtime can at best give 5× speedup. This is why reducing the serial fraction matters more than just adding more cores, and why eliminating overhead like false sharing, load imbalance, and unnecessary barriers compounds those gains.

Practice

Compile the hello-world above and run it:

$ gcc -fopenmp -o hello hello.c
$ OMP_NUM_THREADS=4 ./hello
thread 2 of 4
thread 0 of 4
thread 3 of 4
thread 1 of 4

The output order is non-deterministic — run it a few times and you will get different permutations. Now try dropping -fopenmp:

$ gcc -o hello hello.c
$ ./hello
thread 0 of 1

Without the flag, every #pragma omp directive is silently ignored and the program runs as a single thread. This is the most useful OpenMP debugging technique: if your parallel output looks wrong, remove -fopenmp and check whether the serial output is correct first.

Here is a more realistic example: parallelising a sum of a billion integers with a single pragma.

// compile: gcc -O2 -fopenmp -o sum sum.c
// run: OMP_NUM_THREADS=4 ./sum
// description: parallel reduction; try with -fopenmp omitted for serial baseline
 
#include <omp.h>
#include <stdio.h>
 
int main(void) {
    long n = 1000000000L, sum = 0;
    double t = omp_get_wtime();
    #pragma omp parallel for reduction(+:sum)
    for (long i = 0; i < n; i++)
        sum += i;
    printf("sum=%ld  time=%.2fs\n", sum, omp_get_wtime() - t);
    return 0;
}

On a 4-core machine this runs roughly 4× faster with -fopenmp than without. The only additions to the loop are the pragma line and the reduction(+:sum) clause. Without the clause you would get a data race and a wrong answer — the clause is what makes it correct. When it works this cleanly, it genuinely feels like cheating.

Try varying OMP_NUM_THREADS from 1 up to your core count and beyond. Speedup will plateau or even decline past the hardware thread count — that is thread management overhead and Amdahl's law at work.

Concepts

Overview

Directives

#pragma omp parallel                         // fork a team of threads; join at closing brace
#pragma omp parallel for                     // distribute loop iterations across the team
#pragma omp parallel for reduction(+:s)     // loop with a parallel reduction
#pragma omp parallel sections                // distribute independent blocks across the team
#pragma omp section                          // one block inside a sections region
#pragma omp single                           // one thread runs the block; others wait at end
#pragma omp master                           // thread 0 only; no implicit barrier
#pragma omp task                             // package work for any idle thread to execute
#pragma omp taskwait                         // wait for all child tasks to finish
#pragma omp barrier                          // all threads wait until every thread arrives
#pragma omp critical                         // mutual exclusion — one thread at a time
#pragma omp atomic                           // single hardware-atomic read-modify-write
#pragma omp simd                             // assert the loop is safe to vectorise
#pragma omp flush                            // enforce memory visibility across threads

Functions

omp_get_thread_num()      // ID of the calling thread (0 … N-1)
omp_get_num_threads()     // number of threads in the current team
omp_get_max_threads()     // threads that would be used if a parallel region started now
omp_set_num_threads(n)    // set the default thread count at runtime
omp_get_num_procs()       // number of logical processors available to the program
omp_get_wtime()           // wall-clock time in seconds; use for timing parallel regions
omp_in_parallel()         // 1 if called from inside a parallel region, 0 otherwise

Environment variables

Variable	Default	Description
`OMP_NUM_THREADS`	core count	Number of threads to use in each parallel region
`OMP_SCHEDULE`	`static`	Default schedule kind and optional chunk size, e.g. `dynamic,4`
`OMP_PROC_BIND`	`false`	Thread-to-core affinity policy: `close`, `spread`, or `master`
`OMP_PLACES`	(unset)	Placement units for affinity: `cores`, `threads`, or `sockets`
`OMP_MAX_ACTIVE_LEVELS`	`1`	Maximum nesting depth of simultaneously active parallel regions
`OMP_DISPLAY_ENV`	`false`	Print OpenMP version and active settings at startup: `TRUE` or `VERBOSE`

Ivan's wiki

This is an old revision of the document!

Table of Contents

OpenMP

Practice

Concepts

Overview

Directives

Functions

Environment variables

**This is an old revision of the document!**

Table of Contents

OpenMP

Practice

Concepts

Overview

Directives

Functions

Environment variables

This is an old revision of the document!