In a single-process program, clock_gettime(CLOCK_MONOTONIC, &ts) gives the elapsed time from one CPU's perspective. In a parallel program, each process runs on a potentially different core with a clock that is not perfectly synchronised to the others. Timing on just one process gives its view of the wall time, which can be shorter than the actual parallel section if that process happens to finish early. MPI_Wtime is MPI's equivalent of clock_gettime — it returns wall-clock seconds as a double — and the standard pattern is to reduce across all processes with MPI_MAX to get the true elapsed time. MPI_Wtick returns the resolution of the timer, useful for confirming it is precise enough for the region being measured.
MPI_Barrier(MPI_COMM_WORLD); // align all processes before starting the clock double t0 = MPI_Wtime(); do_work(); double elapsed = MPI_Wtime() - t0; double max_elapsed; MPI_Reduce(&elapsed, &max_elapsed, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); if (rank == 0) printf("%.6f s (timer resolution: %.2e s)\n", max_elapsed, MPI_Wtick());
Reducing with MPI_MAX gives the true wall time of the parallel section, which is set by the slowest process. A mean would undercount if processes finish at different times. For microbenchmarks, wrap the timed region in a loop and divide by the iteration count to stay well above the timer resolution.