**This is an old revision of the document!**
Table of Contents
SIMD (OpenMP)
A modern CPU can add four floats in a single instruction rather than one, loading them into a wide register and operating on all four lanes simultaneously. This is SIMD (single instruction, multiple data), exposed on x86 as SSE/AVX and on ARM as NEON. The compiler attempts to use these instructions automatically (auto-vectorisation), but it can be blocked by pointer aliasing, non-unit strides, or conditionals it cannot prove safe. #pragma omp simd is an explicit assertion that a loop is safe to vectorise, allowing the compiler to emit SIMD instructions even when it would otherwise be cautious.
#pragma omp simd reduction(+:sum) for (int i = 0; i < N; i++) { sum += a[i] * b[i]; }
simd can be combined with parallel for as #pragma omp parallel for simd to both distribute iterations across threads and vectorise the iterations within each thread. The aligned(ptr : 32) clause tells the compiler that ptr is aligned to a 32-byte boundary, which is a prerequisite for some AVX load/store instructions.
