Modern CPUs have Vector Processing Units (VPUs) that allow the processor to do the same instruction on multiple data, SIMD per cycle.
|System||microarchitecture||Instruction Set||SIMD width|
On KNL with 512 bit vector operations 8 double precision operations can be done with each instruction. A code which takes advantage of that can potentially achieve an 8x speedup!
In many cases a compiler is able to transform sequential code into vector operations automatically - a process known as automatic vectorization.
do i = 1, n c(i) = a(i) + b(i) end do
Could be transformed by the compiler such that blocks of 4 elements are processed at a time:
do i = 1, n, 4 c(i) = a(i) + b(i) c(i+1) = a(i+1) + b(i+1) c(i+2) = a(i+2) + b(i+2) c(i+3) = a(i+3) + b(i+3) end do
The loop trip count must be known at entry to the loop at runtime. Statements that can change the trip count dynamically at runtime (such as Fortran's
IF, etc. or C/C++'s
break) must not be present inside the loop.
Branching in the loop inhibits vectorization. Thus, C/C++'s
switchstatements are not allowed. However,
ifstatements are allowed as long as they can be implemented as masked assignments. The calculation is done for all
ifbranches but the results is stored only for those elements for which the mask evaluates to true.
Only the innermost loop is eligible for vectorization. If the compiler transforms an outer loop into an inner loop as a result of optimization, then the loop may be vectorized.
A function call or I/O inside a loop prohibits vectorization. Intrinsic math functions such as
sin, etc. are allowed because such library functions are usually vectorized versions. A loop containing a function that is inlined by the compiler can be vectorized because there will be no more function call.
Data dependencies in the loop could prevent vectorization.
Non-contiguous memory access hampers vectorization efficiency. Eight consecutive ints or floats, or four consecutive doubles, may be loaded directly from memory in a single AVX instruction. But if they are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient.
Also known as "flow dependency". Vectorization creates wrong results.
do i=2,n a(i) = a(i-1) + 1 end do
Also known as "anti-dependency" and can be vectorized.
do i=2,n a(i-1) = a(i) + 1 end do
Also known as "output dependency" and cannot be vectorized.
do i=2,n a(i-1) = x(i) a(i) = 2.0 * i end do
Reduction operations can be vectorized.
s=0.0 do i=1,n s = s + a(i) * b(i) end do
Data movement instructions are more efficient when operating on data objects that are aligned.
While Fortran does not have extensions in the language itself for data alignment some compilers provide non-portable directives or command line flags: the Cray compiler has the directive
!DIR$ VECTOR ALIGNED and the Intel compiler has the compiler flag
Fortran alignment example¶
The following test code examines the effect of memory alignment in a simple-minded matrix-matrix multiplication case. We pad the matrices with extra rows to make them aligned at certain boundaries.
program matmat implicit none integer :: n = 31 integer :: itmax = 200000 #ifdef REAL4 real, allocatable :: a(:,:), b(:,:), c(:,:) #else real*8, allocatable :: a(:,:), b(:,:), c(:,:) #endif #ifdef ALIGN16 !dir$ attributes align : 16 :: a,b,c #elif defined(ALIGN32) !dir$ attributes align : 32 :: a,b,c #elif defined(ALIGN64) !dir$ attributes align : 64 :: a,b,c #endif integer i, j, k, it integer :: vl, nr integer*8 c1, c2, cr, cm real*8 dt !... Vector length #ifdef ALIGN16 vl = 16 / (storage_size(a) / 8) #elif defined(ALIGN32) vl = 32 / (storage_size(a) / 8) #elif defined(ALIGN64) vl = 64 / (storage_size(a) / 8) #else vl = 1 #endif nr = ((n + (vl - 1)) / vl) * vl ! padded row dimension allocate (a(nr,n), b(nr,n), c(nr,n)) !... Initialization do j=1,n #if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64) !dir$ vector aligned #endif do i=1,nr a(i,j) = cos(i * 0.1 + j * 0.2) b(i,j) = sin(i * 0.1 + j * 0.2) c(i,j) = 0. end do end do !... Main loop call system_clock(c1, cr, cm) do it=1,itmax do j=1,n do k=1,n #if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64) !dir$ vector aligned #endif do i=1,nr c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do end do call system_clock(c2, cr, cm) print *, c(1,1)+c(n,n), dble(c2-c1)/dble(cr) deallocate(a, b, c) end
AoS vs SoA¶
A data object can become complex with multiple component elements or attributes. Programmers often represent a group of such data objects using an array of Fortran's derived data type or C's struct objects (i.e., an array of structures or AoS). Although an AoS provides a natural way to represent such data, memory reference of any component requires non-unit stride access. Such a situation is illustrated in the following example code. When the main loop is transformed into a vector loop, three components of a 'coords' object will be stored into three separate vector registers, one for each component. With the AoS data layout, loading into such a register will require stride 3 (or more) access, reducing efficiency of the vector load.
A better data structure for vectorization is to separate each component of the objects into its own array, and then form a data object composed of three arrays (i.e., a structure of arrays or SoA). When the main loop is vectorized, each component will be loaded into a separate register but this will be done with unit-stride access. Therefore, vectorization will be more efficient.
program aossoa implicit none integer :: n = 1000 integer :: itmax = 10000000 #ifdef SOA type coords real, pointer :: x(:), y(:), z(:) end type type (coords) :: p #else type coords real :: x, y, z end type type (coords), allocatable :: p(:) #endif real, allocatable :: dsquared(:) integer i, it integer*8 c1, c2, cr, cm real*8 dt !... Initialization #ifdef SOA allocate(p%x(n), p%y(n), p%z(n), dsquared(n)) do i=1,n p%x(i) = cos(i + 0.1) p%y(i) = cos(i + 0.2) p%z(i) = cos(i + 0.3) end do #else allocate(p(n), dsquared(n)) do i=1,n p(i)%x = cos(i + 0.1) p(i)%y = cos(i + 0.2) p(i)%z = cos(i + 0.3) end do #endif !... Main loop call system_clock(c1, cr, cm) do it=1,itmax #ifdef SOA do i=1,n dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2 end do #else do i=1,n dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2 end do #endif end do call system_clock(c2, cr, cm) dt = dble(c2-c1)/dble(cr) print *, dsquared(1)+dsquared(n/2)+dsquared(n), dt #ifdef SOA deallocate(p%x, p%y, p%z, dsquared) #else deallocate(p, dsquared) #endif end
Elemental functions are functions that can be also invoked with an array actual argument and return array results of the same shape as the argument array. This convenient feature is quite common in Fortran as it is widely used in many intrinsic functions.
A function call inside a loop generally inhibits vectorization. However, if an elemental function is called within a loop, the loop can be executed in vector mode. In vector mode, the function is called with multiple data packed in a vector register and returns packed data.
module fofx implicit none contains elemental function f(x) real(8) :: f real(8), intent(in) :: x f = cos(x * x + 1.0_8) / (x * x + 1.0_8) end function f end module fofx program main use fofx implicit none integer :: n = 1024 integer :: itmax = 1000000 real(8), allocatable :: a(:), x(:) integer :: i, it integer(8) :: c1, c2, cr, cm real(8) :: dt allocate (a(n), x(n)) !... Initialization do i=1,n x(i) = cos(i * 0.1_8) + 0.2_8 end do !... Main loop call system_clock(c1, cr, cm) do it=1,itmax do i=1,n a(i) = f(x(i)) end do x(n) = x(n) + 1.0_8 end do call system_clock(c2, cr, cm) dt = real(c2-c1, 8)/real(cr, 8) write(*,*) n, a(1)+a(n/2)+a(n), dt deallocate(a, x) end program main
The OpenMP standard has the SIMD construct since 4.0 to specify the execution of a loop in vectorization mode (i.e., SIMD operations).
#pragma omp simd [clause...]
!$omp simd [clause...]
where the optional clause could be:
safelen(length)- maximum length for safe vectorization without incurring data dependency
aligned(list[:alignment])- list of the variables are aligned to the number of bytes expressed in the optional parameter.
reduction(reduction-identifier:list)- list the variables where a reduction operation (i.e.,
maxfor maximum, etc.) result is stored
collapse(n)- how many levels of the nested loops that immediately follow the OpenMP directive should be collapsed into a single aggregate loop with larger iteration space.
do it=1,itmax do j=1,n do k=1,n #if defined(ALIGN16) !$omp simd aligned(a,b,c:16) #elif defined(ALIGN32) !$omp simd aligned(a,b,c:32) #elif defined(ALIGN64) !$omp simd aligned(a,b,c:64) #endif do i=1,nr c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do end do
It is also possible to declare that a function can be vectorized with OpenMP.
module fofx implicit none contains #ifdef ELEMENTAL !$omp declare simd (f) #endif function f(x) #ifdef REAL4 real f, x #else real*8 f, x #endif #ifdef REAL4 f = cos(x * x + 1.e0) / (x * x + 1.e0) #else f = cos(x * x + 1.d0) / (x * x + 1.d0) #endif end function f end module fofx ...