# Vectorization¶

Modern CPUs have Vector Processing Units (VPUs) that allow the processor to do the same instruction on multiple data, SIMD per cycle.

System microarchitecture Instruction Set SIMD width
Cori Haswell AVX2 256 bits
Cori KNL AVX-512 512 bits

Tip

On KNL with 512 bit vector operations 8 double precision operations can be done with each instruction. A code which takes advantage of that can potentially achieve an 8x speedup!

## Auto-vectorization¶

In many cases a compiler is able to transform sequential code into vector operations automatically - a process known as automatic vectorization.

Example

do i = 1, n
c(i) = a(i) + b(i)
end do


Could be transformed by the compiler such that blocks of 4 elements are processed at a time:

do i = 1, n, 4
c(i) = a(i) + b(i)
c(i+1) = a(i+1) + b(i+1)
c(i+2) = a(i+2) + b(i+2)
c(i+3) = a(i+3) + b(i+3)
end do


## Vectorization requirements¶

1. The loop trip count must be known at entry to the loop at runtime. Statements that can change the trip count dynamically at runtime (such as Fortran's EXIT, computed IF, etc. or C/C++'s break) must not be present inside the loop.

2. Branching in the loop inhibits vectorization. Thus, C/C++'s switch statements are not allowed. However, if statements are allowed as long as they can be implemented as masked assignments. The calculation is done for all if branches but the results is stored only for those elements for which the mask evaluates to true.

3. Only the innermost loop is eligible for vectorization. If the compiler transforms an outer loop into an inner loop as a result of optimization, then the loop may be vectorized.

4. A function call or I/O inside a loop prohibits vectorization. Intrinsic math functions such as cos, sin, etc. are allowed because such library functions are usually vectorized versions. A loop containing a function that is inlined by the compiler can be vectorized because there will be no more function call.

5. Data dependencies in the loop could prevent vectorization.

6. Non-contiguous memory access hampers vectorization efficiency. Eight consecutive ints or floats, or four consecutive doubles, may be loaded directly from memory in a single AVX instruction. But if they are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient.

### Data dependency¶

Also known as "flow dependency". Vectorization creates wrong results.

do i=2,n
a(i) = a(i-1) + 1
end do


Also known as "anti-dependency" and can be vectorized.

do i=2,n
a(i-1) = a(i) + 1
end do


#### write-after-write¶

Also known as "output dependency" and cannot be vectorized.

do i=2,n
a(i-1) = x(i)
a(i)   = 2.0 * i
end do


#### Reduction operations¶

Reduction operations can be vectorized.

s=0.0
do i=1,n
s = s + a(i) * b(i)
end do


## Memory alignment¶

Data movement instructions are more efficient when operating on data objects that are aligned.

Note

While Fortran does not have extensions in the language itself for data alignment some compilers provide non-portable directives or command line flags: the Cray compiler has the directive !DIRVECTOR ALIGNED and the Intel compiler has the compiler flag -align array64byte. ### Fortran alignment example¶ The following test code examines the effect of memory alignment in a simple-minded matrix-matrix multiplication case. We pad the matrices with extra rows to make them aligned at certain boundaries.  program matmat implicit none integer :: n = 31 integer :: itmax = 200000 #ifdef REAL4 real, allocatable :: a(:,:), b(:,:), c(:,:) #else real*8, allocatable :: a(:,:), b(:,:), c(:,:) #endif #ifdef ALIGN16 !dir attributes align : 16 :: a,b,c
#elif defined(ALIGN32)
!dirattributes align : 32 :: a,b,c #elif defined(ALIGN64) !dir attributes align : 64 :: a,b,c
#endif
integer i, j, k, it
integer :: vl, nr
integer*8 c1, c2, cr, cm
real*8 dt

!...  Vector length

#ifdef ALIGN16
vl = 16 / (storage_size(a) / 8)
#elif defined(ALIGN32)
vl = 32 / (storage_size(a) / 8)
#elif defined(ALIGN64)
vl = 64 / (storage_size(a) / 8)
#else
vl = 1
#endif

nr = ((n + (vl - 1)) / vl) * vl      ! padded row dimension
allocate (a(nr,n), b(nr,n), c(nr,n))

!...  Initialization

do j=1,n
#if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64)
!dirvector aligned #endif do i=1,nr a(i,j) = cos(i * 0.1 + j * 0.2) b(i,j) = sin(i * 0.1 + j * 0.2) c(i,j) = 0. end do end do !... Main loop call system_clock(c1, cr, cm) do it=1,itmax do j=1,n do k=1,n #if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64) !dir vector aligned
#endif
do i=1,nr
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
end do
call system_clock(c2, cr, cm)

print *, c(1,1)+c(n,n), dble(c2-c1)/dble(cr)
deallocate(a, b, c)
end


## AoS vs SoA¶

Array of Structures vs Structures of Arrays

A data object can become complex with multiple component elements or attributes. Programmers often represent a group of such data objects using an array of Fortran's derived data type or C's struct objects (i.e., an array of structures or AoS). Although an AoS provides a natural way to represent such data, memory reference of any component requires non-unit stride access. Such a situation is illustrated in the following example code. When the main loop is transformed into a vector loop, three components of a 'coords' object will be stored into three separate vector registers, one for each component. With the AoS data layout, loading into such a register will require stride 3 (or more) access, reducing efficiency of the vector load.

A better data structure for vectorization is to separate each component of the objects into its own array, and then form a data object composed of three arrays (i.e., a structure of arrays or SoA). When the main loop is vectorized, each component will be loaded into a separate register but this will be done with unit-stride access. Therefore, vectorization will be more efficient.

      program aossoa

implicit none
integer :: n = 1000
integer :: itmax = 10000000
#ifdef SOA
type coords
real, pointer :: x(:), y(:), z(:)
end type
type (coords) :: p
#else
type coords
real :: x, y, z
end type
type (coords), allocatable :: p(:)
#endif
real, allocatable :: dsquared(:)
integer i, it
integer*8 c1, c2, cr, cm
real*8 dt

!...  Initialization

#ifdef SOA
allocate(p%x(n), p%y(n), p%z(n), dsquared(n))
do i=1,n
p%x(i) = cos(i + 0.1)
p%y(i) = cos(i + 0.2)
p%z(i) = cos(i + 0.3)
end do
#else
allocate(p(n), dsquared(n))
do i=1,n
p(i)%x = cos(i + 0.1)
p(i)%y = cos(i + 0.2)
p(i)%z = cos(i + 0.3)
end do
#endif

!...  Main loop

call system_clock(c1, cr, cm)
do it=1,itmax
#ifdef SOA
do i=1,n
dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2
end do
#else
do i=1,n
dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2
end do
#endif
end do
call system_clock(c2, cr, cm)

dt = dble(c2-c1)/dble(cr)
print *, dsquared(1)+dsquared(n/2)+dsquared(n), dt
#ifdef SOA
deallocate(p%x, p%y, p%z, dsquared)
#else
deallocate(p, dsquared)
#endif
end


## Elemental functions¶

Elemental functions are functions that can be also invoked with an array actual argument and return array results of the same shape as the argument array. This convenient feature is quite common in Fortran as it is widely used in many intrinsic functions.

A function call inside a loop generally inhibits vectorization. However, if an elemental function is called within a loop, the loop can be executed in vector mode. In vector mode, the function is called with multiple data packed in a vector register and returns packed data.

### Fortran example¶

module fofx
implicit none
contains
elemental function f(x)
real(8) :: f
real(8), intent(in) :: x
f = cos(x * x + 1.0_8) / (x * x + 1.0_8)
end function f
end module fofx

program main
use fofx
implicit none
integer :: n = 1024
integer :: itmax = 1000000
real(8), allocatable :: a(:), x(:)
integer :: i, it
integer(8) :: c1, c2, cr, cm
real(8) ::  dt

allocate (a(n), x(n))

!... Initialization

do i=1,n
x(i) = cos(i * 0.1_8) + 0.2_8
end do

!... Main loop

call system_clock(c1, cr, cm)
do it=1,itmax
do i=1,n
a(i) = f(x(i))
end do
x(n) = x(n) + 1.0_8
end do
call system_clock(c2, cr, cm)

dt = real(c2-c1, 8)/real(cr, 8)
write(*,*) n, a(1)+a(n/2)+a(n), dt

deallocate(a, x)
end program main


## OpenMP¶

The OpenMP standard has the SIMD construct since 4.0 to specify the execution of a loop in vectorization mode (i.e., SIMD operations).

#pragma omp simd [clause...]

!omp simd [clause...]  where the optional clause could be: • safelen(length) - maximum length for safe vectorization without incurring data dependency • aligned(list[:alignment]) - list of the variables are aligned to the number of bytes expressed in the optional parameter. • reduction(reduction-identifier:list) - list the variables where a reduction operation (i.e., + for summation, min for minimum, max for maximum, etc.) result is stored • collapse(n) - how many levels of the nested loops that immediately follow the OpenMP directive should be collapsed into a single aggregate loop with larger iteration space. ### Memory alignment¶  do it=1,itmax do j=1,n do k=1,n #if defined(ALIGN16) !omp simd aligned(a,b,c:16)
#elif defined(ALIGN32)
!omp simd aligned(a,b,c:32) #elif defined(ALIGN64) !omp simd aligned(a,b,c:64)
#endif
do i=1,nr
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
end do


### Elemental functions¶

It is also possible to declare that a function can be vectorized with OpenMP.

     module fofx
implicit none
contains
#ifdef ELEMENTAL
!\$omp declare simd (f)
#endif
function f(x)
#ifdef REAL4
real f, x
#else
real*8 f, x
#endif
#ifdef REAL4
f = cos(x * x + 1.e0) / (x * x + 1.e0)
#else
f = cos(x * x + 1.d0) / (x * x + 1.d0)
#endif
end function f
end module fofx
...