Memory Subsystem Tuning

There are a number of important steps that an user can take to optimize there application to tune for the memory hierarchy. We go into details for some of these issues. Maximize cache reuse

  1. Always minimize stride length . For the best-case scenario, stride length 1 is optimal for most systems and in particularly the vector systems. If that is not possible, then the low-stride access should be the goal. That increases cache efficiency, as well as sets up hardware and software prefetching. Stride lengths of powers of two is typically the worst case scenario leading to cache misses.

    The following snippets of codes illustrates the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.

    Fortran Example:
    Real*8 :: a(m,n), b(m,n), c(m,n)
    ...
    do i=1,n
      do j=1,m
       a(j,i)=b(j,i)+c(j,i)
      end do
    end do
    					
    C Example:
    Double a[m][n], b[m][n], c[m][n];
    ...
    for (i=0;i < m;i++){
      for (j=0;j < n;j++){
        a[i][j]=b[i][j]+c[i][j];
      }
    }
    				

  2. Another approach is data reuse in cache by cache blocking. The idea is to load chunks of the data so it fits maximally in the different levels of cache while in use. Otherwise the data has to be loaded into cache from memory every time it becomes necessary since its not in cache. This phenomenon is commonly known as cache miss . This is costly from the computational standpoint, since the latency for loading data from memory is a few orders higher than from cache, hence the concern. The goal is to keep as much of the data in cache while it is in use and to minimizing loading it from memory.

    This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different submatrices in cache while the computation is on-going.

    Example: Matrix multiplication

    Real*8 a(n,n), b(n,n), c(n,n)
    do ii=1,n,nb  ! <- nb is blocking factor
      do jj=1,n,nb
        do kk=1,n,nb
          do i=ii,min(n,ii+nb-1)
            do j=jj,min(n,jj+nb-1)
              do k=kk,min(n,kk+nb-1)
                c(i,j)=c(i,j)+a(j,k)*b(k,i)
              end do
            end do
          end do
        end do
      end do
    end do
    		
  3. Another standard issue is the dimension of arrays when they are stored and it is always best to avoid leading dimensions that are a multiple of a high power of two. More particulalrly, users should be aware of the cache line and associativity. Performance degrades when the stride is a multiple of the cache line size.

    Example: consider an L1 cache tha is 16 K in size and 4-way set associative, with a cache line of 64 Bytes.
    Real*8 :: a(1024,50)
    ...
    do i=1,n
      a(1,i)=0.50*a(1,i)
    end do
    				
    A 16K 4-way set assoc cache has 4 sets of 4K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4! The user ends up with a 256 byte cache from the original 16K, due to non-optimal choice of leading dimension.
    Solution: Change leading dimension to 1028 (1024 + 1/2 cache line)

Encourage Data Prefetching to Hide Memory Latency

Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.

Example: In the following dot-product example, the number of streams that are prefetched are increased from 2 to 4 to 6 for the same functionality. However care must be taken that just prefetching an increasing number of streams does not necessarily translate into increased performance. There is a threashold value beyond which prefetching a higher number of streams can be counterproductive.
2 streams 4 streams 6 streams
do i=1,n
  s=s+x(i)*y(i)
end do
dotp=s
		
do i=1,n/2
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/2)*y(i+n/2)
end do
s0=s0+x(i)*y(i)
dotp=s0+s1
		
do i=1,n/3
  s0=s0+x(i)*y(i)
  s1=s1+x(i+n/3)*y(i+n/3)
  s2=s2+x(i+2*n/3)*y(i+2*n/3)
end do
do i=3*(n/3)+1,n
  s0=s0+x(i)*y(i)
end do
dotp=s0+s1+s2
		
Work within available physical memory The most important issue is to make sure to fit the problem size to memory. Working in virtual memory leads to performance degradation and should be avoided at all cost. In addition swapping causes problems on some Linux systems. HR50 Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better

...
do i=1,n,k
  s1 = s1 + x(i)*y(i)
  s2 = s2 + x(i+1)*y(i+1)
  s3 = s3 + x(i+2)*y(i+2)
  s4 = s4 + x(i+3)*y(i+3)
  ...
  sk = sk + x(i+k)*y(i+k)
end do
...
dotp = s1 + s2 + s3 + s4 + ... + sk

Avoid Divide Operations

The following example illustrates a very common step since a floating point divide is more expensive than a multiply. If the divide step is inside of a loop it is better to subsitute that step by a multiply outside of the loop provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.

a=...  
do i=1,n 
x(i)=x(i)/a 
end do

arrow
 a=...
 ainv=1.0/a
 do i=1,n
   x(i)=x(i)*ainv 
 end do

HR50 I/O Subsystem Tuning

Some of the more common sense approach entails using whats provided by the vendor i.e. taking advantage of the hardware . On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.

Other common sensical approaches to optimizing I/O is to be aware of the existence and the locations of the filesystems i.e. whether the filesystems are locally mounted or through a remote filesystem. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the filesystem over the network and should always be the goal at the design level.

The other approaches including considering the best software options available. Some of them are enumerated below:

  1. Read or write as much data as possible with a single READ/WRITE/PRINT. Avoid performing multiple writes of small records.
  2. Use binary instead of ASCII format because of overhead in converting from internal representation of real numbers to character string. In addition, ASCII files are larger than the corresponding binary file.
  3. In Fortran, prefer direct access to sequential access. Direct or random access files do not have record length indicators at the beginning and end of each record.
  4. If available, use asynchronous I/O to overlap reads/writes with computation. This capability is available on the Cray, IBM, and SGI.
HR50 Fortran90 Performance Pitfalls

There are a couple of design issues for fortran90 applications that users should be aware of.

For the first case, the F90 Array syntax of the two dimensional arrays impacts performance. Consider the two snippets gvien below:

Case 1:
do j = js,je
  do k = ks,ke
    do i = is,ie
      rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j))
    enddo
  enddo
enddo
		
Case 2:
rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - &
    smdiv * rt(is:ie,ks:ke,js:je) – rtold(is:ie,ks:ke,js:je))
		

The array syntax in the computation step of the second approach leads to significant performance penalty over explicit loops on cache-based systems, although it is more elegent. Vector systems tend to prefer this array syntax from the performance standpoint. More importantly, the array syntax generates a larger temporary arrays on program stack.

The way the arrays are declared also impact performance. In the following example, the F90 assumed shape arrays in the second case the negative performance impact is significantly higher, almost ten-fold in compile time. (This timing test was done on the IBM Power3 for CLimate application).

Case 1:
REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold
Results in F77-style assumed-size arrays

  Compile time:  46 seconds
  Run time:     .064 seconds / call
    
Case 2:
REAL, DIMENSION( ims:    , kms:    , jms:    ) :: r, rt, rw, rtold
Results in F90-style assumed-shape arrays
  Compile time:  3120 seconds!!
  Run time:     .083 seconds / call
		

Another issue which arises from the F90 assumed shape arrays, when it is a parameter in a subroutine. Assume-shape arrays as a paramter in a subroutine may result in the subroutine being passed a copy rather than the array itself arrays are passed as parameters. This F90 copy-in/copy-out overhead is very inefficient and can also cause errors while calling external libraries like MPI.