There are a number of important steps that an user can take to optimize there application to tune for the memory hierarchy. We go into details for some of these issues. Maximize cache reuse
The following snippets of codes illustrates the correct way to access contiguous elements i.e. stride 1, for a matrix in both C and Fortran.
Fortran Example:Real*8 :: a(m,n), b(m,n), c(m,n) ... do i=1,n do j=1,m a(j,i)=b(j,i)+c(j,i) end do end do |
C Example:Double a[m][n], b[m][n], c[m][n]; ... for (i=0;i < m;i++){ for (j=0;j < n;j++){ a[i][j]=b[i][j]+c[i][j]; } } |
This concept is illustrated in the following matrix-matrix multiply example where the indices for the i, j, k loops are set up in such a way so as to fit the greatest possible sizes of the different submatrices in cache while the computation is on-going.
Example: Matrix multiplication
Real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb ! <- nb is blocking factor do jj=1,n,nb do kk=1,n,nb do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) end do end do end do end do end do end do
Example: consider an L1 cache tha is 16 K in size and 4-way set associative, with a cache line of 64 Bytes. | |
Real*8 :: a(1024,50) ... do i=1,n a(1,i)=0.50*a(1,i) end do |
A 16K 4-way set assoc cache has 4 sets of 4K each (4096). If each cache line is 64 bytes, then there are 64 cache lines per set. Effectively reduces L1 from 256 cache lines to only 4! The user ends up with a 256 byte cache from the original 16K, due to non-optimal choice of leading dimension. |
Solution: Change leading dimension to 1028 (1024 + 1/2 cache line) |
Prefetching is the ability to predict the next cache line to be accessed and start bringing it in from memory. If data is requested far enough in advance, the latency to memory can be hidden. Compiler inserts prefetch instructions into loop -- instructions that move data from main memory into cache in advance of their use. Prefetching may also be specified by the user using directives.
Example: In the following dot-product example, the number of streams that are prefetched are increased from 2 to 4 to 6 for the same functionality. However care must be taken that just prefetching an increasing number of streams does not necessarily translate into increased performance. There is a threashold value beyond which prefetching a higher number of streams can be counterproductive. | ||
2 streams | 4 streams | 6 streams |
do i=1,n s=s+x(i)*y(i) end do dotp=s |
do i=1,n/2 s0=s0+x(i)*y(i) s1=s1+x(i+n/2)*y(i+n/2) end do s0=s0+x(i)*y(i) dotp=s0+s1 |
do i=1,n/3 s0=s0+x(i)*y(i) s1=s1+x(i+n/3)*y(i+n/3) s2=s2+x(i+2*n/3)*y(i+2*n/3) end do do i=3*(n/3)+1,n s0=s0+x(i)*y(i) end do dotp=s0+s1+s2 |
Unroll Inner Loops to Hide FP Latency
In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better
... do i=1,n,k s1 = s1 + x(i)*y(i) s2 = s2 + x(i+1)*y(i+1) s3 = s3 + x(i+2)*y(i+2) s4 = s4 + x(i+3)*y(i+3) ... sk = sk + x(i+k)*y(i+k) end do ... dotp = s1 + s2 + s3 + s4 + ... + sk
Avoid Divide Operations
The following example illustrates a very common step since a floating point divide is more expensive than a multiply. If the divide step is inside of a loop it is better to subsitute that step by a multiply outside of the loop provided no dependencies exist. Another alternative is to replace the loop by optimized vector intrinsics library, if available.
|
![]() |
|
HR50 I/O Subsystem Tuning
Some of the more common sense approach entails using whats provided by the vendor i.e. taking advantage of the hardware . On Linux systems for example, this would mean using the Parallel Virtual Filesystem (PVFS) for Linux-based clusters. On IBM systems, for example, that would imply using the fast Global Parallel Filesystem (GPFS) provided by IBM.
Other common sensical approaches to optimizing I/O is to be aware of the existence and the locations of the filesystems i.e. whether the filesystems are locally mounted or through a remote filesystem. The former is much faster than the latter, due to limitations of network bandwidth, disk speed and overhead due to accessing the filesystem over the network and should always be the goal at the design level.
The other approaches including considering the best software options available. Some of them are enumerated below:
There are a couple of design issues for fortran90 applications that users should be aware of.
For the first case, the F90 Array syntax of the two dimensional arrays impacts performance. Consider the two snippets gvien below:
Case 1:
do j = js,je do k = ks,ke do i = is,ie rt(i,k,j) = rt(i,k,j) - smdiv*(rt(i,k,j) - rtold(i,k,j)) enddo enddo enddo |
Case 2:
rt(is:ie,ks:ke,js:je)=rt(is:ie,ks:ke,js:je) - & smdiv * rt(is:ie,ks:ke,js:je) – rtold(is:ie,ks:ke,js:je)) |
The array syntax in the computation step of the second approach leads to significant performance penalty over explicit loops on cache-based systems, although it is more elegent. Vector systems tend to prefer this array syntax from the performance standpoint. More importantly, the array syntax generates a larger temporary arrays on program stack.
The way the arrays are declared also impact performance. In the following example, the F90 assumed shape arrays in the second case the negative performance impact is significantly higher, almost ten-fold in compile time. (This timing test was done on the IBM Power3 for CLimate application).
Case 1:
REAL, DIMENSION( ims:ime , kms:kme , jms:jme ) :: r, rt, rw, rtold Results in F77-style assumed-size arrays Compile time: 46 seconds Run time: .064 seconds / call |
Case 2:
REAL, DIMENSION( ims: , kms: , jms: ) :: r, rt, rw, rtold Results in F90-style assumed-shape arrays Compile time: 3120 seconds!! Run time: .083 seconds / call |
Another issue which arises from the F90 assumed shape arrays, when it is a parameter in a subroutine. Assume-shape arrays as a paramter in a subroutine may result in the subroutine being passed a copy rather than the array itself arrays are passed as parameters. This F90 copy-in/copy-out overhead is very inefficient and can also cause errors while calling external libraries like MPI.