Profiling

Can I speed up my program?

=Introduction=

How fast does each part of my program take? Perhaps I could speed it up? As with debugging, we could add print statements, use a stopwatch, compile and re-run etc. etc. But that would be tedius and there are tools around which do the job way better.

Programming Languages

 * The very significant influence of an optimising compiler.
 * Memory hierarchies

Scripting Languages

 * Avoid loops, especially in interpreted languages (like R).
 * tic, toc timings in matlab
 * The python profiler.

=Compiler Options=

You probably want to make sure that you've added all the go-faster flags that are available before you embark on any profiling. Activating optimisations for speed can make your progam run a lot faster!

=Tools=


 * TAU
 * PerfExpert
 * gprof
 * valgrind

=Using Valgrind=

Valgrind is an excellent open-source tool for debugging and profiling.

Compile your program as normal, adding in any optimisation flags that you desire, but also add the -g flag so that valgrind can report useful information such as line numbers etc. Then run it through the callgrind tool, embedded in valgrind:

valgrind --tool=callgrind your-program [program options]

When your program has run you find a file called callgrind.out.xxxx in the current directory, where xxxx is replaced with a number (the process ID of the command you have just executed).

You can inspect the contents of this newly created file using a graphical display too call kcachegrind:

kcachegrind callgrind.out.xxxx

(For those using Enterprise Linux, you call install valgrind and kcachegrind using yum install kdesdk valgrind.)

=An Simple Example=

svn co https://svn.ggy.bris.ac.uk/subversion-open/profiling/trunk ./profiling cd profiling/examples/example1 make valgrind --tool=callgrind ./div.exe >& div.out kcachegrind callgrind.out.xxxx

We can see from the graphical display given by kcachegrind that our inefficient division routine takes far more of the runtime that our efficient routine. Using this kind of information, we can focus our re-engineering efforts on the slower parts of our program.



=OK, but how do I make my code run faster?=

OK, let's assume that we've located a region of our program that is taking a long time. So far so good, but how can we address that? There are--of course--a great many reasons why some code may take a long time to run. One reason could be just that it has a lot to do! Let's assume, however, that the code can be made to run faster by applying a little more of the old grey matter. With this in mind, let's consider a few common examples of code which can tweaked to run faster.

A Common Situation Involoving Loops
For our first example, consider the way we loop over the contents of a 2-dimensional array--a pretty common situation!. Skip over a directory and compile the example programs:

cd ../example2 make

In this directory, we have two almost identical programs. Compare the contents of loopingA.f90 and loopingB.f90. You will see that inside each program a 2-dimensional array is declared and two nested loops are used to visit each of the cells of the array in turn (and an arithmetic value is dropped in). The only way in which the two programs differ is the order in which the cycle through the contents of the arrays. In loopingA.f90, the outer loop is over the columns of the array and the inner loop is over the rows (i.e. for a given value of ii, all values of jj and cycled over). The opposite is true for loopingB.f90.

Let's compare how long it takes these two programs to run. We can do this using the Linux time command:

time ./loopingA.exe

and now,

time ./loopingB.exe

Dude! loopingA.exe takes more than twice the time of loopingB.exe. What's the reason?

Well, Fortran stores it's 2-dimensional arrays in column-major order. Our 2-dimension array is actually stored in the computer memory as a 1-dimension array, where the cells in a given column are next to each other. For example, the array:


 * $$ \begin{bmatrix}

1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$$

would actually be stored in memory as:

1 4  2  5  3  6

That means is that our outer loop should be over the columns, and our inner loop over the rows. Otherwise we would end up hopping all around the memory, which takes extra time, and explains why loopingA takes longer.

The opposite situation is true for the C programming language, which stores its 2-dimensional arrays in row-major order. You can type:

make testC

To run some equivalent C code examples.

Not Doing the Same Thing Twice (or more!)
The following is the Manning formula drawn from the field of hydrology:


 * $$V = \frac{1}{n} R_h ^\frac{2}{3} \cdot S^\frac{1}{2}$$

where:


 * $$V$$ || is the cross-sectional average velocity (m/s)
 * $$n$$ || is the Gauckler–Manning coefficient (independent of units)
 * $$R_h$$ || is the hydraulic radius (m)
 * $$S$$ || is the slope of the water surface or the linear hydraulic head loss (m/m) ($$S=h_f/L$$)
 * }
 * $$R_h$$ || is the hydraulic radius (m)
 * $$S$$ || is the slope of the water surface or the linear hydraulic head loss (m/m) ($$S=h_f/L$$)
 * }
 * }

A program to compute estimates of $$V$$, given an array of N hydraulic radii may contain the following loop:

do ii=1,N V(ii) = (1/n) * (R(n)**(2.0/3.0)) * sqrt(S) end do

However, this loop may be rather wasteful. For example, if the value for the Gauckler–Manning coefficient, $$n$$ is the same for all the hydraulic radii, we would be computing (1/n) needlessly each time through the loop and could take it outside:

n_inv = (1/n)

do ii=1,N V(ii) = n_inv * (R(n)**(2.0/3.0)) * sqrt(S) end do

This would make particular sense, since the division operator is an expensive one. We also don't need to compute (2.0/3.0)) over and over:

n_inv = (1/n) exponent=2.0/3.0

do ii=1,N V(ii) = n_inv * (R(n)**exponent) * sqrt(S) end do

Lastly, if the slope of the water surface is constant, then the square-root in the loop is also superfluous:

n_inv = (1/n) exponent=2.0/3.0 root_S = sqrt(S)

do ii=1,N V(ii) = n_inv * (R(n)**exponent) * root_S end do

Now, modern optimising compilers are getting better and better at spotting this sort of thing and may actually create machine code from the first example which is as if you had made all the subsequent improvements. This is partly why we profile our programs in the first place, as it is very difficult to second guess what the compiler will do with your source code.

With that as a caveat it is worth just briefly listing typically expensive operations (in terms of CPU cycles) which a profile may point you too and which you may be able to address through some programming tweaks:


 * memory operations such as allocate in Fortran, or malloc in C
 * sqrt, exp and log
 * division and raising raising numbers to a power

It's difficult to make sweeping statements in the area of code optimisation, but it will certainly pay not to do the same thing twice (or more!), especially is that thing is a relatively expensive operation.

Compiler Optimisations
Once you have re-factored any portions of your code highlighted by profiling your program, you may like to take a few minutes to peruse all the command line options available on your compiler. A number of these will be aimed at execution speed. Those aimed at Streaming SIMD Extensions (SSE) can be particularly effective.

A caveat, however, is that you should be careful when invoking aggressive compiler optimisations since, under certain circumstances, they can change the values that your program will output. See, for example the numerics tutorial.