There are some tools that provide useful insights when we're looking at low-level behaviour of code. Let's look at a few. You might want some for the Mini-Project in this course.


We met GDB and gdbtui in labs 2 and 3.

It's most useful when there's a gap between your expectation for the behaviour of some code and the results you're seeing.

With appropriate application of breakpoints and watch values (and appropriate test cases), you should be able to narrow in on the problem.


If this is a way of interacting with code that feels good to you, then great. If not, that's fine with me (or maybe you need more practice with it).

Interactive debuggers like this are available in many programming languages and/or IDEs. Have a look in your favourite languages.

Measuring Time

A question we have often had for our code: how long is it taking to run?

In timing tests I have given, I used the x86 time stamp counter, which counts the number of CPU cycles since the core was reset.

Measuring Time

It's accessible with the rdtsc (read time-stamp counter) instruction, but I have been using the C __rdtsc compiler intrinsic (i.e. way to tell the compiler I want exactly that instruction).

#include <x86intrin.h>
unsigned start = __rdtsc();
result = calculate_something();
unsigned end = __rdtsc();
unsigned elapsed = end - start;

Measuring Time

The time stamp counter is probably not the best way to measure running time.

There is no guarantee that each CPU core will have a synchronized counter: if your code is moved between cores, you might get start and end times that are unrelated. The number of cycles can vary depending on power state (e.g. power-saving, full-speed) of the core, so number of cycles might not be very meaningful.

Measuring Time

But for this course, I wanted to be able to give examples in in CPU cycles, so I decided to ignore the problems.

Do as I say, not as I do…

Measuring Time

The more-correct thing to do is to look at wall clock time (elapsed real time) at a high resolution. The C function clock_gettime will get the time at a nanosecond resolution (or as close as the hardware supports).

#define _POSIX_C_SOURCE 199309L
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
result = calculate_something();
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
double elapsed = (end.tv_sec - start.tv_sec)
                 + (end.tv_nsec - start.tv_nsec) / 1e9;

Measuring Time

I wrote a quick experiment to try them both:

Calculated 5011704 in 458092524 cycles.
Calculated 5011704 in 0.122615 s.

This was on a 3.7 GHz processor, and the two numbers seem very close, at least in this case:

>>> 458092524 / 3.7e9

Measuring Time

In general when measuring the running time of code…

  • Do your best to measure the parts you care about, not parts you don't (e.g. program startup, allocating memory or filling arrays, writing files with results, etc).

Measuring Time

  • Remember the CPU power states: idle CPUs will be in a slower low-power mode and take a little time to come back to full speed. Configuring them into a performance mode or having a little pre-measurement warmup is necessary.
  • Processors can overheat and thermal throttle and slow themselves down to avoid damage, especially laptops and small-form-factor systems. It usually happens at 99°C: if CPU temperatures are in the 90°C range, your results may be meaningless.

Timing Commands

Timing the whole program execution can measure lots of stuff you don't want (e.g. creating data structures for the real test), but it works on any program, which is flexible.

Also, there are lots of tools that do it with various levels of sophistication.

Timing Commands

The Linux time command is a common quick way to see the running time of a command:

$ time ./a.out
real    0m1.714s
user    0m1.389s
sys     0m0.114s

This is 1.714 s of elapsed clock time; 1.389 s of processor time and 0.114 s waiting on the system calls (≈ the operating system to do stuff).

Note: a program that uses two threads and does lots of compute work in both, you'd expect user to be around double real.

Timing Commands

That way of calling time is the built-in in the BASH shell. There's also an executable /usr/bin/time that gives a lot more info with -v:

$ /usr/bin/time -v ./a.out
	Command being timed: "./a.out"
	User time (seconds): 3.70
	System time (seconds): 0.22
	Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94
	Maximum resident set size (kbytes): 394760
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 98382
	Swaps: 0

Timing Commands

Both the time built-in and /usr/bin/time command are a quick way to check running time of a whole program. They're useful if you're okay with fairly coarse timing, or can't do clock_gettime kind of stuff for some reason (e.g. not being able to modify the source).


The Perf tool can run your program while counting a huge variety of stats about what's going on.

The default perf stat gives a good quick summary of what's going on…


$ perf stat ./a.out
 Performance counter stats for './a.out':

            711.62 msec task-clock                #    1.000 CPUs utilized          
                 2      context-switches          #    2.810 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            12,565      page-faults               #   17.657 K/sec                  
     3,099,396,198      cycles                    #    4.355 GHz                    
     2,160,022,445      instructions              #    0.70  insn per cycle         
       451,838,503      branches                  #  634.939 M/sec                  
           945,485      branch-misses             #    0.21% of all branches        

       0.711910230 seconds time elapsed

       0.675934000 seconds user
       0.035996000 seconds sys


perf can tell you about the things it can count, and you can select with -e on the command line.

perf list

e.g. we can focus on branches by selecting only a few counters:

perf stat -e branches,branch-misses,\
br_inst_retired.not_taken ./a.out


Let's try to confirm my story about the branch predictor and this code:

for (uint64_t i = 0; i < n; i++) {
    if ( arr[i] < 0 ) {
        total += arr[i];

I claimed that with a random array there are lots of branch mispredictions and with an alternating even-odd array there are few.


With a random array:

     1,393,825,373      branches                                                    
       321,245,207      branch-misses             #   23.05% of all branches        
       360,752,605      br_inst_retired.not_taken                                   

With an alternating even-odd array:

     1,394,767,126      branches                                                    
           262,392      branch-misses             #    0.02% of all branches        
       360,848,628      br_inst_retired.not_taken                                   


for (uint64_t i = 0; i < n; i++) {
    if ( arr[i] < 0 ) {
        total += arr[i];

Remember there's a conditional branch implied by both the for and if. The loop-bottom branch should be extremely predictable: it's taken all but the last iteration (or not, depending how the compiler arranged the loop logic).

So here, the 23% of mispredictions is probably almost exactly half of the if branches.


Or maybe we're interested in cache performance. Then maybe a command like this

perf stat -e L1-dcache-load-misses,L1-dcache-loads,\
LLC-load-misses,LLC-loads ./a.out

(LLC = last-level cache, probably L3.)


Let's try with the random access array sum code: the example that had an array of indices that could be shuffled or not to access the data randomly or sequentially.

for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];


With the indices array shuffled:

        64,394,902      L1-dcache-load-misses     #    8.68% of all L1-dcache accesses  (67.09%)
       741,750,025      L1-dcache-loads                                               (67.09%)
        25,817,611      LLC-load-misses           #   53.76% of all LL-cache accesses  (66.68%)
        48,026,135      LLC-loads                                                     (66.19%)

Without the shuffle:

         6,450,308      L1-dcache-load-misses     #    1.54% of all L1-dcache accesses  (67.01%)
       418,986,443      L1-dcache-loads                                               (67.01%)
           266,494      LLC-load-misses           #   61.86% of all LL-cache accesses  (66.29%)
           430,775      LLC-loads                                                     (65.98%)

Remember this is the whole program: creating the arrays, shuffling, and summing.


With the shuffle, there are more total memory access: those are the work of shuffling the array. We're not quite making a fair comparison because we're measuring programs that do different things (apart from the comparison we want).

// shuffle_array(indices, length);


For a fair comparison, we should always create a copy of indices that's shuffled, and just choose which to use. That would artifically make the counts the same before the part we want to measure.

indices_shuffled = (uint32_t*)malloc(length * sizeof(uint32_t));
if (indices_shuffled == NULL) { … }
memcpy(indices_shuffled, indices, length * sizeof(uint32_t));
shuffle_array(indices_shuffled, length);


Using shuffled_indices:

        67,401,304      L1-dcache-load-misses     #    8.96% of all L1-dcache accesses  (66.86%)
       752,394,854      L1-dcache-loads                                               (66.86%)
        25,553,268      LLC-load-misses           #   54.52% of all LL-cache accesses  (66.86%)
        46,870,946      LLC-loads                                                     (66.39%)

Using the unshuffled indices:

        39,101,601      L1-dcache-load-misses     #    5.13% of all L1-dcache accesses  (66.61%)
       762,779,577      L1-dcache-loads                                               (66.61%)
        10,132,539      LLC-load-misses           #   50.84% of all LL-cache accesses  (66.78%)
        19,929,036      LLC-loads                                                     (66.78%)


Those ways of starting perf all had it start the program and measure all of it.

It's possible to start Perf on an already-running process if you know it's process ID (PID).

perf stat -e branches,branch-misses -p 12345

Any Linux process/CPU monitoring tools will tell you the PID of running programs. Or…


So will the getpid C function. A quick way to measure only some of the code in your program is two terminal windows and something like this:

printf("Pausing. Start: perf stat -p %u\n", getpid());
// the code you care about measuring...
printf("The good part is done. Ctrl-C on perf.\n");


I did that with my shuffled array adder and the main loop. With the shuffled indices:

        32,527,861      L1-dcache-load-misses     #   53.28% of all L1-dcache accesses  (68.93%)
        61,045,114      L1-dcache-loads                                               (69.09%)
        16,311,932      LLC-load-misses           #   57.91% of all LL-cache accesses  (68.16%)
        28,168,193      LLC-loads                                                     (65.08%)

Using the unshuffled indices:

         4,967,712      L1-dcache-load-misses     #    7.90% of all L1-dcache accesses  (68.60%)
        62,906,699      L1-dcache-loads                                               (68.66%)
           535,821      LLC-load-misses           #   71.49% of all LL-cache accesses  (62.73%)
           749,518      LLC-loads                                                     (62.68%)


The lesson, I think: it's not too hard to use Perf to see broad trends in code. Seeing exactly what's happening is tricky and you have to carefully consider what's going on and what parts you care about.

The length of the perf list menu of counters gives away how much detail is available, if you can figure out the right question to ask.

An experiment…


Can I confirm my theory about the data hazards slowing down the array sum code?

I'm going to compare these two implementations: same loop, but using one vs two accumulators.

for (uint64_t i = 0; i < n - 1; i += 2) {
    total += arr[i];
    total += arr[i + 1];
for (uint64_t i = 0; i < n - 1; i += 2) {
    total1 += arr[i];
    total2 += arr[i + 1];


Compiled -O2 and run:

perf stat -e L1-dcache-loads,branches,branch-misses,\
instructions,cycles ./a.out
       784,176,365      L1-dcache-loads                                               (80.00%)
       451,016,427      branches                                                      (80.14%)
           391,793      branch-misses             #    0.09% of all branches          (80.15%)
     2,158,643,786      instructions              #    0.70  insn per cycle           (80.14%)
     3,080,949,868      cycles                                                        (79.58%)
       779,680,179      L1-dcache-loads                                               (79.80%)
       451,253,316      branches                                                      (80.28%)
           283,216      branch-misses             #    0.06% of all branches          (80.28%)
     2,169,797,960      instructions              #    1.17  insn per cycle           (80.28%)
     1,860,696,786      cycles                                                        (79.36%)

Same memory accesses, branches, instructions. It just took almost 1/2 the time. I still think it's probably data hazards. Still no proof.


It's also possible to record the Perf counts to a file, so you can take a more detailed look later.

First: include debugging symbols when you compile (-g) and probably step-back on compiler optimization (-O1) so the assembly maps more closely to your code.

Then something like:

perf record -e branches,branch-misses ./a.out


Then you can explore the recorded data:

perf report

There's more info here: which functions were counting for more/​less of each thing.

Keys: , (to move around), enter (to descend an option), esc (to move back out), a (annotate code with counter results), in annotation t (to switch between percents/​totals), h for help.


If you want to look at the annotation of your code line-by-line, note that the counters are associated with lines close to when the event occurred, not necessarily the exact line.

You can ask for increased precision (if it's supported) by adding :p or :pp to the counter. So the pattern becomes something like:

gcc … -O1 -g code.c
perf record -e branches:pp,branch-misses:pp ./a.out
perf report


I have had much more comprehensible results from Perf on code compiled -O1 instead of -O3.

I suspect the higher optimization levels modify things enough that it become hard to map runtime behaviour back to my code.


Running perf generally requires an admin to allow access to the hardward performance counters (there are apparently security implications). Also, most of the counters aren't available in virtual machines.

See the Running Perf page on the course web site for info.


Another tool (that you can run without admin powers): Valgrind.

With its defaults, it looks for memory leak problems:

valgrind ./a.out
==2923447== HEAP SUMMARY:
==2923447==     in use at exit: 0 bytes in 0 blocks
==2923447==   total heap usage: 3 allocs, 3 frees, 51,201,024 bytes allocated
==2923447== All heap blocks were freed -- no leaks are possible


Probably more interesting for us: cachegrind, which reports cache performance.

valgrind --tool=cachegrind ./a.out

Let's try it on the random access array sum problem…


With the shuffled_indices array:

==2933165== I   refs:      602,938,883Instructions
==2933165== I1  misses:          1,530
==2933165== LLi misses:          1,517
==2933165== I1  miss rate:        0.00%
==2933165== LLi miss rate:        0.00%
==2933165== D   refs:      258,000,928  (173,577,018 rd   + 84,423,910 wr)
==2933165== D1  misses:     10,164,053  (  9,115,035 rd   +  1,049,018 wr)
==2933165== LLd misses:      5,798,679  (  4,749,697 rd   +  1,048,982 wr)
==2933165== D1  miss rate:         3.9% (        5.3%     +        1.2%  )
==2933165== LLd miss rate:         2.2% (        2.7%     +        1.2%  )
==2933165== LL refs:        10,165,583  (  9,116,565 rd   +  1,049,018 wr)
==2933165== LL misses:       5,800,196  (  4,751,214 rd   +  1,048,982 wr)
==2933165== LL miss rate:          0.7% (        0.6%     +        1.2%  )


With the non-shuffled indices array:

==2933109== I   refs:      602,938,547
==2933109== I1  misses:          1,521
==2933109== LLi misses:          1,508
==2933109== I1  miss rate:        0.00%
==2933109== LLi miss rate:        0.00%
==2933109== D   refs:      258,000,799  (173,576,937 rd   + 84,423,862 wr)
==2933109== D1  misses:      6,497,057  (  5,448,039 rd   +  1,049,018 wr)
==2933109== LLd misses:      2,984,517  (  1,935,535 rd   +  1,048,982 wr)
==2933109== D1  miss rate:         2.5% (        3.1%     +        1.2%  )
==2933109== LLd miss rate:         1.2% (        1.1%     +        1.2%  )
==2933109== LL refs:         6,498,578  (  5,449,560 rd   +  1,049,018 wr)
==2933109== LL misses:       2,986,025  (  1,937,043 rd   +  1,048,982 wr)
==2933109== LL miss rate:          0.3% (        0.2%     +        1.2%  )


Note: Without the shuffle there are ≈6.5M misses at D1 (L1 data) and at LLd (last-level data, L3) With the shuffle, ≈10M misses at both.

Both are measuing the whole program (creating, filling, shuffling arrays, before the part I care about).


If you have compiled with debugging info (-g), the cg_annotate command will give much more detailed info about which functions and lines of code were hitting/​missing cache.

cg_annotate cachegrind.out.12345

… and the output is far too wide to put on a lecture slide.


On the shuffled indexing code, it reported this line:

total += data[indices[i]];

… having a D1mr (D1 cache read misses) of:

4,453,109 (48.86%)

… which matches my understanding of what's happening. It might be more clear if the two array accesses were on separate lines.


Note: Valgrind will slow down execution of your code: it replaces the memory allocator (malloc and friends) and collects detailed counts of what happens while the code is running.

So, it's great for analyzing memory behaviour. Don't try to measure performance at the same time.


Sometimes it's not clear what part(s) of your code are taking the time.

In particular, if you want to make your code faster, speeding up something that takes 0.01% of the total running time isn't going to matter. If you have code that's taking 80% of the total program running time, that's where to look for speed gains.


In general, a profiler will record what parts of your code are taking time during execution. It usually reports per-function and tells you how many times it was called, or how much much time execution spent in that code.


Valgrind also has a trick for this: Callgrind.

It measures the number of times each instruction runs (and total number of instructions taken within a function call). We have seen that number of instructions ≠ running time, but it's related.


Again, compile with debugging info (-g). Then run with Callgrind:

valgrind --tool=callgrind ./a.out

That produces a callgrind.out.* file. Then:

callgrind_annotate callgrind.out.12345


In the Callgrind report, look for large percents. We expect main to take ≈100% of the running time, but below that you might find surprises.


e.g. an excerpt for the random array sum code:

         4 ( 0.00%)      build_arrays(&data, &indices, length);
297,261,446 (49.30%)  => random_access_2.c:build_arrays (1x)
 1,048,591 ( 0.17%)      for (INDEX_T i = 0; i < length; i++) {
 5,767,174 ( 0.96%)          total += data[indices[i]];
         .               }
         4 ( 0.00%)  int main(int argc, char* argv[]) {⋮
602,792,553 (99.98%)  => random_access.c:test (1x)
         3 ( 0.00%)  }

i.e. it takes more instructions to build the arrays than to sum them. We know that's not where the time is taken, but it's interesting.


There are other interesting profiling tools:


The lesson here: there are many tools that can help you understand what's happening in your code (or between your code and the hardware). It's not always easy to work with them or interpret their output, but there's lots of info to be discovered.

But don't forget: you can also look at code, understand what it does, and infer things about its behaviour based on what we have discussed in this course.