Tools

There are some tools that provide useful insights when we're looking at low-level behaviour of code. Let's look at a few. You might want some for the Mini-Project in this course.

GDB

We met GDB and gdbtui in labs 2 and 3.

It's most useful when there's a gap between your expectation for the behaviour of some code and the results you're seeing.

With appropriate application of breakpoints and watch values (and appropriate test cases), you should be able to narrow in on the problem.

GDB

If this is a way of interacting with code that feels good to you, then great. If not, that's fine with me (or maybe you need more practice with it).

Interactive debuggers like this are available in many programming languages and/or IDEs. Have a look in your favourite languages/tools.

Measuring Time

A question we have often had for our code: how long is it taking to run?

I have occasionally used the x86 time stamp counter, which counts the number of CPU cycles since the core was reset.

Measuring Time

It's accessible with the rdtsc (read time-stamp counter) instruction, but I have been using the C __rdtsc compiler intrinsic (i.e. way to tell the compiler I want exactly that instruction).

#include <x86intrin.h>

unsigned start = __rdtsc();
result = calculate_something();
unsigned end = __rdtsc();
unsigned elapsed = end - start;

Measuring Time

The time stamp counter sounds greate but is probably not the best way to measure running time.

There is no guarantee that each CPU core will have a synchronized counter: if your code is moved between cores, you might get start and end times that are unrelated. The number of cycles can vary depending on power state (e.g. power-saving vs full-speed) of the core, so number of cycles might not mean what you think.

Measuring Time

The more-correct thing to do is to look at elapsed real time at a high resolution. The C function clock_gettime will get the time at a nanosecond resolution (or as close as the hardware supports).

#define _POSIX_C_SOURCE 199309L
#include <time.h>

struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
result = calculate_something();
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
double elapsed = (end.tv_sec - start.tv_sec)
                 + (end.tv_nsec - start.tv_nsec) / 1e9;

[The _POSIX_C_SOURCE thing enables features in time.h.]

Measuring Time

I wrote a quick experiment to try them both:

Calculated 50029111 in 3264988053 cycles.
Calculated 50029111 in 0.95471 s.

This was on a 5.5 GHz (boost clock) processor, and the two numbers seem to diverge (in a consistent way), but it's hard to know what the actual frequency of that core was at that moment:

>>> 3264988053 / 5.5e9
0.5936341914545454

Measuring Time

A note on the argument to clock_gettime…

CLOCK_PROCESS_CPUTIME_ID indicates the amount of CPU time your process has used: if the process is paused (because the OS gives another process time on the CPU cores), it won't be affected. If you have 3 threads running 100%, it will progress by 3 s for every real-time second elapsed.

CLOCK_REALTIME measures clock time: it counts only real-time, not CPU utilization. If you have 3 threads running, it will progress by 1 s for every real-time second elapsed.

Measuring Time

In general when measuring the running time of code…

Do your best to measure the parts you care about, not parts you don't (e.g. program startup, allocating memory or filling arrays, writing files with results, etc).

Measuring Time

Remember the CPU power states: idle CPUs will be in a slower low-power mode and take a little time to come back to full speed. Configuring them into a performance mode or having a little pre-measurement warmup is necessary.
Processors can overheat and thermal throttle and slow themselves down to avoid damage, especially laptops and small-form-factor systems. It usually happens at 99°C: if CPU temperatures are in the 90°C range, your results may be meaningless.

Timing Commands

Timing the whole program execution can measure lots of stuff you don't want (e.g. creating data structures for the real test), but it works on any program, which is flexible.

Also, there are lots of tools that do it with various levels of sophistication.

Timing Commands

The Linux time command is a common quick way to see the running time of a command:

$ time ./a.out
real    0m1.714s
user    0m1.389s
sys     0m0.114s

This is 1.714 s of elapsed clock time; 1.389 s of processor time and 0.114 s waiting on the system calls (≈ the operating system to do stuff).

Note: a program that uses two threads and does lots of compute work in both, you'd expect user to be around double real.

Timing Commands

That way of calling time is the built-in in the BASH shell. There's also an executable /usr/bin/time that gives a lot more info with -v:

$ /usr/bin/time -v ./a.out
	Command being timed: "./a.out"
	User time (seconds): 3.70
	System time (seconds): 0.22
	Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94
    ⋮
	Maximum resident set size (kbytes): 394760
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 98382
    ⋮
	Swaps: 0
    ⋮

Timing Commands

Both the time built-in and /usr/bin/time command are a quick way to check running time of a whole program. They're useful if you're okay with fairly coarse timing, or can't do clock_gettime kind of stuff for some reason (e.g. not being able to modify the source).

Perf

The Perf tool can run your program while counting a huge variety of stats about what's going on.

The default perf stat gives a good quick summary of what's going on…

Perf

$ perf stat ./a.out
 Performance counter stats for './a.out':

            711.62 msec task-clock       #    1.000 CPUs utilized          
                 2      context-switches #    2.810 /sec                   
                 0      cpu-migrations   #    0.000 /sec                   
            12,565      page-faults      #   17.657 K/sec                  
     3,099,396,198      cycles           #    4.355 GHz                    
     2,160,022,445      instructions     #    0.70  insn per cycle         
       451,838,503      branches         #  634.939 M/sec                  
           945,485      branch-misses    #    0.21% of all branches        

       0.711910230 seconds time elapsed

       0.675934000 seconds user
       0.035996000 seconds sys

Perf

perf can tell you about the things it can count, and you can select with -e on the command line.

perf list

e.g. we can focus on branches by selecting only a few counters:

perf stat -e branches,branch-misses ./a.out

Perf

Let's try to confirm my story about the branch predictor and this code:

for (uint64_t i = 0; i < n; i++) {
    if ( arr[i] < 0 ) {
        total += arr[i];
    }
}

I claimed that with a random array there are lots of branch mispredictions and with an alternating even-odd array there are few.

Perf

With a random array:

     1,393,825,373      branches                                                    
       321,245,207      branch-misses     #   23.05% of all branches

With an alternating even-odd array:

     1,394,767,126      branches                                                    
           262,392      branch-misses     #    0.02% of all branches

Perf

for (uint64_t i = 0; i < n; i++) {
    if ( arr[i] < 0 ) {
        total += arr[i];
    }
}

Remember there's a conditional branch implied by both the for and if. The loop-bottom branch should be extremely predictable: it's taken all but the last iteration (or not, depending how the compiler arranged the loop logic).

So here, the 23% of mispredictions is probably almost exactly half of the if branches.

Perf

Or maybe we're interested in cache performance. Then maybe a command like this

perf stat -e L1-dcache-load-misses,L1-dcache-loads,\
LLC-load-misses,LLC-loads ./a.out

(LLC = last-level cache, probably L3.) Or on another computer, I had to:

perf stat -e L1-dcache-load-misses,L1-dcache-loads,\
cache-misses,cache-references ./a.out

Perf

Let's try with the random access array sum code: the example that had an array of indices that could be shuffled or not to access the data randomly or sequentially.

for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];
}

Perf

With the indices array shuffled:

        64,394,902      L1-dcache-load-misses     #    8.68% of all L1-dcache accesses  (67.09%)
       741,750,025      L1-dcache-loads                                               (67.09%)
        25,817,611      LLC-load-misses           #   53.76% of all LL-cache accesses  (66.68%)
        48,026,135      LLC-loads                                                     (66.19%)

Without the shuffle:

         6,450,308      L1-dcache-load-misses     #    1.54% of all L1-dcache accesses  (67.01%)
       418,986,443      L1-dcache-loads                                               (67.01%)
           266,494      LLC-load-misses           #   61.86% of all LL-cache accesses  (66.29%)
           430,775      LLC-loads                                                     (65.98%)

Remember this is the whole program: creating the arrays, shuffling, and summing.

Perf

With the shuffle, there are more total memory access: those are the work of shuffling the array. We're not quite making a fair comparison because we're measuring programs that do different things (apart from the comparison we want).

// shuffle_array(indices, length);

Perf

For a fair comparison, we should always create a copy of indices that's shuffled, and just choose which to use. That would artifically make the counts the same before the part we want to measure.

indices_shuffled = (uint32_t*)malloc(length * sizeof(uint32_t));
if (indices_shuffled == NULL) { … }
memcpy(indices_shuffled, indices, length * sizeof(uint32_t));
shuffle_array(indices_shuffled, length);

Perf

Using shuffled_indices:

        67,401,304      L1-dcache-load-misses     #    8.96% of all L1-dcache accesses  (66.86%)
       752,394,854      L1-dcache-loads                                               (66.86%)
        25,553,268      LLC-load-misses           #   54.52% of all LL-cache accesses  (66.86%)
        46,870,946      LLC-loads                                                     (66.39%)

Using the unshuffled indices:

        39,101,601      L1-dcache-load-misses     #    5.13% of all L1-dcache accesses  (66.61%)
       762,779,577      L1-dcache-loads                                               (66.61%)
        10,132,539      LLC-load-misses           #   50.84% of all LL-cache accesses  (66.78%)
        19,929,036      LLC-loads                                                     (66.78%)

Perf

Those ways of starting perf all had it start the program and measure all of it.

It's possible to start Perf on an already-running process if you know it's process ID (PID).

perf stat -e branches,branch-misses -p 12345

Any Linux process/CPU monitoring tools will tell you the PID of running programs. Or…

Perf

So will the C getpid function. A quick way to measure only some of the code in your program is two terminal windows and something like this:

printf("Pausing. Start: perf stat -p %u\n", getpid());
sleep(10);
// the code you care about measuring...
⋮
printf("The interesting part is done. Ctrl-C on perf.\n");
sleep(10);

Perf

I did that with my shuffled array adder and the main loop. With the shuffled indices:

        32,527,861      L1-dcache-load-misses     #   53.28% of all L1-dcache accesses  (68.93%)
        61,045,114      L1-dcache-loads                                               (69.09%)
        16,311,932      LLC-load-misses           #   57.91% of all LL-cache accesses  (68.16%)
        28,168,193      LLC-loads                                                     (65.08%)

Using the unshuffled indices:

         4,967,712      L1-dcache-load-misses     #    7.90% of all L1-dcache accesses  (68.60%)
        62,906,699      L1-dcache-loads                                               (68.66%)
           535,821      LLC-load-misses           #   71.49% of all LL-cache accesses  (62.73%)
           749,518      LLC-loads                                                     (62.68%)

Perf

The lesson, I think: it's not too hard to use Perf to see broad trends in code. Seeing exactly what's happening is tricky and you have to carefully consider what's going on and what parts you care about.

The length of the perf list menu of counters gives away how much detail is available, if you can figure out the right question to ask.

An experiment…

Perf

Can I confirm my theory about the data hazards slowing down the array sum code?

I'm going to compare these two implementations: same loop, but using one vs two accumulators.

for (uint64_t i = 0; i < n - 1; i += 2) {
    total += arr[i];
    total += arr[i + 1];
}

for (uint64_t i = 0; i < n - 1; i += 2) {
    total1 += arr[i];
    total2 += arr[i + 1];
}

Perf

Compiled -O2 and run:

perf stat -e L1-dcache-loads,branches,branch-misses,\
instructions,cycles ./a.out

       784,176,365      L1-dcache-loads                                               (80.00%)
       451,016,427      branches                                                      (80.14%)
           391,793      branch-misses             #    0.09% of all branches          (80.15%)
     2,158,643,786      instructions              #    0.70  insn per cycle           (80.14%)
     3,080,949,868      cycles                                                        (79.58%)

       779,680,179      L1-dcache-loads                                               (79.80%)
       451,253,316      branches                                                      (80.28%)
           283,216      branch-misses             #    0.06% of all branches          (80.28%)
     2,169,797,960      instructions              #    1.17  insn per cycle           (80.28%)
     1,860,696,786      cycles                                                        (79.36%)

Same memory accesses, branches, instructions. It just took almost 1/2 the time. I still think it's probably data hazards. Still no proof.

Perf

It's also possible to record the Perf counts to a file, so you can take a more detailed look later.

First: include debugging symbols when you compile (-g) and probably step-back on compiler optimization (-O1) so the assembly maps more closely to your code.

Then something like:

perf record -e branches,branch-misses ./a.out

Perf

Then you can explore the recorded data:

perf report

There's more info here: which functions were counting for more/less of each thing.

Keys: ↑, ↓ (to move around), enter (to descend an option), esc (to move back out), a (annotate code with counter results), in annotation t (to switch between percents/totals), h for help.

Perf

If you want to look at the annotation of your code line-by-line, note that the counters are associated with lines close to when the event occurred, not necessarily the exact line.

You can ask for increased precision (if it's supported) by adding :p or :pp to the counter. So the pattern becomes something like:

gcc … -O1 -g code.c
perf record -e branches:pp,branch-misses:pp ./a.out
perf report

Perf

I have had much more comprehensible results from Perf on code compiled -O1 instead of -O3.

The higher optimization levels modify things enough that it become hard to map runtime behaviour back to my code.

Perf

Running perf generally requires an admin to allow access to the hardward performance counters (there are apparently security implications). Also, most of the counters aren't available in virtual machines.

See the Running Perf page on the course web site for info.

Valgrind

Another tool (that you can run without admin powers): Valgrind.

With its defaults, it looks for memory leak problems:

valgrind ./a.out

==2923447== HEAP SUMMARY:
==2923447==     in use at exit: 0 bytes in 0 blocks
==2923447==   total heap usage: 3 allocs, 3 frees, 51,201,024 bytes allocated
==2923447== 
==2923447== All heap blocks were freed -- no leaks are possible

Valgrind

Probably more interesting for us: cachegrind, which reports cache performance.

valgrind --tool=cachegrind ./a.out

Let's try it on the random access array sum problem…

Valgrind

With the shuffled_indices array:

==2933165== I   refs:      602,938,883Instructions
==2933165== I1  misses:          1,530
==2933165== LLi misses:          1,517
==2933165== I1  miss rate:        0.00%
==2933165== LLi miss rate:        0.00%
==2933165== 
==2933165== D   refs:      258,000,928  (173,577,018 rd   + 84,423,910 wr)
==2933165== D1  misses:     10,164,053  (  9,115,035 rd   +  1,049,018 wr)
==2933165== LLd misses:      5,798,679  (  4,749,697 rd   +  1,048,982 wr)
==2933165== D1  miss rate:         3.9% (        5.3%     +        1.2%  )
==2933165== LLd miss rate:         2.2% (        2.7%     +        1.2%  )
==2933165== 
==2933165== LL refs:        10,165,583  (  9,116,565 rd   +  1,049,018 wr)
==2933165== LL misses:       5,800,196  (  4,751,214 rd   +  1,048,982 wr)
==2933165== LL miss rate:          0.7% (        0.6%     +        1.2%  )

Valgrind

With the non-shuffled indices array:

==2933109== I   refs:      602,938,547
==2933109== I1  misses:          1,521
==2933109== LLi misses:          1,508
==2933109== I1  miss rate:        0.00%
==2933109== LLi miss rate:        0.00%
==2933109== 
==2933109== D   refs:      258,000,799  (173,576,937 rd   + 84,423,862 wr)
==2933109== D1  misses:      6,497,057  (  5,448,039 rd   +  1,049,018 wr)
==2933109== LLd misses:      2,984,517  (  1,935,535 rd   +  1,048,982 wr)
==2933109== D1  miss rate:         2.5% (        3.1%     +        1.2%  )
==2933109== LLd miss rate:         1.2% (        1.1%     +        1.2%  )
==2933109== 
==2933109== LL refs:         6,498,578  (  5,449,560 rd   +  1,049,018 wr)
==2933109== LL misses:       2,986,025  (  1,937,043 rd   +  1,048,982 wr)
==2933109== LL miss rate:          0.3% (        0.2%     +        1.2%  )

Valgrind

Note: Without the shuffle there are ≈6.5M misses at D1 (L1 data) and at LLd (last-level data, L3) With the shuffle, ≈10M misses at both.

Both are measuing the whole program (creating, filling, shuffling arrays, before the part I care about).

Valgrind

If you have compiled with debugging info (-g), the cg_annotate command will give much more detailed info about which functions and lines of code were hitting/missing cache.

cg_annotate cachegrind.out.12345

… and the output is far too wide to put on a lecture slide.

Valgrind

On the shuffled indexing code, it reported this line:

total += data[indices[i]];

… having a D1mr (D1 cache read misses) of:

4,453,109 (48.86%)

… which matches my understanding of what's happening. It might be more clear if the two array accesses were on separate lines.

Valgrind

Note: Valgrind will slow down execution of your code: it replaces the memory allocator (malloc and friends) and collects detailed counts of what happens while the code is running.

So, it's great for analyzing memory behaviour. Don't try to measure performance at the same time.

Profiling

Sometimes it's not clear what part(s) of your code are taking the time.

In particular, if you want to make your code faster, speeding up something that takes 0.01% of the total running time isn't going to matter. If you have code that's taking 80% of the total program running time, that's where to look for speed gains.

Profiling

In general, a profiler will record what parts of your code are taking time during execution. It usually reports per-function and tells you how many times it was called, or how much much time execution spent in that code.

Profiling

Valgrind also has a trick for this: Callgrind.

It measures the number of times each instruction runs (and total number of instructions taken within a function call). We have seen that number of instructions ≠ running time, but it's related.

Profiling

Again, compile with debugging info (-g). Then run with Callgrind:

valgrind --tool=callgrind ./a.out

That produces a callgrind.out.* file. Then:

callgrind_annotate callgrind.out.12345

Profiling

In the Callgrind report, look for large percents. We expect main to take ≈100% of the running time, but below that you might find surprises.

Profiling

e.g. an excerpt for the random array sum code:

         4 ( 0.00%)      build_arrays(&data, &indices, length);
297,261,446 (49.30%)  => random_access_2.c:build_arrays (1x)
⋮
 1,048,591 ( 0.17%)      for (INDEX_T i = 0; i < length; i++) {
 5,767,174 ( 0.96%)          total += data[indices[i]];
         .               }
⋮
         4 ( 0.00%)  int main(int argc, char* argv[]) {⋮
602,792,553 (99.98%)  => random_access.c:test (1x)
         3 ( 0.00%)  }

i.e. it takes more instructions to build the arrays than to sum them. We know that's not where the time is taken, but it's interesting.

Profiling

There are other interesting profiling tools:

Oops

When experimenting with timing code, I decided to see if I could measure memory latency (formerly described as about 50 cyles)…

My strategy: create an array much larger than L3 cache, then randomly access it and see how long accesses take on average. They should mostly be cache misses, so true memory accesses with corresponding latency.

Oops

The other operations in my loop should be able to hang out in the pipeline and get done concurrently with the previous/next memory access, so I think I'm measuring almost exclusively memory access time.

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (size_t i = 0; i < reads; i++) {
    total += data[rand() % length];
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
double elapsed = (end.tv_sec - start.tv_sec)
             + (end.tv_nsec - start.tv_nsec) / 1e9;
printf("Per access: %g ns\n", elapsed / reads * 1e9);

Oops

The times I measured were consistently 100–400 cycles, not the 50 I expected.

First problem: not all memory is created equal. The >200 cycle latency was on lower-spec machines.

Oops

On my home desktop, the DDR5-6000 CL30 memory should have 30 × 2000 / 6000 = 10 ns latency, and I was consistently seeing around 17 ns. Eventually, I realized that everybody's description of memory latency is technically correct but effectively wrong.

Accessing memory is done through L3 cache, so the latency of a memory access is the latency of the memory itself plus the latency of the L3 cache.

Oops

So, I shouldn't expect a memory access to happen in ≈50 cycles. I should expect more like ≈100 cycles since an L3 cache access alone is >40 cycles. Around 200 cycles makes sense for slower memory (or memory without XMP/EXPO enabled?).

I have added a note to the Memory & Storage slide deck, but not corrected it fully: I think that would cause more confusion overall.

Oops

But am I really measuring memory latency here? Based on my understanding of the pipeline + superscalar processor, almost all of the running time should be the memory access.

clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for (size_t i = 0; i < reads; i++) {
    total += data[rand() % length];
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

i.e. one loop iteration should take ⪅ the memory latency.

Oops

But is it possible that multiple memory accesses are actually happening concurrently (beause of the pipeline, or dual channel memory, or …) and I'm actually measuring some fraction of the real latency?

Maybe? My results make some sense at least, and increase/decrease on different systems in approximately the ways I'd expect. They're at least related to memory latency.

Summary

The lesson here: there are many tools that can help you understand what's happening in your code (or between your code and the hardware). It's not always easy to work with them or interpret their output, but there's lots to be discovered.

Computers are complex systems, so measuring and understanding what's happening can be a real challenge.

Summary

But don't forget: you can also look at code, understand what it does, and infer things about its behaviour based on what we have discussed in this course.