There are some tools that provide useful insights when we're looking at low-level behaviour of code. Let's look at a few. You might want some for the Mini-Project in this course.
We met GDB and gdbtui
in labs 2 and 3.
It's most useful when there's a gap between your expectation for the behaviour of some code and the results you're seeing.
With appropriate application of breakpoints and watch values (and appropriate test cases), you should be able to narrow in on the problem.
If this is a way of interacting with code that feels good to you, then great. If not, that's fine with me (or maybe you need more practice with it).
Interactive debuggers like this are available in many programming languages and/or IDEs. Have a look in your favourite languages/tools.
A question we have often had for our code: how long is it taking to run?
I have occasionally used the x86 time stamp counter, which counts the number of CPU cycles since the core was reset.
It's accessible with the rdtsc
(read time-stamp counter) instruction, but I have been using the C __rdtsc
compiler intrinsic (i.e. way to tell the compiler I want exactly that instruction).
#include <x86intrin.h>
unsigned start = __rdtsc(); result = calculate_something(); unsigned end = __rdtsc(); unsigned elapsed = end - start;
The time stamp counter sounds greate but is probably not the best way to measure running time.
There is no guarantee that each CPU core will have a synchronized counter: if your code is moved between cores, you might get start
and end
times that are unrelated. The number of cycles can vary depending on power state (e.g. power-saving vs full-speed) of the core, so number of cycles
might not mean what you think.
The more-correct thing to do is to look at elapsed real time at a high resolution. The C function clock_gettime
will get the time at a nanosecond resolution (or as close as the hardware supports).
#define _POSIX_C_SOURCE 199309L #include <time.h>
struct timespec start, end; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); result = calculate_something(); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
I wrote a quick experiment to try them both:
Calculated 50029111 in 3264988053 cycles. Calculated 50029111 in 0.95471 s.
This was on a 5.5 GHz (boost clock) processor, and the two numbers seem to diverge (in a consistent way), but it's hard to know what the actual frequency of that core was at that moment:
>>> 3264988053 / 5.5e9 0.5936341914545454
A note on the argument to clock_gettime
…
CLOCK_PROCESS_CPUTIME_ID
indicates the amount of CPU time your process has used: if the process is paused (because the OS gives another process time on the CPU cores), it won't be affected. If you have 3 threads running 100%, it will progress by 3 s for every real-time second elapsed.
CLOCK_REALTIME
measures clock time: it counts only real-time, not CPU utilization. If you have 3 threads running, it will progress by 1 s for every real-time second elapsed.
In general when measuring the running time of code…
Timing the whole program execution can measure lots of stuff you don't want (e.g. creating data structures for the real test), but it works on any program, which is flexible.
Also, there are lots of tools that do it with various levels of sophistication.
The Linux time
command is a common quick way to see the running time of a command:
$ time ./a.out real 0m1.714s user 0m1.389s sys 0m0.114s
This is 1.714 s of elapsed clock time; 1.389 s of processor time and 0.114 s waiting on the system calls (≈ the operating system to do stuff).
Note: a program that uses two threads and does lots of compute work in both, you'd expect user
to be around double real
.
That way of calling time
is the built-in in the BASH shell. There's also an executable /usr/bin/time
that gives a lot more info with -v
:
$ /usr/bin/time -v ./a.out Command being timed: "./a.out" User time (seconds): 3.70 System time (seconds): 0.22 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94 ⋮ Maximum resident set size (kbytes): 394760 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 98382 ⋮ Swaps: 0 ⋮
Both the time
built-in and /usr/bin/time
command are a quick way to check running time of a whole program. They're useful if you're okay with fairly coarse timing, or can't do clock_gettime
kind of stuff for some reason (e.g. not being able to modify the source).
The Perf tool can run your program while counting a huge variety of stats about what's going on.
The default perf stat
gives a good quick summary of what's going on…
$ perf stat ./a.out Performance counter stats for './a.out': 711.62 msec task-clock # 1.000 CPUs utilized 2 context-switches # 2.810 /sec 0 cpu-migrations # 0.000 /sec 12,565 page-faults # 17.657 K/sec 3,099,396,198 cycles # 4.355 GHz 2,160,022,445 instructions # 0.70 insn per cycle 451,838,503 branches # 634.939 M/sec 945,485 branch-misses # 0.21% of all branches 0.711910230 seconds time elapsed 0.675934000 seconds user 0.035996000 seconds sys
perf
can tell you about the things it can count, and you can select with -e
on the command line.
perf list
e.g. we can focus on branches by selecting only a few counters:
perf stat -e branches,branch-misses ./a.out
Let's try to confirm my story about the branch predictor and this code:
for (uint64_t i = 0; i < n; i++) { if ( arr[i] < 0 ) { total += arr[i]; } }
I claimed that with a random array there are lots of branch mispredictions and with an alternating even-odd array there are few.
With a random array:
1,393,825,373 branches 321,245,207 branch-misses # 23.05% of all branches
With an alternating even-odd array:
1,394,767,126 branches 262,392 branch-misses # 0.02% of all branches
for (uint64_t i = 0; i < n; i++) { if ( arr[i] < 0 ) { total += arr[i]; } }
Remember there's a conditional branch implied by both the for
and if
. The loop-bottom branch should be extremely predictable: it's taken all but the last iteration (or not, depending how the compiler arranged the loop logic).
So here, the 23% of mispredictions is probably almost exactly half of the if
branches.
Or maybe we're interested in cache performance. Then maybe a command like this
perf stat -e L1-dcache-load-misses,L1-dcache-loads,\ LLC-load-misses,LLC-loads ./a.out
(LLC
= last-level cache
, probably L3.) Or on another computer, I had to:
perf stat -e L1-dcache-load-misses,L1-dcache-loads,\ cache-misses,cache-references ./a.out
Let's try with the random access array sum code: the example that had an array of indices that could be shuffled or not to access the data randomly or sequentially.
for (uint32_t i = 0; i < length; i++) { total += data[indices[i]]; }
With the indices
array shuffled:
64,394,902 L1-dcache-load-misses # 8.68% of all L1-dcache accesses (67.09%) 741,750,025 L1-dcache-loads (67.09%) 25,817,611 LLC-load-misses # 53.76% of all LL-cache accesses (66.68%) 48,026,135 LLC-loads (66.19%)
Without the shuffle:
6,450,308 L1-dcache-load-misses # 1.54% of all L1-dcache accesses (67.01%) 418,986,443 L1-dcache-loads (67.01%) 266,494 LLC-load-misses # 61.86% of all LL-cache accesses (66.29%) 430,775 LLC-loads (65.98%)
Remember this is the whole program: creating the arrays, shuffling, and summing.
With the shuffle, there are more total memory access: those are the work of shuffling the array. We're not quite making a fair comparison because we're measuring programs that do different things (apart from the comparison we want).
// shuffle_array(indices, length);
For a fair comparison, we should always create a copy of indices
that's shuffled, and just choose which to use. That would artifically make the counts the same before the part we want to measure.
indices_shuffled = (uint32_t*)malloc(length * sizeof(uint32_t)); if (indices_shuffled == NULL) { … } memcpy(indices_shuffled, indices, length * sizeof(uint32_t)); shuffle_array(indices_shuffled, length);
Using shuffled_indices
:
67,401,304 L1-dcache-load-misses # 8.96% of all L1-dcache accesses (66.86%) 752,394,854 L1-dcache-loads (66.86%) 25,553,268 LLC-load-misses # 54.52% of all LL-cache accesses (66.86%) 46,870,946 LLC-loads (66.39%)
Using the unshuffled indices
:
39,101,601 L1-dcache-load-misses # 5.13% of all L1-dcache accesses (66.61%) 762,779,577 L1-dcache-loads (66.61%) 10,132,539 LLC-load-misses # 50.84% of all LL-cache accesses (66.78%) 19,929,036 LLC-loads (66.78%)
Those ways of starting perf
all had it start the program and measure all of it.
It's possible to start Perf on an already-running process if you know it's process ID (PID).
perf stat -e branches,branch-misses -p 12345
Any Linux process/CPU monitoring tools will tell you the PID of running programs. Or…
So will the C getpid
function. A quick way to measure only some of the code in your program is two terminal windows and something like this:
printf("Pausing. Start: perf stat -p %u\n", getpid()); sleep(10); // the code you care about measuring... ⋮ printf("The interesting part is done. Ctrl-C on perf.\n"); sleep(10);
I did that with my shuffled array adder and the main
loop. With the shuffled indices:
32,527,861 L1-dcache-load-misses # 53.28% of all L1-dcache accesses (68.93%) 61,045,114 L1-dcache-loads (69.09%) 16,311,932 LLC-load-misses # 57.91% of all LL-cache accesses (68.16%) 28,168,193 LLC-loads (65.08%)
Using the unshuffled indices
:
4,967,712 L1-dcache-load-misses # 7.90% of all L1-dcache accesses (68.60%) 62,906,699 L1-dcache-loads (68.66%) 535,821 LLC-load-misses # 71.49% of all LL-cache accesses (62.73%) 749,518 LLC-loads (62.68%)
The lesson, I think: it's not too hard to use Perf to see broad trends in code. Seeing exactly what's happening is tricky and you have to carefully consider what's going on and what parts you care about.
The length of the
menu of counters gives away how much detail is available, if you can figure out the right question to ask.perf list
An experiment…
Can I confirm my theory about the data hazards slowing down the array sum code?
I'm going to compare these two implementations: same loop, but using one vs two accumulators.
for (uint64_t i = 0; i < n - 1; i += 2) { total += arr[i]; total += arr[i + 1]; }
for (uint64_t i = 0; i < n - 1; i += 2) { total1 += arr[i]; total2 += arr[i + 1]; }
Compiled -O2
and run:
perf stat -e L1-dcache-loads,branches,branch-misses,\ instructions,cycles ./a.out
784,176,365 L1-dcache-loads (80.00%) 451,016,427 branches (80.14%) 391,793 branch-misses # 0.09% of all branches (80.15%) 2,158,643,786 instructions # 0.70 insn per cycle (80.14%) 3,080,949,868 cycles (79.58%)
779,680,179 L1-dcache-loads (79.80%) 451,253,316 branches (80.28%) 283,216 branch-misses # 0.06% of all branches (80.28%) 2,169,797,960 instructions # 1.17 insn per cycle (80.28%) 1,860,696,786 cycles (79.36%)
Same memory accesses, branches, instructions. It just took almost 1/2 the time. I still think it's probably data hazards. Still no proof.
It's also possible to record the Perf counts to a file, so you can take a more detailed look later.
First: include debugging symbols when you compile -g
)-O1
)
Then something like:
perf record -e branches,branch-misses ./a.out
Then you can explore the recorded data:
perf report
There's more info here: which functions were counting for more/less of each thing.
Keys: ↑, ↓ (to move around), enter (to descend an option), esc (to move back out), a (annotate code with counter results), in annotation t (to switch between percents/totals), h for help.
If you want to look at the annotation
of your code line-by-line, note that the counters are associated with lines close to when the event occurred, not necessarily the exact line.
You can ask for increased precision (if it's supported) by adding :p
or :pp
to the counter. So the pattern becomes something like:
gcc … -O1 -g code.c perf record -e branches:pp,branch-misses:pp ./a.out perf report
I have had much more comprehensible results from Perf on code compiled -O1
instead of -O3
.
The higher optimization levels modify things enough that it become hard to map runtime behaviour back to my code.
Running perf
generally requires an admin to allow access to the hardward performance counters (there are apparently security implications). Also, most of the counters aren't available in virtual machines.
See the Running Perf
page on the course web site for info.
Another tool (that you can run without admin powers): Valgrind.
With its defaults, it looks for memory leak problems:
valgrind ./a.out
==2923447== HEAP SUMMARY: ==2923447== in use at exit: 0 bytes in 0 blocks ==2923447== total heap usage: 3 allocs, 3 frees, 51,201,024 bytes allocated ==2923447== ==2923447== All heap blocks were freed -- no leaks are possible
Probably more interesting for us: cachegrind, which reports cache performance.
valgrind --tool=cachegrind ./a.out
Let's try it on the random access array sum problem…
With the shuffled_indices
array:
==2933165== I refs: 602,938,883Instructions ==2933165== I1 misses: 1,530 ==2933165== LLi misses: 1,517 ==2933165== I1 miss rate: 0.00% ==2933165== LLi miss rate: 0.00% ==2933165== ==2933165== D refs: 258,000,928 (173,577,018 rd + 84,423,910 wr) ==2933165== D1 misses: 10,164,053 ( 9,115,035 rd + 1,049,018 wr) ==2933165== LLd misses: 5,798,679 ( 4,749,697 rd + 1,048,982 wr) ==2933165== D1 miss rate: 3.9% ( 5.3% + 1.2% ) ==2933165== LLd miss rate: 2.2% ( 2.7% + 1.2% ) ==2933165== ==2933165== LL refs: 10,165,583 ( 9,116,565 rd + 1,049,018 wr) ==2933165== LL misses: 5,800,196 ( 4,751,214 rd + 1,048,982 wr) ==2933165== LL miss rate: 0.7% ( 0.6% + 1.2% )
With the non-shuffled indices
array:
==2933109== I refs: 602,938,547 ==2933109== I1 misses: 1,521 ==2933109== LLi misses: 1,508 ==2933109== I1 miss rate: 0.00% ==2933109== LLi miss rate: 0.00% ==2933109== ==2933109== D refs: 258,000,799 (173,576,937 rd + 84,423,862 wr) ==2933109== D1 misses: 6,497,057 ( 5,448,039 rd + 1,049,018 wr) ==2933109== LLd misses: 2,984,517 ( 1,935,535 rd + 1,048,982 wr) ==2933109== D1 miss rate: 2.5% ( 3.1% + 1.2% ) ==2933109== LLd miss rate: 1.2% ( 1.1% + 1.2% ) ==2933109== ==2933109== LL refs: 6,498,578 ( 5,449,560 rd + 1,049,018 wr) ==2933109== LL misses: 2,986,025 ( 1,937,043 rd + 1,048,982 wr) ==2933109== LL miss rate: 0.3% ( 0.2% + 1.2% )
Note: Without the shuffle there are ≈6.5M misses at D1
(L1 data) and at LLd
(last-level data, L3) With the shuffle, ≈10M misses at both.
Both are measuing the whole program (creating, filling, shuffling arrays, before the part I care about).
If you have compiled with debugging info (-g
), the cg_annotate
command will give much more detailed info about which functions and lines of code were hitting/missing cache.
cg_annotate cachegrind.out.12345
… and the output is far too wide to put on a lecture slide.
On the shuffled indexing code, it reported this line:
total += data[indices[i]];
… having a D1mr
(D1 cache read misses
) of:
4,453,109 (48.86%)
… which matches my understanding of what's happening. It might be more clear if the two array accesses were on separate lines.
Note: Valgrind will slow down execution of your code: it replaces the memory allocator (malloc
and friends) and collects detailed counts of what happens while the code is running.
So, it's great for analyzing memory behaviour. Don't try to measure performance at the same time.
Sometimes it's not clear what part(s) of your code are taking the time.
In particular, if you want to make your code faster, speeding up something that takes 0.01% of the total running time isn't going to matter. If you have code that's taking 80% of the total program running time, that's where to look for speed gains.
In general, a profiler will record what parts of your code are taking time during execution. It usually reports per-function and tells you how many times it was called, or how much much time execution spent in that code.
Valgrind also has a trick for this: Callgrind.
It measures the number of times each instruction runs (and total number of instructions taken within a function call). We have seen that number of instructions ≠ running time, but it's related.
Again, compile with debugging info (-g
). Then run with Callgrind:
valgrind --tool=callgrind ./a.out
That produces a callgrind.out.*
file. Then:
callgrind_annotate callgrind.out.12345
In the Callgrind report, look for large percents. We expect main
to take ≈100% of the running time, but below that you might find surprises.
e.g. an excerpt for the random array sum code:
4 ( 0.00%) build_arrays(&data, &indices, length); 297,261,446 (49.30%) => random_access_2.c:build_arrays (1x) ⋮ 1,048,591 ( 0.17%) for (INDEX_T i = 0; i < length; i++) { 5,767,174 ( 0.96%) total += data[indices[i]]; . } ⋮ 4 ( 0.00%) int main(int argc, char* argv[]) {⋮ 602,792,553 (99.98%) => random_access.c:test (1x) 3 ( 0.00%) }
i.e. it takes more instructions to build the arrays than to sum them. We know that's not where the time is taken, but it's interesting.
There are other interesting profiling tools:
When experimenting with timing code, I decided to see if I could measure memory latency (formerly described as about 50 cyles
)…
My strategy: create an array much larger than L3 cache, then randomly access it and see how long accesses take on average. They should mostly be cache misses, so true
memory accesses with corresponding latency.
The other operations in my loop should be able to hang out in the pipeline and get done concurrently with the previous/next memory access, so I think I'm measuring almost exclusively memory access time.
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (size_t i = 0; i < reads; i++) { total += data[rand() % length]; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("Per access: %g ns\n", elapsed / reads * 1e9);
The times I measured were consistently 100–400 cycles, not the 50 I expected.
First problem: not all memory is created equal. The >200 cycle latency was on lower-spec machines.
On my home desktop, the DDR5-6000 CL30 memory should have 30 × 2000 / 6000 = 10 ns latency, and I was consistently seeing around 17 ns. Eventually, I realized that everybody's description of memory latency is technically correct but effectively wrong.
Accessing memory is done through L3 cache, so the latency of a memory access is the latency of the memory itself plus the latency of the L3 cache.
So, I shouldn't expect a memory access to happen in ≈50 cycles. I should expect more like ≈100 cycles since an L3 cache access alone is >40 cycles. Around 200 cycles makes sense for slower memory (or memory without XMP/EXPO enabled?).
I have added a note to the Memory & Storage
slide deck, but not corrected it fully: I think that would cause more confusion overall.
But am I really measuring memory latency here? Based on my understanding of the pipeline + superscalar processor, almost all of the running time should be the memory access.
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (size_t i = 0; i < reads; i++) { total += data[rand() % length]; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
i.e. one loop iteration should take ⪅ the memory latency.
But is it possible that multiple memory accesses are actually happening concurrently (beause of the pipeline, or dual channel memory, or …) and I'm actually measuring some fraction of the real
latency?
Maybe? My results make some sense at least, and increase/decrease on different systems in approximately the ways I'd expect. They're at least related to memory latency.
The lesson here: there are many tools that can help you understand what's happening in your code (or between your code and the hardware). It's not always easy to work with them or interpret their output, but there's lots to be discovered.
Computers are complex systems, so measuring and understanding what's happening can be a real challenge.
But don't forget: you can also look at code, understand what it does, and infer things about its behaviour based on what we have discussed in this course.