When we store stuff in memory, what does that really mean? When we use memory in our program, what's going on?
We know the basics of working with memory… our programs have static, stack, heap. We can work with them all from both assembly and C (but haven't allocated heap memory from assembly).
There are still some mysteries. e.g. I have mentioned that memory is slow
: retrieving a value from memory takes something like 50 CPU cycles (depending on CPU speed, memory speed & latency).
That sounds bad.
Update: the preceeding calculation is correct, but the effective memory latency for a program is the latency of the memory itself plus the latency of the L3 cache: more like 100 cycles.
I'm not going to update every slide in this deck to reflect that, just to avoid confusion, but know that the about 50
should actually be about 100
.
But then, how does this code (compiled with -O0
) take only around 5 cycles per array element?
int64_t array_sum_1(int64_t* arr, unsigned n) { int64_t total = 0; for (unsigned i = 0; i < n; i++) { total += arr[i]; } return total; }
That's a memory access in the loop. Shouldn't it take ≥50 cycles per element?
When we refer to a computer's memory, we're talking about the random-access memory (RAM). Here, random access
refers to being able to access any part of it in a similar amount of time.
[Compare a spinning hard drive or CD, where access time varies depending how far the disk needs to phyically spin before the right data is under the read/write head.]
Modern computers use dynamic RAM (DRAM) where (roughly-speaking), each bit is held by a tiny capacitor where charged
and discharged
can represent 0 and 1.
The tiny capacitors naturally lose their charge (on a microsecond scale) and must periodically have their charge refreshed.
DRAM is a different storage technology than the flip-flops that are storing the bits in our registers: that's static RAM or SRAM.
Relatively speaking: DRAM is slower and cheaper. So, we can have a lot of it, but it's not as fast as registers.
When you read or write memory, a ≈50 cycle waiting time doesn't sound like something I'd be willing to accept.
The workaround: cache.
The idea: fast SRAM is expensive, but we can have a little. It can sit between the processor and memory, and hold a copy of some of the data from memory. Then the processor can access that subset of memory quickly.
The cache is managed by the hardware: we don't directly control what's in it, but it affects the performance of our programs, so knowing what's going on can help us write better code.
When some code does a memory access, an entire cache line will be read from memory into cache (typically 64 bytes).
That's generally useful: most code will access nearby memory next, so it's useful to have ready to go.
There are several layers of cache in modern computers: layers of progressively larger/slower cache.
Typically in a modern system, the fastest/smallest/closest cache is L1 (level 1). Each CPU core has its own, and there is separate L1 cache for instructions (i.e. for the memory accesses to fill the instruction register) and data (i.e. the memory the program itself is accessing).
Then L2 (also separate for each core) and L3 (shared by all cores in the processor).
Fetching from L1 cache takes ∼4 cycles; ∼10 cycles for L2 cache and ∼40 for L3 (depending on the processor).
Memory access takes around 50 cycles.
A typical Haswell processor (i7-4790) has cache arranged like this:
See the Haswell block diagram for more details.
The caches can't store as much as memory: sometimes a memory access will be found in the cache, a cache hit. Sometimes it won't, and the value must be found in memory: a cache miss.
If we can arrange our code so there are more hits, it will be faster.
If we have to wait ≈40 cycles to access L3 cache, that doesn't sound much better than ≈50 for memory.
But that's just latency: the delay between request and response. There's also the issue of bandwidth or throughput: the amount of data that can be transferred per second.
An silly example to illustrate the difference…
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. Andrew Tanenbaum, ∼1985
The modern equivalent would be SD cards in an SUV. A microSD card is 15×11×1 mm. That's 6 million per m3.
Let's say I bought 6 million 256 GB microSD cards and drove across the country. The drive might take a week. The bandwidth of that data transfer is 20 Tb/s. The latency is 7 days.
Compare 20 Tb/s throughput and 7 day latency vs Ethernet: 1 or 10 Gb/s bandwidth and latency <1 ms.
Both latency and throughput matter.
So there are several places your data might be at any moment: registers, cache, memory. Also disks/storage that we'll discuss later.
There's a spectrum of size vs latency and bandwidth. My best summary…
Storage | Size | Latency (cycles) | Throughput | ||
---|---|---|---|---|---|
Registers | <kB | 1 | |||
L1 Data Cache | 32 kB/core | 4 | * | 400 GB/s | * |
L2 Cache | 256 kB/core | 12 | * | 100 GB/s | * |
L3 Cache | 8 MB | 36 | * | 40 GB/s | * |
Memory | many GB | 44 | * | 13 GB/s | * |
Solid State Drive | few TB | 200–600 | * | 500–5000 MB/s | * |
Spinning HD | several TB | 10k–20k | * | 80–160 MB/s | * |
Network storage | many TB | 70k–175k | * | 1 GB/s |
The result of having memory cache: we can access memory
faster than the memory actually operates, but only if we're lucky
and the data we want is in cache.
How can we make it more likely that the memory we're using is in cache? By keeping in mind how caching works.
When cache is full, values are moved out (to the next layer of cache or memory) based on the least recently used values. So, we expect data that has been accessed recently to be in cache.
The cache will also hold data near other data that has been used recently (because cache is filled by cache line
). So, data near other data we have been working with should be in cache.
The message: working with data near what you have recently been working with should be fast.
How much difference does it make? It depends on how much data you have and how near/recent.
Let's experiment…
I'm going to create two arrays: a data
array, and an array of indicies into that array that will hold values 0 to n-1, like this:
int64_t* data = (int64_t*)malloc(length * sizeof(int64_t)); uint32_t* indices = (uint32_t*)malloc(length * sizeof(uint32_t)); for (uint32_t i = 0; i < length; i++) { data[i] = random_value(); indices[i] = i; }
We will iterate through the data by accessing data[indices[i]]
.
int64_t total = 0; for (uint32_t i = 0; i < length; i++) { total += data[indices[i]]; }
for (uint32_t i = 0; i < length; i++) { data[i] = random_value(); indices[i] = i; } int64_t total = 0; for (uint32_t i = 0; i < length; i++) { total += data[indices[i]]; }
So far, sweeping through data[indices[i]]
will access data
in order: high locality.
If we randomly shuffle indices
between the two loops, access to data
will be in a random order: horrible locality.
Predictions…
Randomly accessing arrays small enough to fit in cache should be fast. Everything should stay in, so we can access any part fairly quickly.
As things get a little larger than the cache, random accesses should slow down slightly because a few accesses are cache misses (but most still are in cache).
As arrays grow, there will be more and more cache misses, so slower accesses.
With \(2^{17}\) elements, the two arrays are 0.5 + 1 MB: all of the data fits in my L2 cache. Running times are 1.6 cycles per element without shuffling indices and 2.2 cycles per element with.
With array length \(2^{24}\) (64 + 128 MB), the arrays are much larger than the L3 cache.
Without shuffling the indices, the loop takes about 2.5 cycles/element. With the index array shuffled, it takes around 27 cycles/element.
With indices randomly shuffled, marking L1, L2, L3 sizes (note x and y are both log scale):
Observations…
The boundaries between L1/L2/L3 are difficult to see, but it looks like there's a little change at those points.
It's very clear where data no longer fits in L3. As we get much larger than L3, most times we touch the data array, there's no cache and we force a real
memory access.
But if we don't shuffle the indices, we have great locality and almost all of the array accesses should be in L1 cache. It should be fast regardless of the size of the array.
And in fact…
Without shuffling the indices (note different y scale):
I tried the array sum problem by shuffling the indices just a little: randomly moving them within ≈10 of their original position (gently_shuffle_array
in the file).
Performance was almost exactly the same as the no shuffle
results. You don't have to access data exactly sequentially, just enough so that what you're looking for is usually in cache. Close is close enough.
The lesson: keep your memory accesses local
as much as possible.
You can have and work with a lot of data in memory. If you work with it in nearby chunks, it will be much faster.
Or: you shouldn't actually access random-access memory
randomly.
One data structure that has notable memory locality problems: linked lists.
Depending how/when elements are allocated, they could be scattered around a process's memory allocation. Traversing the linked list would then have bad locality.
Contrast an array that's guaranteed to have good locality (if you access it in order).
Of course, we could try to keep our linked list elements local to each other: maybe we could allocate them around the same time and not do too much inserting/deleting. Then why use a linked list at all?
There are ways to implement linked list that have better memory locality, but that's off-topic for us.
Possible linked list use-case: build a problematic data collection as a linked list, then convert to an array for faster processing.
Memory locality and caching explain why we expect B-trees to have overall better performance than balanced binary search trees.
B-trees ensure several high-locality memory accesses before following a pointer/reference to the next (possible far away) node. Binary trees don't, so may have bad locality of reference.
None of what we have discussed explains why this can run in ≈2 cycles per element: if it takes 4 cycles to access L1, shouldn't 4 be the minimum?
for (uint32_t i = 0; i < length; i++) { total += data[indices[i]]; }
Also, it seems like there must be at least five instructions in the main for
loop (mov
, add
, inc
, cmp
, jl
). Shouldn't that take ≥5 cycles + the L1 fetch?
Those mysteries will have to wait.
All of the places we've seen to store data have been volatile: whatever is stored in them disappears when they're off.
The other class of places we can store data: persistent or non-volatile that continues to hold data even without power.
Persistent storage is generally drives
or disks
(even if they aren't actually disk-shaped). Historically, that has been mostly spinning hard disks. Like this (with the top removed):
The spinning disks in a hard disk are coated with a material that can be magnetized: little areas can be magnetized one way or the other to represent a bit.
The disk itself spins and a read/write head floats above it and can either detect the magnetic orientation or change it.
Spinning hard drives have fairly high latency: reading specific data has to wait for the read/write head to move to the right position, and for the disk to spin to the right spot.
Their bandwidth is also relatively low: they can only read/write data as fast as the disk spins under the read/write head.
The other storage option that's common: solid state drives (SSDs).
SSDs use NAND flash storage which is randomly addressable. and both lower latency and higher bandwidth than spinning disks.
One of the factors for bandwidth and latency of SSDs is how they're connected to the rest of the computer: it could be a traditional storage connector like SATA.
Or a connection designed for faster data transfer like NVMe. NVMe connections are essentially 4-lane PCIe (if that's a meanful statement for you).
PCIe 5.0 NVMe has a theoretical max bandwidth of 16 GB/s, but apparently 2.6 GB/s is more realistic for real drives.
SATA drives have a maximum bandwidth of 600 MB/s.
Basically, saying something is an SSD
covers a huge range of latency and throughput.
All SSDs (in a normal price range?) are still much slower than memory, but the gap is much smaller than it was in the days of spinning HDDs.
Another possibility for persistent storage: any of those storage things, except over a network.
Operating systems can mount network disks and users/programs can use them like local storage.
Network storage can be shared among many users, and be larger and faster (because a dedicated storage devices can hold many drives for both more storage and to combine their bandwidth).
Latency will be the combination of the drives and the network latency. Bandwidth will be limited by both the drives and network.
The idea of memory addresses as we have been thinking of them is a little too simple to actually work in a real modern system.
Most importantly: computers have more than one program running at a time. What if they want the same memory address?
What if we did this:
.data stuff: .quad 0
… how would we know that our stuff
address was different than any address used by any other program?
It turns out that none of our memory addresses have actually been locations in physical memory. Physical memory: the actual memory hardware.
Physical memory does have addresses that refer to bytes stored there, but there has always been another layer of abstraction between our code and it.
Memory addresses the programmer sees are in virtual memory. Virtual memory addresses are translated to physical memory addresses by the computer's memory management unit (MMU).
When our programs ask the operating system for memory, the operating system gives us some virtual memory and assigns that to some physical memory.
The OS tells the MMU which parts of physical memory belong to which process, and the MMU is in charge of translating addresses as instrutions run.
It will also prevent our program from being able to access another program's memory. This is the cause of a segmentation fault
: trying to access memory outside your segment
.
So your code sees a nice continuous virtual memory block, and the MMU translates that to the physical memory the OS has given you.
Virtual memory is given to programs, and mapped to physical memory in units of a page.
Each program has some pages of memory that it can use. The MMU keeps a list of what physical memory each page is assigned to: the page table.
The page size is typically 4 kB in most systems, but can be larger.
$ getconf PAGE_SIZE 4096
Larger page sizes mean more waste when programs need smaller amounts of memory; smaller pages mean a larger page table.
Page size is a compile-time choice for the Linux kernel, so you probably aren't going to change it.
The translation lookaside buffer (TLB) is effectively an in-memory cache for page table lookups. Without it, looking up pages in the page table would slow everything down.
Larger pages would also mean more TLB hits, which might be nice if you're using a lot of memory.
So as a programmer/user, we can mostly ignore the virtual/physical translation that's happening.
But the virtual memory system has one more trick…
The operating system can take the data from a virtual memory page, copy it to disk, and then take it out of physical memory.
That space on disk is known as swap space or swap and a page moved to swap is swapped out.
The full virtual memory picture is more like…
So swap can allow your computer can have more virtual memory than physical memory.
When a program tries to access a page that's swapped out, the MMU triggers a page fault. That tells the operating system to move that page back from swap into physical memory (possibly moving another page to swap if no physical memory is available). Then the memory access can continue.
Remember that disk is much slower than memory: handling a page fault is going to take much longer than a normal
memory access.
If that happens occasionally, it's not too bad.
If programs are actively using more data than will fit in physical memory, page faults will happen frequently and the OS will spent a lot of time moving pages back and forth from disk and be basically unusable: thrashing.
If your system is using swap, memory locality is even more important.
I tried the random array access code in a VM with 1 GB of memory…
Sequential access is still bad, but maybe tolerable.
Swap use has been less common in recent years: memory is cheap/large enough to have enough for most things most people do. Memory is so much faster than disks that using swap hasn't seemed worth it.
A possible future: NVMe drives are fast enough that swap might be useful again. Terabytes of cheap and usable virtual memory? Maybe?
Refinement of something mentioned earlier…
When your program requests memory from the OS, the physical memory isn't quite yours until you access it. When you first touch a page of memory, a minor
page fault is triggered and the physical memory is cleared out for you.
If you access a page that's swapped out, it triggers a major
page fault.
Cache sits between the processor and memory because of the speed difference between them: most code does/can/should have good memory locality, so things can be usefully cached.
There's a similar speed difference between disk and memory, so caching makes sense there too: disk cache.
There's no extra hardware for disk cache. The operating system is in charge: it keeps recently-used files (actually, disk blocks) in memory.
How much data is cached will depend on the operating system.
Linux is particularly aggressive: it will use any free memory to cache data from storage. It can even swap out virtual memory to get the space for disk cache
There's a Linux swappiness
setting that controls this, if you would like to express yourself through kernel parameters.
On a random Linux system:
$ free -m total used free shared buff/cache available Mem: 32021 4426 5521 127 22072 27104 Swap: 33742 3 33739
i.e. 32 GB physical memory, about 4 GB used for programs, 22 GB used for disk cache. There is 3 MB of stuff swapped out even though there's plenty of space. If you look too quickly, it seems like there's only 5 GB of memory free, but the disk cache could be freed up, leaving the available
space.
It might not come up in this course, but disk cache makes it incredibly hard to time disk operations.
Unless you're very careful, you're going to find out that the first thing you do is slow (while the file is read and cached), and everything after that is fast (because it's actually coming from memory).
The memory hierarchy
in a computer describes the spectrum of places data could be stored. That could either by the program/programmer explicitly (registers, allocated memory, files on disk) or managed by the system (cache, swap).
Even the ones not under the programmer's direct control have a huge effect on how fast data can be accessed: programmers sometimes need to work with them in mind.
The usual diagram of the memory hierarchy:
A lot of the speed/latency choices are made when you choose PC components (e.g. you can choose different RAM speeds, or SSD connectivity). Also the processor, number of cores, etc.
I'll defer to Linus Tech Tips: How to Build a PC, the last guide you’ll ever need! (2024 Update)