Memory & Storage

Memory & Storage

When we store stuff, what does that really mean? When we use memory in our program, what's going on?

We know the basics of working with memory… our programs have static, stack, heap. We can work with them all from both assembly and C (but haven't allocated heap memory from assembly).

Memory & Storage

There are still some mysteries. e.g. I have mentioned that memory is slow: retrieving a value from memory takes something like 50 CPU cycles (depending on CPU speed, memory speed & latency).

That sounds bad.

Memory & Storage

But then, how does this code (compiled with -O0) take only around 5 cycles per array element?

int64_t array_sum_1(int64_t* arr, unsigned n) {
    int64_t total = 0;
    for (unsigned i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

That's a memory access in the loop. Shouldn't it take ≥50 cycles per element?

Memory

When we refer to a computer's memory, we're talking about the random-access memory (RAM). Here, random access refers to being able to access any part of it in a similar amount of time.

[Compare a spinning hard drive or CD, where access time varies depending how far the disk needs to phyically spin before the right data is under the read/​write head.]

Memory

Modern computers use dynamic RAM (DRAM) where (roughly-speaking), each bit is held by a tiny capacitor where charged and discharged can represent 0 and 1.

The tiny capacitors naturally lose their charge (on a microsecond scale) and must periodically have their charge refreshed.

Memory

DRAM is a different storage technology than the flip-flops that are storing the bits in our registers: that's static RAM or SRAM.

Relatively speaking: DRAM is slower and cheaper. So, we can have a lot of it, but it's not as fast as registers.

Cache

When you read or write memory, a ≈50 cycle waiting time doesn't sound like something I'd be willing to accept.

The workaround: cache.

Cache

The idea: fast SRAM is expensive, but we can have a little. It can sit between the processor and memory, and hold a copy of some of the data from memory. Then the processor can access that subset of memory quickly.

The cache is managed by the hardware: we don't directly control what's in it, but it affects the performance of our programs, so knowing what's going on can help us write better code.

Cache

When some code does a memory access, an entire cache line will be read from memory into cache (typically 64 bytes).

That's generally useful: most code will access nearby memory next, so it's useful to have ready to go.

Cache

There are several layers of cache in modern computers: layers of progressively larger/​slower cache.

Typically in a modern system, the fastest/​smallest/​closest cache is L1 (level 1). Each CPU core has its own, and there is separate L1 cache for instructions (i.e. for the memory accesses to fill the instruction register) and data (i.e. the memory the program itself is accessing).

Cache

Then L2 (also separate for each core) and L3 (shared by all cores in the processor).

Fetching from L1 cache takes ∼4 cycles; ∼10 cycles for L2 cache and ∼40 for L3 (depending on the processor).

Memory access takes around 14 ns so for a 3.5 GHz clock speed we're waiting,

\[ (3.5\times10^{9} \tfrac{\mathrm{cycles}}{\mathrm{s}}) \times (14\times10^{-9}\mathrm{\ s}) = 49\mathrm{\ cycles}\,. \]

Cache

A typical Haswell processor (i7-4790) has cache arranged like this:

Haswell cache

Cache

The caches can't store as much as memory: sometimes a memory access will be found in the cache, a cache hit. Sometimes it won't, and the value must be found in memory: a cache miss.

If we can arrange our code so there are more hits, it will be faster.

Latency & Bandwidth

If we have to wait ∼40 cycles to access L3 cache, that doesn't sound much better than ∼50 for memory.

But that's just latency: the delay between request and response. There's also the issue of bandwidth or throughput: the amount of data that can be transferred per second.

An silly example to illustrate the difference…

Latency & Bandwidth

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. Andrew Tanenbaum, ∼1985

The modern equivalent would be SD cards in an SUV. A microSD card is 15×11×1 mm. That's 6 million per m3.

Let's say I bought 6 million 256 GB microSD cards and drove across the country. The drive might take a week. The bandwidth of that data transfer is 20 Tb/s. The latency is 7 days.

Latency & Bandwidth

Compare 20 Tb/s throughput and 7 day latency vs Ethernet: 1–10 Gb/s bandwidth and latency <1 ms.

Both latency and throughput matter.

Memory Hierarchy

So there are several places your data might be at any moment: registers, cache, memory. Also disks/​storage that we'll discuss later.

There's a spectrum of size vs latency and bandwidth. My best summary…

Memory Hierarchy

StorageSizeLatency
(cycles)
Throughput
Registers <kB 1
L1 Data Cache 32 kB/core 4* 400 GB/s*
L2 Cache 256 kB/core 12* 100 GB/s*
L3 Cache 8 MB 36* 40 GB/s*
Memory many GB 48* 13 GB/s*
Solid State Drive few TB 200–600* 500–5000 MB/s*
Spinning HD several TB 10k–20k* 80–160 MB/s*
Network storage many TB 70k–175k* 1 GB/s

Memory Locality

The result of having memory cache: we can access memory faster than the memory actually operates, but only if we're lucky and the data we want is in cache.

How can we make it more likely that the memory we're using is in cache? By keeping in mind how caching works.

Memory Locality

When cache is full, values are moved out (to the next layer of cache or memory) based on the least recently used values. So, we expect data that has been accessed recently to be in cache.

The cache will also hold data near other data that has been used recently (because cache is filled by cache line). So, data near other data we have been working with should be in cache.

Memory Locality

The message: working with data near what you have recently been working with should be fast.

How much difference does it make? It depends on how much data you have and how near/​recent.

Memory Locality

Let's experiment…

I'm going to create two arrays: a data array, and an array of indicies into that array that will hold values 0 to n-1, like this:

int64_t* data = (int64_t*)malloc(length * sizeof(int64_t));
uint32_t* indices = (uint32_t*)malloc(length * sizeof(uint32_t));
for (uint32_t i = 0; i < length; i++) {
    data[i] = random_value();
    indices[i] = i;
}

Memory Locality

We will iterate through the data by accessing data[indices[i]].

int64_t total = 0;
for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];
}

Memory Locality

for (uint32_t i = 0; i < length; i++) {
    data[i] = random_value();
    indices[i] = i;
}
int64_t total = 0;
for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];
}

So far, sweeping through data[indices[i]] will access data in order: high locality.

If we randomly shuffle indices between the two loops, access to data will be in a random order: horrible locality.

Memory Locality

Predictions…

Randomly accessing arrays small enough to fit in cache should be fast. Everything should stay in, so we can access any part fairly quickly.

As things get a little larger than the cache, random accesses should slow down slightly because a few accesses are cache misses (but most still are in cache).

As arrays grow, there will be more and more cache misses, so slower accesses.

Memory Locality

With \(2^{17}\) elements, the two arrays are 0.5 + 1 MB: all of the data fits in my L2 cache. Running times are 1.6 cycles per element without shuffling indices and 2.2 cycles per element with.

With array length \(2^{24}\) (64 + 128 MB), the arrays are much larger than the L3 cache.

Without shuffling the indices, the loop takes about 2.5 cycles/element. With the index array shuffled, it takes around 27 cycles/element.

Memory Locality

With indices randomly shuffled, marking L1, L2, L3 sizes (note x and y are both log scale):

random array access timing

Memory Locality

Observations…

The boundaries between L1/​L2/​L3 are difficult to see, but it looks like there's a little change at those points.

It's very clear where data no longer fits in L3. As we get much larger than L3, most times we touch the data array, there's no cache and we force a real memory access.

Memory Locality

But if we don't shuffle the indices, we have great locality and almost all of the array accesses should be in L1 cache. It should be fast regardless of the size of the array.

And in fact…

Memory Locality

Without shuffling the indices (note different y scale):

sequential array access timing

Memory Locality

I tried the array sum problem by shuffling the indices just a little: randomly moving them within ≈10 of their original position (gently_shuffle_array in the file).

Performance was almost exactly the same as the no shuffle results. You don't have to access data exactly sequentially, just enough so that what you're looking for is usually in cache. Close is close enough.

Memory Locality

The lesson: keep your memory accesses local as much as possible.

You can have and work with a lot of data in memory. If you work with it in nearby chunks, it will be much faster.

Or: you shouldn't actually access random-access memory randomly.

Memory Locality

One data structure that has notable memory locality problems: linked lists.

singly linked list
Wikimedia Commons File:Singly-linked-list.svg

Depending how/when elements are allocated, they could be scattered around a process's memory allocation. Traversing the linked list would then have bad locality.

Contrast an array that's guaranteed to have good locality (if you access it in order).

Memory Locality

singly linked list

Of course, we could try to keep our linked list elements local to each other: maybe we could allocate them around the same time and not do too much inserting/​deleting. Then why use a linked list at all?

There are ways to implement linked list that have better memory locality, but that's off-topic for us.

Memory Locality

None of what we have discussed explains why this can run in ≈2 cycles per element: if it takes 4 cycles to access L1, shouldn't 4 be the minimum?

for (uint32_t i = 0; i < length; i++) {
    total += data[indices[i]];
}

Also, it seems like there must be at least five instructions in the main for loop (mov, add, inc, cmp, jl). Shouldn't that take ≥5 cycles + the L1 fetch?

Those mysteries will have to wait.

Storage

All of the places we've seen to store data have been volatile: whatever is stored in them disappears when they're off.

The other class of places we can store data: persistent or non-volatile that continues to hold data even without power.

Storage

Persistent storage is generally drives or disks (even if they aren't actually disk-shaped). Historically, that has been mostly spinning hard disks. Like this (with the top removed):

hard drive
WikiMedia Commons File:Laptop-hard-drive-exposed.jpg

Storage

The spinning disks in a hard disk are coated with a material that can be magnetized: little areas can be magnetized one way or the other to represent a bit.

The disk itself spins and a read/​write head floats above it and can either detect the magnetic orientation or change it.

Storage

Spinning hard drives have fairly high latency: reading specific data has to wait for the read/​write head to move to the right position, and for the disk to spin to the right spot.

Their bandwidth is also relatively low: they can only read/​write data as fast as the disk spins under the read/​write head.

Storage

The other storage option that's common: solid state drives (SSDs).

SSDs use NAND flash storage which is randomly addressable. and both lower latency and higher bandwidth than spinning disks.

Storage

One of the factors for bandwidth and latency of SSDs is how they're connected to the rest of the computer: it could be a traditional storage connector like SATA.

SATA SSD
Wikimedia Commons File:Vertex_2_Solid_State_Drive….jpg

Storage

Or a connection designed for faster data transfer like NVMe. NVMe connections are essentially 4-lane PCIe (if that's a meanful statement for you).

M.2 NVMe SSD
Wikimedia Commons File:Samsung_980_….jpg

Storage

PCIe 5.0 NVMe has a theoretical max bandwidth of 16 GB/s, but apparently 2.6 GB/s is more realistic for real drives.

SATA drives have a maximum bandwidth of 600 MB/s.

Storage

Basically, saying something is an SSD covers a huge range of latency and throughput.

All SSDs (in a normal price range?) are still much slower than memory, but the gap is much smaller than it was in the days of spinning HDDs.

Network Storage

Another possibility for persistent storage: any of those storage things, except over a network.

Operating systems can mount network disks and users/​programs can use them like local storage.

Network Storage

Network storage can be shared among many users, and be larger and faster (because a dedicated storage devices can hold many drives for both more storage and to combine their bandwidth).

Latency will be the combination of the drives and the network latency. Bandwidth will be limited by both the drives and network.

Virtual Memory

The idea of memory addresses as we have been thinking of them is a little too simple to actually work in a real modern system.

Most importantly: computers have more than one program running at a time. What if they want the same memory address?

Virtual Memory

What if we did this:

    .data
stuff:
    .quad 0

… how would we know that our stuff address was different than any address used by any other program?

Virtual Memory

It turns out that none of our memory addresses have actually been locations in physical memory. Physical memory: the actual memory hardware.

Physical memory does have addresses that refer to bytes stored there, but there has always been another layer of abstraction between our code and it.

Virtual Memory

Memory addresses the programmer sees are in virtual memory. Virtual memory addresses are translated to physical memory addresses by the computer's memory management unit (MMU).

When our programs ask the operating system for memory, the operating system gives us some virtual memory and assigns that to some physical memory.

Virtual Memory

The OS tells the MMU which parts of physical memory belong to which process, and the MMU is in charge of translating addresses as instrutions run.

It will also prevent our program from being able to access another program's memory. This is the cause of a segmentation fault: trying to access memory outside your segment.

Virtual Memory

So your code sees a nice continuous virtual memory block, and the MMU translates that to the physical memory the OS has given you.

virtual memory
Source: WikiMedia Commons File:Virtual memory.svg

Virtual Memory

Virtual memory is given to programs, and mapped to physical memory in units of a page.

Each program has some pages of memory that it can use. The MMU keeps a list of what physical memory each page is assigned to: the page table.

Virtual Memory

The page size is typically 4 kB in most systems, but can be larger.

$ getconf PAGE_SIZE
4096

Larger page sizes mean more waste when programs need smaller amounts of memory; smaller pages mean a larger page table.

Page size is a compile-time choice for the Linux kernel, so you probably aren't going to change it.

Virtual Memory

The translation lookaside buffer (TLB) is effectively an in-memory cache for page table lookups. Without it, looking up pages in the page table would slow everything down.

Larger pages would also mean more TLB hits, which might be nice if you're using a lot of memory.

Virtual Memory

So as a programmer/​user, we can mostly ignore the virtual/​physical translation that's happening.

But the virtual memory system has one more trick…

Swap

The operating system can take the data from a virtual memory page, copy it to disk, and then take it out of physical memory.

That space on disk is known as swap space or swap and a page moved to swap is swapped out.

The full virtual memory picture is more like…

Swap

virtual memory
Source: WikiMedia Commons File:Virtual memory.svg

Swap

So swap can allow your computer can have more virtual memory than physical memory.

When a program tries to access a page that's swapped out, the MMU triggers a page fault. That tells the operating system to move that page back from swap into physical memory (possibly moving another page to swap if no physical memory is available). Then the memory access can continue.

Swap

Remember that disk is much slower than memory: handling a page fault is going to take much longer than a normal memory access.

If that happens occasionally, it's not too bad.

Swap

If programs are actively using more data than will fit in physical memory, page faults will happen frequently and the OS will spent a lot of time moving pages back and forth from disk and be basically unusable: thrashing.

If your system is using swap, memory locality is even more important.

I tried the random array access code in a VM with 1 GB of memory…

Swap

random array access with swap

Swap

Sequential access is still bad, but maybe tolerable.

sequential array access with swap

Swap

Swap use has been less common in recent years: memory is cheap/​large enough to have enough for most things most people do. Memory is so much faster than disks that using swap hasn't seemed worth it.

A possible future: NVMe drives are fast enough that swap might be useful again. A terabyte of usable virtual memory? Maybe?

Swap

Refinement of something mentioned earlier…

When your program requests memory from the OS, the physical memory isn't quite yours until you access it. When you first touch a page of memory, a minor page fault is triggered and the physical memory is cleared out for you.

If you access a page that's swapped out, it triggers a major page fault.

Disk Cache

Cache sits between the processor and memory because of the speed difference between them: most code does/​can/​should have good memory locality, so things can be usefully cached.

There's a similar speed difference between disk and memory, so caching makes sense there too: disk cache.

Disk Cache

There's no extra hardware for disk cache. The operating system is in charge: it keeps recently-used files (actually, disk blocks) in memory.

How much data is cached will depend on the operating system.

Disk Cache

Linux is particularly aggressive: it will use any free memory to cache data from storage. It can even swap out virtual memory to get the space for disk cache

There's a Linux swappiness setting that controls this, if you would like to express yourself through kernel parameters.

Disk Cache

On a random Linux system:

$ free -m
       total   used   free  shared  buff/cache  available
Mem:   32021   4426   5521     127       22072      27104
Swap:  33742      3  33739

i.e. 32 GB physical memory, about 4 GB used for programs, 22 GB used for disk cache. There is 3 MB of stuff swapped out even though there's plenty of space. If you look too quickly, it seems like there's only 5 GB of memory free, but the disk cache could be freed up, leaving the available space.

Disk Cache

It might not come up in this course, but disk cache makes it incredibly hard to time disk operations.

Unless you're very careful, you're going to find out that the first thing you do is slow (while the file is read and cached), and everything after that is fast (because it's actually coming from memory).

Memory Hierarchy

The memory hierarchy in a computer describes the spectrum of places data could be stored. That could either by the program/​programmer explicitly (registers, allocated memory, files on disk) or managed by the system (cache, swap).

Even the ones not under the programmer's direct control have a huge effect on how fast data can be accessed: programmers sometimes need to work with them in mind.

Memory Hierarchy

The usual diagram of the memory hierarchy:

memory hierarchy