In this lab, you will investigate the performance of a real memory hierarchy using timers and hardware performance counters. This lab will rely on some of the same material as lab 11, so make sure you understand the basics of hardware performance counters and the perf
tool.
In this lab, you will use a simple program that creates a linked list and traverses it over and over again. The length of the linked list—which you can specify on the command line—determines the working set of the program. The working set is informally defined as the memory used repeatedly throughout the program’s execution (a formal definition is actually quite tricky). Caches do a great job improving performance when the program’s working set fits in cache, but things quickly fall apart once the working set is too large.
The source code for our sample program, cache-test.c
, is as follows:
#include <stdio.h>
#include <stdlib.h>
#define ACCESSES 1000000000
// Define a linked list node type with no data
typedef struct node {
struct node* next;
} node_t;
int main(int argc, char** argv) {
// Check to make sure we received a command line option
if(argc < 2) {
fprintf(stderr, "Usage: %s <list size>\n", argv[0]);
return 1;
}
// How many items should we have in the list?
int items = atoi(argv[1]);
// Keep a pointer to the beginning and end of the list
node_t* head = NULL;
node_t* last = NULL;
// Repeatedly add items to the end of the list
for(int i=0; i<items; i++) {
// Allocate memory for the linked list node
node_t* n = (node_t*)malloc(sizeof(node_t));
// We're adding to the end, so the next pointer is NULL
n->next = NULL;
// Is the list empty? If so, the new node is the head and tail
if(last == NULL) {
head = n;
last = n;
} else {
last->next = n;
last = n;
}
}
// Now that we have a list, traverse the list over and over again until we've
// visited `ACCESSES` nodes in our linked list.
node_t* current = head;
for(int i=0; i<ACCESSES; i++) {
if(current == NULL) current = head;
else current = current->next;
}
return 0;
}
Compile this program with the command:
$ gcc --std=c11 -o cache-test cache-test.c
The most time-consuming part of the program is the last loop in main
. There we traverse our linked list over and over again until we’ve visited ACCESSES
nodes. Without a cache, this should take the same amount of time regardless of how many unique nodes there are in the list. However, with a cache we would expect much better performance when the entire linked list fits into our cache.
Make sure you understand what this program is doing before you move on. If you have any questions, the class mentors or I can help you step through it.
The perf
tool will be useful for this lab, primiarily using the perf stat
command to count occurrences of specific events. We’ll focus on four main events:
L1-dcache-loads
L1-dcache-load-misses
cache-references
perf
tool but this one works just as well for our purposes.cache-misses
Using these four events, we can estimate cache miss rates; L1 cache miss rate is the number of L1-dcache-load-misses
over the number of L1-dcache-loads
. L2 cache miss rate is the number of cache-references
over the number of L1-dcache-load-misses
(approximately—I can explain why if you’re interested). Finally, the L3 cache miss rate is cache-misses
over cache-references
.
You can run a program using perf stat
with multiple -e
options to count more than one event at a time. This allows you to measure all the cache behaviors at once, as well as the program’s runtime.
Run our test program with the perf stat
command on a variety of input sizes to look for points where the performance changes. Be organized and keep all of your data! You will need to report the program’s runtime over a range of working set sizes, as well as the miss rates for all three levels of the cache. You must cover a wide enough range of working set sizes to include cases where the L1 cache performs very well all the way up to working set sizes that stress the level 3 cache.
Graph your results. You must plot runtime, and you might want to plot total cache misses, miss rates, or both. I recommend plotting each measurement on a separate graph, but you may be able to fit the measurements you need on a single graph if you use an alternate y-axis. This is entirely up to you.
Explain what is happening at each inflection point. What is responsible for each substantial change in performance? A thoughtful analysis will include at least two paragraphs discussing the results. Particularly detailed or convincing discussions will earn extra credit.
/sys/devices/system/cpu/cpu0/cache/
. You can learn quite a bit about the structure of each cache, which may be helpful for explaining your performance results. The cat
command line tool is useful for printing out the contents of files.>>
to append the output from the perf
tool to a text file after each run. This script runs the sleep
command, which doesn’t do much.#!/bin/bash
nums="1 2 3 4"
for i in $nums; do
perf stat -e cycles sleep $i 2>> output.txt
done