Lab: Cache Experiments

Assigned:: Wednesday, Nov 29, 2017
Due:: Friday, Dec 15, 2017 (by 9am)
Collaboration:: Work with your assigned partner for this lab. You may use your classmates as a resource, but please cite them. Sharing of complete or nearly-complete answers is not permitted. If you do not know whether it is acceptable use a specific resource you should ask.

Overview

In this lab, you will investigate the performance of a real memory hierarchy using timers and hardware performance counters. This lab will rely on some of the same material as lab 11, so make sure you understand the basics of hardware performance counters and the perf tool.

Groups

Prabir and Nripesh
Greyson and Paps
Lex and Gemma
Dennis and Reilly
Kathryn and Jimin
Aditi and Ryan
Andrew and Zachary
Rojina and Matt
Bazil and An
Hoang, John, and Faizaan
Nick, Mari, and Clara

Introduction

In this lab, you will use a simple program that creates a linked list and traverses it over and over again. The length of the linked list—which you can specify on the command line—determines the working set of the program. The working set is informally defined as the memory used repeatedly throughout the program’s execution (a formal definition is actually quite tricky). Caches do a great job improving performance when the program’s working set fits in cache, but things quickly fall apart once the working set is too large.

The source code for our sample program, cache-test.c, is as follows:

#include <stdio.h>
#include <stdlib.h>

#define ACCESSES 1000000000

// Define a linked list node type with no data
typedef struct node {
  struct node* next;
} node_t;

int main(int argc, char** argv) {
  // Check to make sure we received a command line option
  if(argc < 2) {
    fprintf(stderr, "Usage: %s <list size>\n", argv[0]);
    return 1;
  }

  // How many items should we have in the list?
  int items = atoi(argv[1]);

  // Keep a pointer to the beginning and end of the list
  node_t* head = NULL;
  node_t* last = NULL;

  // Repeatedly add items to the end of the list
  for(int i=0; i<items; i++) {
    // Allocate memory for the linked list node
    node_t* n = (node_t*)malloc(sizeof(node_t));
    
    // We're adding to the end, so the next pointer is NULL
    n->next = NULL;
    
    // Is the list empty? If so, the new node is the head and tail
    if(last == NULL) {
      head = n;
      last = n;
    } else {
      last->next = n;
      last = n;
    }
  }

  // Now that we have a list, traverse the list over and over again until we've
  // visited `ACCESSES` nodes in our linked list.
  node_t* current = head;
  for(int i=0; i<ACCESSES; i++) {
    if(current == NULL) current = head;
    else current = current->next;
  }

  return 0;
}

Compile this program with the command:

$ gcc --std=c11 -o cache-test cache-test.c

The most time-consuming part of the program is the last loop in main. There we traverse our linked list over and over again until we’ve visited ACCESSES nodes. Without a cache, this should take the same amount of time regardless of how many unique nodes there are in the list. However, with a cache we would expect much better performance when the entire linked list fits into our cache.

Make sure you understand what this program is doing before you move on. If you have any questions, the class mentors or I can help you step through it.

Measuring Cache Performance

The perf tool will be useful for this lab, primiarily using the perf stat command to count occurrences of specific events. We’ll focus on four main events:

L1-dcache-loads: This event counts the number of times we tried to load a value from the top-level cache.
L1-dcache-load-misses: This counts the number of times we tried to load data from the top-level cache, but the cache did not have the memory we were looking for. This also tells us how many times our processor will look for data memory in the level 2 cache.
cache-references: This event tells us how many times we looked for memory in the level 3 cache, which is the same as the number of times we misses in the level 2 cache. For reasons that are somewhat mysterious to me, level 2 cache misses are not available as a built-in event in the perf tool but this one works just as well for our purposes.
cache-misses: This event counts misses in the last (third) level cache.

Using these four events, we can estimate cache miss rates; L1 cache miss rate is the number of L1-dcache-load-misses over the number of L1-dcache-loads. L2 cache miss rate is the number of cache-references over the number of L1-dcache-load-misses (approximately—I can explain why if you’re interested). Finally, the L3 cache miss rate is cache-misses over cache-references.

You can run a program using perf stat with multiple -e options to count more than one event at a time. This allows you to measure all the cache behaviors at once, as well as the program’s runtime.

Your Task

Run our test program with the perf stat command on a variety of input sizes to look for points where the performance changes. Be organized and keep all of your data! You will need to report the program’s runtime over a range of working set sizes, as well as the miss rates for all three levels of the cache. You must cover a wide enough range of working set sizes to include cases where the L1 cache performs very well all the way up to working set sizes that stress the level 3 cache.

Graph your results. You must plot runtime, and you might want to plot total cache misses, miss rates, or both. I recommend plotting each measurement on a separate graph, but you may be able to fit the measurements you need on a single graph if you use an alternate y-axis. This is entirely up to you.

Explain what is happening at each inflection point. What is responsible for each substantial change in performance? A thoughtful analysis will include at least two paragraphs discussing the results. Particularly detailed or convincing discussions will earn extra credit.

Hints

When you’re just starting out with data collection, increasing working set size by powers of two is a quick way to explore the space of input sizes. You’ll know when you’ve found a major change in performance, and you can examine the change in finer detail (smaller differences in working set size) after you’ve found where it happens.
Use a spreadsheet! You can collect raw data and easily compute the miss rates after the fact.
Linux has a useful system that lets you examine the properties of your CPU’s cache. Take a look at the files under the path /sys/devices/system/cpu/cpu0/cache/. You can learn quite a bit about the structure of each cache, which may be helpful for explaining your performance results. The cat command line tool is useful for printing out the contents of files.
The following bash script can be run directly in your terminal or in a .sh file. This code uses a loop to execute a command multiple times, and uses redirection with >> to append the output from the perf tool to a text file after each run. This script runs the sleep command, which doesn’t do much.

#!/bin/bash
nums="1 2 3 4"
for i in $nums; do
  perf stat -e cycles sleep $i 2>> output.txt
done