Lab: Hardware Performance Counters

Assigned:: Wednesday, Nov 15, 2017
Due:: Monday, Nov 20, 2017 (by 5pm)
Collaboration:: Work with your assigned partner for this lab. You may use your classmates as a resource, but please cite them. Sharing of complete or nearly-complete answers is not permitted. If you do not know whether it is acceptable use a specific resource you should ask.

Overview

In this lab, you will learn how to use hardware performance counters to monitor the execution of real programs on real, complex hardware. These counters can give you insight into how a program executes and why performance may differ between two similar versions of the same program.

Groups

Aditi and Andrew
Nripesh and Clara
Mari and Hoang
Bazil and John
Prabir and Nick
Matt and Greyson
An and Dennis
Reilly and Gemma
Paps and Ryan
Faizaan and Jimin
Kathryn and Rojina
Lex and Zachary

Introduction

Hardware performance counters can count a wide variety of events inside a processor as it executes your programs. We will use the Linux perf tool to access the counters to determine why one version of a program runs faster than another. The perf tool has many sub-commands, but we will use just four:

perf list: This sub-command lists the events available on the current machine. Use this to figure out what kinds of events you can count.
perf stat: This sub-command allows you to run a program with one or more events counted during the execution. When the program finishes, perf prints out the number of times each counted event occurred.
perf record: This sub-command records not only how many events occurred as the program executed, but also where some fraction of those events happened. One out of every, say, 10,000 occurrences of an event, perf will record which instruction was executing at the time. This can give you a sense of which parts of a program are “responsible” for the events.
perf report: After running perf record, use this sub-command to view the results of the execution. You can navigate by function name, then “annotate” a function to see which instructions account for what fraction of the events.

We’ll walk through an example use of each command in the next section.

Part A: Mystery Program 1

We have two very similar versions of a program that have wildly different performance:

Version 1: `a-v1.c`

#include <stdlib.h>
#include <time.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int main(int argc, char** argv) {
  srand(time(NULL));

  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = rand() % 256;
  }

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    for(int j=0; j<ARRAY_SIZE; j++) {
      if(data[j] >= 128) {
        sum += data[j];
      }
    }
  }

  return 0;
}

To compile this program, copy the above code to a file a-v1.c and run the following command:

$ gcc --std=c11 -g -o a-v1 a-v1.c

Version 2: `a-v2.c`

#include <stdlib.h>
#include <time.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int compare(const void* ap, const void* bp) {
  int a = *(int*)ap;
  int b = *(int*)bp;
  if(a < b) return -1;
  else if(a == b) return 0;
  else return 1;
}

int main(int argc, char** argv) {
  srand(time(NULL));

  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = rand() % 256;
  }

  qsort(data, ARRAY_SIZE, sizeof(int), compare);

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    for(int j=0; j<ARRAY_SIZE; j++) {
      if(data[j] >= 128) {
        sum += data[j];
      }
    }
  }

  return 0;
}

To compile this program, copy the above code to a file a-v2.c and run the following command:

$ gcc --std=c11 -g -o a-v2 a-v2.c

A Performance Mystery

Now that we have two similar versions of the program, let’s see how they perform. You can run the programs with the time command to measure runtime:

$ time ./a-v1
real	0m1.839s
user	0m1.832s
sys	0m0.008s

$ time ./a-v2
real	0m0.729s
user	0m0.724s
sys	0m0.000s

You should pay attention to the “real” time; this is the wall clock time elapsed while the program ran. Oddly enough, version 1 of this program takes 2.5x longer to run than version 2. We’ll use performance counters to figure out why.

Using `perf`

The first thing we can check is to see if the two programs are running dramatically different numbers of instructions. First, we need to figure out which event to count.

$ perf list
List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
...

As you can see, there is an event called instructions. Hit q to quit out of perf list. We’ll use this event with perf stat:

$ perf stat -e instructions ./a-v1

Performance counter stats for 'a-v1':

    3,281,876,619      instructions             

      1.879405634 seconds time elapsed

$ perf stat -e instructions ./a-v2

Performance counter stats for 'a-v2':

    3,297,032,646      instructions             

      0.727112668 seconds time elapsed

Even though version 2 runs faster, it looks like it runs slightly more instructions than version one. Look through other events in the perf list output and try them to see if you can find a huge difference. You probably won’t know what all of these events are, but many should be familiar. I am happy to answer questions about events. You can ignore the events related to the cache; we will learn what a cache is and look much more closely at these events next week.

Digging in with `perf record`

Once you find an event that’s wildly different, you may have a clue about the root cause of the performance difference.

While it doesn’t give us a good sense of why the program ran the way it did, we’ll look at the cycles event as an example. To dig in on performance events, we first use the perf record event:

$ perf record -e cycles ./a-v1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.298 MB perf.data (~13014 samples) ]

This command runs the program and produces output in the perf.data file. We can look at the data using the perf report tool.

$ perf report
Samples: 7K of event 'cycles', Event count (approx.): 6637428286
94%  a-v1  a-v1               [.] main
03%  a-v1  libc-2.19.so       [.] __random
02%  a-v1  [kernel.kallsyms]  [k] truncate_inode_pages_final
00%  a-v1  [kernel.kallsyms]  [k] __inode_permission
00%  a-v1  [kernel.kallsyms]  [k] native_write_msr_safe

This shows that 99.94% of instructions were executed in the program a-v1 inside the main function (not a huge surprise there). If you use the arrow keys to select the line for the main function and hit Enter, you can view more detail. Select the “Annotate main” option, which gives you an output that should look something like this:

main  /home/curtsinger/courses/2016F/csc211/performance-counters/a-v1
       │      addl   $0x1,-0x4(%rbp)
       │5c:   cmpl   $0x7fff,-0x4(%rbp)
       │    ↑ jle    32
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │      movl   $0x0,-0x8(%rbp)
       │    ↓ jmp    af
       │        int sum = 0;
       │6e:   movl   $0x0,-0xc(%rbp)
       │        for(int i=0; i<ARRAY_SIZE; i++) {
       │      movl   $0x0,-0x10(%rbp)
       │    ↓ jmp    a2
       │          if(data[i] >= 128) {
  4.90 │7e:   mov    -0x10(%rbp),%eax
  0.07 │      cltq
  1.18 │      mov    -0x20010(%rbp,%rax,4),%eax
 19.37 │      cmp    $0x7f,%eax
       │    ↓ jle    9e
       │            sum += data[i];
 11.73 │      mov    -0x10(%rbp),%eax
 11.57 │      cltq
  1.01 │      mov    -0x20010(%rbp,%rax,4),%eax
  6.95 │      add    %eax,-0xc(%rbp)
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │        int sum = 0;
       │        for(int i=0; i<ARRAY_SIZE; i++) {
 19.40 │9e:   addl   $0x1,-0x10(%rbp)
 13.79 │a2:   cmpl   $0x7fff,-0x10(%rbp)
 10.02 │    ↑ jle    7e
       │
       │      for(int i=0; i<ARRAY_SIZE; i++) {
       │        data[i] = rand() % 256;
       │      }
       │
       │      for(int i=0; i<ITERATIONS; i++) {
       │      addl   $0x1,-0x8(%rbp)
  0.01 │af:   cmpl   $0x270f,-0x8(%rbp)
       │    ↑ jle    6e
       │            sum += data[i];
       │          }
       │        }
       │      }
       │
       │      return 0;
       │      mov    $0x0,%eax
       │    }
       │      leaveq
       │    ← retq

This output interleaves assembly with the C code it was generated by. While the output is a bit hard to follow, you can generally find the C code for a set of instructions just after the instructions. The numbers in the left column show what percent of cycles were spent executing each instruction over the entire run. Once you’ve found an event that is significantly different between the two programs, this view should help you pinpoint the exact line of code or instruction where that event is occurring. When you’re done with this view, hit q twice to quit out of perf report.

Your Task

Using the perf tool, explain why version 1 is so much slower than version 2. You should be specific: you should identify a meaningful event that explains the difference and be able to explain why this event is occurring. Describe the process you used to diagnose the issue, and explain your findings.

Part B: A second challenge

Here are two versions of another program:

Version 1: `b-v1.c`

#include <stdio.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int main(int argc, char** argv) {
  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = i+1;
  }

  data[ARRAY_SIZE-1] = -1;

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    int index = 0;
    while(index != -1) {
      sum += data[index];
      index = data[index];
    }
  }

  return 0;
}

Copy this code to the file b-v1.c and compile with the command:

$ gcc --std=c11 -g -o b-v1 b-v1.c

Version 2: `b-v2.c`

#include <stdio.h>

#define ITERATIONS 10000
#define ARRAY_SIZE 32768

int main(int argc, char** argv) {
  int data[ARRAY_SIZE];
  
  for(int i=0; i<ARRAY_SIZE; i++) {
    data[i] = i+1;
  }

  data[ARRAY_SIZE-1] = -1;

  for(int i=0; i<ITERATIONS; i++) {
    int sum = 0;
    int index = 0;
    while(index != ARRAY_SIZE) {
      sum += data[index];
      index++;
    }
  }

  return 0;
}

Copy this code to the file b-v2.c and compile with the command:

$ gcc --std=c11 -g -o b-v2 b-v2.c

Your Task

As with our first pair of programs, these two versions are similar but have very different performance:

$ time ./b-v1
real	0m1.215s
user	0m1.212s
sys	0m0.000s

$ time ./b-v2
real	0m0.731s
user	0m0.728s
sys	0m0.000s

Use hardware performance counters to identify the root cause of the performance difference. Describe the process you used to diagnose the issue, and explain your findings.

Lab: Hardware Performance Counters

Overview

Groups

Introduction

Part A: Mystery Program 1

Version 1: a-v1.c

Version 2: a-v2.c

A Performance Mystery

Using perf

Digging in with perf record

Your Task

Part B: A second challenge

Version 1: b-v1.c

Version 2: b-v2.c

Your Task

Version 1: `a-v1.c`

Version 2: `a-v2.c`

Using `perf`

Digging in with `perf record`

Version 1: `b-v1.c`

Version 2: `b-v2.c`