In this lab, you will learn how to use hardware performance counters to monitor the execution of real programs on real, complex hardware. These counters can give you insight into how a program executes and why performance may differ between two similar versions of the same program.
Hardware performance counters can count a wide variety of events inside a processor as it executes your programs. We will use the Linux perf
tool to access the counters to determine why one version of a program runs faster than another. The perf
tool has many sub-commands, but we will use just four:
perf list
perf stat
perf
prints out the number of times each counted event occurred.perf record
perf
will record which instruction was executing at the time. This can give you a sense of which parts of a program are “responsible” for the events.perf report
perf record
, use this sub-command to view the results of the execution. You can navigate by function name, then “annotate” a function to see which instructions account for what fraction of the events.We’ll walk through an example use of each command in the next section.
We have two very similar versions of a program that have wildly different performance:
a-v1.c
#include <stdlib.h>
#include <time.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int main(int argc, char** argv) {
srand(time(NULL));
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = rand() % 256;
}
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
for(int j=0; j<ARRAY_SIZE; j++) {
if(data[j] >= 128) {
sum += data[j];
}
}
}
return 0;
}
To compile this program, copy the above code to a file a-v1.c
and run the following command:
$ gcc --std=c11 -g -o a-v1 a-v1.c
a-v2.c
#include <stdlib.h>
#include <time.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int compare(const void* ap, const void* bp) {
int a = *(int*)ap;
int b = *(int*)bp;
if(a < b) return -1;
else if(a == b) return 0;
else return 1;
}
int main(int argc, char** argv) {
srand(time(NULL));
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = rand() % 256;
}
qsort(data, ARRAY_SIZE, sizeof(int), compare);
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
for(int j=0; j<ARRAY_SIZE; j++) {
if(data[j] >= 128) {
sum += data[j];
}
}
}
return 0;
}
To compile this program, copy the above code to a file a-v2.c
and run the following command:
$ gcc --std=c11 -g -o a-v2 a-v2.c
Now that we have two similar versions of the program, let’s see how they perform. You can run the programs with the time
command to measure runtime:
$ time ./a-v1
real 0m1.839s
user 0m1.832s
sys 0m0.008s
$ time ./a-v2
real 0m0.729s
user 0m0.724s
sys 0m0.000s
You should pay attention to the “real” time; this is the wall clock time elapsed while the program ran. Oddly enough, version 1 of this program takes 2.5x longer to run than version 2. We’ll use performance counters to figure out why.
perf
The first thing we can check is to see if the two programs are running dramatically different numbers of instructions. First, we need to figure out which event to count.
$ perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
...
As you can see, there is an event called instructions
. Hit q to quit out of perf list
. We’ll use this event with perf stat
:
$ perf stat -e instructions ./a-v1
Performance counter stats for 'a-v1':
3,281,876,619 instructions
1.879405634 seconds time elapsed
$ perf stat -e instructions ./a-v2
Performance counter stats for 'a-v2':
3,297,032,646 instructions
0.727112668 seconds time elapsed
Even though version 2 runs faster, it looks like it runs slightly more instructions than version one. Look through other events in the perf list
output and try them to see if you can find a huge difference. You probably won’t know what all of these events are, but many should be familiar. I am happy to answer questions about events. You can ignore the events related to the cache; we will learn what a cache is and look much more closely at these events next week.
perf record
Once you find an event that’s wildly different, you may have a clue about the root cause of the performance difference.
While it doesn’t give us a good sense of why the program ran the way it did, we’ll look at the cycles
event
as an example. To dig in on performance events, we first use the perf record
event:
$ perf record -e cycles ./a-v1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.298 MB perf.data (~13014 samples) ]
This command runs the program and produces output in the perf.data
file. We can look at the data using the perf report
tool.
$ perf report
Samples: 7K of event 'cycles', Event count (approx.): 6637428286
99.94% a-v1 a-v1 [.] main
0.03% a-v1 libc-2.19.so [.] __random
0.02% a-v1 [kernel.kallsyms] [k] truncate_inode_pages_final
0.00% a-v1 [kernel.kallsyms] [k] __inode_permission
0.00% a-v1 [kernel.kallsyms] [k] native_write_msr_safe
This shows that 99.94% of instructions were executed in the program a-v1
inside the main
function (not a huge surprise there).
If you use the arrow keys to select the line for the main
function and hit Enter, you can view more detail.
Select the “Annotate main” option, which gives you an output that should look something like this:
main /home/curtsinger/courses/2016F/csc211/performance-counters/a-v1
│ addl $0x1,-0x4(%rbp)
│5c: cmpl $0x7fff,-0x4(%rbp)
│ ↑ jle 32
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ movl $0x0,-0x8(%rbp)
│ ↓ jmp af
│ int sum = 0;
│6e: movl $0x0,-0xc(%rbp)
│ for(int i=0; i<ARRAY_SIZE; i++) {
│ movl $0x0,-0x10(%rbp)
│ ↓ jmp a2
│ if(data[i] >= 128) {
4.90 │7e: mov -0x10(%rbp),%eax
0.07 │ cltq
1.18 │ mov -0x20010(%rbp,%rax,4),%eax
19.37 │ cmp $0x7f,%eax
│ ↓ jle 9e
│ sum += data[i];
11.73 │ mov -0x10(%rbp),%eax
11.57 │ cltq
1.01 │ mov -0x20010(%rbp,%rax,4),%eax
6.95 │ add %eax,-0xc(%rbp)
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ int sum = 0;
│ for(int i=0; i<ARRAY_SIZE; i++) {
19.40 │9e: addl $0x1,-0x10(%rbp)
13.79 │a2: cmpl $0x7fff,-0x10(%rbp)
10.02 │ ↑ jle 7e
│
│ for(int i=0; i<ARRAY_SIZE; i++) {
│ data[i] = rand() % 256;
│ }
│
│ for(int i=0; i<ITERATIONS; i++) {
│ addl $0x1,-0x8(%rbp)
0.01 │af: cmpl $0x270f,-0x8(%rbp)
│ ↑ jle 6e
│ sum += data[i];
│ }
│ }
│ }
│
│ return 0;
│ mov $0x0,%eax
│ }
│ leaveq
│ ← retq
This output interleaves assembly with the C code it was generated by. While the output is a bit hard to follow, you can generally find the C code for a set of instructions just after the instructions. The numbers in the left column show what percent of cycles were spent executing each instruction over the entire run.
Once you’ve found an event that is significantly different between the two programs, this view should help you pinpoint the exact line of code or instruction where that event is occurring.
When you’re done with this view, hit q twice to quit out of perf report
.
Using the perf
tool, explain why version 1 is so much slower than version 2. You should
be specific: you should identify a meaningful event that explains the difference and be
able to explain why this event is occurring. Describe the process you used to diagnose the
issue, and explain your findings.
Here are two versions of another program:
b-v1.c
#include <stdio.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int main(int argc, char** argv) {
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = i+1;
}
data[ARRAY_SIZE-1] = -1;
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
int index = 0;
while(index != -1) {
sum += data[index];
index = data[index];
}
}
return 0;
}
Copy this code to the file b-v1.c
and compile with the command:
$ gcc --std=c11 -g -o b-v1 b-v1.c
b-v2.c
#include <stdio.h>
#define ITERATIONS 10000
#define ARRAY_SIZE 32768
int main(int argc, char** argv) {
int data[ARRAY_SIZE];
for(int i=0; i<ARRAY_SIZE; i++) {
data[i] = i+1;
}
data[ARRAY_SIZE-1] = -1;
for(int i=0; i<ITERATIONS; i++) {
int sum = 0;
int index = 0;
while(index != ARRAY_SIZE) {
sum += data[index];
index++;
}
}
return 0;
}
Copy this code to the file b-v2.c
and compile with the command:
$ gcc --std=c11 -g -o b-v2 b-v2.c
As with our first pair of programs, these two versions are similar but have very different performance:
$ time ./b-v1
real 0m1.215s
user 0m1.212s
sys 0m0.000s
$ time ./b-v2
real 0m0.731s
user 0m0.728s
sys 0m0.000s
Use hardware performance counters to identify the root cause of the performance difference. Describe the process you used to diagnose the issue, and explain your findings.