Exercise: Debugging

Collaboration

Work with your assigned group for this exercise.

Submitting

You do not need to turn anything in for this exercise.

Preparation

This exercise uses a collection of buggy programs I have prepared for you to practice using gdb. You will complete this exercise in groups; one member of each group will need to start a session on MathLAN, which you can then share.

Start the exercise by connecting to a MathLAN machine, cloning the exercise repository at /home/curtsinger/csc213/exercises/debugging, and opening your code with VSCode. I’ll outline the steps for this process below, but it’s nearly the same as for the first assignment.

Follow these steps to set up your working copy of this exercise:

Open https://gpu.cs.grinnell.edu and log in with your Grinnell account. You will need to approve your login with Duo or enter in a one-time-password if you do not use Duo.
Start an XFCE session. If you see a blank page or some other error for the session, close the tab and then click on your session again to re-open it. This fixes most issues, but please ask for help if you run into persistent issues.
Open a terminal window in the XFCE Session.
Run the following commands in your terminal to set up a working directory for this class:
```
$ mkdir -p ~/csc213/assignments ~/csc213/exercises ~/csc213/labs
```
Now use the git command to check out a copy of the starter code for the exercise:
```
$ git clone /home/curtsinger/csc213/exercises/debugging ~/csc213/exercises/
```
And now you can use the code command to open the starter code with Visual Studio Code.
```
$ code ~/csc213/exercises/debugging
```
A Visual Studio Code window should appear with the debugging directory open in the file browser. You may see a welcome message, which you can close. You can also close any prompts to upgrade to a new version of VSCode.
Open a terminal inside of VSCode using the Terminal menu. By default, terminals appear on the bottom of the window. I find it more conveninent to move it to the right side; just right-click somewhere near the top of the panel that appears and choose Move Panel Right.
Now you can run make in the terminal to build the starter code, or just type ctrl+shift+b to run the default build task (which just runs make).

Part A: Catching Segfaults

We’ll start out by looking at our first buggy program, partA. While the source code is in the directory you copied, this exercise will walk you through a debugging session without the source code. Instead, we’ll rely on gdb to show us lines of code where errors occurred.

Run the partA program outside of gdb to verify that it does indeed have a bug:

              $ ./partA
Segmentation fault

A great next step to debug this program is to start it in gdb:

              $ gdb ./partA
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
...
Reading symbols from ./partA...done.
(gdb)

            

To run the program in gdb, enter the command run and hit enter. This time, you should end up with output that looks something like this:

              Starting program: /home/awesomestudent/csc213/exercises/debugging/partA

Program received signal SIGSEGV, Segmentation fault.
0x00005555555551a6 in total_characters (words=0x7fffffffe660, num_words=10) at partA.c:20
20	    while (words[i].word[j++] != '\0') count += words[i].count;

In this run of the program, gdb is telling us that a segmentation fault happened inside of the total_characters function on line 20 of a source file named partA.c. Normally you will have access to the source code for the programs you are debugging, but often times different parts of your program (such as libraries) will not have debug information or you may just not understand the provided code yet. To replicate that environment, try to complete this exercise without openning the source files.

The first step when debugging a fault like this is to see how we reached the code where the error occurred. We can see that by running the backtrace (or bt) command:

              (gdb) backtrace
#0  0x00005555555551a6 in total_characters (words=0x7fffffffe660, num_words=10) at partA.c:20
#1  0x00005555555553d9 in main () at partA.c:66

            

This shows us the line where our segmentation fault occurred, and also tells us that this function was called from the main function.

Starting our search

I typically begin debugging segmentation faults or other types of errors that stop the program with two questions:

What parts of the current line could have triggered the failure?
How did we get to this error?

We can answer the second answer using backtrace, but you will have to rely on your C knowledge to answer the first question. We need to look at the source line where the error occurred, which may be off the screen at this point. To bring it back, use the frame command:

              (gdb) frame
#0  0x00005555555551a6 in total_characters (words=0x7fffffffe660, num_words=10)
    at partA.c:20
9	    while (words[i].word[j++] != '\0') count += words[i].count;

            

Our program crashed with a segmentation fault, which occurrs when you dereference in invalid pointer. The pointer may be NULL, or it could have held some other invalid memory location. Work with your partner to come up with a list of all the parts of this line that dereference a pointer; this could happen when the code uses the * operator, array indexing, or ->. Once you have a list, move on to the next step.

Hunting for invalid pointers

Once you have a list of operations that dereference pointers, you can use gdb to look at the pointer values to see if any of them are suspicious. One possible operation that dereferences a pointer is words[i]. If the words pointer is not valid, indexing into it as an array would trigger a segmentation fault. Use the print command to look at this value:

              (gdb) print words
$1 = (word_count_t *) 0x7fffffffe660

This shows us that words has type word_count_t*, and its value is 0x7fffffffe660. We can tell right away that words is not NULL (NULL is zero on most reasonable machines), but is 0x7fffffffe660 a valid pointer? You will gradually develop a sense of what a real pointer looks like, but you can check to see if an address is valid using gdb as well. The info proc mappings gdb command can show you all of the valid ranges in your program’s address space. Keep in mind that your output almost certainly will not match the example output below, so be sure to run the command on your own.

              (gdb) info proc mappings
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
      0x555555554000     0x555555555000     0x1000        0x0 /home/awesomestudent/csc213/exercises/debugging/partA
      0x555555555000     0x555555556000     0x1000     0x1000 /home/awesomestudent/csc213/exercises/debugging/partA
      0x555555556000     0x555555557000     0x1000     0x2000 /home/awesomestudent/csc213/exercises/debugging/partA
      0x555555557000     0x555555558000     0x1000     0x2000 /home/awesomestudent/csc213/exercises/debugging/partA
      0x555555558000     0x555555559000     0x1000     0x3000 /home/awesomestudent/csc213/exercises/debugging/partA
      0x7ffff7ddc000     0x7ffff7dfe000    0x22000        0x0 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7dfe000     0x7ffff7f46000   0x148000    0x22000 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7f46000     0x7ffff7f92000    0x4c000   0x16a000 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7f92000     0x7ffff7f93000     0x1000   0x1b6000 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7f93000     0x7ffff7f97000     0x4000   0x1b6000 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7f97000     0x7ffff7f99000     0x2000   0x1ba000 /usr/lib/x86_64-linux-gnu/libc-2.28.so
      0x7ffff7f99000     0x7ffff7f9f000     0x6000        0x0
      0x7ffff7fd0000     0x7ffff7fd3000     0x3000        0x0 [vvar]
      0x7ffff7fd3000     0x7ffff7fd5000     0x2000        0x0 [vdso]
      0x7ffff7fd5000     0x7ffff7fd6000     0x1000        0x0 /usr/lib/x86_64-linux-gnu/ld-2.28.so
      0x7ffff7fd6000     0x7ffff7ff4000    0x1e000     0x1000 /usr/lib/x86_64-linux-gnu/ld-2.28.so
      0x7ffff7ff4000     0x7ffff7ffc000     0x8000    0x1f000 /usr/lib/x86_64-linux-gnu/ld-2.28.so
      0x7ffff7ffc000     0x7ffff7ffd000     0x1000    0x26000 /usr/lib/x86_64-linux-gnu/ld-2.28.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x27000 /usr/lib/x86_64-linux-gnu/ld-2.28.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000        0x0
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]

            

This shows all of the virtual addresses accessible to this program, each established by the operating system. Most of these were set up via calls to mmap. Note that some mappings are placed at random locations, so your addresses may not match up exactly. If you look through the entries, you’ll see that words has a value that falls between the start and end addresses of the last entry. This entry corresponds to the program’s stack, so the pointer words points to space on the stack. That can’t cause a segmentation fault because the pointer references a valid region of memory.

Continue printing values of variables used on the current line until you have identified the offending pointer. You can write C-like expressions after print in gdb, such as print words[i].count. Make sure you’ve identified the specific pointer access that is causing the segmentation fault before you move on.

Using this information

Now that you’ve discovered the offending pointer, the next step is to examine the code of the main function to figure out why it is calling total_characters with an array that contains an invalid pointer. Use what you found in the previous part to fix the error; you’ll probably find it quickly, but without a debugger you may not have been able to find the issue. Once you’ve fixed the bug, rerun make and verify that the program finishes without crashing.

As you’ll see in the next part, sometimes finding the corrupted value is just the first step in a longer debugging process.

Part B: Diagnosing Mysterious Bugs

For this part, we will look at a short program with exactly one memory error. Open the source file partB.c. This is a pretty straightforward program that copies one array to another. If you see the error already you’re more observant than most CS students and faculty; this is a common mistake that fools many people. But, for the purposes of this exercise, try not to hunt through the code; we’re going to save ourselves some work by using gdb instead.

First, we’ll run the program without gdb:

              $ ./partB
I've made a huge mistake.

Unlike our first example, this program does not stop at the point where an error occurred. Instead, we just get the wrong result. Still, we can use gdb to track down the root cause of the error. Start the program with gdb:

              $ gdb ./partB
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
...
Reading symbols from ./partB...done.
(gdb) run
Starting program: /home/awesomestudent/csc213/exercises/debugging/partB
I've made a huge mistake.
[Inferior 1 (process 3102) exited normally]

            

We’re still getting the wrong answer, so we can work backwards through the program. If the sums of the two arrays are not equal, we could check to see what the values of those sums are. A breakpoint is a reasonable way to do this. The program has computed the sums by line 28, so we’ll set a breakpoint, run the program again, and then print both sums with gdb:

              (gdb) break partB.c:28
Breakpoint 1 at 0x5555555551f4: file partB.c, line 28.

(gdb) run
Starting program: /home/awesomestudent/csc213/exercises/debugging/partB

Breakpoint 1, main () at partB.c:28
28	  if (array1_sum == array2_sum)

(gdb) print array1_sum
$1 = 1295394831

(gdb) print array2_sum
$2 = 15

It looks like array2_sum is computed correctly, but array1_sum is not. That’s odd, because we are copying values from array1 to array2, and yet somehow array1 is being overwritten. Just to verify that’s really happening, try printing some values from array1:

              (gdb) print array1[0]
$3 = 1431654944
(gdb) print array1[1]
$4 = 21845

            

The values of array1 are definitely changed from their initial values, but we didn’t write any code to directly modify array. That means this program has a buffer overrun, meaning some other write went beyond the bounds of where it was writing to. Buffer overruns can be difficult to track down. However, gdb gives us the tools we need to catch this buffer overrun as it occurs. There are two possibilities that will make sense in different circumstances, but we’ll track down the error with both.

Catching the error with conditional breakpoints

We know that the values of array1 are being overwritten by some code in our program. Because we have a small program, we can actually narrow this down pretty easily; the only code that writes to memory in our program is the loop on lines 18–20. This loop just runs a few times, so we could set a breakpoint on each iteration of the loop and inspect the result:

              (gdb) break partB.c:19
Breakpoint 2 at 0x5555555551ae: file partB.c, line 19.

(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /home/awesomestudent/csc213/exercises/debugging/partB

Breakpoint 2, main () at partB.c:19
19	    array2[i] = array1[i];

            

We’re now at the first write to memory. Because we suspect a buffer overrun, we should make sure our array indices are in-bounds. You can print i every time, or you can use gdb’s display command to print the value of a variable each time the program stops.

              (gdb) display i
1: i = 0
(gdb) continue
Continuing.

Breakpoint 2, main () at partB.c:19
19	    array2[i] = array1[i];
1: i = 1

            

Now each time we continue, the program will stop at our breakpoint and print i. If you want to run the same command repeatedly, just hit Enter in gdb. This process will take us to the error eventually, and in our case after just a few iterations. However, this does not work well if your code loops thousands of times before an error occurs. For this, we can use conditional breakpoints.

First, remove all breakpoints from your program:

              (gdb) delete
Delete all breakpoints? (y or n) y

Now, we’ll set a conditional breakpoint. We aren’t concerned about in-bounds writes to memory, but we do want to catch the first out of bounds write. That occurs for indices greater than or equal to 5, the length of array2.

              (gdb) break partB.c:19 if i >= 5
Breakpoint 3 at 0x5555555551ae: file partB.c, line 19.

(gdb) run
Starting program: /home/awesomestudent/csc213/exercises/debugging/partB

Breakpoint 3, main () at partB.c:19
19	    array2[i] = array1[i];
1: i = 5

            

Now we’ve stopped the program at exactly the point where an out-of-bounds write occurrs. Given this information, you can go back to the code and figure out why this loop is running for too many iterations. This approach works well when you have a good idea of where a buffer overrun is occurring and you want to catch it “in the act.” However, it’s not always clear which code you should be checking; you have to understand the program very well to know where all possible writes might occur, and large programs will have many lines of code that write to memory and could potentially overrun a buffer. That’s where watchpoints are useful.

Catching the error with watchpoints

In this case, we’re going to ignore the code and instead watch for modifications to memory. First, delete our breakpoints, set a new breakpoint after we’ve added up the arrays, and run the program.

              (gdb) delete
Delete all breakpoints? (y or n) y

(gdb) break partB.c:28
Breakpoint 4 at 0x5555555551f4: file partB.c, line 28.

(gdb) run
Starting program: /home/awesomestudent/csc213/exercises/debugging/partB

Breakpoint 4, main () at partB.c:28
28	  if (array1_sum == array2_sum)

            

Now we’re at the point where we’ve computed invalid array sums. Instead of looking at the sums themselves, we’ll look inside the arrays. Remember that we checked the values in array1 before; let’s compare those to the values in array2:

              (gdb) print array1[0]
$4 = 1431654944
(gdb) print array2[0]
$5 = 1

            

We’d expect these values to match, since the code copies from array1 to array2. Clearly array1[0] is beign overwritten. To catch this overwriting, we’ll delete our breakpoints and start the program again. The start command will begin executing the program and stop once we reach main.

              (gdb) delete
Delete all breakpoints? (y or n) y

(gdb) start
Temporary breakpoint 5 at 0x555555555182: file partB.c, line 14.
Starting program: /home/awesomestudent/csc213/exercises/debugging/partB

Temporary breakpoint 5, main () at partB.c:14
14	  int array1[] = {1, 2, 3, 4, 5};

            

Now that the program has started we can set a watchpoint to monitor array1[0] for changes. If we tried to do this before starting the program we may have the wrong address; many parts of the program are loaded at random addresses on each run, so we need to make sure we get addresses from the current run.

              (gdb) watch array1[0]
Hardware watchpoint 6: array1[0]

(gdb) continue
Continuing.
Hardware watchpoint 6: array1[0]

Old value = 1431654944
New value = 1
0x0000555555555189 in main () at partB.c:14
14	  int array1[] = {1, 2, 3, 4, 5};

            

We’ve now stopped our program at the first modification to array1[0]. This is actually initializing the array, so we haven’t found the write we’re hunting for.

              (gdb) continue
Continuing.
Hardware watchpoint 6: array1[0]

Old value = 1
New value = 1431654944
main () at partB.c:18
18	  for (int i = 0; i < sizeof(array1); i++)
1: i = 8

            

Now we’ve stopped the program at the point where array1[0] is overwritten. This brings us to the same point we reached with conditional breakpoints, but we did not need to know which code was overwriting the array contents. In general, watchpoints are useful when you know a value is being changed but you don’t know why. I recommend using these over conditional breakpoints in most cases, but they have some limitations. You are limited to just four watchpoints at a time, and watchpoint can only detect modifications to a range of 1, 2, 4, or 8 bytes, not an entire array or a large struct.

Wrapping Up

Now that you’ve tracked down the problem with partB.c, make sure you know how to fix it. This case is somewhat contrived, but hopefully you can take some of these techniques and use them for debugging your own programs in the future. There are many more gdb commands, so I recommend running the help command to see what commands are available. The gdb command line also includes Tab completion, so you can auto-complete many commands if you remmeber how they start. If you learn any new, useful gdb commands, please share them with the class!

There are quite a few gdb commands we did not use in this example. Two notable examples are step and next, which allow you to walk through your program one line at a time. In general, you want to avoid doing this; use breakpoints and watchpoints to stop the program at the point you want instead of running the program one line at a time. Sometimes there are cases where you have no choice but to step through a program one line at a time. The next command will go to the next line of the current function. If the current line calls another function, gdb will execute that function and break when it returns. The step command will run until the next source line, whether it is in the same function or not. These are equivalent to the “step over” and “step into” operations that many graphical debuggers support.