Exercise: Getting Started with Threads

Now that you have a basic understanding of concurrency and threads, we’ll put those concepts into practice with a few simple multithreaded programs. As you complete this exercise you will likely need to refer to documentation for the pthread_* functions. The two best resources are the textbook chapter on the Threads API and the Linux man pages.

Your task today will be to implement a parallel program that counts words in a large corpus of text documents. The starter code includes a single-threaded implementation, along with an input dataset from the Miller Center of Public Affairs that includes over 1000 speeches delivered by US presidents. The provided text has been processed slightly from the original, available at https://data.millercenter.org.

Starter Code and Introduction

Log in to at MathLAN machine and launch a terminal. Run the following commands to get a copy of the starter code for this exercise and open it in VSCode:

              $ mkdir -p ~/csc213/exercises
$ git clone /home/curtsinger/csc213/exercises/threads ~/csc213/exercises/threads
$ code ~/csc213/exercises/threads

            

Review the code provided with this exercise. Here’s a quick overview of what you’ll find in each file:

Makefile: This file specifies the process for building the example program from source. Run make in the terminal to build the project. You’ll notice this program is set up to build with Address Sanitizer by default; please leave this enabled.
countwords.c: This file implements a single-threaded word counting program. The main function validates command line arguments, and then passes the validated array of filenames to count_words. You’ll be able to reuse the word counting logic in the provided implementation.
inputs: This directory contains more than 1000 files from the text corpus we’ll be processing. You might end up looking inside specific files if you run into issues, but you should be able to leave these files alone in most cases.

To compile and run the starter code, run the following commands in a terminal:

              $ make
clang -g -fsanitize=address -o countwords countwords.c -lpthread
$ ./countwords inputs/*
Total word count: 4146109

            

You’ll likely notice the counting can take a while (a few seconds, at least) but that’s mostly the time spent loading the files on the first run. The OS does a good job keeping caching file data in memory, so later runs should be faster. We’ll use the time command to measure how long the short program run lasts:

              $ time ./countwords inputs/*
Total word count: 4146109
real    0m0.160s
user    0m0.135s
sys     0m0.017s

            

The output from time tells us how much time elapsed on the clock (the “real” time) during the run, as well as the time spent in the program itself (“user”) and in the OS (“sys”). Our goal is to make real time shorter, and using threads is often a good way to do that. We might not see a big improvement for a short-running program like this one, but for even larger inputs we could see benefits from using threads. The goal for today is to preserve the existing functionality while dividing the work up in parallel.

A. Counting With Threads

Any time we are using threads to perform computation, we have to decide how to divide up the work into pieces that can be done concurrently. For our countwords program, the easiest place to start is to create a thread to count the words in each file. There are some downsides to this approach that we’ll address in later parts of the exercise, but we can ignore those issues for now.

Follow these high-level steps to convert the countwords program to one that counts in parallel:

Write a new function, worker, that will run as the body of each thread. This function should count the words in a single file. As you write this function, update count_words so it calls worker. Make sure you still see the same total word count.
Modify the worker function so it matches the parameter and return type requirements for pthread_create. The function must take a single void* argument, and must return a void* result. You can pass a filename string as the void* argument without much extra work, but you may want to look at the Thread API reading to see some recommendations for returning non-pointer values from threads. Keep in mind, it is never safe for a function to return a pointer to one of its local variables.
Instead of calling worker directly in count_words, create a thread for each file. Every thread will run the worker function, but shoudl have a different argument. You will need to save the pthread_t identifier for each of your threads so you can join with them later.
Finally, once all threads are created, join with each thread. You can join with threads in any order, as long as you join with them all before returning a result. Whatever method you chose for returning counts from threads will dictate the method you use to combine counts from each thread.

Once you have a working implementation, run it with a timer to see how its performance compares to the original single-threaded version. If everything works correctly, you should see output that looks something like this:

              $ time ./countwords inputs/*
Total word count: 4146109

real    0m0.189s
user    0m1.337s
sys     0m0.453s

            

Unfortunately, the real time required to run this program is longer than the original program. That’s likely because creating threads is not free; we created over 1000 threads, but our machines can only run a few of them at one time. The work of creating more threads is reflected in the sys time, which is significantly higher than the 0.017s we saw with the original program.

You may have noticed the user and sys times are both longer than the program’s actual runtime. That’s because these times are the total time across all threads in the program. This is a normal result when you are doing running multiple threads, so don’t be alarmed when you see this in later parts of the exercise.

B. Creating a Thread Pool

In part A, we divided work up across many threads but we didn’t see any performance benefit from that change. We’ll address that in this part of the exercise using a design called a thread pool.

A thread pool is a group of threads that complete work in parallel, but we can reduce the cost of creating threads by making a fixed number of threads instead of one thread per item of work. For this to work, we’ll need our threads to count words from more than one file. The worker threads keep counting words until all files have been counted instead of just counting one pre-assigned file.

The pseudocode below shows what a typical worker thread does in a thread pool:

              While the worklist is not empty:
  Take work item w from the worklist
  Complete work item w
Return the result

            

You’ve already written code to complete work items (that’s counting words in our case), and returning a result should be nearly identical to what you’ve done in part A. The first two lines of pseudocode will require some changes, though.

We can use the files and num_files parameters passed to count_words as our worklist. Every thread will need access to both those parameters, as well as a next_item index that records where the next available file in the worklist is located. Normally we would pass these as separate parameters to our worker function, but the pthread_create function forces us to write worker functions that take a single void* parameter.

The typical way we pass multiple values to a thread in C is by putting them inside a struct. Create a single struct and pass a pointer to that struct as the thread parameter. It is important that all worker threads have a pointer to the same struct, since they will be coordinating through the next_item index. When one thread updates the index of the next available item, other threads should see that update.

We also need threads to take turns when they interact with the worklist. A thread must be able to check if the list is empty and claim the next item from the worklist without being interrupted by another thread. If we don’t address this, we could end up with two threads counting in the same file. This issue is an example of an atomicity violation, which we will read about and discuss more in the future.

We can force threads to take turns by using a lock, which is implemented in a pthread_mutex_t in C. We’ll talk more about locking discipline soon, but for now you will want a single lock that each thread will hold any time it is accessing the index field of your thread arguments. Make sure you have a single pthread_mutex_t that is accessed by all threads; making your lock a local variable in the thread function won’t work. The best place to put your lock is in the struct along with the value it protects (next_item in this case). Don’t forget to initialize the lock, which will require a call to the pthread_mutex_init function.

Finally, you’ll need to update count_words to create a fixed number of threads. It would be a good practice to #define a THREAD_COUNT constant that you can adjust later. As of the time of writing, the lab machines for CSC 213 have the ability to threads simultaneously, so start with a thread count of 20.

Once you have everything working, run the program again to see if we’ve sped anything up. Here’s some representative output you should expect, although the results depend a lot on small implementation details for programs that run for such a short time:

              $ time ./countwords inputs/*
Total word count: 4146109

real    0m0.088s
user    0m1.033s
sys     0m0.063s

            

As you can see, we’ve finally made performance better than the single-threaded version, at least for this example. Don’t worry too much if your implementation is still slower; we’ll write threaded programs that do much more computationally intensive work that counting words in the near future, and you’ll definitely see a speedup there.

C. Designing a Better Thread Pool

The thread pool you implemented in part B was good enough to improve program performance, but it has some limitations. One major limitation is that we cannot safely add work to the worklist after creating worker threads.

Once worker threads are running they could finish at any time. If the main thread adds another file to the worklist it may be processed, but if the worker threads have exited already by this point they won’t count words in the added file. This is an example of an order violation, where two events in a concurrent program must happen in a specific order (work must be added before threads exit), but there is no mechanism in place to ensure that order.

Work with your partner(s) to design a different thread pool that allows you to add work at any point. You should be able to convince yourself that your design will guarantee that worker threads do not exit until all work is finished. Make sure your design still allows threads to exit once the main thread is done adding work to the list.