Assignment: Parallel Lettercount

Assigned
  • May 3, 2021
Due
  • May 10, 2021 by 11:59pm
Collaboration
    All assignments in this class should be completed individually. You may ask for assistance from the instructor or mentor, but you may not discuss any aspect of your work on the assignment with other students in the class.
Submitting
    To submit this assignment, run make zip in the VSCode terminal to create a zip archive of your work. Log in to Gradescope at https://gradescope.com and upload it to the Parallel Lettercount assignment on Gradescope. You can resubmit as many times as you like before the deadline.

Overview

For this assignment, you will write a parallel program that counts how many times each of the 26 letters of the alphabet occurs in an input file. The program can ignore all non-alphabetic characters, but should count both uppercase and lowercase versions of the same letter together. Your program must be able to run with one, two, four, or eight threads, depending on a command line parameter.

To run the program, use the command ./lettercount <N> <input file>, where <N> is a placeholder for the number of threads and <input file> is a path to a text file the program should process.

The starter code for this assignment already includes code to read the number of threads and the input file from the command line. This code uses the mmap function to load the file into memory, so you can access it like a normal array of bytes and let the OS do the work of accessing the file behind the scenes. See the examples below for more detail on how the program should run.

To complete your implementation, you’ll need to add at least one additional function that runs in the body of each thread. This thread function will be responsible for counting a subset of the characters in an input file. The comments in the starter code include instructions and some hints, so please read them carefully.

Starter Code

Run the following commands on MathLAN to make a copy of the starter code for this assignment:

$ mkdir -p ~/csc213/assignments; git clone /home/curtsinger/csc213/assignments/lettercount ~/csc213/assignments/

Then run this command to open the project in VSCode:

$ code ~/csc213/assignments/lettercount

Questions & Answers

Should adding threads speed up the program?
It may not, depending on your synchronization strategy. That’s okay, since the purpose of this assignment is to make sure you understand how to use threads and locks to protect shared data. However, adjusting your implementation to improve performance may be a useful exercise if you want an additional challenge. Note: You will likely need to run your code on MathLAN for threads to result in any sort of performance improvement.
Can we just count separately in each thread and then combine counts later?
No, at least not entirely. If you would like to use a local count and update a global count periodically that’s okay, but make sure your threads do not count more than 100 characters between updates to the shared counters.
How evenly do we have to divide work between threads?
Each thread should count approximately file_size / num_threads characters. It’s okay if there is a slight imbalance, but not thread should count more than num_threads extra characters.

Examples

For all of the examples below, you should get the same counts regardless of the number of threads you use.

The first input contains only spaces and lowercase alphabetic characters, with the standard typewriter test phrase “the quick brown fox jumps over the lazy dog”.

$ ./lettercount 2 inputs/input1.txt
a: 1
b: 1
c: 1
d: 1
e: 3
f: 1
g: 1
h: 2
i: 1
j: 1
k: 1
l: 1
m: 1
n: 1
o: 4
p: 1
q: 1
r: 2
s: 1
t: 2
u: 2
v: 1
w: 1
x: 1
y: 1
z: 1

The second test input mixes upper and lower case letters, along with some other punctuation. This text is the Grinnell College mission statement.

$ ./lettercount 4 inputs/input2.txt
a: 71
b: 8
c: 35
d: 42
e: 104
f: 22
g: 21
h: 42
i: 69
j: 0
k: 7
l: 49
m: 14
n: 60
o: 67
p: 11
q: 4
r: 49
s: 50
t: 70
u: 27
v: 10
w: 16
x: 2
y: 13
z: 0

And finally, the third provided input contains the complete works of Shakespeare, provided by Project Gutenberg (see inputs/input3-license.txt).

$ ./lettercount 8 inputs/input3.txt
a: 288594
b: 61788
c: 87839
d: 149127
e: 446147
f: 80333
g: 68054
h: 236584
i: 253329
j: 4752
k: 35362
l: 169657
m: 111219
n: 242751
o: 313890
p: 58249
q: 3577
r: 237250
s: 248518
t: 328987
u: 128697
v: 37497
w: 89286
x: 5217
y: 94173
z: 1626