Assignment: Ngram Generator

Assigned
  • April 1, 2021
Due
  • April 5, 2021 by 11:59pm
Collaboration
    All assignments in this class should be completed individually. You may ask for assistance from the instructor or mentor, but you may not discuss any aspect of your work on the assignment with other students in the class.
Submitting
    To submit this assignment, run make zip in the VSCode terminal to create a zip archive of your work. Log in to Gradescope at https://gradescope.com and upload it to the Ngram Generator assignment on Gradescope. You can resubmit as many times as you like before the deadline.

Overview

For this assignment, you will implement a simple command line utility that reads in strings from standard input and produces a list of all the ngrams contained in the input. An ngram is a length-n sequence of characters, sometimes used for language modeling or other kinds of text analysis. Your program should print all of the ngrams in a given string; see the examples below for details.

You will need to complete this assignment using the provided starter code, and upload your code to Gradescope. Follow these steps to set up your working copy of this assignment:

  1. Open https://gpu.cs.grinnell.edu and log in with your Grinnell account. You will need to approve your login with Duo or enter in a one-time-password if you do not use Duo.
  2. Start an XFCE session. If you see a blank page or some other error for the session, close the tab and then click on your session again to re-open it. This fixes most issues, but please ask for help if you run into persistent issues.
  3. Open a terminal window in the XFCE Session.
  4. Run the following commands in your terminal to set up a working directory for this class:
    $ mkdir -p ~/csc213/assignments ~/csc213/exercises ~/csc213/labs
    
  5. Now use the git command to check out a copy of the starter code for the assignment:
    $ git clone /home/curtsinger/csc213/assignments/ngram ~/csc213/assignments/
    
  6. And now you can use the code command to open the starter code with Visual Studio Code.
    $ code ~/csc213/assignments/ngram
    
  7. A Visual Studio Code window should appear with the ngram directory open in the file browser. You may see a welcome message, which you can close. You can also close any prompts to upgrade to a new version of VSCode.
  8. Open a terminal inside of VSCode using the Terminal menu. By default, terminals appear on the bottom of the window. I find it more conveninent to move it to the right side; just right-click somewhere near the top of the panel that appears and choose Move Panel Right.
  9. Now you can run make in the terminal to build the starter code, or just type ctrl+shift+b to run the default build task (which just runs make).

We’ll use VSCode as the default editor for this class. You can use other editors if you prefer, but you’ll be missing out on some useful features. The VSCode projects I distribute will automatically format your C code, and will include some default settings that help with syntax highlighting, running build tasks, etc.

At this point you should read through the requirements for the assignment and review the provided code.

Requirements

The best way to understand the expected output for this tool is to see an example. The ngram program reads from standard input, which we can send to the program using the shell | operator. This pipe takes the output of the preceding command and sends it as input to the next command. We can generate output easily with the echo command, but the echo command adds a newline to the end of the string by default, so I will use it with the option -n to omit the newline in the examples below.

Here is a simple run of the ngram tool to produce all the trigrams of a given input:

$ echo -n "This is a test" | ./ngram 3
Thi
his
is
s i
 is
is
s a
 a
a t
 te
tes
est

Note that the tool does not print “st” or “t” at the end of the run—it should only print complete trigrams. If the input does not include enough characters to form any complete ngrams, the tool should print nothing. The ngram command should work for ngrams as short as 1, up to , the largest positive integer that can be represented in a signed 32 bit number (provided your computer has enough memory to hold an ngram that large and your operating system will allow you to request that much memory). In other words, you may not hard-code an upper limit on the number of characters in an ngram. Similarly, there is no upper limit on the length of the input to the program. That means you should not attempt to read all of the input and store it in the program at one time.

The starter code includes some basic validation of the single command line argument that specifies the length of the ngrams that should be collected. Please write your solution in the main function below this validation code. You are welcome to add helper functions, but please do not change or remove the code that validates the N parameter.

Please use reasonable coding practices in writing your solution; that means you should write comments, indent your code properly, and use informative variable names. You should also follow best-practices in writing your solution; that means you should free and memory you malloc, and make sure to check for errors returned by standard POSIX functions. I strongly encourage you to search for POSIX functions that will help you construct your solution rather than writing everything from scratch.

I highly recommend that you read input one character at a time using a function like fgetc; this is a bit easier than reading multiple characters at a time, and should be just as fast. One approach you should not use is to read in the entire input at once. The input may be too large for you to store in memory at one time, so you will have to process it as you read it in.

You may have noticed that the provided Makefile builds with the -Wall and -Werror options. This tells the compiler to give all available warnings, and to treat any warning as a build error. I will use these options when building your code for grading, so make sure the code you submit does not produce any errors or warnings.

Questions & Answers

What if the input string is shorter than N?
The program should print nothing.
What if N or the input is huge?
It should work in theory. You will eventually run up against limits like the amount of memory on your machine or the maximum stack size. That’s fine; just don’t hard-code an upper limit on N or the length of the program’s input.
When should the ngram program stop?
When it reaches the end of the program’s input, not the end of a line.

Additional Examples

Here are a few additional example runs of ngram that may be useful for testing your implementation:

$ echo -n "Hello" | ./ngram 1
H
e
l
l
o
$ echo -n "Hi" | ./ngram 2
Hi
$ echo -n "Hi" | ./ngram 3
$ echo -n "213 is great" | ./ngram 2
21
13
3
 i
is
s
 g
gr
re
ea
at