Assignment: Archive Printer

Assigned

April 12, 2021

Due

April 19, 2021 by ~~11:59pm~~ class time on Tuesday

Collaboration

All assignments in this class should be completed individually. You may ask for assistance from the instructor or mentor, but you may not discuss any aspect of your work on the assignment with other students in the class.

Submitting

make zip

https://gradescope.com

Archive Printer

Overview

For this assignment, you will write a small program that prints out the contents of a UNIX archive. This format, which typically uses the file extension “.ar”, is a predecessor for the .tar format. UNIX archives are also used to hold libraries you can link to compiled programs.

An archive contains one or more files using a relatively simple data layout. Your program will open an archive file (using some provided code) and print out each file in the archive. To read the archive format you will need to use some pointer manipulation techniques you probably have not seen before this course.

You can find the full details about the .ar format on Wikipedia, but you will only have to support a limited subset of the format for this assignment. The overall structure of a .ar file is a file signature followed by a file header, the data for that file, the next file header, that file’s data, and so on until the end of the file.

You will need to complete this assignment using the provided starter code, and upload your code to Gradescope. Follow these steps to set up your working copy of this assignment:

Open https://gpu.cs.grinnell.edu and log in with your Grinnell account. You will need to approve your login with Duo or enter in a one-time-password if you do not use Duo.
Start an XFCE session. If you see a blank page or some other error for the session, close the tab and then click on your session again to re-open it. This fixes most issues, but please ask for help if you run into persistent issues.
Open a terminal window in the XFCE Session.
Run the following commands in your terminal to set up a working directory for this class:
```
$ mkdir -p ~/csc213/assignments ~/csc213/exercises ~/csc213/labs
```

Now use the git command to check out a copy of the starter code for the assignment:

                  $ git clone /home/curtsinger/csc213/assignments/print-archive ~/csc213/assignments/

                

And now you can use the code command to open the starter code with Visual Studio Code.
```
$ code ~/csc213/assignments/print-archive
```
A Visual Studio Code window should appear with the print-archive directory open in the file browser. You may see a welcome message, which you can close. You can also close any prompts to upgrade to a new version of VSCode.
Open a terminal inside of VSCode using the Terminal menu. By default, terminals appear on the bottom of the window. I find it more conveninent to move it to the right side; just right-click somewhere near the top of the panel that appears and choose Move Panel Right.
Now you can run make in the terminal to build the starter code, or just type ctrl+shift+b to run the default build task (which just runs make).

We’ll use VSCode as the default editor for this class. You can use other editors if you prefer, but you’ll be missing out on some useful features. The VSCode projects I distribute will automatically format your C code, and will include some default settings that help with syntax highlighting, running build tasks, etc.

At this point you should read through the requirements for the assignment and review the provided code.

File Signature

The file signature is the string !<arch> followed by a line feed character (code 0x0A). Each .ar file begins with this signature so a program reading it can verify the format. The starter code checks this file signature, so you can safely assume your code will only have to process .ar files. You will still need to skip past these eight bytes (seven normal characters and a line feed). Note that there is not a null terminator at the end of this string. Also, keep in mind that the file signature only appears at the start of the .ar file, not before every file.

File Header

Each file header gives us a variety of important information about each file, although we’ll just need two pieces of the header: the file identifier (its name) and the file size. Here are all the fields in the file header along with their size in bytes.

File Identifier (16 bytes): This contains 16 ASCII characters that record the name of the file in the archive. The end of the filename is denoted by a / character, followed by spaces to fill the rest of the 16 bytes. Files that are longer than 15 characters are stored in a different way, but you do not need to support longer filenames.
File Modification Timestamp (12 bytes): This stores the last time the file was modified as a string. Rather than writing raw numbers, the .ar format stores numeric values in human-readable ASCII characters. You can use sscanf, strtod, or atoi to convert these values to integers. Any unused bytes after the number are filled with spaces.
Owner ID (6 bytes): This stores the numeric ID of the file’s owner. Again, this is a string of numeric digits in base ten.
Group ID (6 bytes): This stores the numeric ID of the file’s assigned group.
File Mode (8 bytes): This stores the permissions for the file: whether the owner, group, and other users can read, write, and/or execute the file. This value is stored in ASCII characters, but the number is written in octal (base eight).
File Size (10 bytes): This stores the size of the file measured in bytes. Like most of the previous numbers, this is a string of decimal digits that can be converted to an integer with sscanf, strtod, or atoi.
Ending Characters (2 bytes): Each file header ends with the characters 0x60 and 0x0A.

File Data

The actual contents of a file begins immediately after the ending characters of the file header. The number of bytes of file data is the file size, stored in the file header. There is no special character to mark the end of the file data.

One odd constraint of the .ar format is that file headers must always begin an even number of bytes away from the start of the file. The header itself is an even number of bytes, but if the file data has an odd length there is one byte of padding immediately after the file data, but this byte is not part of the file data.

Questions & Answers

Why are input files 4 and 5 a little larger than they should be?: Make sure you account for the padding character that .ar files add when the file is an odd number of bytes.
What is a uint8_t?: The “u” means unsigned, the “8” means 8 bits, and it’s an integer type. These are useful when you care more about the size of some data than exactly how it’s represented.
Could the character '\0' appear in the contents of a file?: Yes, but I won’t test this.
Should we print the slash and spaces in a file identifier?: No, just the filename.
Does our program have to work for empty files in the archive?: No.

Requirements

Your task is to implement the print_contents(uint8_t* data, size_t file_size) function in the starter code. This function will be called with a pointer to the beginning of an archive file’s entire contents, along with the size of the archive file. The function should print the name of each file followed by a newline, the contents of the file, and then another newlines. The sample inputs will include files that end in newlines, so there should be a blank line after each file’s contents.

Creating Archive Files

To test your archive reader, you will need some input archive files. The starter code includes an inputs directory that contains some simple test files. You can create your own .ar file if you would like to test additional inputs with a command like this one:

$ ar rcs output.ar input1.txt input2.txt

This will create a new file named output.ar that should work with your reader, as long as you create the file on a Linux machine. Like many old file formats, this one has many variants. On macOS, the ar tool produces a slightly different version of the format that your program does not need to support. The tool will work fine on the provided inputs even when you run it on a mac, but if you want to make additional inputs you will have to do so on a Linux machine.

Working with Pointers

This assignment will almost certainly force you to manipulate pointers in an unfamiliar way. We’re used to using pointers to access consecutive values of the same type: arrays. This format instead intersperses headers with file data of variable length. That means you’re likely going to need to do addition on pointers. You can add constants to pointers, but it’s important that you understand how this works. Adding 5 to an int* will add 5 * sizeof(int) bytes to the pointer. Generally if you’re working with values of a known size you would use fixed-size types like uint8_t, uint16_t, uint32_t or uint64_t, which are guaranteed to be 8-bit, 16-bit, 32-bit, and 64-bit unsigned integers, respectively.

You might also want to create a struct to hold the file header data. You can add fields that are the appropriate size for each entry, but your compiler might try to insert additional padding between fields in the struct. To prevent this, you have to add the option __attribute__((packed)) to the end of the struct definition. For example, the following struct will almost certainly include hidden padding bytes to bring the size up to a more reasonable value (8 bytes seems likely):

              struct somestruct {
  int x;
  char chars[3];
};

            

If we instead want this struct to be packed together with no extra space (so it matches a specification like the one for our file header), we could write:

              struct __attribute__((packed)) mystruct {
  int x;
  char chars[3];
}

            

If you skip this attribute your file header may contain unwanted padding bytes, so your reader will not access the correct values in the header. Whether you use a struct for the header or not, you will almost certainly need pointer math to get from one file header to the start of the next file header.

Examples

The inputs directory includes five archives, each containing one additional file. These files all contain text, and include a mix of even and odd sizes. Here are the expected outputs for each input file. Pay close attention to the number of blank lines. The padding between the end of a file’s data and the next file header is a newline character, so incorrect implementations could potentially print one additional newline after odd-sized files.

              $ ./print-archive inputs/input1.ar
a.txt
Greetings from the file a.txt

              $ ./print-archive inputs/input2.ar
a.txt
Greetings from the file a.txt

b.txt
Hello from b.txt as well!

              $ ./print-archive inputs/input3.ar
a.txt
Greetings from the file a.txt

b.txt
Hello from b.txt as well!

c.txt
Yet another hello, this time from c.txt.

              $ ./print-archive inputs/input4.ar
a.txt
Greetings from the file a.txt

b.txt
Hello from b.txt as well!

c.txt
Yet another hello, this time from c.txt.

d.txt
An again, here's a hello from d.txt.


            

              $ ./print-archive inputs/input5.ar
a.txt
Greetings from the file a.txt

b.txt
Hello from b.txt as well!

c.txt
Yet another hello, this time from c.txt.

d.txt
An again, here's a hello from d.txt.

e.txt
This is getting a bit old, but here's e.txt.