Table of Contents
Counting Words
Programming in C is a very important baseline skill for CSC4103. This exercise should make sure you’re comfortable with the basics of the language. In particular, you need to be fluent in working with struct
’s, linked data structures (e.g., lists), pointers, arrays, typedefs, and such. Before you start you might want to (re)read the Study Guide: C and attempt to answer all the questions there.
You will be writing a program called words
. words
is a program that counts (1) the total amount of words and (2) the frequency of each word in a file(s). It then prints the results to stdout
. Like most UNIX utilities in the real world, your program should read its input from each of the files specified as command line arguments, printing the cumulative word counts. If no file is provided, your program should read from stdin
.
In C, header files (suffixed by .h
) are how we engineer abstractions. They define the objects, types, methods, and – most importantly – documentation. The corresponding .c
file (i.e., the file with the same base name, just with the file extension .c
) provides the implementation of the abstraction. You should be able to write code with the header file without peeking under the covers at its implementation.
In this case, words/word_count.h
provides the definition of the struct word_count
, which we will use as a linked list to keep track of a word and its frequency. This has been typedef’d into WordCount
. This means that instead of typing out struct word_count
, we can use WordCount
as shorthand. The header file also gives us a list of functions that are defined in words/word_count.c
. Part of this assignment is to write code for these functions in words/word_count.c
.
We have provided you with a compiled version of sort_words
so that you do not need to write the wordcount_sort
function. However, you may still need to write your own comparator function (i.e. wordcount_less
). The Makefile
links this in with your two object files, words.o
and word_count.o
.
Note that words.o
is an ELF formatted binary. As such you will need to use a system which can run ELF executables to test your program (such as the CSC4103 Docker container). Windows and OS X do NOT use ELF and as such should not be used for testing.
For this section, you will be making changes to words/main.c
and words/word_count.c
. After editing these files, cd into the words
directory and run make
in the terminal. This will create the words executable. (Remember to run make
after making code changes to generate a fresh executable). Use this executable (and your own test cases) to test your program for correctness. The autograder will use a series of test cases to determine your final score for this section.
For the below examples, suppose we have a file called words.txt
that contains the following content:
abc def AaA
bbb zzz aaa
Using functions such as
rewind
andfseek
will cause some issues withstdin
that might not be apparent when testing locally, so you should avoid using those in your implementation.
Total count
Your first task will be to implement total word count. When executed, words
will print the total number of words counted to stdout. At this point, you will not need to make edits to words/word_count.c
. A complete implementation of the num_words
function along with the appropriate changes to words/main.c
can suffice.
A word is defined as a sequence of contiguous alphabetical characters of length greater than one. All words should be converted to their lower-case representation and be treated as not case-sensitive. All non-alphabetical characters are ignored and treated as word separators. The maximum length of a word has been defined at the top of words/main.c
, see the definition of the macro MAX_WORD_LEN
.
After completing this part, running ./words words.txt
should print:
The total number of words is: 6
Remember to support input from multiple files and standard input! Running ./words words.txt words.txt
should print:
The total number of words is: 12
To test reading from standard input, check that ./words < words.txt
prints:
The total number of words is: 6
As two final sanity checks, printf hi | ./words
should print:
The total number of words is: 1
and printf '' | ./words
should output:
The total number of words is: 0
Implementing a List of Words
The files words/word_count.h
and words/word_count.c
contain a skeleton implementation of a single-linked list. The next task for you is to complete the implementation of all of the functions in words/word_count.c
. In particular you will have to work on the following functions:
free_words
: Free the list ofWordCount
’s referred to by thewclist
argument. This will free all list nodes and all strings (words) stored in the list.len_words
: Return the length of the list ofWordCount
’s.find_word
: Return the pointer to theWordCount
element of the word, if it exists,NULL
otherwise.add_word
: If the given word is present in the givenWordCount
’s list, increment the count, otherwise insert a new list node with count 1. For your convenience, we have provided the functionnew_string
that you can use to create and fill the buffer holding the string (word) to be stored.
As an example, here is a possible implementation of len_words
:
ssize_t len_words(WordCount* wchead)
{
/* Return the length of the list of WordCount's. */
ssize_t len = 0;
while (wchead != NULL)
{
++len;
wchead = wchead->next;
}
return len;
}
You will be using this list of words for implementing the functionalities as described below. Please refer to the file words/main.c
in the starter code for more details and instructions.
Frequency count
Your next task will be to implement frequency counting. Your program should print each unique word as well as the number of times it occurred. This should be sorted in order of frequency (low first). The alphabetical ordering of words should be used as a tie breaker. The wordcount_sort
function has been defined for you in words/wc_sort.o
. However, you will need to implement the wordcount_less
function in main.c
.
You will need to implement the functions in words/word_count.c
to support the linked list data structure (i.e. WordCount
a.k.a. struct word_count
). The complete implementation of words/word_count.c
will prove to be useful when implementing count_words()
in words/main.c
. After completing this part, running ./words -f words.txt
should print:
1 abc
1 bbb
1 def
1 zzz
2 aaa
The output should be primary sorted based on the frequencies of the words. For words with equal frequencies, the ordering should be determined by the words’ alphabetical sequencing. All words should be converted to lower case before being processed.
Sorting is implemented for you by the function wordcount_sort
(by linking against words/wc_sort.o
), which however relies on calling a function named wordcount_less
for comparing two WordCount
instances that we have provided as a skeleton only. You will have to complete this function’s implementation in words/main.c
Hint: You can run the code below to verify the basic functionality of your program (don’t treat this as a testing spec though):
cat <filename> | tr " " "\n" | tr -s "\n" | tr "[:upper:]" "[:lower:]" | tr -d -C "[:lower:]\n" | sort | uniq -c | sort -n
Error handling
In this course, you should get accustomed to writing code defensively. More specifically, you should always include error handling code such that your code does not panic or segfault in the event of errors.
In this assignment, you will implement several functions, all of which have docstrings indicating what you should return if an error occurs during execution of the function. Make sure that whenever you call these functions in the rest of your code, you check the returned value and handle error cases appropriately.
In addition to handling errors, it is always good practice to perform argument validation at the beginning of a function. A simple example of argument validation is ensuring that pointers are not NULL.
Below are a few examples of naive code and their counterparts that handle errors appropriately.
Example 1
Without error handling:
char *new_string(char *str)
{
return strcpy((char *) malloc(strlen(str) + 1), str);
}
malloc
can potentially return a null pointer. If that happens, we will be passing a null pointer as the destination argument to strcpy
, which expects a non-null buffer.
With error handling:
char *new_string(char *str)
{
char *new_str = (char *) malloc(strlen(str) + 1);
if (new_str == NULL)
{
return NULL;
}
return strcpy(new_str, str);
}
Here, we decide explicitly how we want to handle the case where malloc
returns a null pointer. Since the function returns a char *
, it’s reasonable to return NULL
(note that in this assignment, we will be clear about what you should return in error cases). Note that the caller of this function should do error handling of its own by checking if the return value of new_string
is NULL
.
Example 2
Without error handling:
int main(int argc, char *argv[])
{
FILE *input_file;
// ...
char *file_name = ...
input_file = fopen(file_name, "r");
// Perform some logic using input_file.
}
Here, we perform some downstream tasks under the implicit assumption that the call to fopen
succeeded and that input_file
is a valid FILE *
. But what if fopen
fails? For example, the file with name file_name
may not exist. In that case, input_file
will be NULL
. If we try to dereference input_file
later, we can segfault or even encounter other undefined behavior that can lead to an uncontrolled crash of our program.
With error handling:
int main(int argc, char *argv[])
{
FILE *input_file;
// ...
char *file_name = ...
input_file = fopen(file_name, "r");
if (input_file == NULL)
{
fprintf(stderr, "File '%s' does not exist.\n", file_name);
return 1;
}
// Perform some logic using input_file.
}
Here, we address the case where fopen
fails and print a helpful error message before returning the error code 1, which will get propagated to main’s caller and allow for controlled exit of the program.
Next up: Limits.