pthread_words

Implementation

words and list_words operate in a single thread, opening, reading, and processing each file one after another. With pthread_words, your task is to provide the same end-to-end functionality while using multiple threads (one thread for each input file). This means you are not allowed to materialize your intermediate results (i.e. write each thread’s results to a separate file and aggregate these). Make sure to read through word_count.h and Makefile to see what macros are used for pthread_words. In particular, the word_count_list_t will differ from the same structure used for list_words.

pthread_words.c will serve as the driver program, meaning it will manage the creation and upkeep of the threads you create. Each file should be processed in a separate thread. We recommend you reference pthread.c to draw inspiration and clues as to how you should structure your code.

You are permitted (and explicitly encouraged) to re-use and adapt the relevant parts of the code you wrote in the ‘list_words’ part of this assignment. This will greatly simplify the task at hand.

As additional reading material, we suggest that you (re)read the Study Guides Threads and Synchonization before starting to work on this tasks.

Synchronization

When implementing words_count_pthread.c, keep in mind the functionality of all these methods are identical to what you did in words_count_list.c. However, you must use synchronization techniques to ensure coordination amongst different threads to prevent race conditions.

Your synchronization must be fine-grained. Different threads should be able to open and read their respective files concurrently, serializing only their modifications to shared data. In particular, it is unacceptable to use a global lock around the call to the count_words function in pthread_words.c, since it would prevent multiple threads from concurrently reading files. We reserve the right to deduct points from such implementations. Instead, you should only synchronize access to the word count list data structure in word_count_pthread.c. You will need to ensure all such modifications are complete before printing the result or terminating the process.

We recommend that you start by just implementing the thread-per-file aspect (i.e. without synchronization). This will be quite similar to the code you’ve written in word_count_list.c. Your program might not even error, since multithreaded programs with synchronization bugs may appear to work properly much of the time. Once you’re confident in the logic of your methods, then add in the necessary synchronization.

You can use the helper functions provided by word_helpers.h, especially the function count_words that can split the content of a given file into words and fill the given word list with the results. There is no need for you to implement the splitting of a file into words again - you have already done that in Assignment 0.

To help you find subtle synchronization bugs in your program, we have provided a decently large input for your words program in the gutenberg/ directory. These files were generated from select stories from Project Gutenberg, making sure to choose short stories so that the word count program does not take too long to run. Make sure to compare the results of running pthread_words to running words to check if your output is correct. As stated before, this does not ensure your synchronization is correct, but it might alert you to subtle synchronization bugs that may not manifest for smaller inputs.

To verify that your code does the right thing you may run make check_pthread. This will execute your pthread_words program by generating word counts for the files in the gutenberg/ directory.

These files were generated from select stories from Project Gutenberg, making sure to choose short stories so that the word count program does not take too long to run. The results directory also contains files (with the file extension .frequencies) that have been generated by a correctly working version of the code. make check_pthread compares the output generated by your version with the content of those files.

Questions

After correctly implementing pthread_words (i.e. passing autograder tests), compare list_words and pthread_words by answering the following questions in results/answers.md.

Q1: Briefly compare the performance of list_words and pthread_words when run on the Gutenberg dataset. How might you explain their relative performance? Hint: use the time command for this.

Q2: Under what circumstances would pthread_words perform better than list_words?

Q3: Under what circumstances would list_words perform better than pthread_words?

Q4: Is it possible to use multithreading in a way that pthread_words always performs better than list_words? Explain!

Next up: Submission.