pthread_words
Implementation
words
and list_words
operate in a single thread, opening, reading, and processing each file one after another. With pthread_words
, your task is to provide the same end-to-end functionality while using multiple threads (one thread for each input file). This means you are not allowed to materialize your intermediate results (i.e. write each thread’s results to a separate file and aggregate these). Make sure to read through word_count.h
and Makefile
to see what macros are used for pthread_words
. In particular, the word_count_list_t
will differ from the same structure used for list_words
.
pthread_words.c
will serve as the driver program, meaning it will manage the creation and upkeep of the threads you create. Each file should be processed in a separate thread. We recommend you reference pthread.c
to draw inspiration and clues as to how you should structure your code.
You are permitted (and explicitly encouraged) to re-use and adapt the relevant parts of the code you wrote in the ‘list_words’ part of this assignment. This will greatly simplify the task at hand.
As additional reading material, we suggest that you (re)read the Study Guides Threads and Synchonization before starting to work on this tasks.
Synchronization
When implementing words_count_pthread.c
, keep in mind the functionality of all these methods are identical to what you did in words_count_list.c
. However, you must use synchronization techniques to ensure coordination amongst different threads to prevent race conditions.
Your synchronization must be fine-grained. Different threads should be able to open and read their respective files concurrently, serializing only their modifications to shared data. In particular, it is unacceptable to use a global lock around the call to the count_words
function in pthread_words.c
, since it would prevent multiple threads from concurrently reading files. We reserve the right to deduct points from such implementations. Instead, you should only synchronize access to the word count list data structure in word_count_pthread.c
. You will need to ensure all such modifications are complete before printing the result or terminating the process.
We recommend that you start by just implementing the thread-per-file aspect (i.e. without synchronization). This will be quite similar to the code you’ve written in word_count_list.c
. Your program might not even error, since multithreaded programs with synchronization bugs may appear to work properly much of the time. Once you’re confident in the logic of your methods, then add in the necessary synchronization.
You can use the helper functions provided by
word_helpers.h
, especially the functioncount_words
that can split the content of a given file into words and fill the given word list with the results. There is no need for you to implement the splitting of a file into words again - you have already done that in Assignment 0.
To help you find subtle synchronization bugs in your program, we have provided a decently large input for your words program in the gutenberg/
directory. These files were generated from select stories from Project Gutenberg, making sure to choose short stories so that the word count program does not take too long to run. Make sure to compare the results of running pthread_words to running words to check if your output is correct. As stated before, this does not ensure your synchronization is correct, but it might alert you to subtle synchronization bugs that may not manifest for smaller inputs.
To verify that your code does the right thing you may run make check_pthread
. This will execute your pthread_words
program by generating word counts for the files in the gutenberg/
directory.
These files were generated from select stories from Project Gutenberg, making sure to choose short stories so that the word count program does not take too long to run. The results
directory also contains files (with the file extension .frequencies
) that have been generated by a correctly working version of the code. make check_pthread
compares the output generated by your version with the content of those files.
Questions
After correctly implementing pthread_words
(i.e. passing autograder tests), compare list_words
and pthread_words
by answering the following questions in results/answers.md
.
Q1: Briefly compare the performance of
list_words
andpthread_words
when run on the Gutenberg dataset. How might you explain their relative performance? Hint: use thetime
command for this.Q2: Under what circumstances would
pthread_words
perform better thanlist_words
?Q3: Under what circumstances would
list_words
perform better thanpthread_words
?Q4: Is it possible to use multithreading in a way that
pthread_words
always performs better thanlist_words
? Explain!
Next up: Submission.