Example MapReduce Job
This section describes an example showing the steps involved in running word count on the MapReduce cluster you will implement in this assignment. This is only an example; don’t hardcode any of the numbers listed here. MapReduce may run differently depending on if and when failures happen.
The order in which workers poll for tasks, and the time it takes them to complete those tasks are non-deterministic. The specifics of which worker received which task are not important for your implementation.
Setup
The coordinator (mr-coordinator) is started.
After a few seconds, 3 workers (mr-worker) are started. We’ll refer to the workers as worker 1, worker 2, and worker 3.
Job submission
-
A client (
mr-client) submits this job via theSUBMIT_JOBRPC:files = [ "data/gutenberg/p.txt", "data/gutenberg/q.txt", "data/gutenberg/r.txt", "data/gutenberg/s.txt" ] output_dir = "/tmp/map-reduce/gutenberg" app = "wc" n_reduce = 2 args = [1, 2, 3]The args are not used by the word count
mapandreducefunctions, but they are included to show you how they should be used for other applications that may depend on args.Since there are 4 input files, there are 4
maptasks (one per input file). Sincen_reduceis 2, there are tworeducetasks. -
The coordinator accepts the job, assigns it an ID of 0, and returns this ID to the client. The job is added to the coordinator’s queue.
-
The client periodically polls the status of the job using the POLL_JOB RPC with job_id = 0 to see when it completes or fails.
Map Task Execution
-
Worker 1 polls the coordinator for a task, and is assigned
maptask 0 for job 0. -
Worker 2 polls the coordinator for a task, and is assigned
maptask 1 for job 0. Similarly, worker 3 is assignedmaptask 2 for job 0. -
Each worker executes its assigned
maptask. This part is already implemented for you, so you do not have to understand this fully. If you are interested, you can look at these notes for a more concrete example.- They create
n_reduce(2 in this case) temporary files on disk with namesmr-i-j, whereiis themaptask andjis the reduce task. We’ll refer to these intermediate output files as buckets. - They read the input file into memory (eg.
maptask 0 would readdata/gutenberg/p.txt). - They call the
mapfunction corresponding to thewcapplication. The key is the input filename; the value is the file’s contents. The auxiliary args[1, 2, 3]are also passed to themapfunction. - They iterate over the resulting key value pairs. For each KV pair:
- The key is hashed using the ihash function in
lib/lib.h. - A reduce bucket is selected by computing
ihash(key) % n_reduce. - The KV pair is written into the corresponding bucket using the length-delimited writer implemented in
codec/codec.c. The key is sent first, then the value.
- The key is hashed using the ihash function in
- They create
-
Workers 1, 2, and 3 finish their
maptasks and notify the coordinator. -
The coordinator assigns the final
maptask (task 3) to worker 1. Workers 2 and 3 sit idle since there are no available tasks (maptasks must finish beforereducetasks can be assigned). They periodically poll the coordinator to see if new tasks are available. -
Worker 1 finishes its
maptasks and notifies the coordinator. Allmaptasks are now complete.
Reduce Task Execution
-
Workers 1 and 2 polls the coordinator and are assigned
reducetasks 0 and 1, respectively. Immediately after this, worker 2 crashes. -
Worker 1 begins executing
reducetask 0. It reads in the relevant key-value pairs from the intermediate files created during the mapping stage that correspond toreducetask 0. -
Worker 1 concatenates all the key-value pairs it obtains, and then sorts the pairs by key. Once again, this logic is implemented for you, but feel free to look at these notes or the starter code for a better understanding of how it works.
- For each run of key-value pairs corresponding to the same key
K, worker 1 does the following:- Calls the word count
reducefunction with keyK, the list of values corresponding toK, and auxiliary args[1, 2, 3]. Thereducefunction returns a single valueV. - Writes
(K, V)to the output file/tmp/map-reduce/gutenberg/mr-out-0.
- Calls the word count
-
Worker 1 completes
reducetask 0 and notifies the coordinator. -
Workers 1 and 3 continually poll the coordinator for tasks. After some time passes without a finished task notification from worker 2, the coordinator will realize that worker 2 has crashed. Since worker 2 was executing
reducetask 1, the coordinator will schedulereducetask 1 for re-execution. -
The next time worker 3 polls the coordinator, it is told to execute
reducetask 1. Worker 1 continues to wait. - The
reducetask completes successfully, and the coordinator is notified. The coordinator notes that all tasks for job 0 have been completed.
Job Completion
-
On the next
PollJobRPC issued by the MapReduce client, the coordinator notifies the client that the job was completed. -
The client runs some post processing on the MapReduce output files to convert them to a human-readable format. Our autograder will inspect this final output.