Fault Tolerance
This part is optional and is worth 5 points of extra credit on this assignment.
Please keep in mind that the value of the extra credit may not be proportionate to the time it takes to complete it.
In this part, you will complete your MapReduce system by implementing fault tolerance. Specifically, you will update your coordinator to handle worker crashes and failure.
You should handle worker crashes by detecting when a worker has failed, and reassigning relevant tasks to other workers. In this system, we will consider a worker failed if it hasn’t completed its assigned task within TASK_TIMEOUT_SECS seconds.
Worker crashes
Your coordinator should be able to determine whether a worker has died. This should be implemented by checking whether an incomplete task was assigned longer than TASK_TIMEOUT_SECS seconds ago.
Completed tasks should not be reassigned, since their output is stored on disk.
Since the wait times we define happen to be multiples of seconds, you may use time.h. An example is provided below:
#include <stdio.h>
#include <time.h>
#include <unistd.h>
int main () {
time_t start = time(NULL);
sleep(5);
printf("Seconds since start of program = %ld\n", time(NULL) - start);
return 0;
}
More accurate waiting (Optional)
While we will not be testing the accuracy of your wait times, you may find the code snippet below helpful if you want to support millisecond precision for wait times. Keep in mind that improving wait time precision is entirely optional.
#include <sys/time.h> suseconds_t micros_elapsed(struct timeval time) { struct timeval current_time; gettimeofday(¤t_time, NULL); return (current_time.tv_sec - time.tv_sec) * 1000000 + current_time.tv_usec - time.tv_usec; }
In order to use this effectively, your coordinator will also need to store when each map or reduce task is assigned. Modify your data structures to keep track of the assigned time of each task as either a time_t or a struct time_val, and update this field appropriately using the time or get_time_of_day functions.
Once you have done this, implement the logic for reassigning tasks that are not completed within TASK_TIMEOUT_SECS seconds.