Table of Contents

Assignment 3: MapReduce

Due Monday, December 8, 11:59 pm Central Time

Submissions handed in by the due date receive a small on-time bonus.

All students are granted a pre-approved extension or “grace period” of 24 hours after the due date. Late submissions are accepted during the grace period with no penalty.

The grace period expires Tuesday, December 9, 11:59 pm Central Time, after which no further late submissions are accepted (or are even possible).

This project is an individual homework. No group work is allowed.

Introduction

MapReduce is a programming model for scalable and highly parallelized data processing. It abstracts away the complexities of developing a fault tolerant distributed system by exposing a simple API where the user specifies two functions:

  • map: produces a set of key/value pairs from the input data
  • reduce: combines values corresponding to the same key

With these functions, tasks can be automatically parallelized and executed on a cluster. This paradigm is particularly powerful since it allows programmers with little background in distributed systems to write parallelizable code for a wide variety of real world tasks.

In this assignment, which is loosely based on a lab from MIT, you will be implementing your own fault tolerant MapReduce system in C. Specifically, you will be implementing a coordinator process that distributes tasks to worker processes that carry out the actual computation. You will also handle worker failure by implementing task redistribution. The system design you will be building is similar to that outlined in the MapReduce paper.

In this assignment, we use the term ‘coordinator’ rather than ‘master’.

The purpose of this assignment is to teach you about distributed algorithms, so we will not be too strict on memory management (i.e. we will not test that you free all of the data you allocate). Of course, poor memory management that causes segmentation faults or other major issues will cause you to fail tests, but do not spend a lot of time worrying about memory leaks that do not impact the functionality of your code.

RPC lab

To help you get a basic idea of how the coordinator will communicate with workers, we have put together a walk-though that demonstrates through how rpcgen works. rpcgen is a protocol compiler that helps with writing Remote Procedure Call (RPC) applications in C. This walk-though is worth 5% of your grade on the entire assignment. The code you write for the walk-though will be in a separate directory (walk-though-rpc) from the rest of the Map Reduce code (hw-map-reduce). To trigger the autograder for the walk-though portion, simply push everything to your repository. There are no extensions for the walk-though portion of this assignment.

Final

The final component consists of the rest of the tasks (through Fault tolerance).

Project Tasks

Here is a list of documents that we suggest you read and work on in sequence. This will help to minimize the time you spend on every step between setting up your environment and submitting your work.