[Distributed File System] MapReduce: simplified data processing on large clusters Overview

March 30, 2020

This is a paper summary of the paper, "MapReduce: simplified data processing on large clusters".

What is the paper trying to do?

Inspired by the map and reduce primitives in functional languages, the paper introduces a new abstraction whereby map and reduce operations allows the program to parallelize large computations easily. In short, it allows the anyone to execute programs with parallelization, fault-tolerance, data distribution and load balancing without bothering with the mess that usually comes with it.

What do you think is the contribution of the paper?

The major contribution of this paper is developing an interface that distributes large-scale computations using MapReduce. This allows it to achieve ‘automatic parallelization’. Another big contribution is implementing this interface on large clusters of commodify PCs.

What are its major strengths?

Model is easy-to-use.
Because it hides away the details of parallelization, fault-tolerance, locality optimization and load balancing, it makes it easy to use even for programmers without any experience in parallel and distributed systems.
Many problems can be expressed as MapReduce problems/computations.
The use cases of MapReduce is huge; ranging form web search service to sorting and data mining.
Developed an implementation of MapReduce that scales to large clusters of thousands of machines.

Resources:

MapReduce: simplified data processing on large clusters (https://dl.acm.org/doi/10.1145/1327452.1327492)

Search This Blog

NotAfraidOfWong