[Distributed File System] MapReduce: simplified data processing on large clusters Overview
This is a paper summary of the paper, "MapReduce: simplified data processing on large clusters". What is the paper trying to do? Inspired by the map and reduce primitives in functional languages, the paper introduces a new abstraction whereby map and reduce operations allows the program to parallelize large computations easily. In short, it allows the anyone to execute programs with parallelization, fault-tolerance, data distribution and load balancing without bothering with the mess that usually comes with it. What do you think is the contribution of the paper? The major contribution of this paper is developing an interface that distributes large-scale computations using MapReduce. This allows it to achieve ‘automatic parallelization’. Another big contribution is implementing this interface on large clusters of commodify PCs. What are its major strengths? Model is easy-to-use. Because it hides away the details of parallelization, fault-tolerance, locality optimiza