[Cloud Computing] Spark: Cluster Computing with Working Sets

April 13, 2020

What is the paper trying to do? What do you think is the contribution of the paper?

The paper, “Spark: Cluster Computing with Working Sets”, is trying to explain Spark, a new framework that can retain the scalability and fault tolerance similar to MapReduce, and at the same time, support applications that reuse a working set of data across multiple parallel operations. This type of application is not as effective in MapReduce because MapReduce only works well with acyclic data flow graphs. However, it works well in Spark. Spark achieves this by using an abstraction called “resilient distributed datasets (RDD)”. It is a “read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost”.

Although Spark is still in prototype, the authors demonstrate that Spark can outperform “Hadoop by 10x in iterative machine learning workloads and can be used interactively to scan a 39GB dataset with sub-second latency”.

What are its major strengths?

The main strength of Spark lies in its resilient distributed dataset (RDD).

RDD allows users to explicitly cache memory across machines, and re-use it in multiple MapReduce-like parallel operations.
Is fault tolerance. If one of the RDD is lost, the rest of the RDDs have enough information about how the information on the lost RDD was pieced together and can reconstruct that lost RDD.
Provides three simple data abstractions: resilient distributed dataset (RDD) and two restricted types of shared variables, which are broadcast variables and accumulators. These powerful enough to be implemented on applications that currently pose challenges for existing cluster computing framework.

Search This Blog

NotAfraidOfWong