0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper.

1 Answer

0 votes
by (31.4k points)
edited by

In spark all the jobs comprises of a series of operators and run on a set of data. And in a job all the operators are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance, let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.

At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.

  • The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.

  • The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about the dependencies of the stages.

  • The Worker executes the tasks on the Slave.

Let's take an example of counting how many log messages appear at each level of severity,

Following is the log file that starts with the severity level,

INFO I'm Info message

WARN I'm a Warn message

INFO I'm another Info message

and the following scala code extracts the same,

val input = sc.textFile("log.txt")

val splitedLines = input.map(line => line.split(" "))

                        .map(words => (words(0), 1))

                        .reduceByKey{(a,b) => a + b}

This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.

To display the lineage of an RDD, Spark provides a debug method toDebugString(). For example executing toDebugString() on the splitedLines RDD, will output the following:

(2) ShuffledRDD[6] at reduceByKey at <console>:25 []

    +-(2) MapPartitionsRDD[5] at map at <console>:24 []

    |  MapPartitionsRDD[4] at map at <console>:23 []

    |  log.txt MapPartitionsRDD[1] at textFile at <console>:21 []

    |  log.txt HadoopRDD[0] at textFile at <console>:21 []

The first line (from the bottom) shows the input RDD. We created this RDD by calling sc.textFile(). Below is the more diagrammatic view of the DAG graph created from the given RDD.


If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !