Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (11.4k points)

I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in Spark context(not SparkContext). I have below questions

  • How different is job from a Driver program.
  • Application itself is a part of Driver program?

I read the Spark documention but still this thing is not clear for me.

Kindly help with some example if possible . It would be very helpful.

1 Answer

0 votes
by (32.3k points)

Generally, a Job can be described as a piece of code that reads some input from HDFS or local, performs some computation on the data and writes some output data.

Spark has his own definition for "job".

An ideal definition for a job in case of Spark can be described as a parallel computation consisting of multiple tasks that get spawned in response to a Spark action (e.g. save, collect).

Let's say you need to do the following:

  1. Load a file into RDD1 with people names and addresses 

  2. Load a file with people phone no.s and names into RDD2

  3. Join RDD1 and RDD2 by name, to get RDD3

  4. Map on RDD3 to get a nice HTML presentation card for each person as RDD4

  5. Save RDD4 to file.

  6. Map RDD1 to extract zipcodes from the addresses to get RDD5

  7. Aggregate on RDD5 to get a count of how many people live on each zipcode as RDD6

  8. Collect RDD6 and prints these stats to the stdout.


So, now the driver program is this entire piece of code, running all 8 steps.

Step % will be considered as a job as in this step the entire HTML card set is produced(we are using the save action, not a transformation). Similarly, with the collect on step 8 a job is created.

Other steps will be sorted into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for example: join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can conceive stages as computations that produce intermediate results, which can, in fact, be persisted.

Now, since we'll be using RDD1 more than once, we can persist it, avoiding recomputation.

Now if I conclude, we basically talked about how the logic of a given algorithm will be broken. Also,  a task is a particular piece of data that will go through a given stage, on a given executor.

I hope this helped you in a better understanding of “job” in Spark.

Related questions

Browse Categories