Generally, a Job can be described as a piece of code that reads some input from HDFS or local, performs some computation on the data and writes some output data.
Spark has his own definition for "job".
An ideal definition for a job in case of Spark can be described as a parallel computation consisting of multiple tasks that get spawned in response to a Spark action (e.g. save, collect).
Let's say you need to do the following:
Load a file into RDD1 with people names and addresses
Load a file with people phone no.s and names into RDD2
Join RDD1 and RDD2 by name, to get RDD3
Map on RDD3 to get a nice HTML presentation card for each person as RDD4
Save RDD4 to file.
Map RDD1 to extract zipcodes from the addresses to get RDD5
Aggregate on RDD5 to get a count of how many people live on each zipcode as RDD6
Collect RDD6 and prints these stats to the stdout.
So, now the driver program is this entire piece of code, running all 8 steps.
Step % will be considered as a job as in this step the entire HTML card set is produced(we are using the save action, not a transformation). Similarly, with the collect on step 8 a job is created.
Other steps will be sorted into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for example: join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can conceive stages as computations that produce intermediate results, which can, in fact, be persisted.
Now, since we'll be using RDD1 more than once, we can persist it, avoiding recomputation.
Now if I conclude, we basically talked about how the logic of a given algorithm will be broken. Also, a task is a particular piece of data that will go through a given stage, on a given executor.
I hope this helped you in a better understanding of “job” in Spark.