Is my understanding right?

  1. Application: one spark submit.

  2. job: once a lazy evaluation happens, there is a job.

  3. stage: It is related to the shuffle and the transformation type. It is hard for me to understand the boundary of the stage.

  4. task: It is unit operation. One transformation per task. One task per transformation.

Help wanted to improve this understanding.

1 Answer

Yes, you are going in the right direction. Just keep few things in mind.

  • The application is always considered as the main function.

  • Whenever you apply an action on an RDD, a "job" is created. Jobs are work submitted to Spark.

  • Jobs are divided into "stages" based on the shuffle boundary.

  • Moving forward, each stage is divided into tasks based on the number of partitions in the RDD. Therefore, tasks are considered as the smallest units of work for Spark.

