When for execution, we submit a spark job to local or on a cluster, the behavior of spark job totally depends on one parameter, that is the “Driver” component. Where the “Driver” component of spark job will reside, it defines the behavior of spark job.
Basically, there are two types of “Deploy modes” in spark, such as “Client mode” and “Cluster mode”. Let’s discuss each in detail.
Spark Client Mode
The behavior of the spark job depends on the “driver” component and here, the”driver” component of spark job will run on the machine from which job is submitted. Hence, this spark mode is basically called “client mode”.
When a job submitting machine is within or near to “spark infrastructure”. Since there is no high network latency of data movement for final result generation between “spark infrastructure” and “driver”, then, this mode works very fine.
When a job submitting machine is very remote to “spark infrastructure”, also has high network latency. Hence, in that case, this spark mode does not work in a good manner.
The driver runs on a dedicated server (Master node) inside a dedicated process. This means it has got all the available resources at its disposal to execute work.
The driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Spark Cluster Mode
Similarly, here “driver” component of spark job will not run on the local machine from which job is submitted. Hence, this spark mode is basically “cluster mode”. In addition, here spark jobs will launch the “driver” component inside the cluster.
When the job submitting machine is remote from “spark infrastructure”. Since, within “spark infrastructure”, “driver” component will be running. Thus, it reduces data movement between job submitting machine and “spark infrastructure”. In such a case, This mode works totally fine.
While we work with this spark mode, the chance of network disconnection between “driver” and “spark infrastructure” reduces. Since they reside in the same infrastructure. Also, reduces the chance of job failure.
The Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader.
The Driver runs as a dedicated, standalone process inside the Worker.
Now, answering your second question, the way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:
/bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
If you want to know more about Spark, then do check out this awesome video tutorial:
If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.