0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

TL;DR: In a Spark Standalone cluster, what are the differences between client and cluster deploy modes? How do I set which mode my application is going to run on?

We have a Spark Standalone cluster with three machines, all of them with Spark 1.6.1:

  • A master machine, which also is where our application is run using spark-submit
  • 2 identical worker machines

From the Spark Documentation, I read:

(...) For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

However, I don't really understand the practical differences by reading this, and I don't get what are the advantages and disadvantages of the different deploy modes.

Additionally, when I start my application using start-submit, even if I set the property spark.submit.deployMode to "cluster", the Spark UI for my context shows the following entry:

So I am not able to test both modes to see the practical differences. That being said, my questions are:

1) What are the practical differences between Spark Standalone client deploy mode and clusterdeploy mode? What are the pro's and con's of using each one?

2) How to I choose which one my application is going to be running on, using spark-submit?


1 Answer

0 votes
by (32.5k points)
edited by

When for execution, we submit a spark job to local or on a cluster, the behavior of spark job totally depends on one parameter, that is the “Driver” component. Where the “Driver” component of spark job will reside, it defines the behavior of spark job.

Basically, there are two types of “Deploy modes” in spark, such as “Client mode” and “Cluster mode”. Let’s discuss each in detail.

Spark Client Mode

The behavior of the spark job depends on the “driver” component and here, the”driver” component of spark job will run on the machine from which job is submitted. Hence, this spark mode is basically called “client mode”.

  • When a job submitting machine is within or near to “spark infrastructure”. Since there is no high network latency of data movement for final result generation between “spark infrastructure” and “driver”, then, this mode works very fine.

  • When a job submitting machine is very remote to “spark infrastructure”, also has high network latency. Hence, in that case, this spark mode does not work in a good manner.

  • The driver runs on a dedicated server (Master node) inside a dedicated process. This means it has got all the available resources at its disposal to execute work.

  • The driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).

Spark Cluster Mode

Similarly, here “driver” component of spark job will not run on the local machine from which job is submitted. Hence, this spark mode is basically “cluster mode”. In addition, here spark jobs will launch the “driver” component inside the cluster.

  • When the job submitting machine is remote from “spark infrastructure”. Since, within “spark infrastructure”, “driver” component will be running. Thus, it reduces data movement between job submitting machine and “spark infrastructure”. In such a case, This mode works totally fine.

  • While we work with this spark mode, the chance of network disconnection between “driver” and “spark infrastructure”  reduces. Since they reside in the same infrastructure. Also, reduces the chance of job failure.

  • The Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader.

  • The Driver runs as a dedicated, standalone process inside the Worker.

Now, answering your second question, the way to choose which mode to run in is by using the --deploy-mode flag. From the Spark Configuration page:

/bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

If you want to know more about Spark, then do check out this awesome video tutorial:

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !