What conditions should cluster deploy mode be used instead of client?

Question

asked Jul 10, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

The doc https://spark.apache.org/docs/1.1.0/submitting-applications.html

describes deploy-mode as :

--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)

Using this diagram fig1 as a guide (taken from http://spark.apache.org/docs/1.2.0/cluster-overview.html) :

enter image description here

If I kick off a Spark job :

./bin/spark-submit \
--class com.driver \
--master spark://MY_MASTER:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/Driver.jar

Then the Driver Program will be MY_MASTER as specified in fig1 MY_MASTER

If instead I use --deploy-mode cluster then the Driver Program will be shared among the Worker Nodes ? If this is true then does this mean that the Driver Program box in fig1 can be dropped (as it is no longer utilized) as the SparkContext will also be shared among the worker nodes ?

What conditions should cluster be used instead of client ?

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T09:56:42+0000

Talking about deployment modes of spark, it simply tells us where the driver program will run. Basically, it is possible in two ways. At first, either the drives program will run on the worker node inside the cluster, i.e. Spark Cluster mode or it will run on an external client, i.e. Client spark mode.

In Client mode, ”driver” component of spark job runs on the local machine from which job is submitted. Hence, this spark mode is basically called as “client mode”.

If job submitting machine is within or near to “spark infrastructure” and there is no high network latency of data movement for final result generation between “spark infrastructure” and “driver”, in that case, this mode works very fine.
If job submitting machine is very remote to “spark infrastructure”, and also have high network latency, in that scenario, this spark mode does not work in a good manner.

In cluster mode, “driver” component of spark job will not run on the local machine from which job is submitted. Here, spark job launches “driver” component inside the cluster.

When job submitting machine is remote from “spark infrastructure”and since, “driver” component will be running within “spark infrastructure” in such case data movement between job submitting machine and “spark infrastructure” will be reduced. Therefore, this mode will work finally here.
While we work with spark cluster mode, the chances of network disconnection between “driver” and “spark infrastructure” reduces.Also, the chance of job failure is very less.

What conditions should cluster deploy mode be used instead of client?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources