Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am new to Apache Spark, and I just learned that Spark supports three types of cluster:

  • Standalone - meaning Spark will manage its own cluster
  • YARN - using Hadoop's YARN resource manager
  • Mesos - Apache's dedicated resource manager project

Since I am new to Spark, I think I should try Standalone first. But I wonder which one is the recommended. Say, in the future I need to build a large cluster (hundreds of instances), which cluster type should I go to?

1 Answer

0 votes
by (32.3k points)
edited by

Basically, we have three cluster types for Spark

  • Standalone

  • Apache Mesos

  • Hadoop YARN

Spark Standalone cluster (Spark deploy cluster) is Spark’s own built-in cluster environment. Since Spark Standalone is available in the default distribution of Apache Spark it is the easiest way to run your Spark applications in a clustered environment in many cases.

Standalone mode is the easiest to set up and run your Spark applications. Also, it provides almost similar features similar to other cluster managers.

image

Standalone works on 2 nodes:

  • Standalone Master - It is a resource manager for the Spark Standalone cluster.

  • Standalone Worker(standalone slave) - It is a worker in the Spark Standalone cluster, which actually assigns the tasks to every executor.

YARN has quite good support regarding data locality for HDFS.

Most Hadoop distributions already install YARN and HDFS together.

On YARN, a Spark executor maps to a single YARN container. In order to deploy applications to YARN clusters, you need to use Spark with YARN support.

Advantage of Yarn over Mesos and Standalone:

  • YARN gives you an allowance to dynamically share and centrally configure the same pool of cluster resources amongst all frameworks that run on YARN.

  • YARN has an authentication security service-level authorization, it is authentication for Web consoles and data confidentiality.

Mesos handles the workload in a distributed environment by dynamic resource sharing and isolation. Mesos cluster manager is the recommended choice when it comes to managing large scale apache clusters.

It is open-source software that sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently.

The main idea behind Mesos is to make a large collection of heterogeneous resources. Mesos introduces a mechanism called resource offers, i.e. distributed two-level scheduling. Mesos takes responsibility and decides how many resources are required by each framework, while frameworks have the power to accept their desired resources and computations, which will be running on them.

One advantage we get using Mesos above YARN and Standalone is that Mesos has a unique thin resource sharing layer which gives frameworks a common interface for accessing cluster resources and hence, enables fine-grained sharing options across diverse cluster computing frameworks. The sole purpose is to increase resource utilization by deploying multiple distributed systems to a shared pool of nodes.

If you want more information regarding the same, refer to the following video:

Browse Categories

...