• Articles
  • Tutorials
  • Interview Questions

Apache Spark Architecture

Table of content

Show More

Apache spark has become a key cluster computer framework that catches the world of big data with fire. It is a more accessible, powerful, and powerful data tool to deal with a variety of big data challenges.

Introduction to Spark Architecture

Apache Spark is an actively developed and unified computing engine and a set of libraries. It is used for parallel data processing on computer clusters and has become a standard tool for any developer or data scientist interested in big data.

Spark supports multiple widely used programming languages, such as Java, Python, R, and Scala. It includes libraries for a diverse range of tasks, such as SQL, streaming, machine learning, etc. It runs anywhere from a laptop to a cluster of thousands of servers, making it a beginner-friendly system with a steep learning curve, and users can scale up to big data processing or an incredibly large scale.

Watch this PySpark Course video:

Video Thumbnail

 

The Apache Spark framework uses a master-slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Apache Spark can be used for batch processing and real-time processing as well.

Working on the Apache Spark Architecture

The basic Apache Spark architecture diagram is shown in the figure below:

Spark Arch
Driver Program in the Apache Spark architecture calls the main program of an application and creates SparkContext. A SparkContext consists of all the basic functionalities. Spark Driver contains various other components such as DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, which are responsible for translating the user-written code into jobs that are actually executed on the cluster.

Spark Driver and SparkContext collectively watch over the job execution within the cluster. Spark Driver works with the Cluster Manager to manage various other jobs. The cluster Manager does the resource allocating work. And then, the job is split into multiple smaller tasks which are further distributed to worker nodes.

Whenever an RDD is created in the SparkContext, it can be distributed across many worker nodes and can also be cached there.

Worker nodes execute the tasks assigned by the Cluster Manager and return it back to the Spark Context.

An executor is responsible for the execution of these tasks. The lifetime of executors is the same as that of the Spark Application. If we want to increase the performance of the system, we can increase the number of workers so that the jobs can be divided into more logical portions.

Spark Architecture Applications

Let us understand the high-level components that are a part of the architecture of the Apache Spark application:

The Spark Driver

The Spark Driver resembles the cockpit of a Spark application. It performs the role of the Spark application’s execution controller. The Spark driver keeps track of all the application states for the Spark cluster. The cluster manager must be interfaced with the Spark driver in order to obtain physical resources and start executors.

The Spark Executors

The tasks assigned by the Spark driver are performed by the Spark executors. The core responsibility of a Spark executor is to take the assigned tasks, run them, and report back their success or failure state and results. Each Spark application has its own separate executor processes.

The Cluster Manager

The cluster manager maintains a cluster of machines that will run Spark applications. It has its own driver called the “master” and “worker” abstractions. These are tied to physical machines instead of processes like in Spark. 

The machine, if you look on the left of the Spark architecture illustration, is the Cluster Manager Driver Node. The circles are the daemon processes that are running on and managing all individual worker nodes. These are just the processes from the Cluster Manager. During this time, no Spark Application is running.

When the time comes to run a Spark application, the resources are requested from the cluster manager to run it. Depending on the configuration of the application, it can be somewhere to run the Spark driver or simply resources for the executors of the Spark application.

Over the course of the execution of the Spark application, the Cluster Manager manages the underlying machines that the application is running on.

Modes of Execution

An execution model helps determine where the resources mentioned previously are physically located when the application is run. There are three modes of execution to choose from:

Cluster Mode

Cluster mode is the most common way of running Spark applications during which, a pre-compiled Python script, JAR, or R script is submitted to a cluster manager by a user. The driver process is then launched on a worker node inside the cluster by the cluster manager, in addition to the executor processes. This implies that the cluster manager is in charge of maintaining all Spark application-related processes.

Client Mode

Client mode is almost the same as cluster mode except that the Spark driver remains on the client machine that submitted the application. This means that the client machine maintains the Spark driver process, and the cluster manager maintains the executor ones. These machines are commonly known as gateway machines or edge nodes.

Local Mode

In the local mode, the entire Spark application is run on a single machine. It observes parallelism through threads on that single machine. This is a common way to test applications or experiment with local development. However, it is not recommended for running production applications.

Get 100% Hike!

Master Most in Demand Skills Now!

Two Main Abstractions of Apache Spark

Apache Spark has a well-defined layer architecture that is designed on two main abstractions:

  • Resilient Distributed Dataset (RDD): RDD is an immutable (read-only), fundamental collection of elements or items that can be operated on many devices at the same time (spark parallel processing). Each dataset in an RDD can be divided into logical portions, which are then executed on different nodes of a cluster.
  • Directed Acyclic Graph (DAG): DAG is the scheduling layer of the Apache Spark architecture that implements stage-oriented scheduling. Compared to MapReduce which creates a graph in two stages, Map and Reduce, Apache Spark can create DAGs that contain many stages.

Cluster Managers in Spark Architecture

The SparkContext can work with various Cluster Managers, like Standalone Cluster Manager, Yet Another Resource Negotiator (YARN), or Mesos, which allocate resources to containers in the worker nodes. The work is done inside these containers.

Standalone Cluster

The Spark Standalone Cluster comprises a Standalone Master that functions as the Resource Manager, along with Standalone Workers serving as the worker nodes. In this cluster mode, each worker node hosts a sole executor responsible for executing tasks.

To commence the execution process, a client establishes a connection with the Standalone Master, requesting the requisite resources. Acting as the application master, the client collaborates with the Resource Manager to procure the necessary resources. A web-based User Interface (UI) is accessible within this Cluster Manager, enabling users to visualize comprehensive details regarding all clusters and job statistics.

Standalone Cluster

Hadoop YARN (Yet Another Resource Negotiator)

YARN takes care of resource management for the Hadoop ecosystem. It has two components:

  • Resource Manager: It manages resources on all applications in the system. It consists of a Scheduler and an Application Manager. The Scheduler allocates resources to various applications.
  • Node Manager: Node Manager consists of an Application Manager and a Container. Each task of MapReduce runs in a container. An application or job thus requires one or more containers, and the Node Manager monitors these containers and resource usage. This is reported to the Resource Manager.

YARN also provides security for the authorization and authentication of web consoles for data confidentiality. Hadoop uses Kerberos to authenticate its users and services.

Hadoop YARN

Apache Mesos

Apache Mesos handles the workload from many sources by using dynamic resource sharing and isolation. It helps in deploying and managing applications in large-scale cluster environments. Apache Mesos consists of three components:

  • Mesos Master: Mesos Master provides fault tolerance (the capability to operate and recover loss when a failure occurs). A cluster contains many Mesos Masters.
  • Mesos Slave: Mesos Slave is an instance that offers resources to the cluster. Mesos Slave assigns resources only when a Mesos Master assigns a task.
  • Mesos Frameworks: Mesos Frameworks allow applications to request resources from the cluster so that the application can perform the tasks.

Apache Mesos

Big Data Hadoop Expert

This brings us to the end of this section. To sum up, Spark helps us break down the intensive and high-computational jobs into smaller, more concise tasks which are then executed by the worker nodes. It also achieves the processing of real-time or archived data using its basic architecture.

Course Schedule

Name Date Details
Big Data Course 14 Dec 2024(Sat-Sun) Weekend Batch View Details
21 Dec 2024(Sat-Sun) Weekend Batch
28 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.