Two Main Abstractions of Apache Spark
Apache Spark has a well-defined layer architecture which is designed on two main abstractions:
- Resilient Distributed Dataset (RDD): RDD is an immutable (read-only), fundamental collection of elements or items that can be operated on many devices at the same time (parallel processing). Each dataset in an RDD can be divided into logical portions, which are then executed on different nodes of a cluster.
- Directed Acyclic Graph (DAG): DAG is the scheduling layer of the Apache Spark architecture that implements stage-oriented scheduling. Compared to MapReduce that creates a graph in two stages, Map and Reduce, Apache Spark can create DAGs that contain many stages.
Watch this Apache Spark Architecture video tutorial:
The Apache Spark framework uses a master–slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Apache Spark can be used for batch processing and real-time processing as well.
Working of the Apache Spark Architecture
The basic Apache Spark architecture is shown in the figure below:
Driver Program in the Apache Spark architecture calls the main program of an application and creates SparkContext. A SparkContext consists of all the basic functionalities. Spark Driver contains various other components such as DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, which are responsible for translating the user-written code into jobs that are actually executed on the cluster.
Spark Driver and SparkContext collectively watch over the job execution within the cluster. Spark Driver works with the Cluster Manager to manage various other jobs. Cluster Manager does the resource allocating work. And then, the job is split into multiple smaller tasks which are further distributed to worker nodes.
Whenever an RDD is created in the SparkContext, it can be distributed across many worker nodes and can also be cached there.
Worker nodes execute the tasks assigned by the Cluster Manager and return it back to the Spark Context.
An executor is responsible for the execution of these tasks. The lifetime of executors is the same as that of the Spark Application. If we want to increase the performance of the system, we can increase the number of workers so that the jobs can be divided into more logical portions.
The SparkContext can work with various Cluster Managers, like Standalone Cluster Manager, Yet Another Resource Negotiator (YARN), or Mesos, which allocate resources to containers in the worker nodes. The work is done inside these containers.
Build your career as an Apache Spark Specialist by signing up for this Cloudera Spark Training!
Standalone Master is the Resource Manager and Standalone Worker is the worker in the Spark Standalone Cluster.
In the Standalone Cluster mode, there is only one executor to run the tasks on each worker node.
A client establishes a connection with the Standalone Master, asks for resources, and starts the execution process on the worker node.
Here, the client is the application master, and it requests the resources from the Resource Manager. In this Cluster Manager, we have a Web UI to view all clusters and job statistics.
Hadoop YARN (Yet Another Resource Negotiator)
YARN takes care of resource management for the Hadoop ecosystem. It has two components:
- Resource Manager: It manages resources on all applications in the system. It consists of a Scheduler and an Application Manager. The Scheduler allocates resources to various applications.
- Node Manager: Node Manager consists of an Application Manager and a Container. Each task of MapReduce runs in a container. An application or job thus requires one or more containers, and the Node Manager monitors these containers and resource usage. This is reported to the Resource Manager.
Read this extensive Spark Tutorial to grasp detailed knowledge on Hadoop!
YARN also provides security for authorization and authentication of web consoles for data confidentiality. Hadoop uses Kerberos to authenticate its users and services.
Apache Mesos handles the workload from many sources by using dynamic resource sharing and isolation. It helps in deploying and managing applications in large-scale cluster environments. Apache Mesos consists of three components:
- Mesos Master: Mesos Master provides fault tolerance (the capability to operate and recover loss when a failure occurs). A cluster contains many Mesos Masters.
- Mesos Slave: Mesos Slave is an instance that offers resources to the cluster. Mesos Slave assigns resources only when a Mesos Master assigns a task.
- Mesos Frameworks: Mesos Frameworks allow applications to request resources from the cluster so that the application can perform the tasks.
If you have more queries related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!
This brings us to the end of this section. To sum up, Spark helps us break down the intensive and high-computational jobs into smaller, more concise tasks which are then executed by the worker nodes. It also achieves the processing of real-time or archived data using its basic architecture.
Prepare yourself for the industry with these Top Hadoop Interview Questions and Answers now!