Two Main Abstractions of Apache Spark
Apache Spark has a well-defined layer architecture which is designed on two main abstractions
- Resilient Distributed Dataset (RDD): RDD is an immutable (read-only), fundamental collection of elements or items that can be operated on many devices at the same time (parallel processing). Each dataset in an RDD can be divided into logical portions, which are then executed on different nodes of a cluster.
- Directed Acyclic Graph (DAG):Directed Acyclic Graph is the scheduling layer of Apache Spark Architecture that implements stage-oriented scheduling. Compared to MapReduce, which creates a graph in two stages, Map and Reduce, Apache Spark Architecture can create DAGs that contains many stages.
Watch this Apache Spark Architecture Video
Apache Spark Framework uses a master–slave architecture which consists of a driver, which runs as a master node, and many executors which run across the worker nodes in the cluster. Apache Spark can be used for batch processing and real-time processing as well.
Apache Spark Architecture Working
The basic Apache Spark architecture is shown in the figure below
Apache Spark Architecture Driver Program calls the main program of an application and also creates the Spark Context. A Spark Context consists of all the basic functionalities. The Spark Driver also contains various other components like DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager which are responsible for translating the user written code into jobs which are actually executed on the cluster.
The Spark Driver and the Spark Context collectively watch over the job execution within the cluster. The Spark Driver works with the Cluster Manager to manage various other jobs. Cluster Manager does all the resource allocating work. And then, the job is split into multiple smaller tasks which are further distributed onto the worker nodes.
Whenever an RDD is created in the Spark Context, it can be distributed across many worker nodes and can also be cached there.
Worker nodes execute the tasks which are assigned by the Cluster Manager and returns it back to the Spark Context.
Executor is responsible for the execution of these tasks. Lifetime of executors is same as that of the Spark Application. If you want to increase the performance of the system, you can increase the number of workers so that the jobs can be divided into more logical portions.
The SparkContext can work with various Cluster Managers, like Standalone Cluster Manager, Yet Another Resource Navigator (YARN), or Mesos, which allocate resources to containers in the worker nodes. The work is done inside the containers.
Build your career as an Apache Spark Specialist from Cloudera Spark Training.
Standalone Master is the Resource Manager, whereas the Standalone Worker is the worker in the Spark Standalone Cluster.
In Standalone Cluster mode, there is only one executor to run the tasks on each worker node.
A client establishes a connection with the Standalone Master, asks for resources, and starts the execution process on the worker node.
Here, the client is the application master, and it requests the resources from the Resource Manager. In this Cluster Manager, we have a Web UI to view all clusters and job statistics.
Hadoop YARN (Yet Another Resource Negotiator)
YARN takes care of the resource management for the Hadoop ecosystem. It has two components:
Resource Manager: It manages the resources on all applications in the system. It consists of a Scheduler and an Application Manager. The Scheduler allocates resources to various applications.
Node Manager: Node Manager consists of an Application Manager and a Container. Each task of MapReduce runs in one Container. An application or job requires one or more Containers. And the Node Manager monitors these Containers and the resource usage. This is reported to the Resource Manager.
Read this extensive Spark Tutorial! to grasp a detailed knowledge on Hadoop
YARN also provides security for authorization and authentication for web consoles for data confidentiality. Hadoop uses Kerberos to authenticate its users and services.
Apache Mesos handles the workload from many sources by using dynamic resource sharing and isolation. It helps in deploying and managing applications in large-scale cluster environments. Apache Mesos consists of three components:
Mesos Master: Mesos Master provides fault tolerance (the capability to operate and to recover loss after a failure occurs). A cluster contains many Mesos Masters.
Mesos Slave: Mesos Slave is an instance which offers resources to the cluster. Mesos Slave assigns the resources only when a Mesos Master assigns the task.
If you have any questions or query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.
Mesos Frameworks: Mesos Frameworks allow applications to request resources from the cluster so that the application can perform the tasks.
So, this brings us to the end of this section. To sum up, Spark helps us to break down the intensive and high computational jobs into smaller, more concise tasks which are then executed by the worker nodes. It also achieves processing of real-time or archived data using the basic architecture.
Prepare yourself for the Top Hadoop Interview Questions And Answers Now!