Spark is an open-source cluster-computing framework which addresses all the limitations of MapReduce.
It is suitable for real-time processing, trivial operations, and processing larger data on the network. It is also suitable for OLTP, graphs, and iterative execution.
As compared to the disk-based, two-stage MapReduce of Hadoop, Spark provides up to 100 times faster performance for a few applications with in-memory primitives.
Fast performance makes it suitable for machine learning algorithms as it allows programs to load data into the memory of a cluster and query the data constantly.
Components of a Spark Project
A Spark project comprises various components such as:
Spark Core
Spark Core is the foundation of the entire Spark project. It contains the basic functionality of Spark, including components for task scheduling, memory management, interacting with storage systems, fault recovery, and more.
Spark Core is also described as home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.
Spark Streaming
Spark Streaming leverages the fast scheduling capability of Spark Core for streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
Spark SQL
Spark SQL dwells at the top of Spark Core. It introduces SchemaRDD, which is a new data abstraction and supports semi-structured and structured data.
SchemaRDD gets manipulated in any of the provided domain-specific such as Java, Scala, and Python by the Spark SQL. Spark SQL also supports SQL with Open Database Connectivity or Java Database Connectivity, commonly known as ODBC or JDBC server and command-line interfaces.
Machine Learning Library
Machine Learning Library, also known as MLlib lies on top of Spark and is a distributed machine learning framework.
MLlib applies various common statistical and machine learning algorithms. With its memory-based architecture, it is nine times faster than the Apache Mahout Hadoop disk-based version.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It comes with an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also gives an optimized runtime for this abstraction.
Apache Spark also comes with an in-memory computation feature which gives it an upper hand in terms of processing speed.
It permits to store large data amounts in the same space, thereby reducing the amount of memory required for performing a query. It also increased the speed of processing.
Now talking about Spark Architecture:
Here, we have two main abstractions of Apache Spark:
RDD(Resilient Dataset Distribution): RDD is an immutable (read-only), fundamental collection of elements or items that can be operated on many devices at the same time (parallel processing). Each dataset in an RDD can be divided into logical portions, which are then executed on different nodes of a cluster.
Ways to create RDDs −
Parallelizing an existing collection in your driver program.
Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc.
Applying Transformation on an existing RDD, which will result in creation of a new RDD.
DAG(Directed Acyclic Graph): DAG is a sequence of computations performed on data where each node is an RDD partition and edge is a transformation on top of data.
Now, Talking about working Architecture of Spark, It runs on a Master-Slave architecture:
In your master node, you have the driver program, which drives your application.
It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where the Spark Context is created. Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster.
The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager.
It translates the RDD’s into the execution graph and splits the graph into multiple stages.
Driver stores the metadata about all the Resilient Distributed Databases and their partitions.
Cockpits of Jobs and Tasks Execution -Driver program converts a user application into smaller execution units known as tasks. Tasks are then executed by the executors i.e. the worker processes which run individual tasks.
Driver exposes the information about the running spark application through a Web UI at port 4040.
Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context.
Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.
If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster.
With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute it faster.
I would suggest you to visit this tutorial for in-depth knowledge of the run-time working of Spark Architecture.