• Articles
  • Tutorials
  • Interview Questions

What is Apache Spark?

What is Apache Spark?

The following topics will be covered in this blog:

Check out the video on PySpark Course to learn more about its basics:

Video Thumbnail

What is Spark Framework?

Apache Spark is a fast, flexible, and developer-friendly leading platform for large-scale SQL, machine learning, batch processing, and stream processing. It is essentially a data processing framework that has the ability to quickly perform processing tasks on very large data sets. It is also capable of distributing data processing tasks across multiple computers, either by itself or in conjunction with other distributed computing tools.

To crunch through large data stores requires the marshaling of enormous computing power. Some of the programming burdens of these tasks are taken by Spark. Working with an easy-to-use API eliminates much of the menial and high-intensity work of distributed computing and big data processing.

Spark Features

Following are the features of Apache Spark:

  • Speed: Apache Spark helps run applications in the Hadoop cluster up to 100 times faster in memory and 10 times faster on disk. This is due to the ability to reduce the number of reads or write operations to the disk. The intermediate processing data is stored in memory.
  • Supports multiple languages: it provides built-in APIs in Java, Python, or Scala, opening up the options to write applications in different languages. For interactive querying, Spark comes with 80 high-level operators.
  • Advanced analytics: It supports MapReduce, SQL queries, machine learning, streaming data, and graph algorithms.

Spark Components

Spark as a whole consists of various spark tools, libraries, APIs, databases, etc. The main components of Apache Spark are as follows:

  • Spark Core

Spare Core is the basic building block of Spark, which includes all components for job scheduling, performing various memory operations, fault tolerance, and more. Spark Core is also home to the API that consists of RDD. Moreover, It provides APIs for building and manipulating data in RDD.

  • Spark SQL

Apache Spark works with the unstructured data using its ‘go to’ tool, Spark SQL. It allows querying data via SQL, as well as via Apache Hive’s form of SQL called Hive Query Language (HQL). It also supports data from various sources like parse tables, log files, JSON, etc. It allows programmers to combine SQL queries with programmable changes or manipulations supported by RDD in Python, Java, Scala, and R.

  • Spark Streaming

Spark Streaming processes live streams of data. Data generated by various sources is processed at the very instant by Spark Streaming. Examples of this data include log files, messages containing status updates posted by users, etc.

  • GraphX

GraphX is Apache Spark’s library for enhancing graphs and enabling graph-parallel computation. Apache Spark includes a number of graph algorithms that help users in simplifying graph analytics.

  • MLlib

Apache Spark comes up with a library containing common Machine Learning (ML) services called MLlib. It provides various types of ML algorithms including regression, clustering, and classification, which can perform various operations on data to get meaningful insights out of it.

Kickstart your career by enrolling in the Apache Spark course in Dubai.

Get 100% Hike!

Master Most in Demand Skills Now!

Apache Spark Architecture

Apache Spark has a master-slave architecture with its cluster made up of a single master and multiple slaves. This architecture relies on two abstractions:

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets consist of the set of data items that are stored on worker nodes in memory. Let’s looks at the meaning and function behind the term, RDD:

  • Resilient: Restores data on failure
  • Distributed: Data is distributed among different nodes
  • Dataset: Group of data

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite directed graph that holds the track of operations applied on RDD. It works on a sequence of data computations and has an arrangement of edges and vertices. The vertices represent RDDs and edges represent the operations applied on the RDD.

Here, the graph refers to the navigation and directed and acyclic refers to how it is done. Let’s take a look at the Spark architecture.

Directed Acyclic Graph (DAG)

Let’s take a look at some of the other important elements in the above architecture.

Driver Program

The Driver Program runs the application’s main() function. It creates the SparkContext object, the purpose of which is to coordinate the Spark applications that run independently on a cluster of assets of processes. To run on a cluster, the SparkContext links to a different cluster manager type and then performs the following operations: –

  • Acquires executors on nodes in the cluster
  • Sends the application code to the executors where the application code can be defined at this stage by Python or JAR files passed to the SparkContext
  • The SparkContext then sends the tasks to the executors to run

Cluster Manager

  • The cluster manager allocates resources across applications. Spark has the capability to run on a large number of clusters.
  • There are various types of cluster managers such as Apache Mesos, Hadoop YARN, and Standalone Scheduler.
  • The Standalone Scheduler is a standalone spark cluster manager enabling the installation of Spark on an empty set of machines.

Worker Node

  • The worker node is the slave node
  • It runs the application code in the cluster

Executor

  • It is a process that is launched for an application on a worker node
  • It runs tasks and stores data in memory or disk storage
  • It reads and writes data to the external sources
  • Every application has an executor

Task

  • A unit of work that is sent to an executor

Spark RDD

RDD stands for Resilient Distributed Dataset. It is an important data structure present in Spark. RDD is basically the fundamental data structure that helps Spark in the in-memory computation functionality. It lets Spark perform in-memory calculations on large clusters in a fault-tolerant environment. Precisely speaking, RDD helps Spark to facilitate faster and more efficient MapReduce operations.

Spark MLlib

Machine Learning in Apache Spark is performed with the help of the Spark MLlib. It is the machine learning library of Spark and is highly scalable. The most special features of MLlib is its high level algorithms and lightning speed.The Spark MLlib includes all the popularly used ML algorithms and utilities. Some of the machine learning algorithms featured by MLlib are regression algorithms, pattern identification/mining algorithms, classification, clustering, and more. Not only does it consist of higher-level algorithms but also certain lower- level machine learning primitives such as generic gradient descent optimization algorithms.

Some of the commonly used Spark MLlib tools are as below:

  • ML Algorithms
  • Featurization
  • Pipelines
  • Persistence
  • Utilities

Spark SQL

Spark SQL is a module present in Apache Spark which is used for structured data processing. It performs the role of a distributed SQL query engine and helps in the easy integration of SQL query processing with machine learning. Basically, it helps in facilitating an easy integration of relational processing with Spark’s functional programming. It was originally created as Apache Hive to run on top of Spark but due to the limitations of Apache Hive, it was integrated with the Spark stack and thus has now replaced Apache Hive.

Spark Core

Spark Core is the fundamental element of all the Spark components. It enables functionalities like task dispatching, input-output operations, scheduling and more. Other important functions performed by Spark Core are fault-tolerance, management of memory, managing job scheduling, storage system interaction etc. These functionalities of the Spark Core can be accessed through tools like Scala, Java APIs etc. To be precise, the Spark Core is the main execution engine of the entire Spark platform and the related functionalities of Spark.

What is DAG in spark?

DAG stands for Directed Acyclic Graph. It constitutes many vertices and edges, where the vertices present in the DAG define the RDDs (Resilient Distributed Datasets) and the edges denote the specific operations to be applied on a particular RDD. The working of the DAG is based on ‘actions’. Whenever an action is called, the DAG submits to the DAG Scheduler post which the graph is further split into executable tasks/steps.

Certification in Bigdata Analytics

Apache Hadoop and Apache Spark

One of the biggest challenges with respect to Big Data is analyzing the data. There are multiple solutions available to do this. The most popular one is Apache Hadoop.

Apache Hadoop is an open-source framework written in Java that allows us to store and process Big Data in a distributed environment, across various clusters of computers using simple programming constructs. To do this, Hadoop uses an algorithm called MapReduce, which divides the task into small parts and assigns them to a set of computers. Hadoop also has its own file system, Hadoop Distributed File System (HDFS), which is based on Google File System (GFS). HDFS is designed to run on low-cost hardware.

Apache Spark is an open-source distributed cluster-computing framework. It is a data processing engine developed to provide faster and easy-to-use analytics than Hadoop MapReduce. Before Apache Software Foundation took possession of Spark, it was under the control of the University of California, Berkeley’s AMPLab.

Learn about Apache Spark from Apache Spark Training and excel in your career as an Apache Spark Specialist.

Hadoop vs Spark

Let’s take a quick look at the key differences between Hadoop and Spark:

  1. Performance: Spark is fast as it uses RAM instead of using disks for reading and writing intermediate data. Hadoop stores the data on multiple sources and the processing is done in batches with the help of MapReduce.
  2. Cost: Since Hadoop relies on any disk storage type for data processing, it runs at a lower cost. On the other hand, Spark runs at a higher cost because it involves in-memory computations that require high quantities of RAM to spin up nodes for real-time data processing.
  3. Processing: In both platforms, data processing happens in a distributed environment. While Hadoop is suitable for batch processing and linear data processing, Spark is ideal for real-time processing as well as processing live unstructured data streams.
  1. Scalability: Hadoop is quickly able to accommodate when there is a rapid growth in data volume with the help of HDFS. Spark, on the other hand, relies on the fault-tolerant HDFS for larger volumes of data.
  2. Security: Although Hadoop is more secure overall, Spark can integrate with it to reach a higher security level. Spark uses authentication via event logging or shared secret, while Hadoop makes use of multiple authentication and access control methods.
  3. Machine learning (ML): When it comes to Machine Learning, Spark is the superior platform because it has MLlib to perform iterative in-memory ML computations. Its tools can also perform classification, regression, pipeline construction, persistence, evaluation, etc.

Having outlined all these drawbacks of Hadoop, it is clear that there was a scope for improvement, which is whySpark was introduced.

Prepare yourself for the industry by going through Hadoop Interview Questions and Answers now!

How Spark Is Better than Hadoop?

  • In-memory Processing: In-memory processing is faster when compared to Hadoop, as there is no time spent in moving data/processes in and out of the disk. Spark is 100 times faster than MapReduce as everything is done here in memory.
  • Stream Processing: Apache Spark supports stream processing, which involves continuous input and output of data. Stream processing is also called real-time processing.
  • Less Latency: Apache Spark is relatively faster than Hadoop since it caches most of the input data in memory by the Resilient Distributed Dataset (RDD). RDD manages distributed processing of data and the transformation of that data. This is where Spark does most of the operations such as transformation and managing the data. Each dataset in an RDD is partitioned into logical portions, which can then be computed on different nodes of a cluster.
  • Lazy Evaluation: Apache Spark starts evaluating only when it is absolutely needed. This plays an important role in contributing to its speed.
  • Fewer Lines of Code: Although Spark is written in both Scala and Java, the implementation is in Scala, so the number of lines is relatively lesser in Spark when compared to Hadoop.

Want to grab detailed knowledge on Hadoop? Read this extensive Spark tutorial!

Use Cases of Apache Spark in Real Life

Many companies use Apache Spark to improve their business insights. These companies gather terabytes of data from users and use it to enhance consumer services. Some of the Apache Spark use cases are as follows:

  • E-commerce: Many e-commerce giants use Apache Spark to improve their consumer experience. Some of the companies which implement Spark to achieve this are:

    1. eBay: eBay deploys Apache Spark to provide discounts or offers to its customers based on their earlier purchases. Using this not only enhances the customer experience but also helps the company provide a smooth and efficient user interface to the customers.

    2. Alibaba: Alibaba runs the largest Spark jobs in the world. Some of these jobs analyze big data, while the rest perform extraction on image data. These components are displayed on a large graph, and Spark is used for deriving results.
  • Healthcare: Apache Spark is being deployed by many healthcare companies to provide their customers with better services. One such company which uses Spark is MyFitnessPal, which helps people achieve a healthier lifestyle through diet and exercise. Using Spark, MyFitnessPal has been able to scan through the food calorie data of about 90 million users that helped it identify high-quality food items.
  • Media and Entertainment: Some of the video streaming websites use Apache Spark, along with MongoDB, to show relevant ads to their users based on their previous activity on that website. For example, Netflix, one of the major players in the video streaming industry, uses Apache Spark to recommend shows to its users based on the previous shows they have watched.

Intellipaat provides the most comprehensive ClouderaSpark course to fast-track your career!

Why Use Hadoop and Spark Together?

If you are thinking of Spark as a complete replacement for Hadoop, then you have got yourself wrong. There are some scenarios where Hadoop and Spark go hand in hand.

  • It can run on Hadoop, stand-alone Mesos, or in the Cloud.
  • Spark’s MLlib components provide capabilities that are not easily achieved by Hadoop’s MapReduce. By using these components, Machine Learning algorithms can be executed faster inside the memory.
  • It does not have its own distributed file system. By combining Spark with Hadoop, you can make use of various Hadoop capabilities. For example, resources are managed via YARN Resource Manager. You can integrate Hadoop with Spark to perform Cluster Administration and Data Management.
  • Hadoop provides enhanced security, which is a critical component for production workloads. Spark workloads can be deployed on available resources anywhere in a cluster, without manually allocating and tracking individual tasks.

Check out our MapReduce Cheat Sheet in Hadoop.

Increased Demand for Spark Professionals

Apache Spark is witnessing widespread demand with enterprises finding it increasingly difficult to hire the right professionals to take on challenging roles in real-world scenarios. It is a fact that today the Apache Spark community is one of the fastest Big Data communities with over 750 contributors from over 200 companies worldwide.

Also, it is a fact that Apache Spark developers are among the highest-paid programmers when it comes to programming for the spark framework as compared to ten other Hadoop development tools. As per a recent survey by O’Reilly Media, it is evident that having these skills under your belt can give you a hike in the salary of about $11,000, and mastering Scala programming can give you a further jump of another $4,000 in your annual salary.

Apache Spark and Storm skilled professionals get average yearly salaries of about $150,000, whereas Data Engineers salaries get about $98,000. As per Indeed, the average salaries for Spark Developers in San Francisco is 35 percent more than the average salaries for Spark Developers in the United States.

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.