Intellipaat
Intellipaat

What is Apache Spark?

Apache Spark is a fast in-memory big data processing engine equipped with the abilities of Machine Learning which runs up to 100 times faster than Apache Hadoop. It is a unified engine that is built around the concept of ease.

What is Apache Spark?
March 23, 2018      6894 Views     2 Comments

Apache Spark definition

Apache Spark is the new processing-engine which is part of the Apache Software Foundation that is powering the Big Data applications around the world. It is taking over from where Hadoop MapReduce left or from where MapReduce is finding it increasing difficulty to cope with the exacting needs of fast-paced enterprise.

The large amounts of unstructured data and the need for increased speed to fulfil the real-time analytics have made this technology a real alternative for Big Data computational exercises.

Go through this inquisitive video on Apache Spark by Intellipaat :

CriteriaSpark
StrengthIn-memory processing and iterative computation
AvailabilityOpen Source
Data processingStreaming and batch processing

Wish to Learn Spark? Click Here

Spark vs Hadoop – A thorough comparison

You have been familiarized with a basic introduction to Apache Spark. We have given out 8 parameters on which Spark is compared with Hadoop. Hadoop lags behind in most of these comparisons but then again Spark is an advancement over Hadoop.

Spark Hadoop
SpeedSpark is 100 times faster in memory.MapReduce is relatively slower to Spark in processing data
SimplicityIt has over 100 operators that help to transform data to work with semi-structured data.While Hadoop’s APIs are also simple they are relatively less simple than Spark
VersatileSpark works well with SQL workloads, streaming data and Machine Learning operationsIt is not as adept like Spark
Iterative application supportIt uses RDDs to delegate smaller workloads to individual nodesMapReduce lacks built in support for iterative applications
RecoveryBy recomputation of DAG RDDs allow for recovery of partitions on failed nodesThere is no RDD but it is also highly resistant to failures
SchedulerSpark is its own flow scheduler due to the virtue of in-memory computation.Hadoop uses Oozie to schedule complex flows which is an external job scheduler.
CachingSpark enhances system performance as it can cache data in memory for further iterations.MapReduce lags behind here as it can’t cache data in memory.
Programming difficultyDue to the presence of high level operators along with RDD programming is easyDeveloper often needs to code in each and every operation.

Architecture Overview

Spark Architecture Overview

Let’s familiarize ourselves with the architecture through which the Spark works. The role of each component is elucidated here.

Role of Driver in Spark Architecture

It’s the entry point of Spark shell. Main() function is run here by driver program and this is where Spark context is created. Spark driver contains components like TaskScheduler, DAGScheduler, BlockManager which is key in translating Spark user code into Spark jobs.

  • Driver schedules the job execution and negotiates with cluster manager.
  • RDDs are translated into execution graph which splits the graph into multiple edges.
  • Driver stores metadata about RDDs and their partitions.
  • Driver converts user application into tasks

Role of worker in Spark Architecture

Worker is a distributed agent responsible for task execution. Every Spark application has its own worker process. Workers run for the entire lifetime of Spark application.

  • Worker performs all data processing
  • Worker reads from and writes data to external sources
  • Worker stores the computation data in memory cache or on hard disk drives
  • Interacts with storage systems

Role of worker in Spark Architecture

Role of Cluster Manager in Spark Architecture

Cluster manager acquires resources on Spark cluster and allocates it to Spark job. 3 types of cluster managers which a Spark application can use to allocate and deallocate physical resources like client Spark jobs, CPU memory etc. Choosing a cluster manager depends on the application as all cluster managers give different scheduling capabilities. To beginners of Spark, standalone cluster manager is the easiest one to use for developing a new Spark application.

Learn Spark in 15 hrs. Download e-book now

X

The major components of Spark ecosystem are as below:

Apache Spark Core

This is the fundamental processing engine of the Spark application. All the other components of Spark are directly dependent on the Spark core. Its most important features include the in-memory processing and the data referencing from external data sources.

Ciaran Dynes, Vice President, Products, Talend.jpg

“At Talend, we’ve teamed with MapR for a number of years and helped thousands of organizations with our advanced data integration platform that offers native support for Apache Spark and Spark Streaming.”– Ciaran Dynes, Vice President, Products, Talend

Spark Streaming

This is another component of Spark that displays its high-speed computational prowess. It works excessively well with streaming data for providing real-time data analytics. The data is segregating into multiple batches and using the RDD abstraction the data is processed in a massively parallel fashion in a continuous manner to suit the needs of processing streaming data.

Download latest questions asked on Spark in top MNC's ?

Spark SQL

This is the Spark component that creates a new level of data abstraction called the SchemaRDD for working exclusively with both structured and semi-structured data by deploying SQL querying language.

GraphX

This is the Graph processing capability of Apache Spark framework that is the amalgamation of iterative graphical computation, exploratory analysis and ETL capabilities. It is possible to view the data as both graphs and collections and also to combine graphs with RDD. It allows for customized iterative graph algorithms using certain specialized APIs.

Rajiv vice president

“Spark is beautiful,  With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day.” – Rajiv Bhat, senior vice president of data sciences and marketplace at InMobi, told the Economic Times

MLlib

Spark can also be used for Machine Learning algorithms and applications using its MLlib library that provides a Machine Learning framework for Spark in a memory-based distributed environment. The Apache Spark MLlib functionalities can be extremely fast when compared to other Machine Learning frameworks like the Apache Mahout. Through this library developers can play around AI in Spark source code.

Curious to know more? Read this extensive Spark Tutorial!

It consists of fundamental processing engine called the Spark Core and this is accompanied with the set of libraries. The Spark interface lets you program the Spark cluster incorporating the data parallelism and fault tolerance. Its distributed processing engine is written in either of the languages like Scala, Java or Python. There are multiple APIs in order to explore the distributed ETL application development. The multiple libraries written on top of the Spark Core lets perform varied applications like SQL data parsing, deploying Machine Learning and processing streaming workloads.

It conveniently replaces the Hadoop MapReduce, it goes much further. It has immensely profound implications for the data science community. The MLlib library is extensively deployed for Machine Learning applications. It increasingly fulfills the needs of the data science domains like classification, clustering, regression, collaborative filtering and dimensionality reduction.

Spark comes with the Machine Learning pipeline API that is used for high level abstraction for defining the data science workflow. Some of the abstractions provided with Spark ML are Estimator, Transformer, Pipeline and Parameters.

The shortcomings of MapReduce

MapReduce had some serious concerns in the way data is being processed and getting high-speed processing is next to impossible. The reasons for MapReduce’s slower throughput are as below:

  • It has a linear dataflow structure on distributed programs
  • It follows a lengthy process of reading data from disk, mapping the data with a function, reducing the mapped data, and storing the reduced data again in the disk.

Go through these Top Spark Interview Questions to grab top Big Data jobs!

The Spark RDD abstraction is the solution

The Apache Spark RDD is among the basic Spark concepts which works in a fundamentally different way. It partitions each dataset in a logical manner so that it can be independently computed on different nodes of the cluster. The RDD is read-only data record deployed as a partitioned collection. It is highly fault-tolerant and helps for massive parallel processing in order to increase the speeds by several factors.

It is possible to achieve RDD by either of the two methods of deploying parallelization or by referencing external dataset from an external data storage system which could be from HDFS, HBase or other shared data source.

Spark is a newer way of computing Big Data on Hadoop cluster. It is designed in such a way that it can work independently even outside of Hadoop without a hitch. Spark already has its own Big Data computation and cluster management system, so it basically uses Hadoop for its storage needs. But even that can be replaced with some other data storage method in order to completely replace its dependence on Hadoop as there is no such thing as Spark storage.

According to MapR, Spark is gaining steady groundswell due to some of its distinctive advantages.

It is extremely good for batch processing, deploying iterative algorithms, querying in an interactive manner, and working with streaming data. These are some of the features that sets it apart from MapReduce but one other distinctive advantage that Spark has is that it makes away with managing multiple tools for multiple tasks.

It provides multiple APIs for Java, Scala and Python objects. Thus, regardless of the language in which you have your application written it is possible to process the data. Spark also has scores of high-level operators for deploying interactive querying.

The various ways in which Spark is deployed

Standalone: In this deployment the Spark application runs on top of the Hadoop Distributed File System and Spark can work well in coordination with the traditional computing engine which is MapReduce.

Spark in MapReduce: Spark can be deployed inside of MapReduce applications in order to speed up the process of data computation i.e. speed up the Mapping and Reducing functions using the in-built Spark functionalities. Here the Spark shell can be deployed without the need for administrative rights.

Girish Pancha, CEO, StreamSets

“Organizations look to MapR’s Converged Data Platform to take advantage of Spark’s distributed in-memory storage for high performance processing across a variety of use cases, including fraud detection, sensor and IoT device analytics and consumer sentiment analysis.” – Girish Pancha, CEO, StreamSets

Hadoop YARN: In this method the Spark application can work on YARN without any installation or any root access. This way the Spark program is integrated well into the Hadoop ecosystem and the other components can work on top of this stack.

Become an expert Hadoop Architect by enrolling in Big Data hadoop Online Training Course!

What is Apache Spark used for?

Today there is a widespread deployment of Big Data. With each passing day the requirements of enterprises increases and therefore there is a need for a faster and more efficient form of processing data. Most of the data is in unstructured format and it is coming in thick and fast as streaming data.

Banking – More and more banksare increasingly adopting Spark platforms to analyze and access social media profiles, emails, call recordings, compliant logs, forum discussions to garner insights which can aid them take correct business decisions for credit risk assessment, customer segmentation and targeted advertising.

E-commerce – It finds a great application in e-commerce industry. Real time transaction details can be sent to streaming clustering algorithms like K-means and collaborative filtering. The results can then be combined with other data sources like productreviews, social media profiles, customer comments to improve recommendations to clients based on new trends.

Alibaba Taobao uses Spark to analyse hundreds of petabytes of data on its ecommerce platform.Plethora of merchants interact with this e commerce platform. These interactions represent a large graph and Machine Learning processing on this data.

eBay uses Apache Spark to provide targeted offers, enhance customer experience, and to optimize the overall performance. Apache Spark engine is leveraged at eBay through Hadoop YARN.YARN manages all the cluster resources to run generic tasks. EBay spark users leverage the Hadoop clusters in the range of 2000 nodes, 20,000 cores and 100TB of RAM through YARN.

To provide targeted offers, optimize overall performance and enhance customer experience eBay uses Apache Spark. Hadoop YARN is used to leverage Apache Spark at eBay. YARN uses all cluster resources to perform generic tasks. About 2000 nodes, 20,000 cores and 100TB of RAM is leveraged in Hadoop clusters by eBay Spark users through YARN.

What is Apache Spark used for

Healthcare – Apache Spark uses advanced analytics on patient records to figure outwhich patients are more likely to fall sick after being discharged. The hospital can better deploy healthcare services to the identified patient saving on costs for both the hospitals and patients.

Wei Zheng, trifacta.jpg

“Providing significant performance improvements over MapReduce, Spark is the tool of choice for data scientists and analyst to turn their data into real results. With Spark under the hood of Trifacta, we can now execute large-scale data transformations at interactive response rates.” – Wei Zheng, Vice President of Products, Trifacta

My Fitness Pal is used to achieve a healthy lifestyle through good food and exercise. With the final goal of high quality food items My Fitnes Pal uses Apache Spark to clean the data inputted by the users. Over 80 million users’ food calorie data has been scanned by My Fitness Pal.

Media – Spark systems is used to identify patterns from real time in-game events in the gaming industry. It is used in businesses like auto adjustment of gaming levels which is based on complexity, players retention and also in targeted advertising.

In real world scenarios,examples of Spark use cases can be found in real-time statistics, predictive analytics, working with sensor data, log data processing, fraud detection, and so on.

Organizations in diverse fields like marketing, manufacturing, finance, law enforcement and scientific research are hugely benefiting from it. IBM has made significant contributions for using Spark in the cloud. Databricks’  Spark Project Tungsten has done major changes and has improved upon memory allocation capability of Spark.

This Dice insight article clearly sums it up all about Spark being the next big thing in Big Data!

What is the right audience to learn Apache Spark?

Apache Spark can be mastered by professionals who are in the IT domain in order to increase their marketability. Big Data Hadoop professionals surely need to learn Apache since it is the next most important technology in Hadoop processing. Other than that even ETL professionals, SQL professionals and project managers can gain immensely if they master Apache Spark. Finally Data Scientists also need to gain in-depth knowledge in order to excel at their careers. Spark is extensively deployed in Machine Learning scenarios and Data Scientists are also expected to work in Machine Learning domain making them the right candidates for Apache Spark training. Those who have an innate desire to learn the latest emerging technologies can also learn the Spark technology.

Increased demand for Spark Professionals everywhere

Apache Spark is seeing widespread demand with enterprises finding it increasingly difficult to hire the right professionals to take on increasingly challenging roles in real world scenarios. It is a fact that today the Apache Spark community is one of the fastest Big Data communities with over 750 contributors from over 200 companies worldwide.

Also it is a fact that Apache Spark developers are among the highest paid programmers when it comes to programming for the Hadoop framework as compared to ten other Hadoop development tools. As per a recent survey by O’Reilly media it was evident that having the Apache Spark skills under your belt can give you a hike in salary of the tune of $11,000 and mastering Scala programming can give you a further jump of another $4,000 in annual salary.

Apache Spark and Storm skilled professionals get average yearly salaries of about $150,000 whereas data engineers get about $98,000. As per Indeed.com the average salary for Spark developers in San Francisco is 35% more than the average salaries for Spark developers in the United States.

Intellipaat provides the most comprehensive Spark online training course to fast track your career!

 

Related Articles