Apache Spark is an open-source computing framework originally started at AMP Lab in Berkeley in the year 2009. Spark is a lightning-fast cluster computing designed to provide a clustering framework that deals with interactive queries and iterative computation for fast computation. It deals with Machine Learning, and it guarantees up to 100 times faster performance for various applications.
- It was built on top of Hadoop MapReduce, and it extends the MapReduce model to efficiently use more types of computations which include interactive queries and stream processing.
- Spark is not a modified version of Hadoop, and it’s not dependent on Hadoop because it has its own cluster management. We can say that Hadoop is one of the ways to implement Spark.
- Spark has two mediums such as:
- Spark is designed to cover a huge amount of workloads such as batch applications, iterative algorithms, interactive queries, and streaming.
- It reduces the management burden of maintaining separate tools.
Check out the video on PySpark Course to learn more about its basics:
Evolution of Apache Spark
Spark is the sub-division of Hadoop originally developed in 2009 in UC Berkeley’s AMP Lab by Matei Zaharia. Open source in the year 2010 under a BSD license, it was later donated to Apache Software Foundation in 2013, and now Apache Spark has become a top leading Apache project since February 2014.
A recent survey by Typesafe conducted among 2,100 developers reports that the awareness of Spark is growing day by day, with 71 percent of respondents claiming that they have an experience with the framework being implemented. Now, it has reached more than 500 organizations, irrespective of size, which have a bond with thousands of developers and extensive resources for the progress of the project. The survey report also says that Spark has received backing from top leading companies like IBM has integrated Spark into its own products and open-sourced its own Machine Learning technology to increase Spark’s capabilities. Recently, IBM announced that more than 3,500 researchers and developers from Spark-related projects are going to use several existing software products, including the SPSS (Statistical Package for the Social Sciences) predictive analytics software, which it bought for $1.2 billion in 2009.
At the initial stages, it was primarily down to its focus on plugging the defaults in MapReduce, namely, the lack of speed and in-memory queuing. The recent real-time experiments have proved that Spark sorted 100 TB of data in just 23 minutes, compared to Hadoop which took 72 minutes to gain the same results using several Amazon Elastic Cloud machines. Spark managed to do so with less than one-tenth of the virtual machines, i.e., only 206, compared to the 2,100 machines used by Hadoop. In the year 2014, Apache Spark won the Gray Sort Benchmark (Daytona 100 TB category) by being used by the Databricks team (including Spark committers, Reynold Xin, Xiangrui Meng, and Matei Zaharia) and by the Themis team from UCSD, who shared a new world record in sorting.
One of the major business tactics of Spark is its accessibility to anyone for a better understanding of databases and a few of its scripting skills while allowing Data Scientists and Statisticians to use interactive web interfaces and languages, namely, Python. This makes it simpler for companies to have better knowledge of their data as well as find the tools to process it. Unlike MapReduce, Spark is also able to run and exist without Hadoop, with better work bandwidth in terms of resource managers like YARN or Mesos. Spark also seems more appropriate in terms of solving many complex problems.
Features of Apache Spark
- Speed: Spark provides an application to run in a Hadoop cluster; it is up to 100 times faster when running in memory and 10 times faster when running on the disk. This is possible by reducing the number of read/write operations on the disk. The intermediate processing data is stored in memory.
- Supporting multiple languages: Spark has built-in APIs (Application Programming Interfaces) in Java, Scala, or Python and hence we can write applications in different languages. It ranges up to 80 high-level operators for interactive querying.
- Advanced Analytics: Spark supports ‘Map’ and ‘Reduce’ including SQL queries, streaming data, and Machine Learning (ML) and Graph algorithms.
Spark Built on Hadoop
The following diagram clearly describes the three ways how Spark can be built with Hadoop components.
Let’s discuss more on these three ways of Spark deployment:
- Standalone: Spark Standalone deployment means that Spark occupies the top position in HDFS (Hadoop Distributed File System) and allocates spaces for HDFS. Here, Spark and MapReduce will run constantly for Spark jobs on the cluster.
- Hadoop Yarn: Hadoop Yarn deployment means, Spark runs on Yarn without any pre-installation or root access. It provides to integrate Spark into the Hadoop ecosystem or Hadoop stack. Other components are allowed to run on top of this stack.
- Spark in MapReduce (SIMR): Basically, it is used to launch Spark jobs in addition to the standalone deployment. With the help of SIMR, users can start with Spark and use its shell without any administrative access.
Components of Spark
Apache Spark Core
It is the general execution engine for the Spark platform where functionalities are in-built. It provides external storage systems where in-memory computing and referencing datasets are implemented.
Spark SQL
Spark SQL is the top level of Spark Core that provides a new view for a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming is Spark Core’s fastest scheduling which has the capability to perform streaming analytics. It provides data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
It’s a distributed Machine Learning framework of Spark that has a distributed memory-based Spark architecture. As per the benchmarks set by MLlib developers against Alternating Least Squares (ALS) implementations, Spark MLlib is 9 times faster than the Hadoop disk-based version of Apache Mahout (this was before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework compared to Spark. Here, APIs are provided for expressing graph computation where they model as user-defined graphs by using Pregel abstraction API. An optimized run time for abstraction is also provided.
Apache Spark has become a dominant force in analytics, today, because of its portfolio assisting various sectors like banking, telecommunications, gaming, etc., and serving giants like Apple, Facebook, IBM, and Microsoft. Moreover, Spark aids different applications as it provides support for various programming languages such as Java, Scala, Python, and R, and it provides for the better deployment of streaming data, Machine Learning, and graph processing.