Apache Spark is an open source computing framework originally started at AMP Lab in Berkley in the year 2009. Spark is lightning-fast cluster computing designed to provide a clustering framework which deals with interactive queries and iterative computation for fast computation. It ideals with machine learning has it guarantees up to 100 times faster performance for various applications.
- It was built on top of Hadoop MapReduce and extends MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing.
- Spark is not a modified version of Hadoop and it’s not dependent on Hadoop because it has its own cluster management. We can say Hadoop is one of the ways to implement Spark.
- Spark has two medium such as:
- Spark is designed to cover a huge amount of workloads such as batch applications, iterative algorithms, interactive queries and streaming.
- It reduces the management burden of maintaining separate tools.
Evolution of Apache Spark
Spark is the sub division of Hadoop originally developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. Open Sourced in the year 2010 under a BSD license. Later it was donated to Apache software foundation in 2013, and now Apache Spark has become a top leading level of Apache project since Feb-2014.
Recent survey reports state that 2,100 developers by Typesafe, awareness of Spark is still growing day by day, with 71% of respondents claim they have an experience with the framework being implemented. Now it has reached more than 500 organizations irrespective of sizes, which have a bond with thousands of developers and extensive resources for progress of the project. Survey report says that Spark received backing from top leading company like IBM, who have integrated Spark into its own products and open sourced its own machine learning technology to increase Spark’s capabilities. IBM on October 26 has committed that more than 3,500 researchers and developers from Spark related projects are going to use several existing software products, including the SPSS (Statistical Package for the Social Sciences) predictive analytics software which it bought for $1.2 billion in 2009.
At the initial stages it’s primarily down to its focus on plugging the defaults in MapReduce, namely lack of speed and in-memory queuing. Recent Real time experiments have proved that the Spark sorted 100 TB of data in just 23 minutes, compared to Hadoop sorted in 72 minutes to gain the same results using a several of Amazon Elastic Cloud machines. Spark managed to do less than one tenth of the machines such as 206 compared to 2100 for Hadoop. Gray Sort Benchmark (Daytona 100TB category)wins the title in the year 2014 , including teams from Data bricks including Spark committers, Reynold Xin, Xiangrui Meng, and Matei Zaharia, tying for first with a Themis team from UCSD, and being stared for a new world record in sorting.
The major business tactics of Spark is its accessibility to anyone for an better understanding of databases and few of its scripting skills, while using interactive web interfaces and languages namely Python where the data scientists and statisticians are more in usage . This makes it simpler to recruiting people who have better knowledge on their data as well as find the tools to process it. Unlike MapReduce, Spark is also able to run and exist without Hadoop, with better work bandwidth in terms of resource managers like YARN or Mesos. Spark seems more appropriate in terms of solving many complex problems that will shortly been known.
Features of Apache Spark
- Speed− Spark provides an application to run in Hadoop cluster, it ranges up to 100 times faster in memory, and 10 times faster when running on disk. This can be possible by reducing number of read/write operations to disk. The intermediate processing data is stored in the memory.
- Supports multiple languages− Spark has a built-in APIs (Application Programming Interface) in Java, Scala, or Python and hence we can write applications in different languages. It ranges up to 80 high-level operators for interactive querying.
- Advanced Analytics− Spark supports ‘Map’ and ‘reduce’ which includes SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Built on Hadoop
The following diagram clearly describes three ways of functioning on how Spark can be built with Hadoop components.
There are three ways of Spark deployment.
- Standalone− Spark Standalone deployment means Spark occupies the top position in HDFS (Hadoop Distributed File System) and allocates spaces for HDFS. Here, Spark and MapReduce will run constantly for spark jobs on cluster.
- Hadoop Yarn− Hadoop Yarn deployment means, spark runs on Yarn without any pre-installation or root access required. Provides to integrate Spark into Hadoop ecosystem or Hadoop stack. Other components are allowed to run on top of stack.
Spark in MapReduce (SIMR) – Basically used to launch spark job in addition to standalone deployment. With the help of SIMR, user can start Spark and uses its shell without any administrative access.
Components of Spark
Apache Spark Core
It’s general execution engine for spark platform where functionality are in-built. Provides external storage systems where In-Memory computing and referencing datasets are implemented.
Spark SQL is the top level of Spark Core that provides a new view for a new data abstraction called Schema RDD, which provides support for structured and semi-structured data.
This is fastest Spark Core’s scheduling which has the capability to perform streaming analytics. It provides data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
It’s a distributed machine learning framework of Spark has it has the distributed memory-based Spark architecture. Benchmarks states that the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is more than nine times faster compared to Hadoop disk-based version of Apache Mahout (initially Mahout gained a Spark interface).
GraphX is a distributed graph-processing framework compared to Spark. API’s are provided for expressing graph computation where models as user-defined graphs by using Pregel abstraction API. An optimized runtime for abstraction is also provided.