Apache Spark was developed by Matei Zaharia in 2009 as a sub project of Hadoop in UC Berkeley’s AMPLab. It was first used by machine learning experts who used Spark to monitor and predict the traffic congestion in the San Francisco Bay Area. Before proceeding any further let’s understand the shortcomings of Hadoop MapReduce
Watch this Spark Tutorial For Beginners video
Limitations of Hadoop MapReduce
1) It used disk-based processing
2) Only java can be used for application building
3) Stream processing is not supported
4) Since Hadoop MapReduce heavily uses Java there are security issues which the cyber criminals exploit.
To overcome these limitations Apache Spark was developed. Apache Software Foundation considers Apache Spark to be one of the most successful projects it has ever conceived. Amazon, Yahoo, Alibaba, Pinterest, Netflix and many other MNCs use Spark. Spark is almost turning indispensible when there is a requirement of real time data analytics.
There is a plethora of data which need to be processed in real time. Let’s see the magnitude of the data being discussed.
1) 30+ Petabytes of user generated data is stored, accessed, and analyzed by Facebook
2) For every minute Youtube users upload 48 hours of new video
3) Facebook has to serve upload requests in the magnitude of 100 terabytes
4) For every minute there are over 500 websites created
This is the reason why there is a huge need for competent real time analytics framework which Spark fulfills.
1) Processing speed is high – Spark is 100 times faster than Hadoop when running in memory and 10 times faster than Hadoop when running on disk. The way how it achieves that is it essentially reduces the number of read-writes to disk.
2) Fault tolerance – To gracefully handle failures of any worker nodes in the cluster, Spark and its RDD (resilient distribution dataset) abstraction is useful. Thus there is no loss of data.
3) In memory processing – Disk is turning up exorbitant with expanding volumes of information. Reading terabytes to petabytes of information from disk and writing back to disk creates a huge overhead. Thus in-memory handling in Spark functions is very helpful in expanding the processing speed. For faster access data is kept in memory. DAG execution engine of Spark is one of the reasons for high speed due to acyclic data flow and in-memory computation.
4) Dynamic – Developing parallel applications in Spark is possible because of over 80 high level operators. Though Scala is the default language for Spark; Python, Java and R can be used to run Spark. This dynamicity can’t be expected in Hadoop MapReduce which supports only Java.
5) It integrates with Hadoop – Using YARN for resource scheduling Spark can run on top of Hadoop cluster. Those proficient in Hadoop can also therefore with Spark without much difficulty.
6) Lazy evaluation – One of the reasons why Spark is so speedy in processing is that it delays evaluation until there is an absolute requirement of it. It uses DAG for computation and it only runs when a driver requests some data.
Domains where Spark is used
Now, let’s take a stroll in various domain areas where Spark is used.
1. Managing an account : It is imperative to guarantee fault tolerant exchanges over the entire banking system. Spark is helpful in this regard. Fraud detection, credit risk analytics and for many other purposes Spark is heavily used.
2. Government : Even government agencies use real time analytics of Spark to bolster national security. For updates regarding threats to national security nations across the world need analytics to keep track of all its intelligence, military, and police agencies.
3. Telecommunications : Real-time analytics is used to support calls, video chats and streaming by telecom companies. To improve customer experience, measures on jitter and delay are adequately taken into account.
4. Healthcare : To consistently check the therapeutic status of critical patients, real time analytics is used by healthcare agencies. Hospitals watchful for blood and organ transplants need to remain in an ongoing contact with each other amid crises. Getting therapeutic treatment on time involves life and death matter for patients. My Fitness Pal is a good firm that tracks calorie data of 80 million users using Spark.
5. Securities market: Stockbrokers utilize real-time analytics to anticipate the development of stock portfolios. Organizations reexamine their business model in the wake of utilizing real-time analytics to investigate the market interest for their brand. Renaissance technologies is managing successfully around $27 billion worth of investment using algorithmic trading and real-time analytics. In India there are various firms providing the benefit of real time analytics to common people in the stock market arena. Minance, Squareoff and Return Wealth are some of them.
Spark has all the benefits of Hadoop in terms of processing capability and then some more. Prudent firms use Spark used on top of HDFS where Spark is used for real time analytics and HDFS is used for storage. This is a complete package and Spark experts claim that there is nothing sacrosanct about Hadoop. The use of Spark will only increase in the future when real time analytics will turn from want to dire necessity for most of companies. The growth of Spark tool will run parallel with growth of big data analytics. Data scientists find simple APIs and processing speed of Spark very useful. As Spark provides real time analytics, we provide most updated training on Spark. Our projects are advanced and address quite well the requirements of IT industry. A sample of it is we have used machine learning through Spark to provide movie recommendations. Consider Spark training from us and ensure proficiency in this big data technology.
If you really want to be proficient in real time analytics then Intellipaat’s Spark training is for you!