Databricks conducted a study which about 1400 Spark users participated in 2015. The study showed that about 56% more Spark users ran Spark streaming in 2015 as compared to 2014. Almost half of the respondents said that Spark streaming was their favorite Spark component. In the 2016 Apache Spark survey of Databricks about half of the participants said that for building real-time streaming use cases they considered Spark Streaming as an essential component. The production use of Spark streaming increased to 22% in 2016 as compared to 14% in 2015. This explains how prevalently it is used in the analytics world. Companies like Netflix, Pinterest and Uber are the famous names which use Spark streaming in their game. Spark streaming is nothing but an extension of core Spark API that is responsible for fault-tolerant, high throughput, scalable processing of live streams. Spark streaming takes live data streams as input and provides as output batches by dividing them. These streams are then processed by Spark engine and final stream results in batches.
Check out this insightful video on Spark Tutorial For Beginners
Why streaming analytics is needed?
A gigantic proportion of data is being generated by the vast majority of companies that are ever poised to leverage value from it and that too in real time. IoT devices, online transactions, sensors, social networks are generating huge data that needs to be acted upon quickly.
Example, do you know that billions of devices will be connected to the IoT in the years to come? They will generate enormous amount of data ready to be processed. Entrepreneurs are already turning their gaze to leverage this great opportunity and in doing that the need for streaming capabilities is very much present. The same is with data with online transactions and detecting frauds in bank credit transactions. Hence there is a dire need for large scale real time data streaming than ever.
Spark streaming architecture
Spark streaming discretizes into micro batches of streaming data instead of processing the streaming data in steps of records per unit time. Data is accepted in parallel by the Spark streaming’s receivers and in the worker nodes of Spark this data is held as buffer. To process batches the Spark engine which is typically latency optimized runs short tasks and outputs the results to other systems. Based on available resources and locality of data Spark tasks are dynamically assigned to the workers. Improved load balancing and rapid fault recovery are its obvious benefits.
Resilient distributed dataset (RDD) constitutes each batch of data and for fault tolerant dataset in Spark this is the basic abstraction. It is because of this feature that streaming data can be processed using any code snippet of Spark or library.
Features of Spark streaming
Ease of use – The language integrated API of Apache Spark is used by Spark streaming to stream processing. One can write streaming jobs in a similar way how batch jobs are written. Java, Scala and Python are supported by Spark streaming.
Spark Integration – A similar code can be reused because Spark streaming runs on Spark and this is useful for running ad-hoc queries on stream state, batch processing, join streams against historical data. Apart from analytics, powerful interactive applications can be built.
Fault tolerance – Lost work and operator state can both be recovered by Spark streaming without adding extra code from the developer.
Advantages of discretized stream processing from Spark streaming
Dynamic load balancing – Fine-grained allocation of computations to resources is possible from dividing the data from small micro-batches. As an example think of a simple workload where partition has to happen on the input data by a key and has to be processed. The demerit in traditional approach which the majority analytics players follow is they process one record at a time and if one record is more computationally more demanding than others then this poses as a bottleneck and slows down the pipeline. The pipeline involves receiving streaming data from data source, process in parallel the data on a cluster and finally output the results to downstream systems. Hence, the job’s tasks in Spark streaming will be load balanced across the workers where some workers will process longer time taking tasks and other workers process shorter time taking tasks.
Fast failure and straggler recovery – While dealing with node failures, legacy systems often have to restart the failed operator on another node and to recompute the lost information they have to replay some part of the data stream. It is to be noted that only one node is handling the recomputation and until a new node hasn’t caught up after the replay, the pipeline won’t proceed. In Spark however the case is different where computation can run anywhere without affecting the correctness and it is divided into small, deterministic tasks in achieving that feat. In the cluster of nodes, failed tasks can be relaunched in parallel. This distributes across many nodes evenly all the recomputations. Compared to the traditional approach recovery from failure is faster.
Unifying batch, streaming and interactive analytics is easy – DStream or distributed stream is a key programming abstraction in Spark streaming. An RDD represents each batch of streaming data. A series of RDDs constitute a DStream. Batch and streaming workloads interoperate seamlessly thanks to this common representation. On each batch of streaming data users can apply arbitrary Spark functions. Spark is therefore ideal for unifying batch, streaming and interactive workloads. There are systems which don’t have a common abstraction and therefore it is a pain to unify them.
Spark streaming performance
The capability to batch data and use Spark engine by the Spark streaming component gives higher throughput to other streaming systems. Latencies as low as few hundred milliseconds can be achieved by Spark streaming. As Spark processes all data together it does so in batches. Micro batching seems to add too much to overall latency. In practice however, batching latency is one among many components of end-to-end pipeline latency. As an example, over a sliding window typically many applications compute and this window is updated periodically like a 15 second window that slides every 1.5 seconds. From multiple sources, pipelines collect records and wait typically to process out-of-order data. Before firing a trigger an automatic triggering algorithm wait for a time period. Batching rarely adds overheads as when compared to end-to-end latency. One would therefore need fewer machines to handle the same workload due to the virtue of throughput gains from DStreams.
Spark streaming applications
There are four ways how Spark Streaming is being implemented nowadays.
Streaming ETL – Before being stockpiled into data stores data is cleaned and aggregated.
Triggers – Abnormal activity is detected in real time and downstream actions are triggered consequentially.
Sophisticated sessions and continuous learning – Events can be grouped and analyzed together of a live session. Session information is used to continuously update machine learning models.
Data enrichment – By joining live data with a static dataset real time analysis can be derived when the live data is enriched with more information.
Spark streaming use cases
1) Uber collects from their mobile users everyday terabytes of event data for real time telemetry analysis. Uber converts the unstructured event data into structured data as it is collected and sends it for complex analytics by building a continuous ETL pipeline using Kafka, Spark Streaming, and HDFS.
2) An ETL data pipeline built by Pinterest feeds data to Spark via Spark streaming to provide a picture as to how the users are engaging with Pins across the globe in real time. Recommendation engine of Pinterest is therefore very good in that it is able to show related pins as people use the service to plan places to go, products to buy and recipes to cook.
3) From various sources, billions of events are received by Netflix. They have used Kafka and Spark streaming to incept a real time engine that gives users the most relevant movie recommendations.
Spark streaming houses within it the capability to recover from failures in real time. The resource allocation is dynamically adapted depending on the workload. Streaming data with SQL queries has never been easier. Apache foundation has been incepting new technologies like Spark, Hadoop and other big data tools. Spark streaming is the streaming data capability of Spark and a very efficient one at that. For performing analytics on the real-time data streams Spark streaming is the best option as compared to the legacy streaming alternatives.
Master Spark streaming through Intellipaat’s Spark Scala training!