Databricks conducted a study in which about 1400 Spark users participated in 2015. The study showed that about 56% more Spark users ran Spark streaming in 2015 as compared to 2014. Almost half of the respondents said that Spark streaming was their favorite Spark component.
In the 2016 Apache Spark survey of Databricks about half of the participants said that for building real-time streaming use cases they considered Spark Streaming as an essential component.
The production use of Spark streaming increased to 22% in 2016 as compared to 14% in 2015. This explains how prevalently it is used in the analytics world. Companies like Netflix, Pinterest, and Uber are the famous names that use Spark streaming in their game.
Check out the video on PySpark Course to learn more about its basics:
What is Apache Spark Streaming?
Spark streaming is nothing but an extension of core Spark API that is responsible for fault-tolerant, high throughput, scalable processing of live streams. Spark streaming takes live data streams as input and provides as output batches by dividing them. These streams are then processed by the Spark engine and the final stream results in batches.
Spark Streaming Example
Some real-time examples of Apache Spark Streaming are:
- Website and network monitoring
- Fraud detection
- Internet of Things sensors
- Advertising
- Web clicks
Spark Streaming Architecture
Spark streaming discretizes into micro-batches of streaming data instead of processing the streaming data in steps of records per unit time. Data is accepted in parallel by the Spark streaming’s receivers and in the worker nodes of Spark this data is held as buffer.
To process batches the Spark engine which is typically latency optimized runs short tasks and outputs the results to other systems. Based on available resources and locality of data Spark tasks are dynamically assigned to the workers. Improved load balancing and rapid fault recovery are its obvious benefits.
Resilient distributed dataset (RDD) constitutes each batch of data and for fault-tolerant dataset in Spark this is the basic abstraction. It is because of this feature that streaming data can be processed using any code snippet of Spark or library.
Get 100% Hike!
Master Most in Demand Skills Now!
How does Spark Streaming work?
A data stream is separated into batches of X seconds by Spark Streaming. These X seconds, known as Dstreams, are internally a sequence of RDDs. With the help of Spark APIs, the Spark Application processes these RDDs. The processed results of the RDD operations are generated in batches.
Data Source in Spark Streaming
Spark Streaming supports various data sources. Some of them are:
- HDFS directories
- TCP sockets
- Kafka
- Flume
- Twitter and more
Spark’s core APIs, machine learning APIs, or DataFrames SQL can process data streams. These can be persisted to a filesystem, databases, HDFS, or any data source that has a Hadoop OutputFormat.
Features of Spark Streaming
Ease of use – The language-integrated API of Apache Spark is used by Spark streaming to stream processing. One can write streaming jobs in a similar way to how batch jobs are written. Java, Scala and Python are supported by Spark streaming.
Spark Integration – A similar code can be reused because Spark streaming runs on Spark and this is useful for running ad-hoc queries on stream state, batch processing, join streams against historical data. Apart from analytics, powerful interactive applications can be built.
Fault tolerance – Lost work and operator state can both be recovered by Spark streaming without adding extra code from the developer.
Spark Streaming Applications
There are four ways how Spark Streaming is being implemented nowadays.
Streaming ETL – Before being stockpiled into data stores data is cleaned and aggregated.
Triggers – Abnormal activity is detected in real-time and downstream actions are triggered consequentially.
Sophisticated Sessions and Continuous Learning – Events can be grouped and analyzed together of a live session. Session information is used to continuously update machine learning models.
Data Enrichment – By joining live data with a static dataset real-time analysis can be derived when the live data is enriched with more information.
Why streaming analytics is needed?
A gigantic proportion of data is being generated by the vast majority of companies that are ever poised to leverage value from it and that too in real-time. IoT devices, online transactions, sensors, and social networks are generating huge data that need to be acted upon quickly.
For example, do you know that billions of devices will be connected to the IoT in the years to come? They will generate enormous amounts of data ready to be processed. Entrepreneurs are already turning their gaze to leverage this great opportunity and in doing that the need for streaming capabilities is very much present. The same is with data with online transactions and detecting frauds in bank credit transactions. Hence there is a dire need for large-scale real-time data streaming than ever.
Advantages of Discretized Stream Processing from Spark Streaming
Dynamic load balancing – Fine-grained allocation of computations to resources is possible by dividing the data from small micro-batches. As an example think of a simple workload where partition has to happen on the input data by a key and has to be processed.
The demerit of the traditional approach which the majority of analytics players follow is they process one record at a time. If one record is computationally more demanding than others, then this poses a bottleneck and slows down the pipeline.
The pipeline involves receiving streaming data from the data source, processing in parallel the data on a cluster, and finally outputting the results to downstream systems. Hence, the job’s tasks in Spark streaming will be load-balanced across the workers where some workers will process longer time taking tasks and other workers process shorter time taking tasks.
Fast failure and straggler recovery – While dealing with node failures, legacy systems often have to restart the failed operator on another node. To recompute the lost information they have to replay some part of the data stream.
It is to be noted that only one node is handling the recomputation and until a new node hasn’t caught up after the replay, the pipeline won’t proceed. In Spark, however, the case is different where computation can run anywhere without affecting the correctness and it is divided into small, deterministic tasks in achieving that feat.
In the cluster of nodes, failed tasks can be relaunched in parallel. This distributes across many nodes evenly all the recomputations. Compared to the traditional approach recovery from failure is faster.
Unifying batch, streaming and interactive analytics is easy – DStream or distributed stream is a key programming abstraction in Spark Streaming. An RDD represents each batch of streaming data. A series of RDDs constitute a DStream.
Batch and streaming workloads interoperate seamlessly thanks to this common representation. On each batch of streaming data, users can apply arbitrary Spark functions. Spark is therefore ideal for unifying batch, streaming, and interactive workloads. There are systems that don’t have a common abstraction and therefore it is a pain to unify them.
The capability to batch data and use the Spark engine by the Spark streaming component gives higher throughput to other streaming systems. Latencies as low as a few hundred milliseconds can be achieved by Spark streaming.
As Spark processes all data together it does so in batches. Micro batching seems to add too much to overall latency. In practice, however, batching latency is one among many components of end-to-end pipeline latency.
As an example, over a sliding window, typically many applications compute and this window is updated periodically like a 15-second window that slides every 1.5 seconds. From multiple sources, pipelines collect records and wait typically to process out-of-order data.
Before firing a trigger an automatic triggering algorithm waits for a time period. Batching rarely adds overheads when compared to end-to-end latency. One would therefore need fewer machines to handle the same workload due to the virtue of throughput gains from DStreams.
Spark Streaming Output Operations
Output operations enable the data from DStream to be pushed out to external systems like file systems or a database.
Output Operation | Meaning |
print() | It is used to print the first ten elements of every batch of data in a DStream. The printing is done on the driver node that runs the application. |
saveAsTextFiles(prefix, [suffix]) | It saves the DStream’s contents in the form of text files. The file name at each batch interval is generated depending on the prefix. |
saveAsHadoopFiles(prefix, [suffix]) | It saves the content of the DStream in the form of Hadoop files. |
saveAsObjectFiles(prefix, [suffix]) | It saves the content of the DStream in the form of SequenceFiles of serialized Java objects. |
foreachRDD(func) | This generic output operator implements a function, func, to each RDD generated from the stream. |
Spark Streaming Use Cases
- Uber collects terabytes of event data from their mobile users every day for real-time telemetry analysis. Uber converts the unstructured event data into structured data as it is collected and sends it for complex analytics by building a continuous ETL pipeline using Kafka, Spark Streaming, and HDFS.
- An ETL data pipeline built by Pinterest feeds data to Spark via Spark streaming to provide a picture as to how the users are engaging with Pins across the globe in real-time. The recommendation engine of Pinterest is therefore very good in that it is able to show related pins as people use the service to plan places to go, products to buy, and recipes to cook.
- From various sources, billions of events are received by Netflix. They have used Kafka and Spark streaming to incept a real-time engine that gives users the most relevant movie recommendations.
Conclusion
Spark streaming houses within it the capability to recover from failures in real-time. The resource allocation is dynamically adapted depending on the workload. Streaming data with SQL queries has never been easier. Apache foundation has been incepting new technologies like Spark, Hadoop, and other big data tools.
Spark streaming is the streaming data capability of Spark and a very efficient one at that. For performing data analytics on the real-time data streams, Spark streaming is the best option as compared to the legacy streaming alternatives.