Apache Storm Tutorial for Beginners

By Abhijit | Last updated on January 9, 2025 | 90693 Views

What is Apache Storm?

Apache Storm is about real-time processing of data streams. It consists of higher level of abstraction than simple message passing (which permits describing topologies as a DAG), per-process fault-tolerance and definite at-least-once semantics for each message in the structure.

A cutting-edge tool for machine learning and real-time analytics
Integrating it with YARN increases efficiency of Hadoop
Processes large volumes of real-time data quickly
Able to process one million records per seconds on each node

What is Trident?

Trident is a high-level concept for doing real-time calculations on top of Storm. It permits you to flawlessly combine high output (millions of messages/second), impressive stream processing with low latency distributed querying. The concepts of Trident will be very familiar to other tolls which have high level batch processing like Cascading or Pig. It has all the features like aggregations, joins, grouping, functions, filters and much more.

Trident is an interesting concept on top of Storm. Further providing advanced level constructs “alaCascading”, it batches groups of Tuples to two important parts:

Create perceptive about processing easier.
Reassure effective data perseverance, even with the support of an API that can provide precisely once semantics for round about cases.

Kick start your career by enrolling in Apache Spark training.

Getting started: each()

To manipulate each Tuple in the group either by a Filter or a Function the each() method is used. Every call to each() method allows us to do an implicit analysis of the Tuples by choosing a subset of any tuple (the non-projected values will still be accessible in continuous calls), but if we need to project them explicitly for any reason, we can use project() API method.

parallelismHint() and partitionBy()

The parallelismHint() method designs the topology up-to where we positioned it to implement with the quantified degree of parallelism.

The shuffle() method is a repartitioning process. It permits us to identify how Tuples should be directed to the following processing layer, as well as making different layers perhaps route with different degrees of parallelism. Shuffle() method executes a random routing meanwhile partitionBy() makes a routing based on a steady hashing of the Fields we require in it. So parallelismHint() method is like that it relates a definite degree of parallelism to all actions before it until there’s some sort of repartitioning will done.

Example

topology.newStream(“spout”, spout)

.parallelismHint(2)
.shuffle()
.each(new Fields(“actor”, “testing”), new PerActorTweetsFilter(“Parent”))

.parallelismHint(5)
.each(new Fields(“actor”, ” testing “), new Utils.PrintFilter());

Go through these Apache Storm Interview Questions to get high-paying Big Data jobs!

Aggregation

This Aggregator method is used to process each batch of tuples and get a map of counts based on location. This method is also a repartitioning process. Aggregation() method combined all the Tuples of a batch in a random but single process.
On the other hand, partitionAggregate() method is not a repartitioning process. In fact, it runs an Aggregation function on the part of the batch that deals with each partition.

groupBy()

The groupBy() method rationally gathered by some Fields and creates a GroupedStream. This grouping changes the actions of the following aggregate() method. As an alternative of aggregating the whole batch, it will aggregate each group individually. So it splits each Stream into multiple Streams, as many as different groups are in the batch.

groupBy() method is not a repartitioning process. groupBy() method is followed by aggregation() method but it is not followed by partitionAggregation() method.

Example

topology.newStream(“spout”, spout)
.groupBy(new Fields(“location”))
.aggregate(new Fields(“location”), new Count(), new Fields(“count”))
.each(new Fields(“location”, “count”), new Utils.PrintFilter());

Enroll yourself in Apache Storm Online Training Course today to master real-time analytics skills!

About the Author

Abhijit

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.