Both Kafka and storm integrate very well to form a real time ecosystem. The current day industry is emanating lots of real-time streaming data there need to be processed in real time. The below are some of the examples.
- Hundreds of sensors get placed around a machinery to know the health of the machinery and predict the failure ahead of time.
- Sensors around cell towers can let us know the power outage and diesel pilferage in real time and so on.
In order to deal with real time data, we need to have a right infrastructure to store and process the incoming data in a distributed manner.
What is Apache Kafka?
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication. Kafka can work in combination with Apache Storm, Apache HBase and Apache Spark for real-time analytics and rendering of streaming data. Kafka can message geospatial data from a fleet of long-haul trucks or sensor data from heating and cooling equipment in office buildings. Whatever the industry or use case, Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop. Apache Kafka supports a wide range of use cases as a general-purpose messaging system for scenarios where high throughput, reliable delivery, and horizontal scalability are important. Apache Storm and Apache HBase both work very well in combination with Kafka. Common use cases include :
- Stream Processing
- Website Activity Tracking
- Metrics Collection and Monitoring
- Log Aggregation
What is Apache Storm?
Apache Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations. Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection. Because Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture.Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.
Twitter Stream Sentiment Analysis is performed in this blog with real time tweets stored in JSON data format.
Steps involved
- A Kafka producer to read files off the disk and send them to the Kafka cluster
- A Kafka spout to consume incoming messages from Kafka brokers
- On receiving of tweets in JSON data format, the tweets need to be parsed to emit tweet_id and tweet_text.
- On extraction of tweet_id & tweet_text, a data cleaning operation (filtering) is required to omit all the non-alpha characters.
- Reduction of Noise for the classifiers like removing the stop words form the next step of data cleaning.
- The cleaned data need to pass to classifiers to perform sentimental analysis.
- Sentimental analysis can let us know whether the tweet is positive tweet or a negative tweet.
- Joining and Scoring of the classifier result helps us in aggregating the results
- Finally the scoring results need to broadcast to HDFS.
Parsing of the JSON tweets is one of the things needed to be done once Kafka Producer and Storm Topology are implemented.
This JSON parsing can be accomplished with JacksonXMLDatabind Library
JsonNode root =mapper.readValue(json, JsonNode.class);
long id;
String text;
if (root.get(“lang”) !=null&&
“en”.equals(root.get(“lang”).textValue()))
{
if (root.get(“id”) !=null&&root.get(“text”) !=null)
{
id =root.get(“id”).longValue();
text =root.get(“text”).textValue();
collector.emit(newValues(id, text));
}
else
LOGGER.debug(“tweet id and/ or text was null”);
}
else
LOGGER.debug(“Ignoring non-english tweet”);
This is just using the basic ObjectMapper class from the JacksonDatabind library to map the simple JSON object provided by the Twitter Streaming API to a JsonNode. The language code is first tested to be English, as the topology does not support non-English tweets. The new tuple is emitted down the Storm pipeline after ensuring the existence of the desired data, namely, tweet_id, and tweet_text.
Once the required twitter_id&twitter_text are emitted, they need to be cleaned and sentimental analysis should be performed.