What is Hadoop Streaming - How Streaming Works

Table of content

Show More

Introduction to Hadoop Streaming

Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your program so you can write MapReduce program in any language which can write to standard output and read standard input. Hadoop offers a lot of methods to help non-Java development.

The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and Hadoop Streaming which permits any program that uses standard input and output to be used for map tasks and reduce tasks.

With this utility, one can create and run MapReduce jobs with any executable or script as the mapper and/or the reducer.

Watch this video on Hadoop before going further on this Hadoop tutorial

Video Thumbnail

Features of Hadoop Streaming

Some of the key features associated with Hadoop Streaming are as follows :

  • Hadoop Streaming is a part of the Hadoop Distribution System.
  • It facilitates ease of writing Map Reduce programs and codes.
  • Hadoop Streaming supports almost all types of programming languages such as Python, C++, Ruby, Perl etc.
  • The entire Hadoop Streaming framework runs on Java. However, the codes might be written in different languages as mentioned in the above point.
  • The Hadoop Streaming process uses Unix Streams that act as an interface between Hadoop and Map Reduce programs.
  • Hadoop Streaming uses various Streaming Command Options and the two mandatory ones are – -input directoryname or filename and -output directoryname

Hadoop Streaming architecture

Hadoop Streaming architecture

As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming Architecture. They are :

  • Input Reader/Format
  • Key Value
  • Mapper Stream
  • Key-Value Pairs
  • Reduce Stream
  • Output Format
  • Map External
  • Reduce External

The involvement of these components will be discussed in detail when we explain the working of the Hadoop streaming. However, to precisely summarize the Hadoop Streaming Architecture, the starting point of the entire process is when the Mapper reads the input value from the Input Reader Format. Once the input data is read, it is mapped by the Mapper as per the logic given in the code. It then passes through the Reducer stream and the data is transferred to the output after data aggregation is done. A more detailed description is given in the below section on the working of the Hadoop Streaming.

Get 100% Hike!

Master Most in Demand Skills Now!

How does Hadoop Streaming Work?

  • Input is read from standard input and the output is emitted to standard output by Mapper and the Reducer. The utility creates a Map/Reduce job, submits the job to an appropriate cluster, and monitors the progress of the job until completion.
  • Every mapper task will launch the script as a separate process when the mapper is initialized after a script is specified for mappers. Mapper task inputs are converted into lines and fed to the standard input and Line oriented outputs are collected from the standard output of the procedure Mapper and every line is changed into a key, value pair which is collected as the outcome of the mapper.
  • Each reducer task will launch the script as a separate process and then the reducer is initialized after a script is specified for reducers. As the reducer task runs, reducer task input key/value pairs are converted into lines and fed to the standard input (STDIN) of the process.
  • Each line of the line-oriented outputs is converted into a key/value pair after it is collected from the standard output (STDOUT) of the process, which is then collected as the output of the reducer.

Hadoop Streaming using Python

Hadoop Streaming supports any programming language that can read from standard input and write to standard output. For Hadoop streaming, one must consider the word-count problem. Codes are written for the mapper and the reducer in python script to be run under Hadoop.

Mapper Code

!/usr/bin/python
import sys
for intellipaatline in sys.stdin: # Input takes from standard input
intellipaatline = intellipaatline.strip() # Remove whitespace either side
words = intellipaatline.split() # Break the line into words
for myword in words: # Iterate the words list
output print '%st%s' % (myword, 1) # Write the results to standard

Reducer Code

#!/usr/bin/python from operator
import item getter
import sys current_word = "" current_count = 0 word = "" for intellipaatline in sys.stdin:
# Input takes from standard input intellipaatline = intellipaatline.strip()
# Remove whitespace either side word , count = intellipaatline.split('t', 1)
# Split the input we got from mapper.py try:
# Convert count variable to integer count = int(count) except ValueError:
# Count was not a number, so silently ignore this line continue if current_word == word: current_count += count else:
if current_word: print '%st%s' % (current_word, current_count)
# Write result to standard o/p current_count = count current_word = word if current_word == word:
# Do not forget to output the last word if needed! print '%st%s' % (current_word, curr

Mapper and Reducer codes should be saved in mapper.py and reducer.py in the Hadoop home directory.

WordCount Execution

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1. 2.1.jar
-input input_dirs  -
output output_dir  -
mapper<path/mapper.py
-reducer <path/reducer.py
Where “” is used for line continuation for clear readability

Important Hadoop Streaming Commands

Parameters Description
-input directory/file-name Input location for the mapper. (Required)
-output directory-name Output location for reducer. (Required)
-mapper executable or script or JavaClassName Mapper executable. (Required)
-reducer executable or script or JavaClassName Reducer executable. (Required)
-file file-name Create the mapper, reducer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassName Class you offer should return key, value pairs of Text class. If not specified TextInputFormat is used as the default.
-outputformat JavaClassName Class you offer should take key, value pairs of Text class. If not specified TextOutputformat is used as the default.
-partitioner JavaClassName Class that determines which reduce a key is sent to.
-combiner streaming Command or JavaClassName Combiner executable for map output.
-inputreader For backward compatibility: specifies a record reader class instead of an input format class.
-verbose Verbose output.
-lazyOutput Creates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output. collect or Context.write.
-numReduceTasks Specifies the number of reducers.
-mapdebug Script to call when map task fails.
-reducedebug Script to call when reduction makes the task failure
-cmdenv name=value Passes the environment variable to streaming commands.

Hadoop Pipes

It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop Streaming which uses standard I/O to communicate with the map and reduce code Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not used.

That’s all for this section of the Hadoop tutorial. Let’s move on to the next one on Hbase!

Our Big Data Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 11th Jan 2025
₹22,743
Cohort starts on 1st Feb 2025
₹22,743
Cohort starts on 25th Jan 2025
₹22,743

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.