bing
Flat 10% & upto 50% off + Free additional Courses. Hurry up!

Streaming

 

It uses UNIX standard streams as the interface between Hadoop and your program so you can write Mapreduce program in any language which can write to standard output and read standard input. Hadoop offers a lot of methods to help non-Java development.

The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and Hadoop Streaming which permits any program that uses standard input and output to be used for map tasks and reduce tasks.

With this utility one can create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

 

Example using Python

Streaming supports any programming language that can read from standard input and write to standard output. For Hadoop streaming one must consider the word-count problem. Codes are written for the mapper and the reducer in python script to be run under Hadoop.

Mapper Code

!/usr/bin/python

import sys

for intellipaatline in sys.stdin:   # Input takes from standard input

intellipaatline = intellipaatline.strip()  # Remove whitespace either side

words = intellipaatline.split()  # Break the line into words

for myword in words:  # Iterate the words list

output print '%s\t%s' % (myword, 1)  # Write the results to standard

 

 

Reducer Code

#!/usr/bin/python

from operator import itemgetter import sys

current_word = ""

current_count = 0

word = ""

for intellipaatline in sys.stdin: # Input takes from standard input

intellipaatline = intellipaatline.strip()# Remove whitespace either side

word , count = intellipaatline.split('\t', 1)  # Split the input we got from mapper.py

try: # Convert count variable to integer

count = int(count)

except ValueError:

# Count was not a number, so silently ignore this line continue

if current_word == word: current_count += count

else:

if current_word:

print '%s\t%s' % (current_word, current_count) # Write result to standard o/p

current_count = count

current_word = word

if current_word == word:  # Do not forget to output the last word if needed!

print '%s\t%s' % (current_word, current_count)

Mapper and Reducer codes should be saved in mapper.py and reducer.py in Hadoop home directory.

 

WordCount Execution

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1. 2.1.jar \

-input input_dirs \ -

output output_dir \ -

mapper <path/mapper.py \

-reducer <path/reducer.py

Where “\” is used for line continuation for clear readability

 

How Streaming Works

Input is read from standard input and the output is emitted to standard output by Mapper and the Reducer. Utility creates a Map/Reduce job, submits the job to an appropriate cluster, and monitors the progress of the job until completion.

Every mapper task will launch the script as a separate process when the mapper is initialized after a script is specified for mappers. Mapper task inputs are converted into lines and fed to the standard input and Line oriented outputs are collected from the standard output of the procedure Mapper and every line is changed into a key, value pair which is collected as the outcome of the mapper.

Each reducer task will launch the script as a separate process and then the reducer is initialized after a script is specified for reducers.As the reducer task runs, reducer task input key/values pairs are converted into lines and feds to the standard input (STDIN) of the process.

Each line of the line-oriented outputs is converted into a key/value pair after it is collected from the standard output (STDOUT) of the process, which is then collected as the output of the reducer.

 

Important Commands

Parameters Description
-input directory/file-name Input location for mapper. (Required)
-output directory-name Output location for reducer. (Required)
-mapper executable or script or JavaClassName Mapper executable. (Required)
-reducer executable or script or JavaClassName Reducer executable. (Required)
-file file-name Create the mapper, reducer or combiner executable available locally on the compute nodes.
-inputformat JavaClassName Class you offer should return key,value pairs of Text class. If not specified TextInputFormat is used as the default.
-outputformat JavaClassName Class you offer should take key, value pairs of Text class. If not specified TextOutputformat is used as the default.
-partitioner JavaClassName Class that determines which reduce a key is sent to.
-combiner streamingCommand or JavaClassName Combiner executable for map output.
-cmdenv name=value Passes the environment variable to streaming commands.
-inputreader For backwards compatibility: specifies a record reader class instead of an input format class.
-verbose Verbose output.
-lazyOutput Creates output lazily. For example if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect or Context.write.
-numReduceTasks Specifies the number of reducers.
-mapdebug Script to call when map task fails.
-reducedebug Script to call when reduce task fails.

 

Hadoop Pipes

It is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming which uses standard I/O to communicate with the map and reduce code Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not used.

"0 Responses on Streaming"

Training in Cities

Bangalore, Hyderabad, Chennai, Delhi, Kolkata, UK, London, Chicago, San Francisco, Dallas, Washington, New York, Orlando, Boston

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.

top

Sales Offer

  • To avail this offer, enroll before 05th December 2016.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer
offer-june

Sign Up or Login to view the Free Streaming.