Apache Oozie Tutorial

Let’s see the topics covered in this section of the Hadoop tutorial:

What is Apache Oozie in Hadoop?
Types of Oozie Jobs
How does Oozie work?
Why Oozie?
Features of Apache Oozie

What is Apache Oozie in Hadoop?

Apache Oozie is a scheduler system used to run and manage Hadoop jobs in a distributed environment. Oozie supports combining multiple complex jobs that run in a particular order for accomplishing a more significant task. With Oozie, within a particular set of tasks, two or more jobs can be programmed to run in parallel.

The reason why Oozie is being used so much is that it is nicely integrated with the Hadoop stack that supports several Hadoop jobs, such as, Pig, Hive, and Sqoop, along with other system-specific tasks, such as Shell and Java.

Oozie is used for triggering the workflow actions that use the Hadoop execution engine for executing various tasks. Oozie leverages the present-day Hadoop machinery for failover, load balancing, etc.

Oozie is responsible for detecting the completion of tasks by polling and callback. When starting a task, Oozie provides a unique callback HTTP URL to the task, and it notifies the URL when the task is complete. If the task doesn’t invoke the callback URL, Oozie polls the task for completion.

Let’s now look at the types of jobs in Oozie.

Types of Oozie Jobs

Oozie Workflow Jobs

Workflow is just a sequence of jobs that are arranged to be represented as DAG (Directed Acyclic Graphs). The jobs depend on each other. This is because an action is executed only after the output from the previous action is retrieved. Decision trees can be used for figuring out how and on what conditions some jobs should run.

There can be various types of actions that are directly based on a particular job and each type of action can have its own tags as well. Job scripts must be placed in HDFS before the execution of the workflow.

Oozie Coordinator Jobs

These jobs are made up of workflow jobs that are triggered by data availability and time. Workflows present in the job coordinator start when some given condition is satisfied. Processes in coordinator jobs:

Start: Starts DateTime for a job
End: Ends DateTime for the job
TimeZone: Timezone of the coordinator application
Frequency: Frequency in minutes of the execution of jobs

Apache Oozie Bundle

Coordinator and workflow jobs are present as packages in Oozie Bundle. Oozie Bundle lets you execute a particular set of coordinator applications, called a data pipeline. There is no explicit dependency here, but data dependancy can be used to create an implicit data application pipeline.
You can start/stop/suspend/resume/rerun Oozie Bundle. It gives a better and easy operational control.

Advancing in this Apache Oozie tutorial, we will understand how to create a Workflow Job.

How does Oozie work?

Now that you know ‘What is Oozie?’, let’s see how exactly Oozie works. Basically, Oozie is a service that runs in the cluster. Workflow definitions are submitted by the clients for immediate processing. There are two nodes, namely, control-flow nodes and action nodes.

The action node is the one representing workflow tasks such as running a MapReduce task, importing data, running a Shell script, etc.

Next, the control-flow node is responsible for controlling the workflow execution in between actions. This is done by allowing constructs like conditional logic. The control-flow node includes a start node (used for starting a workflow job), an end node (designating the end of a job), and an error node (pointing to an error if any).

At the end of the workflow, HTTP callback is used by Oozie for updating the client with the workflow status.

Why Oozie?

The actual motive of using Oozie is for managing several types of jobs that are being processed in the Hadoop system.

In the form of DAG, several dependencies in-between jobs are specified by the user. This information is consumed by Oozie and is taken care of in a particular order as present in the workflow. By doing this, the user’s time for managing the entire workflow is saved. Along with that Oozie specifies the frequency of the execution of a job.

Features of Apache Oozie

Client API, as well as a command-line interface, is present in Oozie that can be used for launching, controlling, and monitoring a job from the Java application.
Using Web Service APIs, jobs can be controlled from anywhere.
The execution of jobs, which are scheduled for running periodically, is possible with Oozie.
Email notifications can be sent after the completion of jobs

That is all for the Apache Oozie tutorial. So far, we learned ‘What is Oozie in Hadoop?’, how it works, why we need Oozie, and the features of Oozie. In the next section of this tutorial, we will learn about Apache Flume.

About the Author

Abhijit

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.