• Articles
  • Tutorials
  • Interview Questions

What is Airflow DAG? Key Components and Benefits

Tutorial Playlist

Welcome to this Airflow blog! We’re here to give you a complete picture of Airflow DAGs – what they are, how they work, and the smartest ways to create them. Keep reading to become an expert in this essential part of Airflow. 

Table of Contents

Watch the video below to understand the concept of Apache Airflow

What is Apache Airflow DAG?

An Airflow DAG, short for Directed Acyclic Graph, is a helpful tool that lets you organize and schedule complicated tasks with data. It is an open-source platform for putting together and scheduling complex data workflows. It is specially customized for Machine Learning Operations (MLOps) and other data-related tasks.

Let’s understand the meaning of DAG in an elaborative way:

Directed: When dealing with multiple tasks, it’s crucial for each task to be linked to at least one preceding or succeeding task.
Acyclic: Tasks are prevented from depending on themselves. This ensures the prevention of infinite loops.
Graph: All tasks are visualizable as nodes and vertices within a graphical framework. It illustrates task relationships.

Airflow DAGs refer to a Python-coded data pipeline. Each DAG represents a set of tasks intended for execution. They are arranged to exhibit task relationships through the Airflow UI. These DAGs make use of the advantageous characteristics of DAG structures for constructing effective data pipelines.

Airflow’s DAGs provide the flexibility to be defined according to specific requirements. Whether they encompass a solitary task or an extensive arrangement of thousands. The structuring possibilities are diverse.

Furthermore, an occurrence of a DAG in action on a specific date is termed a “DAG run.” These DAG runs can be initiated by the Airflow scheduler. As per the defined schedule of the DAG, they can be manually triggered.

Take your career to the next height, enroll in Data Science Course!

Key Components of Apache Airflow

Apache Airflow relies on several key components that operate continuously to keep its system functional. Apache Airflow’s components work together seamlessly to help manage tasks and workflows efficiently. 

Core Components of Apache Airflow

Some of the core components of Apache airflow are listed below:

  • Webserver: Airflow’s web-based control panel makes use of Gunicorn. It serves a User Interface (UI) built with Flask. It allows interaction and monitoring of workflows.
  • Scheduler: A scheduler acts like an intelligent manager. It is responsible for scheduling tasks. Essentially, this role includes planning which tasks need to be completed when and where. 
  • Database: Your workflow and task information is stored here. It acts like an electronic filing cabinet. Postgres typically powers this database. Though other options such as MySQL, MsSQL or SQLite could also work just as well.
  • Executor: This component is what actually gets things done. Its hands do all the work. Once Airflow is up and running, its executor becomes active within its scheduler.

EPGC IITR iHUB

Situational Components of Apache Airflow

When using Astro CLI to run Airflow locally on your machine, you will notice it creates three separate sections devoted to its core components.

These components may also require additional pieces for specific tasks or features.

  • Worker: Workers are the hands-on experts who carry out tasks assigned by an executor. Your choice of executor determines whether these worker components will be necessary or not.
  • Trigger: This component works alongside your scheduler. It helps with tasks that can be postponed until later. It is not necessary for everyone. But, setting it up individually might be useful if using specific features.

Learn more key features of Apache Spark in this Apache Spark Tutorial!

Installation of Airflow DAGs

Airflow DAGs are quite easy to install. But, before that, make sure that pip is available on your system. Here are the following steps: 

Step 1: Begin by installing pip, if not already present. To install pip, execute the below command.

$ sudo apt-get install python3-pip

If you have already installed it, you can directly jump to step 3.

Step 2: Specify the installation location by entering the following command:

$ export AIRFLOW_HOME=~/airflow

Step 3: The next step is to use pip and install Apache Airflow 

$ pip3 install apache-airflow

Step 4: Initialize the backend to manage the workflow by giving the following command.

$ airflow initdb

Step 5: Now you have to run the following command to start your web server or the Apache user interface 

$ airflow webserver -p 8080

Step 6:  Activate the Airflow scheduler for workflow monitoring:

$ airflow scheduler

How to Create and Declare an Airflow DAG

Creating and declaring an airflow DAG is quite simple. Let us take an example where we will create a workflow. The main aim of this workflow is to send “Welcome to Intellipaat” to the log.

The below workflow will execute the send_welcome_message function, printing “Welcome to Intellipaat” to the log. You can customize the DAG further with additional tasks, operators, and parameters to suit your requirements. Below are the following steps:

Step 1: Import Required Modules

Start by importing the necessary modules from the Airflow library.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

Step 2: Define Default Arguments

Set the default arguments for the DAG, including the start date and other parameters.

from datetime import datetime, timedelta
default_args = {
'owner': 'your_name',
'start_date': datetime(2023, 9, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}

Step 3: Instantiate the DAG

Create an instance of the DAG by passing the DAG name and default arguments.

dag = DAG(
    'intellipaat_welcome',
    default_args=default_args,
    schedule_interval=None,  # Set the schedule interval as needed
    catchup=False,
    tags=['intellipaat']
)

Step 4: Define the Python Function

Create a Python function that sends the desired message to the log.

def send_welcome_message():
    message = "Welcome to Intellipaat"
    print(message)

Step 5: Create an Operator

Instantiate a PythonOperator and provide the task details along with the Python function defined earlier.

send_welcome_task = PythonOperator(
    task_id='send_welcome_task',
    python_callable=send_welcome_message,
    dag=dag
)

Step 6: Define Task Dependencies

Define the task dependencies by using the bitshift operator.

send_welcome_task

Step 7: Complete the Workflow

Connect the tasks in the DAG by specifying their order of execution using the set_downstream() method.

send_welcome_task

Step 8: Save the DAG Configuration

Save the DAG configuration to a Python file for Airflow to read and execute.

dag

Step 9: Run the Workflow

After configuring the DAG, you can run the workflow by initiating the Airflow scheduler.

airflow scheduler

How to Load an Airflow DAG

When it comes to loading your custom DAG (Directed Acyclic Graph) files into the Airflow chart, you have three options. These methods are fully compatible, allowing you to apply multiple approaches simultaneously.

Option 1: Local Loading from the Files Folder

This is the optimal solution if your intention is to deploy the chart from your local file system. For local loading, follow these steps:

  • Copy your DAG files into the designated files/dags directory.
  • This action triggers the creation of a configuration map that includes these DAG files.
  • The configuration map is then prepared across all Airflow nodes. This ensures widespread access.

Option 2: Utilizing an Existing Config Map

If you wish to use an established configuration map, follow the below steps:

  • Manually create a configuration map that consists of all of your DAG files.
  • During the Airflow chart deployment, provide the name of the configuration map you created.
  • It can be done by using the option ‘airflow.dagsConfigMap’ in your deployment process.

Option 3: Integrating DAG Files from a GitHub Repository

To simplify the process of integrating DAG files from a GitHub repository, follow these instructions:

  • Store your DAG files within a GitHub repository.
  • Now integrate your files into Airflow pods through an initContainer.
  • Periodically update the repository using a sidecar container.

Deploy your Airflow using the following options. It will ensure that you replace placeholders accordingly:

  • Enable DAG file cloning
airflow.cloneDagFilesFromGit.enabled=true
  • Specify the repository URL
airflow.cloneDagFilesFromGit.repository=https://github.com/USERNAME/REPOSITORY
  • Designate the repository branch
airflow.cloneDagFilesFromGit.branch=master

Common Airflow DAG Mistakes

We have enlisted some common mistakes developers frequently make when working with Airflow DAGs. Have a look at them!

Circular Dependencies: A common error is creating circular dependencies among tasks within a DAG. This happens when tasks reference each other in a way that they form a loop. Due to this, Airflow struggles to determine task execution order. This leads to unexpected outcomes and causes tasks to run indefinitely.

Misaligned Timezones: Another frequent issue is not configuring task execution timezones. If tasks within a DAG are set to different time zones, it can result in task execution at unintended times. This leads to confusion and inaccurate scheduling.

Uncaught Exceptions: Failing to handle exceptions properly can disrupt the entire DAG execution. If an uncaught exception occurs during task execution, the DAG might fail, and subsequent tasks won’t execute. Proper exception handling and logging are crucial to maintaining DAG reliability.

Large DAGs and Task Dependencies: Creating overly complex and large DAGs with numerous task dependencies can hinder Airflow’s performance. Such DAGs become harder to manage, debug, and maintain. Breaking down large DAGs into smaller ones with clear dependencies can mitigate this issue.

Ignoring Airflow Best Practices: Neglecting established Airflow best practices can lead to inefficiencies and errors. We will discuss the best practices for Airflow in detail now.

Check out the top Apache Spark Interview Questions to crack your next interview!

Best Practices for Writing Airflow

Incorporate the below listed best practices into your Airflow workflows. It will enhance your airflow’s readability, maintainability, and overall effectiveness.

Modularizing for Reusability:  Structure your workflows into modular tasks to increase their reusability and maintainability. Break complex processes down into smaller, focused steps. Make sure that each module represents one of several processes involved. This approach creates clarity and prevents disruptions.

Meaningful Task Names and Documentation: Use meaningful names for every task in your DAG. Make notes or documentation to assist other developers in understanding its purpose. Ensure that collaborators can easily understand its logic and objectives. T

Parameterization and Templating: Utilize parameterization and templating to develop flexible workflows that adapt easily to changing environments. Airflow’s built-in features, such as Jinja templating, enable you to incorporate variables within tasks. 

Sensible Error Handling and Logging: Implement error handling mechanisms into your workflows to manage failures, retries, exceptions, and any exceptions for each task. Logging provides crucial visibility into how the workflow runs. 

Testing and Validation: Before releasing workflows to production, conduct comprehensive testing and validation. It can be done with the help of Airflow tools such as the airflow test command for individual task testing. 

Data Science SSBM

Benefits of Airflow DAGs

Airflow DAGs offer numerous advantages that simplify workflows and expand data processing pipelines. Below are five such advantages.

Structured Workflow Visualization: Airflow DAGs offer a clear and structured visual representation of complex workflows. Teams can use this visual map to understand the order of tasks, their dependencies, and the overall flow. 

Dependency Management: With Airflow DAGs, task dependencies can easily be defined. This ensures that tasks run in their proper order, eliminating data inconsistencies and providing accurate results.

Dynamic Scheduling: DAGs allow dynamic scheduling of tasks based on various parameters. These parameters are time, data availability, or external triggers. This enables DAGs to maximize resource utilization. As tasks can be executed precisely when they’re required, reducing idle times and increasing efficiency.

Retry and Error Handling: Failure is an integral part of data workflows, so Airflow DAGs come equipped with error handling and retry mechanisms. They automatically retry failed tasks. This reduces manual intervention while upholding workflow integrity.

Pre- built Operators: Airflow offers users a vast library of pre-built operators. It offers the flexibility to build custom operators to customize workflows specifically for their business needs. Through its extensibility, users can incorporate specialized tasks and integrate with various tools and platforms.

Simplify Complex Workflows: Airflow DAGs simplify complex workflows. It provides increased productivity and adaptability in managing data pipelines. 

Conclusion

To conclude, Airflow DAGs (Directed Acyclic Graphs) represent an effective solution. Apache Airflow DAGs have become very important for data engineers and developers. They help automate data pipelines and handle tricky workflows. With Airflow, these professionals can make DAGs that describe tasks, say when they should happen, and watch how they’re doing.

FAQs

Why use Airflow DAGs?

Airflow DAGs offer an efficient overview of complex workflows. It facilitates collaboration among teams. Their use ensures proper task execution sequences and dynamic scheduling. It also possesses error-handling capabilities.

How do I create dependencies between tasks in an Airflow DAG?

Dependencies can be established using operators and bitshift operations. Tasks scheduled to follow another one can be marked ‘set_downstream’.

Can tasks in an Airflow DAG run in parallel?

Yes, by setting appropriate dependencies, the tasks in an Airflow DAG can run concurrently. Tasks without direct dependencies may even execute simultaneously.

What is Dynamic Scheduling in Airflow DAGs?

Dynamic scheduling allows tasks to be executed according to certain factors. Factors like time, data availability, and external triggers for optimal resource usage.

How Does Error Handling in Airflow DAGs Work?

Airflow DAGs feature built-in error handling and retry mechanisms. Failed tasks can be automatically retried. Moreover, notifications can be set up for critical failures.

Course Schedule

Name Date Details
Data Scientist Course 18 May 2024(Sat-Sun) Weekend Batch
View Details
Data Scientist Course 25 May 2024(Sat-Sun) Weekend Batch
View Details
Data Scientist Course 01 Jun 2024(Sat-Sun) Weekend Batch
View Details

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist who worked as a Supply Chain professional with expertise in demand planning, inventory management, and network optimization. With a master’s degree from IIT Kanpur, his areas of interest include machine learning and operations research.