Apache Airflow Review: the good, the bad

When getting started with Apache Airflow, data engineers have questions similar to the two below

“What are people’s opinions of Airflow?”

“any gotcha I need to be aware of, in Airflow?”

In this post, we look at why Airflow is popular(the good) and what are some of it’s pitfalls(the bad). This article assume you have a basic knowledge of what Apache Airflow is and the concept of DAGs and tasks.

The Good

Good news first

1. Dependency management

Dependency management represents the flow of data through the data pipeline. It can be visualized as some Task A -> Task B -> Task C for a simple case, but in real use cases it is very complex (eg a fan-out scenario where Task A then [Task B, Task C, Task D] in parallel then Task E )

In pre-Airflow days complex dependency was managed by writing the entire data pipeline in one script or having a database that handles the schedule time, run count, etc. But these had to be developed based on the use case and was time consuming. One of the major goals of Airflow was to handle dependency management. Airflow does this extremely well, it allows for standard data flow, complex branching using BranchPythonOperator , task retry, catchup runs, etc.

Dependency management (ref: https://airflow.apache.org/ )

2. Templating and Macros

The whole Airflow model is designed to operate on a time based schedule, usually rerunning the same scripts with different time ranges. As such there are option to have templated scripts that can be filled in with the appropriate time value at DAG run time. for example

from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator

default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}

dag = DAG('simple-mysql-dag',
          default_args=default_arg,
          schedule_interval='0 0 * * *')

mysql_task = MySqlOperator(dag=dag,
                           mysql_conn_id='mysql_default',
                           task_id='mysql_task'
                           sql='<path>/sample_sql.sql',
                           params={'test_user_id': -99})

mysql_task

and the sql file at sample_sql.sql as

USE your_database;

DROP TABLE IF EXISTS event_stats_staging;
CREATE TABLE event_stats_staging
AS SELECT date
        , user_id
        , sum(spend_amt) total_spend_amt
     FROM event
    WHERE date = {{ macros.ds }}
      AND user_id <> {{ params.test_user_id }}
    GROUP BY date, user_id;

   INSERT INTO event_stats (
          date
        , user_id
        , total_spend_amt
   )
   SELECT date
        , user_id
        , total_spend_amt
     FROM event_stats_staging;

DROP TABLE event_stats_staging;

In the above the macros.ds will be replaced with the execution_time when the DAG is being run. In addition to inbuilt macros you can also pass in parameters at run time, by passing it in the params field of the task operator.

3. Operators

Apache Airflow has a lot of operators that you can use to build your pipeline. If you are getting started with Airflow for your project, search for an operator for your use case before writing your own implementation. In case you have a unique use case, you can write your own operator by inheriting from the BaseOperator or the closest existing operator, if all you need is an additional change to an existing operator.

4. UI and logs

Apache Airflow has a great UI, where you can see the status of your DAG, check run times, check logs, re-run tasks and much more. By default these are open to every one who has access to the server, but you can set up additional authentication if required. The logs can be synced to external storage, providing an easy way to check for errors, etc.

Manage Task(ref: https://airflow.apache.org ) ‍

5. Based on functional principles

Apache Airflow was designed based on functional principles. The key ideas are data immutability and idempotence.

Data immutability in this context is storing the raw data and processing it and storing the processed data separately. This provides us the option to rerun the data process in case of errors. Idempotence process is a process where given an input, the output will always be the same. This makes the process easy to test and re-run.

Following these principles, the data flows through the data pipeline, being transformed at each task and producing an output being used in the next task. At the end of the data pipeline you will have the raw data and the transformed data. These concepts will help us manage, trouble shoot and backfill data. Backfill is the process where you have to rerun your data pipeline due to some change in business logic, code error, etc.

6. Open source

Apache Airflow is a very active open source project with a huge active community. This means for most of the issues you face, you can find an answer online, you can read the source code to understand how it works under the hood and you can make commits to Airflow.

The Bad

1. Data processing model is not intuitive to new engineers

New engineers have a hard time internalizing the way Airflow is meant to be used. The most common complaint is writing test cases. Writing test cases is extremely difficult for data pipelines handling raw data. There have been some improvements in this area, e.g.greatexpectations . The other complaints arise from writing non-idempotent tasks, which when failed and retried leads to duplicate data or messy data or failures. Read this article to get a good understanding of the recommended data processing model.

2. Changing schedule interval requires renaming the DAG

If you decide to change the schedule, you have to rename your DAG. Because the previous task instances will not align with the new interval.

3. CI/CD is tricky

If you are deploying Airflow on Docker, the CI/CD process that restarts the Docker container will kill any running processes, you will have to time your deploys to correspond with a no activity time or DAG run(s) that you would be ok with being restarted or being partially rerun. This approach would be ok, if you have a few DAGs, but if the number of DAGs are high it is advisable to use something like a git-sync or s3 sync, where your DAG files are synced to external storage and your deploy basically syncs them to your docker.

4. No native windows support

It is not straight forward to natively run Airflow on windows. But this can be mitigated by using Docker.

Conclusion

In spite of all the cons of Apache Airflow, it still provides an excellent batch data processing framework to manage ETL. There are a lot of companies which use Airflow for the benefits it provides. Hope this post gives you an idea of the pros and cons of Apache Airflow. If you have any questions, thoughts or comments, please leave a comment below.