Apache Airflow
Author, schedule, and manage workflows.
Created at Airbnb as an open-source project in 2014, Airflow was brought into the Apache Software Foundation’s Incubator Program 2016 and announced as a Top-Level Apache Project in 2019. Now, it’s widely recognized as the industry’s leading workflow management solution.
Airflow has many data integrations with popular databases, applications, and tools, as well as dozens of cloud services — with more added each month. The power of a large and engaged open community ensures that Airflow offers comprehensive coverage of new data sources and other providers, and remains up to date with existing ones.
Airflow Architecture
The main components of the Architecture…
- Web Server: This is the UI of Airflow, that can be used to get an overview of the overall health of different Directed Acyclic Graphs (DAG) and also help in visualizing different components and states of each DAG. The Web Server also provides the ability to manage users, roles, and different configurations for the Airflow setup.
- Scheduler: This is the most important part of Airflow, which orchestrates various DAGs and their tasks, taking care of their interdependencies, limiting the number of runs of each DAG so that one DAG doesn’t overwhelm the entire system, and making it easy for users to schedule and run DAGs on Airflow.
- Executor: While the Scheduler orchestrates the tasks, the executors are the components that actually execute tasks. There are various types of executors that come with Airflow, such as SequentialExecutor, LocalExecutor, CeleryExecutor, and the KubernetesExecutor. People usually select the executor that suits their use case best. We will cover the details later in this blog.
- Metadata Database: Airflow supports a variety of databases for its metadata store. This database stores metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections. The Web Server shows the DAGs’ states and their runs from the database. The Scheduler also updates this information in this metadata database.
Advantages of Airflow
- Airflow is completely opensource and can be deployed on Vm, Kubernetes and docker.
- Airflow UI shows the previous DAG runs in Graphical Representation. Users can also check logs for failed Jobs
- Users can enable airflow to sync the DAG from github automatically.
- Airflow provides a nice UI,using which a user can manage jobs,secrets,connections etc.
Inconvenient Truths About Apache Airflow
- Airflow onboarding is not intuitive: Among users’ top grievances were “a lack of best practices on developing DAGs” and “no easy option to launch.” This latter issue has been partially addressed in Airflow Version 2.0 (which was released after the survey), but this version runs on an SQLite database where no parallelization is possible and everything happens sequentially.
As Airflow’s Quick Start guide points out, “this is very limiting” and “you should outgrow very quickly.” - The Airflow Scheduler interval is not intuitive: Airflow’s primary use case is for scheduling periodic batches, not frequent runs, as even its own documentation attests: “Workflows are expected to be mostly static or slowly changing.” This means there are few capabilities for those who need to sample or push data on an ad hoc and ongoing basis, and this makes it less than ideal for some ETL and data science use cases.
4. No versioning in Airflow Scheduler: You’ll find many traditional software development and DevOps practices missing from Airflow, and a big one of those is the ability to maintain versions of your pipelines. There’s no easy way to document all that you’ve built and, if needed, revert to a prior version. If, for example, you delete a Task from your DAG and redeploy it, you’ll lose the associated metadata on the Task Instance.This makes Airflow somewhat fragile, and unless you’ve written a script to capture this yourself, it makes debugging issues much more difficult. It isn’t possible to backtest possible fixes against historical data to validate them.
check out the content and feel free on which topic you people need insights …