CVE-2023–24329 Bypassing URL Blackslisting using Blank in Python urllib library

6 min readAug 23, 2023

Summary

in-depth analysis of CVE-2023–24329

Description

CVE: CVE-2023–24329

Description: An issue in the urllib.parse component of Python before 3.11.4 allows attackers to bypass blocklisting methods by supplying a URL that starts with blank characters.

Vendor: Python Software Foundation

Version: => 3.11.3

Background Story

Python is the most popular programming language in the world. The exploit of CVE-2023–24329 is pretty easy, but its impact is

significant because the urllib.parse library is used in many web applications to prevent blacklisting attacks, such as

injection and file upload. When this library is broken, attackers can manipulate inputs and bypass filters

the bug affects any version before 3.11.4 , I'm using python3 3.11.3 version which is the last version before the patch

Reproducing the Vulnerability

urllib.parse breaks the URL (A Uniform Resource Locator) strings up in components such as the schema (file, FTP, HTTP, HTTP)

that’s used in the URL so when parsing a URL with URL as the example.com

[1] in normal case urlparse function detects the schema of the provided URL and this is used to prevent LFI (file)

[2] if the user adds space to the function will focus to detect the scheme which considers the URL schema with space as below

Demo of the Exploitation scenario

I made a simple web app to Demonstrate the exploit of this CVE in case the library is used to prevent LFI (Local File

Inclusion) using the Flask library which easily can be installed by pip3 install flask with urllib with two files app.py and

index.html

app.py

from flask import Flask, render_template, request
import urllib.request
import urllib.error
app = Flask(__name__)
def safeURLOpener(inputLink):
    block_schemes = ["file", "gopher", "expect", "php", "dict", "ftp", "glob", "data"]
    block_host = ["instagram.com", "youtube.com", "tiktok.com"]
    input_scheme = urllib.parse.urlparse(inputLink).scheme
    input_hostname = urllib.parse.urlparse(inputLink).hostname
    if input_scheme in block_schemes:
        return "Input scheme is forbidden"
    if input_hostname in block_host:
        return "Input hostname is forbidden"
    try:
        target = urllib.request.urlopen(inputLink)
        content = target.read().decode('utf-8')
        return content
    except urllib.error.URLError as e:
        return "Error opening URL: " + str(e)
@app.route('/', methods=['GET', 'POST'])
def index():
    content = ""
    error = None
    if request.method == 'POST':
        domain = request.form.get('domain')
        if domain:
            content = safeURLOpener(domain)
    return render_template('index.html', content=content, error=error)
if name == '__main__':
    app.run(debug=True)

Index.html

<!DOCTYPE html>
<html>
<head>
    <title>Domain Content Viewer</title>
</head>
<body>
    <h1>Domain Content Viewer</h1>
    <form method="post">
        <label for="domain">Enter a domain:</label>
        <input type="text" name="domain" id="domain" value="{{ request.form['domain'] }}">
        <button type="submit">Submit</button>
    </form>
    {% if content %}
    <h2>Content:</h2>
    <pre>{{ content }}</pre>
    {% endif %}
    {% if error %}
    <h2>Error:</h2>
    <p>{{ error }}</p>
    {% endif %}
</body>
</html>

run the app using python3

which running by default to 127.0.0.1:5000

in case submitting a common payload of LFI file:c:\WINDOWS\win.ini Windows-based

the application returned with Input schema is forbidden because file schema blocked by the blocklist

After adding a space in the URL, the application blocklisting failed to detect the URL schema and causes bypassing of the

blocklist

Setting the Debugging Environment

the vulnerable library: https://github.com/python/cpython/tree/3.11/Lib/urllib

from urllib.parse import urlparse  
  url_to_parse = " https://www.vicarius.io/"  
  
output = urlparse(url_to_parse)  
  
print(output)

then add the breakpoint in the line that uses the function to trigger the vulnerable function

start the debugger by clicking on the debugger button in the tools bar

After running the debugger and stepping into it to let the IDE go to the source code of the installed Python version in the

machine, the default in Windows is C:\Users\yosef\AppData\Local\Programs\Python\Python39\Lib\urllib.

It appears that it calls the urlparse function, which exists in the parse.py file.

urlparse

urlparse function first breaks down the provided argument into two pieces

the scheme and

URl set it to an empty string (scheme='') as the default and the given

URL. The function then proceeds to parse the argument

by passing it to the _coerce_args function

_coerce_args

_coerce_args function make sure the given argument if it was a string

return noop and if not raise an error Cannot nix str and non-str arguments

then urlparse call urlspilt to spilt the given url into scheme, netloc

(network location), url, query, and fragment.

urlsplilt

urlsplilt have all the important works, which use

UNSAFE_URL_BYTES_TO_REMOVE variable as shown below to remove '\t', '\r', '\n' prevent injection

Root Cause

for c in url[:i] which checks for the characters in the URL first part before the colon (file schema) using url.find(':') for

all characters if c among the valid characters that exist in scheme_chars variable as shown below:

if c characters existed in scheme_chars the code will proceed and save it as the scheme component and when and the root cause of this CVE when the code fails to get the the schema the code processing without getting the schema as shown in the stack tracing below:

and then checks for the netloc which starts in line number 380 in the code checking the part before // and checks if []

not exist in IPv6 URL format rise an error says Invalid IPv6 URL then it checks for the fragments (sections indicated by #)

and then the query and pass the netloc to _checknetloc() function

and for _checknetloc(netloc)

The _checknetloc() function handles the network location by first confirming whether it consists of ASCII characters or not. It then proceeds to replace symbols like '@', ':', '#', and '?' and normalizes the resulting string. Afterward, the function checks if these characters still exist; if they do, an error is raised

and then it saves the into variable parseResult which appears below picture of the code processed without getting the

schema and netloc components and save the URL into path

Patch diffing

link: https://github.com/python/cpython/pull/99421/commits/a284d69de1d1a42714576d4a9562145a94e62127

the modification of the library was by adding

test_attributes_bad_scheme function which checks by looping a range of possible invalid

scenarios schemes, including ".", "+", "-", "0", "http&" and non-ascii to prevent the bypassing using a blank as the previous

exploitation

Mitigation

Upgrade Python to the lasted version 3.11.4

Final Thoughts

after diving deep into the source code of the Python library urllib and debugging the library we take a look can simple bugs

do a significant impact as we see simple plank can bypass any blacklisting

Reference:

urllib.parse space handling CVE-2023-24329 appears unfixed · Issue #102153 · python/cpython

Everyone (including the submitter of the now public exploit who submitted the issue half a year ago to…

github.com

Join vsociety: https://vsociety.io/

Checkout our discord: https://discord.gg/sHJtMteYHQ