A type confusion bug in nft_set_elem_init (leading to a buffer overflow)

28 min readApr 29, 2024

Screenshots from the blog posts

Summary

An issue was discovered in the Linux kernel A type confusion bug in nft_set_elem_init (leading to a buffer overflow) could be used by a local attacker to escalate privileges, a different vulnerability

Description

An issue was discovered in the Linux kernel . A type confusion bug in nft_set_elem_init (leading to a buffer overflow) could be used by a local attacker to escalate privileges, a different vulnerability than CVE-2022-32250. (The attacker can obtain root access, but must start with an unprivileged user namespace to obtain CAP_NET_ADMIN access.) This can be fixed in nft_setelem_parse_data in net/netfilter/nf_tables_api.c.

What is NF_Tables?

NF_Tables is a packet-filtering framework in the Linux kernel that provides an efficient and flexible way to classify and manipulate network packets. It is designed to replace the older iptables and ip6tables tools for firewall and packet filtering tasks, offering improved performance, syntax, and capabilities.

NF_Tables allows you to define rulesets to control the flow of network packets through your system. It uses a rule-based syntax to match packets based on various criteria and then applies actions to those packets, such as dropping, accepting, or modifying them. The rules are organized into tables, chains, and rulesets, providing a hierarchical structure for packet filtering.

The main purpose netfilter is the table object. In the context of netfilter and nftables, a table is a container for organizing and storing rules. In the given command below:

nft> add table ip my-table

This creates a new table named my-table specifically for filtering on the IP protocol.

+------------------------+
|        my-table        |
|      (IP Filtering)    |
+------------------------+

In a table can contain different objects, such as sets, used to store data. In the below command:

nft> add set ip my-table my-set {type: ipv4_addr;}

This command creates a new set named my-set associated with the table my-table, and it's configured to store IPv4 addresses.

+------------------------+
|        my-table        |
|      (IP Filtering)    |
|         +--------+     |
|         | my-set |     |
|         +--------+     |
|       (IPv4 Addresses) |
+------------------------+

Then finally the creation of chains of rules will come to an action that would be applied to received packets

+------------------------+
|        my-table        |
|      (IP Filtering)    |
|         +--------+     |
|         | my-set |     |
|         +--------+     |
|       (IPv4 Addresses) |
|         +--------+     |
|         | Chain  |-----|--> Rule 1
|         +--------+     |
|         | Chain  |-----|--> Rule 2
|         +--------+     |
|         | Chain  |-----|--> Rule 3
+------------------------+

Let’s understand this with an example :

Consider a scenario where you want to control access to a web server. You could use netfilter to create a table named “web-filter” with a set named “allowed-ips” containing IP addresses allowed to access the server. You might create a chain of rules within this table to permit or deny access based on the source IP addresses. For example:

add table ip web-filter
add set ip web-filter allowed-ips {type: ipv4_addr;}
add rule ip web-filter input ip saddr @allowed-ips accept
add rule ip web-filter input drop

Build the Lab

As this is a kernel module vulnerability it’s typical to debug, so you need to have a little bit more patience than usual 🦐

VirtualBox
I used 2 Linux Virtual Machines.

As we have to debug a Kernel Module and Kernel is a user-space process we GDB alone cannot use it for debugging hence we need an Client/Server architecture. Kernel programs can be debugged remotely using the combination of gdbserver the target machine and gdb on the host machine/development machine. The Linux kernel has a GDB Server implementation called KGDB. It communicates with a GDB client over a network or serial port connection.

Host/Development Machine: Runs gdb against the vmlinux file which contains the symbols and performs debugging
Target Machine: Runs kgdb and is the machine to be debugged

    ------------------                              --------------------
    |       Host      |                             |       Target     |
    |                 |                             |                  |
    |  -------------  |                             |   ------------   |
    | |     gdb     | |<--------------------------->|  |    kgdb    |  |
    | |             | |             Serial or       |  |            |  |
    | --------------  |             Ethernet        |  -------------   |
    |       |         |             Connection      |        |         |
    |  -------------- |                             |  --------------  |
    | | Kernel image ||                             |  |Linux Kernel | |
    | | with debug   ||                             |  |(zImage)     | |
    | | symbols      ||                             |  |             | |
    | | (vmlinux)    ||                             |  --------------- |
    | ----------------|                             |                  |
    -------------------                             --------------------
Hence Two machines are required for using kgdb:

KGDB is a GDB Server implementation integrated into the Linux Kernel, It supports serial port communication (available in the mainline kernel) and network communication (patch required)

It’s available in the mainline Linux kernel since version 2.6.26 (x86 and sparc) and 2.6.27 (arm, MIPS, and PPC)

Enables full control over kernel execution on target, including memory read and write, step-by-step execution, and even breakpoints in interrupt handlers

There might be other ways to do it but I generally do the above way.

I am using Ubuntu AMD64-22.04 LTS iso: https://releases.ubuntu.com/

Connect and Create a Serial Port in VirtualBox

The assumption for the step:

This has been assumed that users have ISO images downloaded locally and already created 2 VM's with that.

For the Demonstration I created 2 machines named as target-server and dev Machine.

To create a serial port in VirtualBox and Connect the machines it's very easy

Select your target machine from virtualbox and go to the settings options
Once you are in the settings tab of the target server select the Serial Ports and enter the below configuration,

Don’t check the Connect to existing pipe/socket as we don't have any previous ones.

Once we have the serial port configured follow the step 1 and 2 for dev machine as well, but in dev machine you need to check Connect to existing pipe/socket and make sure you specify the same Path/Address

Once that’s done Congratulations Labs are ready

WARNING

! Do not start the Dev machine first other wise you will see an error of serial port as you might have already noticed that we connected the 2 machines together with serial port

DEBUGGING KERNEL — nf_tables

As we discussed already the vulnerability lies in nf_tables and it's a kernel module so to debug a kernel module we need to follow some steps let's do those initial settings first:

Verify the machines (dev & target) are communicating in serial-port , to verify the communication between the dev and target machine, send the message on serial ports

I did send the message from target machine to dev machine and confirmed that they are communicating with each other on the serial port. The current version of the kernel is 22.04 if you have downloaded it from the ubuntu official website it will not be an older version so we have to downgrade the kernel , let's continue to do that step in Debugging stage.

2. Download the affected versions of kernel , so to accomplish this step I downloaded the v5.12 from the official kernel GitHub

Once you have checked the affected version of the kernel you need to install this image and update it to your grub but before we do that we need some libraries to be available

1. build-essentials
2. flex
3. bison
4. libnftnl-dev
5. libmnl-dev
6. nftables - (Installed by default but just in case missing)
7. libncurses-dev
8. dkms
9. libssl-dev
10. libelf-dev

Once the packages are installed let’s enable KGDB in the config file to debug the kernel and enable KGDB settings please move inside the git repo where we have downloaded the kernel source and run make nconfig command

This command will bring the config file in graphical view and verify the KGDB the variable value is enabled.

Select the Generic Kernel Debugging Instruments
Verify the KGDB and magic sysrq option is selected
Once these settings are verified we need to verify one more variable DEBUG_INFO it should be y as well, as to look for the variable press f8 and search for the value

As from the verification process, all things are verified, libraries have been installed and things are in place, as the flaw is in nf_tables we need to make sure that this module is also enabled and installed so let's verify that too

To do that we will go to the Networking Support > Networking Option > Network Packet Filtering Framework > Core Netfilter Configuration

For the safer side (as it takes a lot of time to install modules or install kernel image) and we should not miss any class or file debugging I have enabled all netfilter modules for nf_tables so that we don't have to repeat this step for any miss.

Press f6 and save the changes and run make -j8 the command to build the Linux kernel with multiple threads in parallel.

Go out and Grab a coffee as it going to take a long believe me very long

Verify for vmlinux file in the location.

After make -j8 success you need to run make modules_install command and wait for installation and completion of the command.

Once that’s completed run make install and this will update the v5.12 modules in boot , once that's done write update-grub and reboot command to restart the machine. During the restart of the machine, it will display the option to select the kernel version, select v5.12 and boot the kernel.

Verify the kernel version by writing uname -r

Now we have downgraded to the affected version of kernel

Next, We wanted to enable the GDB-Script in the affected target machine, GDB Scripts is a collection of helper scripts that can simplify kernel debugging steps

Todo that we have to perform 2 steps target machine we have to enable CONFIG_GDB_SCRIPT which was enabled in our target machine already.

In Dev machine we have to create a ~/.gdbinit machine and write add-auto-load-safe-path <location-bin-file>

To start the debugging on target machine we also have to copy the debugged build and compiled kernel Linux folder to the dev server. To make copying easy I installed open ssh in target server and used scp command in dev server to copy linux compiled folder from target machine to dev

In target machine we made a tar.gz file and In dev the server used the SCP command to copy linux.tar.gz from target to dev

make tar.gz file with tar : tar -czvf linux.tar.gz linux

In dev server I copied the folder at /home/target/Desktop/linux

scp <username>@<ip-address>:<file-to-copy> <dev-server-location-to-paste>

And then extracted the gz file in the dev server by using the tar command : tar -xzvf linux.tar.gz

Open the copied vmlinux with gdb , make sure to open it with the root user in dev machine

Next to debug kernel we have to specify the serial port and baud rate to the kgdboc so that we can debug kernel from the dev machine.

Run the sysrq magic sequence in target server

echo g > /proc/sysrq-trigger

and On the dev server run target remote /dev/ttyS0

We can see the kgdb breakpoints triggered let's put the breakpoint in our suspected functions

As we have enabled GDB-Script let's load our beloved affected nf_tables.ko module, and to do that we use apropos lx so write lx-symbols to load nf_tables.ko and other existing modules from kernel to GDB

Once the module is loaded we can put the breakpoint in suspected function our case (nft_set_elem_init) under nf_tables_api.c and start our static analysis

STATIC ANALYSIS

Delving into the myriad pathways leading to the ‘dlen’ field, my focus has been captivated by a pivotal moment — the invocation of the ‘memcpy’ function within the realms of ‘nft_set_elem_init’ in the intricate landscape of ‘/net/netfilter/nf_tables_api.c’.

Intriguingly, this code snippet and function call raises eyebrows due to its unconventional approach — utilizing two distinct objects in a rather peculiar manner. The receiving buffer finds its residence within an nft_set_ext object, affectionately named 'ext,' While the magnitude of the data copy is derived from an entirely different entity, an nft_set object. The dynamic allocation of the 'ext' object at line number 5195 in the code accomplished with 'elem' reserves a space dictated by tmpl->len.

Let’s represent the relevant objects and their relationships in a diagram:

+---------------------+
|   nft_set_ext (ext) |
|---------------------|
|    Destination      |
|      Buffer         |
|        +            |
|        |            |
|        v            |
|      elem           |
|        |            |
|        |            |
|        v            |
|---------------------|
|       tmpl->len     |
+---------------------+

+---------------------+
|     nft_set         |
|---------------------|
|       Source        |
|        Size         |
|        +            |
|        |            |
|        v            |
|---------------------|
|       set->dlen     |
+---------------------+

The upper part of the diagram represents the nft_set_ext object (ext), where the destination buffer is stored. The buffer is dynamically allocated at the point with elem, and the size reserved for it is determined by tmpl->len.
The lower part of the diagram represents the nft_set object, where the source size (set->dlen) for the memcpy operation is stored.
The diagram illustrates the two objects, nft_set_ext and nft_set, and their interconnection through the memcpy operation.
The question here pertains to the relationship between the size of the destination buffer (tmpl->len) and the value stored in set->dlen. I am suspicious about potential inconsistencies or dependencies that may exist between these two values.

Let’s check where all nft_set_elem_init has been called to dig further

It has been referenced in the line 5560 and 5793

This function is being invoked from another function named nft_add_set_elem, which is located in the file /net/netfilter/nf_tables_api.c.The purpose of nft_add_set_elem is to add an element to a netfilter set.

+-----------------------------+
  |                             |
  |    nft_add_set_elem         |
  |    (/net/netfilter/         |
  |    nf_tables_api.c)         |
  |                             |
  +--------+--------------------+
           |
           | calls
           |
  +--------v--------------------+
  |                             |
  |    nft_set_elem_init        |
  |                             |
  +-----------------------------+

The nft_set the structure has a field named dlen, presumably indicating the length of data associated with the identifier NFT_SET_EXT_DATA.Within the nft_set_ext structure, there is a field named desc. The desc structure is where the space for data associated with NFT_SET_EXT_DATA is reserved.The desc the structure has a field named len, which is used to determine the length of the space to be reserved for data associated with NFT_SET_EXT_DATA.Contrary to expectations, the length of information from set->dlen is not used for the reservation; instead, the length is determined by desc.len.The desc structure is initialized within the function nft_setelem_parse_data in the /net/netfilter/nf_tables_api.c file. This function is where the length information is set for the NFT_SET_EXT_DATA.

+------------------------+        +---------------------+
|                        |        |                     |
|      nft_set           |        |      nft_set_ext    |
|                        |        |                     |
|------------------------|        |---------------------|
|        ...             |        |         ...         |
|------------------------|        |---------------------|
|         dlen           |        |                     |
|                        |        |---------------------|
|                        |        |                     |
|                        |        |        desc         |
|                        |        |---------------------|
|                        |        |        len          |
+------------------------+        +---------------------+
                                  |        ...          |
                                  +---------------------+

The nft_data_init function is responsible for initializing the data and desc structures based on user-provided data. This initialization occurs at (1) and involves processing user input to set values for the data and desc structures.

A critical check is performed at (2) between desc->len and set->dlen.This check is conditional and is triggered only when the data associated with the added element has a type different from NFT_DATA_VERDICT.

The user has control over the variable set->dlen when creating a new set.
The only restrictions are that set->dlen should be lower than 64 bytes, and the data type should be different from NFT_DATA_VERDICT.
When desc->type is equal to NFT_DATA_VERDICT, desc->len is set to 16 bytes.
If an element of type NFT_DATA_VERDICT is added to a set with data type NFT_DATA_VALUE, it can lead to a situation where desc->len is different from set->dlen.

The vulnerability arises in the nft_set_elem_init function, where a heap buffer overflow is possible. This overflow can extend up to 48 bytes, potentially leading to a security compromise.

In the code snippet, a local variable elem of type struct nft_set_elem is declared. This variable is used to store information about new elements during their creation. The elem variable is used in a call to nft_set_elem_init . This call initializes the elem structure with data provided by the user.

The structure struct nft_set_elem is defined in /net/netfilter/nf_tables.h. It contains unions for key, key_end, and data, each with a maximum size of 64 bytes.

Root Cause

The vulnerability arises because, even though 64 bytes are reserved in the data union, only 16 bytes are written into elem.data when the buffer overflow is triggered. As a result, random bytes are used in the overflow. In essence, the overflow doesn't allow direct control of the data being copied. Instead, it involves copying random data from the allocated buffer, which adds a layer of complexity to potential exploitation. The use of random bytes can make the impact of the overflow less predictable and potentially harder to exploit in a controlled manner.

Exploitation and Explanation:

We have an exploits available for the vulnerability / POC:

@merlinepedra25 : https://github.com/merlinepedra25/CVE-2022-34918-LPE-PoC

I used the exploit.

The requirement to run the exploit:

You need libmnl-dev and libnftnl-dev packages installed in your machine.

Affected Version

Linux, introduced within the commit fdb9c405e35bdc6e305b9b4e20ebc141ed14fc81 [fdb9c405e35bdc6e305b9b4e20ebc141ed14fc81](https://github.com/torvalds/linux/commit/fdb9c405e35bdc6e305b9b4e20ebc141ed14fc81), it affects the Linux kernel since the version 5.8.
Ubuntu <= 22.04 before security patch

Test Environment

Platform

Ubuntu 22.04 amd64

Versions

Linux ubuntu 5.12.0 #2 SMP Aug 18 14:17:41 JST 2023 x86_64 x86_64 x86_64 GNU/Linux

Running Exploit

# Once the exploit is downloaded go to the downloaded folder and run make command 

make # make command will generate the poc file just run ./poc to run the exploit later

Result

use git tool to download the exploit from: https://github.com/merlinepedra25/CVE-2022-34918-LPE-PoC
Run the make command to create the poc executable and run ./poc

Exploitation — Strategy Explanation

As mentioned as well there can be multiple exploits available for the vulnerability, here we discuss the strategy used by @merlinepedra25 in his exploit https://github.com/merlinepedra25/CVE-2022-34918-LPE-PoC.

Root Cause One More Time :

The issue is in a heap overflow vulnerability in the nft_set_elem_init() function, specifying that the overflow length can be as much as 48 bytes (64 - 16). The target objects affected by this vulnerability are those allocated by the kernel memory allocator (kmalloc) with sizes of 64, 96, 128, or 192 bytes. The specific focus in the example is on the case where the vulnerability object is allocated with 64 bytes.

Example

Think of the nft_set_elem_init() function as a construction site where different-sized containers are allocated to store materials. Now, imagine a flaw in how these containers are handled, allowing for an overflow of materials.

In this construction analogy, the overflow length can be substantial — up to 48 extra units of material. The specific containers affected by this vulnerability are the ones designated as kmalloc-64, kmalloc-96, kmalloc-128, or kmalloc-192. For the sake of illustrating this example, let's focus on the kmalloc-64 container.

+------------------------------------------------------+
|                 nft_set_elem_init()                   |
|                   Heap Overflow                       |
+------------------------------------------------------+
|                         48 bytes                      |
| <---------------------------------------------------> |
|                                                       |
|    +----------------------+    +-----------------+    |
|    |   kmalloc-64 object   |    |   Unused Space  |   |
|    +----------------------+    +-----------------+     |
|<--| Vulnerability Object  |<---| Extra Overflow  |<----|
|    |                      |    |    (48 bytes)   |     |
|    |                      |    |                 |     |
|    +----------------------+    +-----------------+     |
|                                                        |
+------------------------------------------------------+

Construction Site (Heap):

The heap is like a construction site where memory is dynamically allocated to store different-sized containers.

2. nft_set_elem_init() Function:

This function represents a specific process in the construction site where materials are handled.

3. Heap Overflow:

The vulnerability in nft_set_elem_init() allows for an overflow of 48 bytes beyond the allocated container.

4. Affected Containers (kmalloc):

The vulnerability impacts containers designated as kmalloc-64, kmalloc-96, kmalloc-128, or kmalloc-192. In this example, we focus on the kmalloc-64 container.

5. Vulnerability Object (kmalloc-64):

The specific object affected by the overflow is the kmalloc-64 container. This is where the vulnerability resides, and it's selected when exploiting the issue.

6. Unused Space and Extra Overflow:

The unused space within the kmalloc-64 container becomes a target for overflow. The overflow, amounting to 48 bytes, extends into this unused space.
In essence, the vulnerability is like a construction flaw allowing materials to spill over into an unintended area.

Exploit Development Strategy:

As we already discussed the root cause of The nft_set_elem_init() function which has a heap overflow, the overflow length can reach 64-16=48bytes and the vulnerability object can be located kmalloc-{64,96,128,192}(the kmalloc-64 vulnerability object is selected when exploiting this article).

Imagine a scenario where you’ve identified a potential security vulnerability, a bit like finding an unguarded entrance in a fortress. However, the challenge lies in exploiting this vulnerability because you don’t have direct control over the data causing the security breach. It’s like trying to navigate through a maze blindfolded.

In the code, there’s a variable called elem.data that plays a crucial role in the overflow, but it starts uninitialized. This uninitialized variable could be a key to controlling the overflow, turning it into a powerful tool for a potential attacker.

Let’s dive into the caller function, nf_tables_newsetelemwhich is like the gatekeeper managing entries into a secure area. It adds elements to a set, and it does so by calling nft_add_set_elem for each element the user wants to include.

Now, imagine this process as a series of doors in a secure facility. The user, like a visitor with a key, can control the number of doors they want to pass through. The key insight is that the process of passing through doors (calls to nft_add_set_elem) can be chained together. This chaining is possible because of the user's ability to iterate over attributes using nla_for_each_nested. It's akin to having a sequence of interconnected rooms.

Now, let’s bring in a real-life analogy: consider each element being added as a room in a building, and each room has its unique set of attributes. The user, acting as a designer, controls the number of rooms they want to design and the features within each room.

Here comes the clever part — as each room (element) is added, it contributes to the overall structure of the building (stack). The uninitialized elem.data is like a space in each room that the user can leverage.

Random Data Stages: Initially, random data occupies the stack, much like furnishing an empty building with random items.
Adding NFT_DATA_VALUE Element: Introducing an element with NFT_DATA_VALUE data is like designing a room with specific features. This user-controlled data now occupies a section of the stack.
Adding NFT_DATA_VERDICT Element: Finally, adding a second element with NFT_DATA_VERDICT data triggers the buffer overflow. The residue of the last element, which contains user-controlled data, is copied during the overflow. This is akin to a design flaw in the building, causing unintended consequences.

In essence, the exploit is like a designer strategically placing rooms in a building, utilizing uninitialized spaces to create a chain reaction that results in controlled data influencing the security of the entire structure. The ability to chain these design decisions allows for a unique and independent way to manipulate the overflow, making it less reliant on specific system configurations.

CACHE: A Place where Overflow will Happen

Imagine you’re planning a construction project, and before getting into the details of how to exploit a vulnerability, you need to understand the terrain — specifically, the cache where the overflow is going to happen. In our case, this is represented by the elem object allocated at (0). Now, the size of this elem is dynamic and depends on choices made by the user, as seen in a previous excerpt from the nft_add_set_elem function. The size can be influenced by options like NFT_SET_ELEM_KEY and NFT_SET_ELEM_KEY_END, which allows the reservation of two buffers with a maximum length of 64 bytes in elem. This implies that the overflow can potentially occur in multiple caches.

Let’s relate this to a real-life example:

Construction Site Analogy:

Think of the construction project as building a structure, where different-sized containers are used to store materials.
The elem object is like a container whose size can be influenced by choices made during the planning phase of the construction project.

Cache Sizes (Ubuntu 22.04 with GFP_KERNEL):

In our project, we are working on Ubuntu 22.04 with the GFP_KERNEL flag. The relevant cache sizes are kmalloc-64, kmalloc-96, kmalloc-128, and kmalloc-192.

Now, all that’s left is to make sure our elem is aligned with the cache object size for the most effective overflow. The diagram below represents the construction of elem aligning it on 64 bytes, considering the cache sizes.

+------------------------------------+
|            Construction Site       |
|    +--------------------------+    |
|    |           elem           |    |
|    |      (User-Selected)     |    |
|    |--------------------------|    |
|    | NFT_SET_ELEM_KEY,        |    |
|    | NFT_SET_ELEM_KEY_END,    |    |
|    | and other options        |    |
|    +--------------------------+    |
|                                    |
|(Cache Sizes: kmalloc-{64,96,128,192}) |
+------------------------------------+

The construction site represents the memory space where the elem object is allocated.
elem is dynamic and influenced by user-selected options, such as NFT_SET_ELEM_KEY and NFT_SET_ELEM_KEY_END.
The diagram visually depicts the alignment of elem on a cache object size (64 bytes in this case) to optimize the overflow.

Exploit Construction strategy:

The construction involves allocating a certain amount of memory for the object header, adding padding through the use of, and allocating space to store element data of type NFT_DATA_VERDICT. The goal is likely to optimize memory usage and layout for efficient exploitation or manipulation.

Object Header (20 bytes):

Like the labels, tags, or identifiers you might attach to boxes on a shelf, the construction allocates 20 bytes for the object header. This is the essential information needed to identify and manage the stored elements.

2. Padding via NFT_SET_ELEM_KEY (28 bytes):

Just as you might strategically arrange smaller items around the edges of a box to fill space efficiently, the construction used NFT_SET_ELEM_KEY to add 28 bytes of padding. This helps optimize the layout within the kmalloc-64 cache.

3. Element Data Storage for NFT_DATA_VERDICT (16 bytes):

Similar to allocating specific compartments for certain types of products on a shelf, 16 bytes are reserved to store element data of type NFT_DATA_VERDICT. This could be likened to allocating space for a specific category of items.

+---------------------------------------+
|          Shelf-64 (kmalloc-64)        |
|    +----------------------------+     |
|    |        Object Header       |     |
|    |   (Identification Tags)    |     |
|    +-------------------------------+  |
|    |   Padding via NFT_SET_ELEM_KEY|  |
|    |-------------------------------|  | 
|    |                               |  |
|    |                               |  |
|    +-------------------------------+  |
|    |    Element Data Storage       |  |
|    |  (NFT_DATA_VERDICT Type)      |  |
|    +-------------------------------+  |
+---------------------------------------+

Shelf-64 (kmalloc-64): Represents the specific cache size targeted by the construction strategy.
Object Header: Serves as identification tags, labels, or headers attached to each storage unit.
Padding via NFT_SET_ELEM_KEY: Analogous to strategically filling space with smaller items on a shelf to maximize efficiency.
Element Data Storage: Reserved space for a specific type of data (in this case, NFT_DATA_VERDICT), comparable to allocating specific compartments for certain categories of products on a shelf.

Since the overflow occurs in kmalloc-x caches and not in kmalloc-cg-x caches where classical msg_msg objects are allocated, and an alternative information leak method is needed. So exploit development happened using user_key_payload objects, typically used to store sensitive user information in the kernel.

Imagine you’re trying to gather information from labeled boxes in a storage facility, but you can’t directly access the boxes you need. However, you discover another set of special boxes that might hold the information you’re looking for — these are the user_key_payload boxes. Each box has a structure similar to the ones you've been trying to access before, containing a header indicating the size of the object and a buffer for user data.

Structure of user_key_payload Object:

struct user_key_payload {
    struct rcu_head rcu;          /* RCU destructor */
    unsigned short  datalen;      /* length of this data */
    char            data[] __aligned(__alignof__(u64)); /* actual data */
};

In the storage facility, these special boxes are allocated within the function user_preparse in a way similar to how you might allocate space for certain items based on their size.

Allocation in user_preparse Function:

int user_preparse(struct key_preparsed_payload *prep) {
    struct user_key_payload *upayload;
    size_t datalen = prep->datalen;

    if (datalen <= 0 || datalen > 32767 || !prep->data)
        return -EINVAL;

    upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);  // Allocation at (6)
    if (!upayload)
        return -ENOMEM;

    /* attach the data */
    prep->quotalen = datalen;
    prep->payload.data[0] = upayload;
    upayload->datalen = datalen;
    memcpy(upayload->data, prep->data, datalen);  // Copying data at (7)
    return 0;
}

The allocation at (6) ensures that the length of the allocated space is based on the size of the user-provided data. The data is then stored just after the header with a memcpy call at (7). The headers of user_key_payload objects are 24 bytes long, allowing them to be used to fill several caches, from kmalloc-32 to kmalloc-8k.

The goal is similar to the previous method with msg_msg objects: overwrite the datalen field with a larger value than the initial one. When retrieving the information stored, the corrupted object will return more data than initially provided by the user.

However, there’s a limitation to this approach. The number of allocated objects is restricted by sysctl variables, specifically kernel.keys.maxkeys (limit on the number of allowed keys) and kernel.keys.maxbytes (limit on the number of stored bytes). The default values for Ubuntu 22.04 are very low:

kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 200

+---------------------------------------------------+
|            Storage Facility (Kernel)              |
|    +-----------------------------------------+    |
|    |           user_key_payload Box         |     |
|    |    (Header + Buffer for User Data)     |     |
|    +-----------------------------------------+    |
|    |            Allocation (kmalloc)        |     |
|    |          Based on User Data Size       |     |
|    +-----------------------------------------+    |
+---------------------------------------------------+

The storage facility represents the kernel memory space.
user_key_payload boxes are analogous to storage boxes containing a header and user data.
Allocation is performed based on the size of user-provided data, and the goal is to manipulate the headers for controlled overflow.

In this analogy, think of the user_key_payload boxes as specially labeled storage containers that might hold the information you're looking for, and the challenge is to efficiently use them to extract valuable details about the system.

The exploit is developed focusing on the kmalloc-64 cache due to its small object size. The exploit developer targets percpu_ref_data objects, which are allocated in this cache and contain pointers to functions useful for computing the Kernel Address Space Layout Randomization (KASLR) base and module bases. The objects are allocated during the initialization of an io_ring_ctx object, specifically in the io_ring_ctx_alloc function, which is part of the Linux core. The io_uring_setup syscall is used as the simplest way to allocate these objects, and the close syscall is employed to program their release.

So overall the whole phase of memory leakage is described in steps it would go like:

Focus on the kmalloc-64 cache for efficient information leakage.
Target objects within this cache are percpu_ref_data objects, which contain useful pointers.
percpu_ref_data structure includes pointers to functions (release and confirm_switch) useful for computing KASLR base or module bases when leaked and a pointer to a dynamically allocated object (ref) useful for computing the physmap base.
Allocation of percpu_ref_data objects occur during the initialization of an io_ring_ctx object using the io_uring_setup syscall.
The io_uring_ctx_alloc function within /fs/io_uring.c is responsible for this allocation.
By leaking information about io_ring_ctx_ref_free and io_rsrc_node_ref_zero functions, we can compute the KASLR base.
The unexpected discovery of percpu_ref_data objects with the address of the io_rsrc_node_ref_zero function in the release field, originating from the io_uring_setup syscall becomes a beneficial side effect for improving the exploit.

Example Diagram for leaking steps:

+---------------------------------------------+
|          Kernel Memory Space                 |
|                                             |
|    +-----------------------------------+    |
|    |            kmalloc-64 Cache      |    |
|    |                                   |    |
|    |  +-----------------------------+  |    |
|    |  |     percpu_ref_data Object  |  |    |
|    |  |                             |  |    |
|    |  |  +-----------------------+  |  |    |
|    |  |  |     count             |  |  |    |
|    |  |  |     release           |  |  |    |
|    |  |  |     confirm_switch    |  |  |    |
|    |  |  |     force_atomic      |  |  |    |
|    |  |  |     allow_reinit      |  |  |    |
|    |  |  |     rcu               |  |  |    |
|    |  |  |     ref               |  |  |    |
|    |  |  +-----------------------+  |  |    |
|    |  +-----------------------------+  |    |
|    |                                   |    |
|    +-----------------------------------+    |
+---------------------------------------------+

The diagram represents the kernel memory space with a focus on the kmalloc-64 cache.
Within this cache, percpu_ref_data objects are allocated during the initialization of an io_ring_ctx object using the io_uring_setup syscall.
These percpu_ref_data objects contain pointers to functions and a reference to dynamically allocated objects, making them valuable targets for information leakage.
The goal is to exploit the leak of information about functions like io_ring_ctx_ref_free and io_rsrc_node_ref_zero to compute the KASLR base and improve the overall exploit.

High-Level Steps to Develop an Exploit

Heap Layout Construction:

Construct the heap layout with the following components:
vul_obj: Vulnerability object
user_key_payload: Payload containing user-controlled data
percpu_ref_data: Per-CPU reference data

2. Overflow and Tamper (Leak Addresses):

Trigger a heap overflow to tamper with user_key_payload.
Modify user_key_payload->datalen to leak percpu_ref_data->release (kernel base address) and percpu_ref_data->ref_physmap (physmap base address).
This step aims to obtain critical kernel addresses for later privilege escalation.

3. Heap Layout Reconstruction (Arbitrary Write):

Construct a new heap layout with the following components:
vul_obj: Vulnerability object
simple_xattr: Simple extended attribute
Trigger another overflow to tamper with simple_xattr and manipulate its linked list.

4. Restricted Arbitrary Write:

Leverage the restricted arbitrary write-on simple_xattr to modify modprobe_path.
This action is performed when the extended attribute (xattr) is removed from the linked list.
The goal is to escalate privileges by modifying modprobe_path to point to /sbin/modprobe and executing arbitrary commands.

5. Prerequisites:

The physmap address needs to be leaked.
The root directory (“/”) contains both the kernel base address and the physmap address.

+-------------------------------------+
|            vul_obj                  |
+-------------------------------------+
|        user_key_payload             |
|    +--------------------------+     |
|    |        percpu_ref_data   |     |
|    +--------------------------+     |
|                                     |
|                                     |
|            (Heap Overflow)          |
|                                     |
+-------------------------------------+
|           simple_xattr              |
|    +--------------------------+     |
|    |      Modified list      |     |
|    +--------------------------+     |
|                                     |
|            (Arbitrary Write)        |
|                                     |
+-------------------------------------+

The heap layout is manipulated to create vulnerabilities in two different objects (vul_obj and simple_xattr).
The first overflow is used to leak kernel addresses (percpu_ref_data->release and percpu_ref_data->ref_physmap).
The second overflow, triggered when removing an extended attribute, allows for a restricted arbitrary write to modify modprobe_path.
Successful exploitation of these vulnerabilities would lead to privilege escalation, allowing an attacker to execute arbitrary commands with elevated privileges.

Patch Diffing

A change was made to fix a vulnerability in the code. As we discussed the flaw is in the nftables framework, particularly within the nft_setelem_parse_data function handling NFT_MSG_NEWSETELEM, permits manipulation of built-in sets. These sets involve the addition of elements. The vulnerability stems from an oversight in the type-checking mechanism during element addition.

The nft_setelem_parse_data function initializes data and desc and then undergoes a legality check to confirm the incoming data's size aligns with the set type. The problem arises when a VERDICT type is introduced, and the set primarily stores the VALUE type. In this scenario, the type check fails to consider the VERDICT type, allowing a VERDICT element to be added to a VALUE set, leading to a potential heap overflow.

The remedy involves a patch that introduces a dedicated check for the VERDICT type, ensuring both type and length conform to the set's expectations.

+---------------------------------------+
|           nft_setelem_parse_data      |
|                                       |
|  +---------------------------+        |
|  |        Initialization     |        |
|  |                           |        |
|  |  - nft_data_init          |        |
|  |  - Size legality check    |        |
|  +---------------------------+        |
|                |                      |
|                v                      |
|  +---------------------------+        |
|  |   Type Check (Before)     |        |
|  |  - VERDICT vs. VALUE type |        |
|  |  - Length verification    |        |
|  +---------------------------+        |
|                |                      |
|                v                      |
|  +---------------------------+        |
|  |    Heap Overflow Occurs   |        |
|  |  - Addition of VERDICT    |        |
|  |    to a VALUE set         |        |
|  +---------------------------+        |
|                |                      |
|                v                      |
|  +---------------------------+        |
|  |      Patched Check        |        |
|  |  - Specific VERDICT check |        |
|  |  - Type and length match  |        |
|  +---------------------------+        |
|                                       |
+---------------------------------------+

The diagram outlines the control flow within the nft_setelem_parse_data function.
The vulnerability arises when the type check fails to appropriately handle the introduction of a VERDICT element into a VALUE set, potentially leading to a heap overflow.
The patched version includes an additional check specifically addressing the VERDICT type, ensuring both type and length align with the set’s expectations, thereby preventing the heap overflow vulnerability.

Final Thoughts

Throughout the journey of analyzing the CVE-2022-34918 and addressing the security concern, it has been an illuminating experience. The process of delving into the heap overflow in restricted user data inputs, understanding its implications, and applying the necessary fixes has deepened my understanding of nf_tables.ko module and heap overflow exploitation.

Furthermore, I would like to acknowledge @Arthur Mongodin the remarkable contribution in crafting an exploit for the vulnerability. The exploit has not only provided a practical demonstration of the vulnerability but has also enabled me to test and validate its vulnerability existence.

I trust that reading this account was as delightful for you as it was for me to craft it.

Also, there can be multiple ways to exploit the vulnerability, The exploitation operates under the assumption that a particular address is consistently mapped in the kernel space, though this is not universally guaranteed. Consequently, the exploit’s reliability is not absolute, yet it boasts a commendable success rate. Another challenge lies in the occurrence of a kernel panic upon completion of the exploit. To mitigate this, efforts are underway to identify objects capable of persisting in kernel memory beyond the conclusion of the exploitation process. It requires thorough experimentation with various placements but it’s a worthwhile task to manipulate it.