Use-After-Free Vulnerability — CVE-2022–32250

45 min readAug 25, 2023

Summary

CVE-2022–32250 — net/netfilter/nf_tables_api.c in the Linux kernel through 5.18.1 allows a local user (able to create user/net namespaces) to escalate privileges to root because an incorrect NFT_STATEFUL_EXPR check leads to a use-after-free.

Description

net/netfilter/nf_tables_api.c in the Linux kernel through 5.18.1 allows a local user (able to create user/net namespaces) to escalate privileges to root because an incorrect NFT_STATEFUL_EXPR check leads to a use-after-free.

What is NF_Tables?

NF_Tables is a packet-filtering framework in the Linux kernel that provides an efficient and flexible way to classify and manipulate network packets. It is designed to replace the older iptables and ip6tables tools for firewall and packet filtering tasks, offering improved performance, syntax, and capabilities.

NF_Tables allows you to define rulesets to control the flow of network packets through your system. It uses a rule-based syntax to match packets based on various criteria and then applies actions to those packets, such as dropping, accepting, or modifying them. The rules are organized into tables, chains, and rulesets, providing a hierarchical structure for packet filtering.

Here's a simple example of using NF_Tables to create a rule that allows incoming SSH (TCP port 22) connections:

1. Installing nf_tables (if not already installed):

Make sure NF_Tables is installed on your system. You can typically install it using your distribution package manager.

2. Creating a nf_tables Rule:

Open a terminal and run the following commands as the root user (or using sudo):

# Create an nftables rule to allow incoming SSH connections
   nft add rule ip filter input tcp dport 22 accept
   
   # List the rules to verify
   nft list ruleset

Let's break down the command:

- nft add rule: This is the command to add a rule to an nf_tables ruleset.

- ip: Specifies the IP protocol.

- filter: Refers to the filter table, which is commonly used for packet filtering.

- input: Refers to the input chain, which is used for incoming packets.

- tcp dport 22: This is the matching criteria. It matches TCP packets with destination port 22 (SSH).

- accept: This is the action to take if the packet matches the criteria. In this case, it allows the packet.

Testing the Rule: To test the rule, you can initiate an SSH connection to your system from another machine. If the rule is correctly configured, the connection should be allowed.

NF_Tables provides a wide range of features, including more complex rule structures, address translation, connection tracking, and more. The above example demonstrates how to create a basic rule to allow SSH connections, but NF_Tables can be used for much more advanced network packet filtering and manipulation tasks.

Expression and Registers

In nf_tables, expressions and registers are used to define actions and store data while processing network packets. Expressions are building blocks that allow you to perform various actions on packets, such as accepting, dropping, or modifying them. Registers are memory locations used to store packet data temporarily for further processing.

Imagine a security guard at the entrance of a building, and the job of a security guard, is to decide who can enter and what actions they're allowed to take inside. To make these decisions, security guards need to look at certain information about the people entering. This information could be things like their name, ID card, purpose of visit, and more.

In the world of networks, similar decisions need to be made to allow or control network traffic. This is where expressions come in. Just like your need for information to make decisions, expressions in network filtering help us gather information from network traffic and perform actions based on that information.

Expressions are like pieces of logic that help us understand what's happening in the network traffic and allow us to take action accordingly. They are the building blocks of rules in a network firewall or filter.

Think of expressions as mini-programs that analyze network data to answer questions like "Who is sending this data?" or "What type of data is being sent?" This information is crucial for making decisions about whether to allow or block traffic.

When someone creates a new type of expression, like logic with a specific function, it’s defined in a special place in the computer’s brain (kernel). This special place knows the name of the expression, what it can do, and how it should be used.

In the Linux kernel, when a new type of expression is created by a module (such as net/netfilter/nft_immediate.c), there is a special structure called nft_expr_type associated with it. This structure contains important information about the expression type, including its name, a table of functions (called ops functions) that define its behavior, various flags, and more.

Explanation with an Analogy:

Think of expression types as different types of tools you can use to build something. Each tool has its own set of characteristics and functions. The nft_expr_type structure is like a user manual for each tool. It tells you the tool's name, what functions it can perform, and any special features it has.

Example Analogy: Building Blocks with Different Tools

Imagine you have a box of building blocks, each representing a different type of expression. You want to build various structures using these blocks. Each block type has a user manual that explains how to use it.

Block Types: Imagine you have blocks of different shapes, like squares, triangles, and circles.
User Manuals: Each block type comes with a user manual that tells you its name, how to stack them, how to connect them, and any special features they have.
Building: You follow the instructions in the user manual to stack and connect the blocks in specific ways to create different structures.

                  +-----------------------+
                  |nft_expr_type Structure|
                  +-----------------------+
                  | Name: Immediate       |
                  | Ops Table:            |
                  | - Op1: Perform Action |
                  | - Op2: Handle Event   |
                  | Flags: Important      |
                  +-----------------------+

                         |
                         v

+------------------+   +------------------+   +------------------+
|    Expression    |   |    Expression    |   |    Expression    |
|   Type: Square   |   |  Type: Triangle  |   |    Type: Circle  |
|   Ops Table:     |   |  Ops Table:     |   |   Ops Table:     |
|   - Op1: Stack   |   |  - Op1: Connect |   |  - Op1: Roll     |
|   - Op2: Paint   |   |  - Op2: Color   |   |  - Op2: Bounce   |
|                  |   |                 |   |                  |
+------------------+   +------------------+   +------------------+

In this analogy:

. The nft_expr_type structure is like the user manual for each block type.
. Each block type corresponds to an expression type defined in the kernel.
. The ops table in the structure is like the set of instructions in the user manual.
. The building process corresponds to using the expression types to achieve specific actions or behaviors.

Just like you use different building blocks to create different structures, in the kernel, different expression types are used to achieve various functionalities within the networking and filtering systems. The nft_expr_type structure provides the necessary information for the kernel to understand and utilize these expression types effectively.

A More Practical Example:

An example of an expression is the “counter” expression, which is used to count the number of packets that match a specific rule. Let’s see how to use the “counter” expression to count incoming packets on a specific port.

Suppose you want to count the number of incoming packets on port 80 (HTTP) using nf_tables. First, you would create a new table to store the rule:

nft add table ip leak_chain

Then, you would add a new chain to the table:

nft add chain ip leak_chain input { type filter hook input priority 0 \; }

Next, you can define the rule with the “counter” expression:

nft add rule ip leak_chain input tcp dport 80 counter

Now, every time a packet with TCP destination port 80 arrives, nf_tables will increment the counter for that rule. To view the counters, you can use the following command:

nft list ruleset

You will see the counters associated with the rule:

table ip leak_chain {
    chain input {
        type filter hook input priority 0; policy accept;
        tcp dport 80 counter packets 1024 bytes 122880
    }
}
`

In this example, 1024 packets with a total of 122880 bytes have been counted so far.

Registers in nf_tables are used to store specific packet data that can be referenced in subsequent rules or actions. For example, you can use a register to store the source IP address of a packet and then use that information in a different rule.

Overall, expressions and registers in nf_tables provide powerful mechanisms for customizing packet processing and implementing advanced filtering and networking logic.

Two specific flags values are mentioned in the statement: stateful (NFT_EXPR_STATEFUL) and garbage collectible (NFT_EXPR_GC). These flags indicate specific characteristics or behaviors of the expression type.

                           +------------------+
                           | nft_expr_type    |
                           +------------------+
                           | select_ops       | --> Select appropriate ops
                           | release_ops      | --> Release ops resources
                           | ops              | --> Default ops
                           | list             | --> Internal list
                           | name             | --> Identifier
                           | owner            | --> Module reference
                           | policy           | --> Attribute policy
                           | maxattr          | --> Highest attribute number
                           | family           | --> Address family
                           | flags            | --> Expression type flags
                           +------------------+
                                     |
            +------------------------|------------------------+
            |                        |                        |
            |                  +------------+          +------------+
            |                  | NFT_EXPR_  |          | NFT_EXPR_  |
            |                  | STATEFUL   |          | GC         |
            |                  +------------+          +------------+
            |                        ^                        ^
            |                        |                        |
            +------------------------|------------------------+
                                     |
                              Specific Expression Flags

Of course, it’s very simplified information but I hope it sets up some information to get started with.

If you want to know more information about nf_tables and Expression, please follow below:

https://wiki.nftables.org/wiki-nftables/index.php/What_is_nftables%3F

https://www.vicarius.io/vsociety/posts/3001

Build the Lab

As this is a kernel module vulnerability it’s typical to debug, so you need to have a little bit more patience than usual 🦐

VirtualBox
I used 2 Linux Virtual Machines.

As we have to debug a Kernel Module and Kernel is a user-space process so GDB alone cannot use it for debugging hence we need an Client/Server architecture. Kernel programs can be debugged remotely using the combination of gdbserver the target machine and gdb on the host machine/development machine. The Linux kernel has a GDB Server implementation called KGDB. It communicates with a GDB client over a network or serial port connection.

Host/Development Machine: Runs gdb against the vmlinux file which contains the symbols and performs debugging
Target Machine: Runs kgdb and is the machine to be debugged

    ------------------                              --------------------
    |       Host      |                             |       Target     |
    |                 |                             |                  |
    |  -------------  |                             |   ------------   |
    | |     gdb     | |<--------------------------->|  |    kgdb    |  |
    | |             | |             Serial or       |  |            |  |
    | --------------  |             Ethernet        |  -------------   |
    |       |         |             Connection      |        |         |
    |  -------------- |                             |  --------------  |
    | | Kernel image ||                             |  |Linux Kernel | |
    | | with debug   ||                             |  |(zImage)     | |
    | | symbols      ||                             |  |             | |
    | | (vmlinux)    ||                             |  --------------- |
    | ----------------|                             |                  |
    -------------------                             --------------------
Hence Two machines are required for using kgdb:

KGDB is a GDB Server implementation integrated into the Linux Kernel, It supports serial port communication (available in the mainline kernel) and network communication (patch required)

It’s available in the mainline Linux kernel since version 2.6.26 (x86 and sparc) and 2.6.27 (arm, mips, and PPC)

Enables full control over kernel execution on target, including memory read and write, step-by-step execution, and even breakpoints in interrupt handlers

There might be other ways to do it but I generally do the above way.

3. I am using Ubuntu AMD64-22.04 LTS iso: https://releases.ubuntu.com/ and Debian as a Dev Server machine : https://www.debian.org/distrib/

Connect and Create a Serial Port in VirtualBox

The assumption for the step:

This has been assumed that users have ISO images downloaded locally and already created 2 VM's with that.

For the Demonstration I created 2 machines named as target and Dev Machine.

To create a serial port in VirtualBox and Connect the machines it's very easy

Select your target machine from virtualbox and go to the settings options
Once you are in the settings tab of the target server select the Serial Ports and enter the below configuration :

Don’t check the Connect to existing pipe/socket as we don't have any previous ones.

Once we have the serial port configured follow the step 1 and 2 for dev machine as well , but in dev machine you need to check Connect to existing pipe/socket and make sure you specify the same Path/Address

Once that’s done Congratulations Labs are ready

WARNING
! Do not start the Dev machine first other wise you will see an error of serial port as you might have already noticed that we connected the 2 machines together with serial port

DEBUGGING KERNEL — nf_tables

As we discussed already the vulnerability lies in nf_tables and it's a kernel module so to debug a kernel module we need to follow some steps so let's do those initial settings first:

Verify the machines (dev & target) are communicating in serial-port , to verify the communication between the dev and target machine, send the message on serial ports

I did send the message multiple times from target machine to dev machine and confirmed that they are communicating with each other on the serial port.

The current version of the kernel is 22.04 if you have downloaded it from the ubuntu official website it will not be an older version so we have to downgrade the kernel , let's continue to do that step in Debugging stage.

Download the affected versions of kernel , so to accomplish this step I downloaded the v5.12 from official kernel github

Once you have checked the affected version of the kernel you need to install this image and update it to your grub but before we do that we need some libraries to be available

1. build-essentials
2. flex
3. bison
4. libnftnl-dev
5. libmnl-dev
6. nftables - (Installed by default but just in case missing)
7. libncurses-dev
8. dkms

Once the packages are installed let’s enable KGDB in the config file to debug the kernel and enable KGDB settings please move inside the git repo where we have downloaded the kernel source and run make nconfig command

This command will bring the config file in graphical view and verify the KGDB the variable value is enabled.

2. Select the Generic Kernel Debugging Instruments

3. Verify the KGDB and magic sysrq option is selected

4. Once these settings are verified we need to verify one more variable DEBUG_INFO it should be y as well, as to look for the variable press f8 and search for the value

As from the verification process, all things are verified, libraries have been installed and things are in place, as the flaw is in nf_tables we need to make sure that this module is also enabled and installed so let's verify that too

To do that we will go to the Networking Support > Networking Option > Network Packet Filtering Framework > Core Netfilter Configuration

For the safer side (as it takes a lot of time to install modules or install kernel image) and we should not miss any class or file debugging I have enabled all netfilter modules for nf_tables so that we don't have to repeat this step for any miss.

Press f6 and save the changes and run make -j8 the command to build the Linux kernel with multiple threads in parallel. Go out and Grab a coffee as it going to take a long believe me very long

After a make -j8 success build you must get the vmlinux the file available with some built-in modules.

After a make -j8 success build you must get the vmlinux the file available with some built-in modules

After make -j8 success you need to run make modules_install command and wait for installation and completion of the command.

Once that’s completed run make install and this will update the v5.12 modules in boot , once that's done just write update-grub and reboot command to restart the machine.

During the restart of the machine, it will display the option to select the kernel version, select v5.12 and boot the kernel. Verify the kernel version by writing uname -r

Now we have downgraded to the affected version of kernel

Next, I wanted to enable the GDB-Script in the affected target machine, GDB Scripts is a collection of helper scripts that can simplify kernel debugging steps

Todo that we have to perform 2 steps

In target machine we have to enable CONFIG_GDB_SCRIPT which was enabled in our target machine already.

2. In Dev machine we have to create a ~/.gdbinit machine and write add-auto-load-safe-path <location-bin-file>

To start the debugging on target machine we also have to copy the debugged build and compiled kernel Linux folder to the dev server. To make copy easy I installed open ssh in target server and used scp command in dev server to copy linux compiled folder from target machine to dev

In target machine we made a tar.gz file and In dev server used SCP command to copy linux.tar.gz from target to dev

make tar.gz file with tar : tar -czvf linux.tar.gz linux

In dev server I copied the folder at /home/target/Desktop/linux

scp target@10.0.2.15:/home/target/Desktop/linux.tar.gz /home/target/Desktop/

And then extracted the gz file in the dev server by using the tar command : tar -xzvf linux.tar.gz

Open the copied vmlinux with gdb

Next to debug kernel we have to specify the serial port and baud rate to the kgdboc so that we can debug kernel from the dev machine.

Run the sysrq magic sequence in target server

echo g > /proc/sysrq-trigger

On the dev server run target remote /dev/ttyS0

We can see the kgdb breakpoints triggered let's put the breakpoint in our suspected functions

As we have enabled GDB-Script let's load our beloved affected nf_tables.ko module, and to do that we use apropos lx so just write lx-symbols to load nf_tables.ko and other existing modules from kernel to GDB

Once the module is loaded let’s try to put the breakpoint in nf_tables_newset function under nf_tables_api.c and start your static analysis

Background

Before we understand why the problem occurred, we need to learn some basic information that helps us grasp the weakness.

What are Sets in nf_tables?

In nf_tables, there's a concept called "Sets", Sets are basically collections of keys and values, Sets have a wide range of uses, but let's simplify things. Imagine sets as a special kind of storage where you can associate things together. For example, think of them as a fancy list of key-value items.

Example of Sets:

Let’s say you have a list of numbers representing different ports, like 22, 80, and 443. Now, imagine you want to block all incoming messages on those ports. To do this, you can put these port numbers into a set. Then, you can use a special method called “nft_lookup” to quickly check if an incoming message’s port number is in the set. If it is, you can block that message.

   +---------------------+
   |       Set           |
   |                     |
   |   +-----+           |
   |   |  22 |           |
   |   +-----+           |
   |   |  80 |           |
   |   +-----+           |
   |   | 443 |           |
   |   +-----+           |
   +---------------------+
         /|\  
          |
          | nft_lookup
          |
   +---------------------+
   | Incoming Message   |
   |                     |
   |  Port: 80           |
   +---------------------+

In this diagram:

The “Set” is like a special container that holds port numbers 22, 80, and 443.
When an incoming message arrives (with Port: 80 in this case), we use the “nft_lookup” method to quickly check if the port number is in the “Set.”
If the port number is found in the “Set,” we can take an action, like blocking the message.

struct nft_set {
	struct list_head		list;
	struct list_head		bindings;
	struct nft_table		*table;
	possible_net_t			net;
	char				*name;
	u64				handle;
	u32				ktype;
	u32				dtype;
	u32				objtype;
	u32				size;
	u8				field_len[NFT_REG32_COUNT];
	u8				field_count;
	u32				use;
	atomic_t			nelems;
	u32				ndeact;
	u64				timeout;
	u32				gc_int;
	u16				policy;
	u16				udlen;
	unsigned char			*udata;
	/* runtime data below here */
	const struct nft_set_ops	*ops ____cacheline_aligned;
	u16				flags:14,
					genmask:2;
	u8				klen;
	u8				dlen;
	u8				num_exprs;
	struct nft_expr			*exprs[NFT_SET_EXPR_MAX];
	struct list_head		catchall_list;
	unsigned char			data[]
		__attribute__((aligned(__alignof__(u64))));
};

Sets and Expressions: In a computer program, you can have something called a “set” that stores a collection of items. You can also have “expressions” that help you perform different actions.

When creating a set, you have the option to include some additional data (referred to as “user data”) with the set. Depending on whether you include this user data or not, the way the set is stored in memory changes.

A quick look into nft_set_ops

/**
 * struct nft_set_ops - nf_tables set operations
 *
 * @lookup: look up an element within the set
 * @update: update an element if exists, add it if doesn't exist
 * @delete: delete an element
 * @insert: insert new element into set
 * @activate: activate new element in the next generation
 * @deactivate: lookup for element and deactivate it in the next generation
 * @flush: deactivate element in the next generation
 * @remove: remove element from set
 * @walk: iterate over all set elements
 * @get: get set elements
 * @privsize: function to return size of set private data
 * @init: initialize private data of new set instance
 * @destroy: destroy private data of set instance
 * @elemsize: element private size
 *
 * Operations lookup, update and delete have simpler interfaces, are faster
 * and currently only used in the packet path. All the rest are slower,
 * control plane functions.
 */
struct nft_set_ops {
 bool    (*lookup)(const struct net *net,
        const struct nft_set *set,
        const u32 *key,
        const struct nft_set_ext **ext);
 bool    (*update)(struct nft_set *set,
        const u32 *key,
        void *(*new)(struct nft_set *,
              const struct nft_expr *,
              struct nft_regs *),
        const struct nft_expr *expr,
        struct nft_regs *regs,
        const struct nft_set_ext **ext);
 bool    (*delete)(const struct nft_set *set,
        const u32 *key);

 int    (*insert)(const struct net *net,
        const struct nft_set *set,
        const struct nft_set_elem *elem,
        struct nft_set_ext **ext);
 void    (*activate)(const struct net *net,
          const struct nft_set *set,
          const struct nft_set_elem *elem);
 void *    (*deactivate)(const struct net *net,
            const struct nft_set *set,
            const struct nft_set_elem *elem);
 bool    (*flush)(const struct net *net,
       const struct nft_set *set,
       void *priv);
 void    (*remove)(const struct net *net,
        const struct nft_set *set,
        const struct nft_set_elem *elem);
 void    (*walk)(const struct nft_ctx *ctx,
      struct nft_set *set,
      struct nft_set_iter *iter);
 void *    (*get)(const struct net *net,
            const struct nft_set *set,
            const struct nft_set_elem *elem,
            unsigned int flags);

 u64    (*privsize)(const struct nlattr * const nla[],
          const struct nft_set_desc *desc);
 bool    (*estimate)(const struct nft_set_desc *desc,
          u32 features,
          struct nft_set_estimate *est);
 int    (*init)(const struct nft_set *set,
      const struct nft_set_desc *desc,
      const struct nlattr * const nla[]);
 void    (*destroy)(const struct nft_set *set);
 void    (*gc_init)(const struct nft_set *set);

 unsigned int   elemsize;
};

What is Lookup Expression?

Imagine you have a set with some data, and you want to check if a specific item is in that set. This is where the “lookup expression” comes in. It’s like asking a question: “Is this thing in the set?”

Example:

Let’s use an example that everyone can relate to a phone contact list.

Imagine you have a list of contacts (a set) in your phone. Each contact has a name (the key) and a phone number (the value).

Now, let’s say you want to check if you have a contact named “Alice” in your list. You’re performing a lookup operation to see if “Alice” is in your contact list.

Diagram:

+----------------------+
 Contacts Set |                      |
+----------------------+             |
|                      |             |
|   Alice   - 123456   |             |
|   Bob     - 789012   |  Lookup     |
|   Charlie - 345678   | Expression  |
|                      |             |
+----------------------+             |
              |                      |
              +----------------------+

In this diagram, the “Contacts Set” is like your set of data, and each row represents a contact with a name (key) and phone number (value). The “Lookup Expression” checks if a specific name (e.g., “Alice”) is present in the set. If it is, you might get the phone number associated with that name.

Now let’s take one step forward and relate the same with nftables

Understanding nf_tables Lookup Expressions

Imagine you have a network firewall, and you want to block certain types of incoming network packets based on their port numbers. To do this, you set up a “rule set” with a list of ports to block. But how does the firewall actually check incoming packets against these rules? This is where nf_tables lookup expressions come into play.

Let’s break down the key concepts and explain them using simple diagrams:

1. Setting Up a Rule Set:

First, you create a rule set with a list of ports you want to block. This is like creating a list of “blocked areas.”

   +-----------------+
   |    Rule Set     |
   +-----------------+
   | Port 80         |
   | Port 443        |
   +-----------------+

2. Performing Packet Checks:

Now, when a network packet arrives, the firewall needs to check if it matches any of the blocked ports. This is where the lookup expression comes in.

   +-----------------+
   |    Rule Set     |
   +-----------------+
   | Port 80         |
   | Port 443        |
   +-----------------+
        ↑
        |
        |
   +-----------------+
   | Lookup Express  |
   | (nft_lookup)    |
   +-----------------+
   |   Check if      |
   |   packet matches|
   |   blocked ports |
   +-----------------+

What does the code look like :

struct nft_lookup {
	struct nft_set			*set;
	u8				sreg;
	u8				dreg;
	bool				invert;
	struct nft_set_binding		binding;
};
struct nft_set_binding {
 struct list_head  list;
 const struct nft_chain  *chain;
 u32    flags;
};

3. Understanding nft_lookup Expression:

The nft_lookup expression helps perform this check. It has several parts:

- set: This points to the rule set (our list of blocked ports).

- sreg: This tells the firewall where to find the incoming packet's port number.

- dreg: This specifies where to store a value if a match is found (the decision to block).

- invert: This determines whether to invert the match result.

- binding: This connects the lookup expression to the rule set.

   +-----------------+
   |    Rule Set     |
   +-----------------+
   | Port 80         |
   | Port 443        |
   +-----------------+
        ↑
        |
        |
   +-----------------+
   | Lookup Express  |
   | (nft_lookup)    |
   +-----------------+
   |  set: Rule Set  |
   |  sreg: Port     |
   |  dreg: Decision |
   |  invert: No     |
   |  binding: Link  |
   +-----------------+

4. Connecting Expressions:

If you have multiple lookup expressions (like lookup1 and lookup2) checking against the same rule set (`set1`), they are linked together through their bindings. This creates a chain of checks.

   +-----------------+
   |    Rule Set     |
   +-----------------+
        ↑
        |
        |
   +-----------------+
   | Lookup Express  |
   | (lookup1)       |
   +-----------------+
        |
        ↓
   +-----------------+
   | Lookup Express  |
   | (lookup2)       |
   +-----------------+
This all goes in a form of linked list so let's represent them in linked list format

The slab cache in which the expression is allocated varies depending on the expression type.

By using nftables lookup expressions, the firewall efficiently determines if incoming packets match any blocked ports. The expressions are like organized "filters" that help the firewall decide what to do with each packet. Just like checking a list of places you're not allowed to enter and deciding whether you should be blocked or allowed through.

nft_dynset and nft_connlimit:

nft_dynset Expression:

Now, let’s say you want something even more versatile. You have shelves (sets) again, but this time you have fancier items on them, like notes that can be read or written. The nft_dynset expression is like a magic note that lets you both read from and write to a specific shelf. It's like having a magical notebook for each shelf where you can jot down new information.

With nft_dynset, you're not just reading values like in nft_lookup, but you can also update or add new values.

Just like before, the nft_dynset expression is "connected" to the shelf it's working with.

          +-------------------+
          | nft_dynset         |
          +-------------------+
                |
                v
          +-------------------+
          | Set (Shelf)        |
          +-------------------+
                |
                v
          +-------------------+
          | Note: Read/Write   |
          +-------------------+

nft_connlimit Expression:

Now, let’s talk about a clever guard that watches over a gate. The nft_connlimit expression acts like this guard, allowing only a certain number of people (connections) through the gate from a single place (IP address). It's like a bouncer at a party ensuring that no one sneaks in too many times.

nft_connlimit is a special kind of expression that focuses on controlling the number of connections from one place (IP address). This expression is unique because it has specific abilities due to its special "marked" status and can even perform secret functions not available to everyone.

          +-------------------+
          | nft_connlimit      |
          +-------------------+
                |
                v
          +-------------------+
          | Set (List of IPs) |
          +-------------------+
                |
                v
          +-------------------+
          | Guard: Connections |
          +-------------------+

STATIC ANALYSIS :

The vulnerability arises from not correctly cleaning up resources when a “lookup” or dynset expression is found while creating a set using the NFT_MSG_NEWSET message. The function nf_tables_newset() is responsible for handling the NFT_MSG_NEWSET netlink message.

The issue occurs when we try to add an nft_lookup expression to a set. To do this, we use the NFT_MSG_NEWSET callback, which in turn calls the nf_tables_newset function. Inside, there's a call to another function --> nft_set_elem_expr_alloc, which then calls the nft_expr_init function.

When you’re creating a set in the context of network filtering (like a firewall), you need to provide certain information to define the set. This information includes an associated table, a set name, a length for the set’s keys, and an ID. Assuming you’ve met all the basic requirements, a specific function will be used to create a new nft_set structure. This structure is important because it helps keep track of the newly created set and its properties.

Once the initialization process of the set is complete,it checks for expression in a set

if a specific attribute called NFTA_SET_EXPR is detected, it triggers a call to the function nft_set_elem_expr_alloc()

This function handles expressions of different types. If the allocation of memory for the expression fails, the code jumps to a label that takes care of destroying the entire set.

Interestingly, even if only one expression fails to initialize, all the related expressions get destroyed using nft_expr_destroy(). However, please note that in the case where the err_set_expr_alloc the condition occurs, the expression that failed to initialize won't have been added to the set->expr array. It won't be destroyed at this point. Instead, it would have been destroyed earlier within the nft_set_elem_expr_alloc() function.

Let’s breakdown the nft_set_elem_expr_alloc function a little

expr = nft_expr_init(ctx, attr);

The above statement in the code initializes an expression.

Then, it checks if the expression type is acceptable to be associated with a set, specifically if it has the NFT_EXPR_STATEFUL attribute

if (!(expr->ops->type->flags & NFT_EXPR_STATEFUL))
		goto err_set_elem_expr;

This order of operations might seem backward, which is true. It allows the initialization of an expression type that may not be suitable to work with a set. This means that if something initialized at expr = nft_expr_init(ctx, attr); doesn't get properly cleaned up, it might persist unnaturally.

if (nla[NFTA_SET_EXPR]) {
		expr = nft_set_elem_expr_alloc(&ctx, set, nla[NFTA_SET_EXPR]);
		if (IS_ERR(expr)) {
			err = PTR_ERR(expr);
			goto err_set_expr_alloc;
		}

It’s important to note that there are only a few expression types compatible with NFT_EXPR_STATEFUL, but this arrangement allows for the initialization of any of these expressions first.

In the previously discussed code, we observed that the destruction process invokes the expression’s destruction function using nf_tables_expr_destroy , and subsequently, the expression is freed.

At first glance, since nft_set_elem_expr_alloc invokes nft_exprs_destroy , it may seem like all related components are properly cleaned up, and the ability to initialize a non-stateful expression itself may not be a security vulnerability. However, this seemingly harmless behavior can contribute to vulnerabilities more easily emerging.

DIG DEEPER:

nft_lookup expression is added to a set using NFT_MSG_NEWSET.
NFT_MSG_NEWSET calls nf_tables_newset.
Inside nf_tables_newset, nft_set_elem_expr_alloc is called.
nft_set_elem_expr_alloc calls nft_expr_init.

+-------------------------+
| Add nft_lookup to Set   |
+-------------------------+
           |
           v
+-------------------------+
| NFT_MSG_NEWSET Callback |
+-------------------------+
           |
           v
+-------------------------+
|  nf_tables_newset        |
+-------------------------+
           |
           v
+-------------------------+
| nft_set_elem_expr_alloc  |
+-------------------------+
           |
           v
+-------------------------+
|   nft_expr_init          |
+-------------------------+

Add nft_lookup to Set:

The process begins when we want to add an nft_lookup expression to a set.

2. NFT_MSG_NEWSET Callback:

To add the nft_lookup expression to the set, we use the NFT_MSG_NEWSET callback. This callback is triggered and executed.

3. nf_tables_newset:

Inside the NFT_MSG_NEWSET callback, the nf_tables_newset the function is called.
This function is responsible for creating a new set.

4. nft_set_elem_expr_alloc:

Within, a call is made to nft_set_elem_expr_alloc.
This function is involved in allocating memory for the expression and binding it to the set.

5. nft_expr_init:

Finally, inside, there’s a call to nft_expr_init.
nft_expr_init initializes the expression and prepares it for binding to the set.

static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
                                      const struct nlattr *nla)
{
    // Initialize variables
    struct nft_expr_info expr_info;
    struct nft_expr *expr;
    struct module *owner;
    int err;
// Parse expression attributes
    err = nf_tables_expr_parse(ctx, nla, &expr_info);
    if (err < 0)
        goto err1;
    // Allocate memory for the expression
    err = -ENOMEM;
    expr = kzalloc(expr_info.ops->size, GFP_KERNEL);
    if (expr == NULL)
        goto err2;
    // Initialize the expression using expression-specific ops
    err = nf_tables_newexpr(ctx, &expr_info, expr); // [1]
    if (err < 0)
        goto err3; // Free expression memory on failure
    return expr;
err3:
    // Cleanup on error
    kfree(expr);
err2:
    // Release resources
    owner = expr_info.ops->type->owner;
    if (expr_info.ops->type->release_ops)
        expr_info.ops->type->release_ops(expr_info.ops);
    module_put(owner);
err1:
    // Return error
    return ERR_PTR(err);
}

Let’s understand the above code with a metaphor example, Imagine you’re building a toolbox for different types of tasks. You need a way to prepare tools based on user instructions. The nft_expr_init the function does something similar by preparing tools (expressions) based on user input.

Initialization:

nft_expr_info stores information about the expression type and attributes.
nft_expr is where we'll store our prepared expression.
module is a part of the Linux kernel that manages different components.

2. Parsing Attributes:

We check the user’s instructions (expression attributes) to figure out what kind of tool they want.
If something goes wrong (negative value), we jump to err1.

3. Memory Allocation:

We need space to create the tool, so we allocate memory (like carving out space in a toolbox).
If we can’t allocate memory (negative value), we jump to err2.

4. Initializing the Tool (Expression):

Now we prepare the actual tool using expression-specific instructions.
If preparation fails (negative value), we jump to err3 to clean up.

5. Finishing Touches:

If there’s an issue in the last step, we release memory and resources (err3).
If we couldn’t allocate memory earlier, we release that (err2).
If there was an issue with the attributes, we release resources (err1).

6. Success:

If everything went smoothly, we hand over the prepared tool (expression) to the user.

User Input               Parsing Attributes             Memory Allocation
    (Expression Attributes)      (expr_info)                      (kzalloc)
        +------+                 +----------------+             +------------+
        |      |                 |                |             |            |
        |      v                 v                v             v            |
        |  +-------------------+ +--------------+ +-----------+ |            |
        |  | nft_expr_init     | | nf_tables_   | |           | | kzalloc    |
        |  |                   | | expr_parse   | |  -ENOMEM  | |            |
        |  |                   | |              | |           | |            |
        |  |                   | |              | |           | |            |
        |  |                   | |              | |           | |            |
        |  |                   | |              | |           | |            |
        |  +--------+----------+ +-------+------+ +-----------+ +-----+------+
        |           |                    |                  |         |
        +-----------|--------------------|------------------|---------+
                    |                    |                  |
        Error       v                    |                  |
       Handling     +--------------------|------------------+
                    |                    v
                    |                +----------------+
                    |                |   nf_tables_   |
                    |                | newexpr        |
                    |                |                |
                    |                +----------------+
                    |                        |
                    +------------------------+
                                             |
                                          Error
                                          Handling

The code initializes an expression by calling the function nf_tables_newexpr. If the initialization fails, it releases the memory allocated for the expression.

static int nf_tables_newexpr(const struct nft_ctx *ctx,
                 const struct nft_expr_info *expr_info,
                 struct nft_expr *expr)
{
    const struct nft_expr_ops *ops = expr_info->ops;
    int err;
// Set the expression's ops to the provided ops
    expr->ops = ops;
    // If the expression's ops have an init function, call it
    if (ops->init) {
        err = ops->init(ctx, expr, (const struct nlattr **)expr_info->tb); // [2]
        if (err < 0)
            goto err1;
    }
    return 0;
err1:
    // If there's an error during initialization, undo and clean up
    expr->ops = NULL;
    return err;
}

The code first fetches the specific set of operations (ops) associated with the expression from expr_info. Then, it sets up the expression by assigning these operations.

If there’s an init a function defined in the operations, it's called to perform the necessary initialization steps. If the initialization fails (returns an error), the code cleans up by resetting the expression's operations and returning the error.

Let’s use a simple analogy to explain this process:

Imagine you’re building a robot. This robot has different parts (operations) that need to be put together correctly to make it work. There’s a central control room (the nf_tables_newexpr function) where the robot is being assembled and initialized.

Fetching Operations (ops): You have a manual (operations) that tells you how to assemble and initialize the robot. This manual contains steps for each part. The code ops is like this manual, and it specifies the steps needed to set up the expression.
Assigning Operations (expr->ops = ops): You follow the manual and start attaching the parts according to the steps. In the code, this step sets up the expression by assigning the operations (ops) to it.
Initialization (ops->init): Some parts require a specific setup before the robot can work. You follow the steps in the manual to perform this setup. In the code, if a init function is defined in the operations (ops), it's called to perform initialization steps.
Successful Initialization: If everything goes smoothly during setup and initialization, your robot is ready to go.
Initialization Failure: If there’s a problem during setup or initialization, you stop and fix the issue. In the code, if the init function returns an error, it means something went wrong during setup
Cleaning Up (expr->ops = NULL): If there's an error, you undo the changes you made so far to avoid leaving the robot in an inconsistent state. In the code, if initialization fails, the operations (ops) are reset to NULL

Each type of expression that defines how packets are matched and processed has its own set of operations defined. Let’s focus on the operations (ops) associated with the “lookup” expression type as an example.

Below is a representation of the operations for the “lookup” expression:

static const struct nft_expr_ops nft_lookup_ops = {
	.type       = &nft_lookup_type,
	.size       = NFT_EXPR_SIZE(sizeof(struct nft_lookup)),
	.eval       = nft_lookup_eval,
	.init       = nft_lookup_init,
	.activate   = nft_lookup_activate,
	.deactivate = nft_lookup_deactivate,
	.destroy    = nft_lookup_destroy,
	.dump       = nft_lookup_dump,
	.validate   = nft_lookup_validate,
	.reduce     = nft_lookup_reduce,
};

type: Specifies the type of the expression, which is "lookup" in this case.
size: Determines the size of memory required for the expression, which is calculated based on the size of the struct nft_lookup data structure.
eval: Points to a function (nft_lookup_eval) that evaluates the expression's logic for packet matching.
init: Points to a function (nft_lookup_init) that initializes the expression's state.
activate: Points to a function (nft_lookup_activate) that activates the expression when used in a rule.
deactivate: Points to a function (nft_lookup_deactivate) that deactivates the expression.
destroy: Points to a function (nft_lookup_destroy) that cleans up and destroys the expression's resources.
dump: Points to a function (nft_lookup_dump) that generates a human-readable representation of the expression for debugging or display purposes.
validate: Points to a function (nft_lookup_validate) that validates the expression's configuration and settings.
reduce: Points to a function (nft_lookup_reduce) that optimizes or reduces the expression's complexity, if possible.

nft_lookup_ops
+-----------------------------------------------------------------------+
|                                                                       |
| type      ---->  &nft_lookup_type                                     |
| size      ---->  NFT_EXPR_SIZE(sizeof(struct nft_lookup))             |
| eval      ---->  nft_lookup_eval                                      |
| init      ---->  nft_lookup_init                                      |
| activate  ---->  nft_lookup_activate                                  |
| deactivate---->  nft_lookup_deactivate                                |
| destroy   ---->  nft_lookup_destroy                                   |
| dump      ---->  nft_lookup_dump                                      |
| validate  ---->  nft_lookup_validate                                  |
| reduce    ---->  nft_lookup_reduce                                    |
|                                                                       |
+-----------------------------------------------------------------------+

Let’s break down the ops->init the function of the lookup expression which is nft_lookup_init.

static int nft_lookup_init(const struct nft_ctx *ctx,
			   const struct nft_expr *expr,
			   const struct nlattr * const tb[])
{
	struct nft_lookup *priv = nft_expr_priv(expr);
	u8 genmask = nft_genmask_next(ctx->net);
	struct nft_set *set;
	u32 flags;
	int err;
// Check if required attributes are provided
 if (tb[NFTA_LOOKUP_SET] == NULL ||
     tb[NFTA_LOOKUP_SREG] == NULL)
  return -EINVAL;
 // Look up the specified set based on attributes
 set = nft_set_lookup_global(ctx->net, ctx->table, tb[NFTA_LOOKUP_SET],
        tb[NFTA_LOOKUP_SET_ID], genmask);
 if (IS_ERR(set))
  return PTR_ERR(set);
 // Load the source register (sreg) from the netlink attribute
 err = nft_parse_register_load(tb[NFTA_LOOKUP_SREG], &priv->sreg,
          set->klen);
 if (err < 0)
  return err;
 // Parse and process the lookup flags if provided
 if (tb[NFTA_LOOKUP_FLAGS]) {
  flags = ntohl(nla_get_be32(tb[NFTA_LOOKUP_FLAGS]));
  // Validate and process the lookup flags
  if (flags & ~NFT_LOOKUP_F_INV)
   return -EINVAL;
  if (flags & NFT_LOOKUP_F_INV) {
   if (set->flags & NFT_SET_MAP)
    return -EINVAL;
   priv->invert = true;
  }
 }
 // Handle destination register (dreg) if provided
 if (tb[NFTA_LOOKUP_DREG] != NULL) {
  if (priv->invert)
   return -EINVAL;
  if (!(set->flags & NFT_SET_MAP))
   return -EINVAL;
  // Parse and store the destination register (dreg) for actions
  err = nft_parse_register_store(ctx, tb[NFTA_LOOKUP_DREG],
            &priv->dreg, NULL, set->dtype,
            set->dlen);
  if (err < 0)
   return err;
 } else if (set->flags & NFT_SET_MAP)
  return -EINVAL;
 // Set flags for the binding based on the set's flags
 priv->binding.flags = set->flags & NFT_SET_MAP;
 // Bind the set to the lookup expression
 err = nf_tables_bind_set(ctx, set, &priv->binding);
 if (err < 0)
  return err;
 // Store the set in the lookup expression's private data
 priv->set = set;
 return 0;
}

The code calls a function called nf_tables_bind_set to connect (or bind) the expression we're working with to a specific set. However, this binding will fail if the set we're trying to connect to is anonymous and it already has some connections (bindings). So, for this connection to work, the set we're using for the lookup must not be anonymous.

# The below line justify the above explanation
if (!list_empty(&set->bindings) && nft_set_is_anonymous(set))
  return -EBUSY;

We already discussed this before but let’s elaborate one more time When adding an expression to a set, the nft_expr_init function is called by nft_set_elem_expr_alloc. This process involves initializing and binding a lookup expression to the set. Let's examine this process further:

nft_set_elem_expr_alloc function is called to allocate memory for an expression associated with a set.
The nft_expr_init the function is called to initialize the expression using the attributes provided
If the NFT_EXPR_STATEFUL flag is not present in the expression's type, the expression is destroyed using the nft_expr_destroy function.
nft_expr_destroy the function is called, which eventually invokes the expression-specific destruction function (ops->destroy).
If the NFT_EXPR_STATEFUL flag is present, and additional checks and operations are performed based on the set's properties.

struct nft_expr *nft_set_elem_expr_alloc(const struct nft_ctx *ctx,
                     const struct nft_set *set,
                     const struct nlattr *attr)
{
    struct nft_expr *expr;
    int err;
expr = nft_expr_init(ctx, attr); // [1]
    if (IS_ERR(expr))
        return expr;
    err = -EOPNOTSUPP;
    if (!(expr->ops->type->flags & NFT_EXPR_STATEFUL)) // [2]
        goto err_set_elem_expr;
    if (expr->ops->type->flags & NFT_EXPR_GC) {
        if (set->flags & NFT_SET_TIMEOUT)
            goto err_set_elem_expr;
        if (!set->ops->gc_init)
            goto err_set_elem_expr;
        set->ops->gc_init(set);
    }
    return expr;
err_set_elem_expr:
    nft_expr_destroy(ctx, expr); // [3]
    return ERR_PTR(err);
}
void nft_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr)
{
    nf_tables_expr_destroy(ctx, expr);
    kfree(expr);
}
static void nf_tables_expr_destroy(const struct nft_ctx *ctx,
                   struct nft_expr *expr)
{
    const struct nft_expr_type *type = expr->ops->type;
    if (expr->ops->destroy)
        expr->ops->destroy(ctx, expr); // [4]
    module_put(type->owner);
}

Let’s consider an example where you’re configuring a firewall using nftables. You want to add an expression to a set that specifies certain rules for matching network packets.

You have a set named “allowed_ips_set” that contains a list of IP addresses allowed to pass through the firewall.
You want to add a lookup expression that matches packets based on their source IP address.
The lookup expression is associated with the “allowed_ips_set.”
You’ve also specified a rule that indicates stateful behavior for the lookup expression.
You want to ensure that if the NFT_EXPR_STATEFUL flag is present, the expression-specific destruction function is called.

          +-------------------------+
          | nft_set_elem_expr_alloc  |
          +-------------------------+
                |
                v
          +-------------------+
          | nft_expr_init     | [1]
          +-------------------+
                |
                v
          +------------------------+
          |Check NFT_EXPR_STATEFUL | [2]
          +------------------------+
           /                  \
          v                    x
+----------------------------+ +-----------------------------+
| Invoke nft_expr_destroy   | | Continue processing based on |
| (Expression is destroyed) | | set and expression properties|
+----------------------------+ +-----------------------------+

nft_set_elem_expr_alloc is called to allocate memory for the expression associated with the set.
nft_expr_init initializes the expression.
If the NFT_EXPR_STATEFUL the flag is not present, the expression is destroyed using nft_expr_destroy. If it is present, continue processing.
If NFT_EXPR_STATEFUL is present, additional checks and operations are performed based on set properties.

Let’s take a look at nft_lookup_destroy function:

static void nft_lookup_destroy(const struct nft_ctx *ctx,
                   const struct nft_expr *expr)
{
    struct nft_lookup *priv = nft_expr_priv(expr);
nf_tables_destroy_set(ctx, priv->set); // [1]
}
void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set)
{
    if (list_empty(&set->bindings) && nft_set_is_anonymous(set)) // [2]
        nft_set_destroy(ctx, set); 
}

Consider a system that uses a lookup expression in its packet filtering rules. This lookup expression is created to match packets against a predefined set of values. When the lookup expression is destroyed, there’s a problem that arises due to the order of operations and the absence of certain checks.

In the function, a call is made to nf_tables_destroy_set potentially destroy the set that the lookup expression is bound to.
Inside nf_tables_destroy_set, a check is performed to determine if it's safe to destroy the set. The set will only be destroyed if it's anonymous and if there are no bindings to it. In other words, the set won't be destroyed if it has a name or if it's still associated with other parts of the system.
If it’s safe to destroy the set, the function nft_set_destroy is called to actually destroy the set.

nft_lookup_destroy
          +----------------------+
          |                      |
          | Call to              |
          | nf_tables_destroy_set|
          |                      |
          +----------------------+
                      |
                      v
          +----------------------+
          |                      |
          | Check if safe to     |
          | destroy set          |
          |                      |
          +----------------------+
                      |
                      v
          +----------------------+
          |                      |
          | nft_set_destroy       |
          |                      |
          +----------------------+

Root Cause :

The issue arises from the sequence of operations within the nft_set_elem_expr_alloc function. Specifically, the problem occurs because the function call to nft_expr_init takes place before confirming whether the expression possesses the NFT_EXPR_STATEFUL flag. This misordering leads to a significant consequence: if an expression lacking the NFT_EXPR_STATEFUL flag is passed, the expression is fully initialized and attached to the set before it is checked for destruction, causing the destruction step to be skipped.

Now, consider what happens when an expression devoid of the NFT_EXPR_STATEFUL the flag is utilized. The expression becomes linked to the set even before its destruction is attempted. However, the set itself remains unharmed since it possesses active bindings. Moreover, in the preceding functions, no mechanism is in place to handle this specific case. Consequently, the expression persists in its attached state to the set, and this attachment remains intact even after the expression is destroyed and its memory freed. Consequently, the set's bindings linked list continues to hold a reference to memory that has already been freed. This situation culminates in a Use-After-Free vulnerability.

+-----------------------+
|nft_set_elem_expr_alloc|
+-----------------------+
          |
          v
+-----------------+
| nft_expr_init   | [Problem: Initialization before checking flag]
+-----------------+
          |
          v
    (Expression Initialized)
          |
          v
+-------------------------+
| Attach Expression to Set|
+-------------------------+
          |
          v
+--------------------+
| Destruction Attempt (Skipped)
+--------------------+
          |
          v
+-------------------------+
| Expression Bound to Set
+-------------------------+
          |
          v
+-------------------------+
| Expression Destroyed    |
+-------------------------+
          |
          v
+----------------------------+
| Set Bindings Linked List  |
|   (Contains pointer to    |
|    freed memory)          |
+----------------------------+

# nft_set_elem_expr_alloc initializes the process and calls nft_expr_init to initialize the expression.
# nft_expr_init is called but does not check the NFT_EXPR_STATEFUL flag before fully initializing the expression.
# The expression is attached to the set.
# The destruction attempt is skipped due to the misordering of operations.
# The expression is bound to the set.
# The expression is destroyed, but the binding remains.
# The set's bindings linked list retains a reference to freed memory.

Exploitation and Explanation:

We have 2 exploits available for the vulnerability / POC:

@junomonster: https://github.com/theori-io/CVE-2022-32250-exploit
@Yordan: https://github.com/ysanatomic/CVE-2022-32250-LPE

I used the exploit created by @junomonster

The requirement to run the exploit:

You need libmnl-dev and libnftnl-dev packages installed in your machine.

Affected Version

Linux, before commit 520778042ccca019f3ffa136dd0ca565c486cedd (26 May, 2022)
Ubuntu <= 22.04 before security patch

Test Environment

Platform
Ubuntu 22.04 amd64
Versions
Linux ubuntu 5.12.0 #2 SMP Aug 18 14:17:41 JST 2023 x86_64 x86_64 x86_64 GNU/Linux

Running

gcc exp.c -o exp -l mnl -l nftnl -w
./exp

Warning

This exploit corrupts Linux kernel slabs, which might cause kernel panic when attempting to acquire root privileges.
Make sure you have libnftnl updated version support in your os, as it has been observed in some os like Ubuntu 18.04 LTS libnftnl The last supported version is 1.0.0.7 and this version doesn't support bitwise-op in the library hence this will blow and exploits will not work.

Result

use git tool to download the exploit from: https://github.com/theori-io/CVE-2022-32250-exploit
Compile the exp.c file and execute the .exp file

Exploitation Strategy

As being mentioned as well there are multiple exploits available for the vulnerability, here we discuss the strategy used by @junomonster in his exploit https://github.com/theori-io/CVE-2022-32250-exploit.

The exploit has three main steps:

Leak the heap address using struct user_key_payload.

+------------------+
|                  |
|   Kernel Heap    |
|                  |
+------------------+
        |
        V
+---------------------------------+
|      struct user_key_payload    |
| (Leaked heap address pointer)   |
+---------------------------------+

2. Leak text address using mqueue to get KASLR.

+----------------------------------+
|             Memory Layout        |
| (Randomized Kernel Text Address) |
+----------------------------------+
        |
        V
+-------------------------------+
|    mqueue Exploitation        |
| (Expose KASLR Offset)         |
+-------------------------------+

3. Overwrite modprobe_path

+-------------------------------+
|          Kernel Configuration |
|            (modprobe_path)    |
+-------------------------------+
        |
        V
+--------------------------------+
|     Exploit: Overwrite         |
|       modprobe_path            |
+--------------------------------+

Step 1: Leak Heap Address

In the context of kernel exploitation, a “heap address” refers to a memory address in the kernel’s dynamic memory allocation region, known as the heap. A “memory leak” occurs when a program or, in this case, the kernel, allocates memory but fails to release it properly, resulting in a loss of memory resources.

struct msg_msg: This is a data structure in the Linux kernel representing a message buffer for inter-process communication. It's used by various IPC mechanisms to exchange data between processes. One of the key features of this structure is that it can perform arbitrary reads and writes by manipulating the size field.

GFP_KERNEL and GFP_KERNEL_ACCOUNT: These are memory allocation flags in the Linux kernel that determine how memory is allocated. They control which slab allocator is used for memory allocation.

kmalloc-cg-xx and kmalloc-xx: These are slab allocators used for dynamic memory allocation in the Linux kernel. The "kmalloc" allocates memory from the kernel heap, and the "xx" part represents the size of the allocated memory. The "kmalloc-cg-xx" and "kmalloc-xx" refer to different slab allocator mechanisms.

Example

Let’s break down the statement using an example and a simplified diagram:

Suppose we have two data structures: struct msg_msg and struct nft_lookup, and we are exploiting a vulnerability in the 5.12.0 Linux Kernel

Memory Allocation: On this specific version:
struct msg_msg is allocated using the GFP_KERNEL_ACCOUNT flag.
struct nft_lookup is allocated using the GFP_KERNEL flag.
Memory Allocation Mechanisms: When memory is allocated using kmalloc, the specific slab allocator used depends on the flags:
If the GFP_KERNEL_ACCOUNT the flag is set, the kernel uses the kmalloc-cg-xx slab allocator.
If the GFP_KERNEL flag is set (without GFP_KERNEL_ACCOUNT), the kernel uses the kmalloc-xx slab allocator.

Kernel Heap
  +-------------------------+
  |       Free Memory       |
  +-------------------------+
  |    kmalloc-cg-xx Used   | <-- struct msg_msg
  +-------------------------+
  |      kmalloc-xx Used    | <-- struct nft_lookup
  +-------------------------+
  |       Free Memory       |
  +-------------------------+

The exploit uses a vulnerability in the struct msg_msg allocation to leak heap addresses. This involves manipulating the size field and observing the memory layout to extract heap addresses and for the same reason it used user keyring to leak the information

User Keyring and Key Payload: A User Keyring is a mechanism in the Linux kernel to manage user-level security keys. A key can have an associated payload, which is the actual data associated with the key. In this case, the payload is defined by the

struct user_key_payload.

struct user_key_payload is allocated from user_preparse

During the static analysis phase, we already explained the vulnerability arises from a Use-After-Free (UAF) issue. The UAF occurs when a pointer to a previously allocated structure is accessed after it has been deallocated, leading to exploitation.

The vulnerability is exploited through a process of allocation, manipulation, and double UAF.

user_preparse() allocates memory for the user_key_payload structure along with the user-defined data. The size of the allocated memory is determined by datalen.

An exploit creates and manipulates structures in a way that a subsequent free operation will free one structure while retaining a pointer to another. By triggering the UAF twice, the exploit can control the content of previously allocated chunks that are now deallocated, leading to arbitrary Read/Write capability and leakage of heap address.

      +------------------------+
      | struct user_key_payload|                 Kernel Memory
      +------------------------+
      |        rcu_head        |                 +------------------+
      |        datalen         |                 | struct nft_lookup|
      |          ...           |                 +------------------+
      |   User-defined data    |                 |        next      |
      |                        |                 +------------------+
      |                        |
      +------------------------+
               |      ^
               |      |
               v      |
          +------------------+
          |   nft_lookup     |
          +------------------+
          |      binding     |
          |       ...        |
          +------------------+
          |       next       |
          +------------------+

The struct user_key_payload overlaps with the nft_lookup structure due to the UAF. The UAF leads to the exploit controlling the memory layout and performing arbitrary read and write operations, including leaking heap addresses.

Step 2: Leak KASLR

For leaking KASLR exploit is using the msg_msg data structure, The msg_msg is commonly exploited in Linux kernel vulnerabilities. However, exploiting this particular vulnerability is difficult due to certain limitations in writing the object's 0x18 field (hexadecimal offset). The mqueue subsystem has functions suitable for exploitation.

The msg_msg data structure is a part of the Linux kernel's message queue subsystem. The designed exploit involves attempting to modify the 0x18 field of an object (data structure), which is difficult due to limitations imposed by the kernel's security mechanisms. The mqueue subsystem in the Linux kernel has do_mq_timedsend a function, which is part of the mqueue subsystem. This function is used to send messages to a message queue.
The code allocates a new posix_msg_tree_node (represented by new_leaf) using kmalloc.

Some more data structure to keep a note on is :

struct posix_msg_tree_node: This structure is allocated using kmalloc-64 and contains members rb_node, msg_list, andpriority.
struct rb_node: This structure represents a red-black tree node and contains __rb_parent_color, rb_right, and rb_left members.

struct posix_msg_tree_node {
    struct rb_node      rb_node;
    struct list_head    msg_list;
    int         priority;
};
struct rb_node {
    unsigned long  __rb_parent_color;
    struct rb_node *rb_right;
    struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long)))

Steps to Exploit:
Allocate UAF 1: Allocate a chunk of memory that contains the vulnerable structure struct nft_expr with the UAF vulnerability.
Overwrite struct posix_msg_tree_node: Exploit the vulnerability in the do_mq_timedsend function to overwrite the struct posix_msg_tree_node with UAF 1.
Create UAF 2: Use the overwritten structure to create UAF 2 and connect it to UAF 1. This is achieved by manipulating pointers.
Overwrite struct user_key_payload: Utilize the keyctl function to overwrite struct user_key_payload with UAF 2.

+-----------------+
| UAF 1           |           +-----------------+
| struct nft_expr |           | UAF 2           |
|                 |           | struct nft_expr |
| data (payload)  |           |                 |
+-----------------+           +-----------------+
       |                             |
       |                             |
       v                             v
+-----------------+           +-----------------+
| struct rb_node  |           | struct rb_node  |
|                 |           |                 |
+-----------------+           +-----------------+
       |                             |
       v                             v
+-----------------+           +-----------------+
| struct msg_list |           | struct msg_list |
+-----------------+           +-----------------+

Due to the manipulation of pointers and structures, msg_list->next of struct posix_msg_tree_node points to the payload of UAF 2. This means that the data fields in UAF 2 overlap with the structure, enabling unauthorized access to the msg_msg data fields. The KASLR leakage can happen if certain structures are allocated below UAF 2.

Some more struct to analyze for exploitation of KASLR :

struct percpu_ref_data {
    atomic_long_t       count;
    percpu_ref_func_t   *release;
    percpu_ref_func_t   *confirm_switch;
    bool            force_atomic:1;
    bool            allow_reinit:1;
    struct rcu_head     rcu;
    struct percpu_ref   *ref;
};
struct user_key_payload {
    struct rcu_head rcu;
    unsigned short  datalen;
    char        data[] __aligned(__alignof__(u64));
};

struct callback_head {
    struct callback_head *next;
    void (*func)(struct callback_head *head);
} __attribute__((aligned(sizeof(void *))));
#define rcu_head callback_head

struct percpu_ref_data: This structure holds various fields, including a reference count (count) and callbacks (release and confirm_switch).struct user_key_payload: This structure contains a field called rcu type struct rcu_head. It also has datalen and data.
The exploitation involves a scenario where a struct percpu_ref_data is allocated and used. This structure contains a reference count (count) that is manipulated during certain operations.
At some point, the kfree(msg_msg->security) operation is performed within the free_msg function. This operation frees the memory associated with a security-related structure (msg_msg->security).
If a struct percpu_ref_data is allocated below a specific point in memory referred to as "UAF 2," and then later freed, it can lead to a kernel crash. The crash occurs because the reference count (count) assigned by the io_uring functions become a specific value (0x800000000000000001), which is invalid and leads to the crash.
To exploit this vulnerability and perform a KASLR leak, an attacker needs to arrange for a struct user_key_payload to exist below UAF 2 in memory. The struct user_key_payload contains an rcu field, which, if manipulated correctly, can trigger a callback when the critical section is terminated. This callback can be used to leak memory addresses and KASLR.

NOTE: RCU (Read-Copy Update): RCU is a synchronization mechanism used in the Linux kernel for efficient read-side access to shared data. It allows multiple threads to read data simultaneously without locking, but it becomes more complex when writing data.

do_mq_timedreceive: This is a function that’s used to read messages from a mqueue with a timeout. It reads the data contained in a struct msg_msg.
msg_get: This function is responsible for retrieving messages from the linked list of messages in the mqueue.
struct msg_msg: This represents a message in the mqueue. It contains the actual message data and some metadata.
msg_get calls msg_get_first_leaf: In this step, msg_get refers to the leaf node (first message) in the linked list of messages.
list_del: This function is called to unlink the first struct msg_msg from the linked list, effectively removing it from the mqueue.
store_msg: After unlinking, the store_msg function is executed. It copies the message data to the user space (user program).
copy_to_user: This function copies data from kernel space to user space, allowing the user program to access the message content
free_msg: Once the data is copied, free_msg is called to release the memory used by the struct msg_msg and its associated resources.
msg_msg->security: This refers to a security-related aspect of the struct msg_msg data structure
msg_msg->data and rcu->func: Here, in this the actual issue is available. The msg_msg->data field contains a function address and the interaction with rcu->func which leads to a possibility to perform KASLR Leakag

Step 3: modprobe Path Overwrite

UAF 1:

The vulnerability occurs in the msg_get function.
The list_del the function is used to remove an element (struct msg_msg) from a linked list.
The UAF happens because the contents of the removed struct msg_msg can still be accessed even after it has been removed from the list.
The exploit can manipulate the m_list pointers and access other fields in the structure.

2. UAF 2:

Another UAF vulnerability exists, involving different structures (nft_expr and user_key_payload).
Similar to UAF 1, a freed structure’s contents can still be accessed.
The exploit can manipulate the fields of the freed structures to overwrite memory.

             +-------------------------+
UAF 1        | struct msg_msg (Freed)  |
             +-------------------------+
             |    m_list (next)        |   <-------------------------+
             +-------------------------+                             |
             |    m_list (prev)        |                             |
             +-------------------------+                             |
             |          ...            |                             |
             +-------------------------+                             |
             |    m_type               |                             |
             +-------------------------+                             |
             |    m_ts                 |                             |
             +-------------------------+                             |
             |   next (freed)          |                             |
             +-------------------------+                             |
             |    security             |                             |
             +-------------------------+                             |
             |   ... (rest of struct)   |                            |
             +-------------------------+                             |
                                                                     |
UAF 2                                                                |
             +-------------------------+                             |
             | struct nft_expr (Freed) |   <------------------------+
             +-------------------------+
             |        ...              |
             +-------------------------+
             | struct user_key_payload |
             +-------------------------+
             |        ...              |
             +-------------------------+
UAF 1                                       UAF 2
         -------------------------------             -------------------------------
0x0     |    rb color   |    rb_right   |           |   rcu->next   | rcu->func ptr |
         --------------- ---------------             --------------- ---------------
0x10    |    rb_left    |  (UAF 2+0x18) |           |   data_len    |     data[0]   |
         --------------- ---------------             --------------- ---------------
0x20    |      ....     |      ....     |           |     data[1]   |     data[2]   |
         --------------- ---------------             --------------- ---------------
0x30    |      ....     |      ....     |           |     data[3]   |     data[4]   |
         --------------- ---------------             --------------- ---------------
UAF 1 overlap info: struct nft_expr == struct posix_msg_tree_node
UAF 2 overlap info: struct nft_expr == struct user_key_payload
data[0] = m_list->next   /   data[1] = m_list->prev   /   data[2] = m_type   /   data[3] = m_ts
struct msg_msg {
    struct list_head m_list;
    long m_type;
    size_t m_ts;        /* message text size */
    struct msg_msgseg *next;
    void *security;
    /* the actual message follows immediately */
};

Patch Diffing

A change was made to fix a vulnerability in the code. As we discussed the vulnerability was related to a specific order of operations: initializing expressions before checking certain flags. This order could lead to problems, so the code logic was changed. The fix involves moving the check for stateful expressions (expressions that retain their state across operations) before their creation. This change prevents the early initialization of expressions that would be destroyed immediately after, but could potentially perform operations before their destruction.

I think the new method is considered better because it not only fixes the current vulnerability but also eliminates other potential vulnerabilities that could arise from the early initialization of expressions that are immediately destroyed at the same location.

Before Fix

Initialize 'exp'
  +--------+
  |        |
  v        |
 [exp]     |
  |        |
  +--------+
   |
   v
  Check 'flag'

After Fix:

Check 'flag'
  +--------+
  |        |
  v        |
 [flag]    |
  |        |
  +--------+
   |
   v
Check stateful expressions
  +--------+
  |        |
  v        |
 [exp]     |
  |        |
  +--------+
   |
   v
 Perform operations using 'exp'

Final Thoughts

Throughout the journey of analyzing the CVE-2022-32250 and addressing the security concern, it has been an illuminating experience. The process of delving into the UAF (use after free memory), understanding its implications, and applying the necessary fixes has deepened my understanding of nf_tables.ko module and UAF exploitation.

Furthermore, I would like to acknowledge @junomonster and Yordan (anatomic)remarkable contribution in crafting an exploit for the vulnerability. Their exploit has not only provided a practical demonstration of the vulnerability but has also enabled me to test and validate its vulnerability existence.

I trust that reading this account was as delightful for you as it was for me to craft it.

Also, there are multiple ways to exploit the vulnerability I will add the links in the reference section for further details and look . Sometimes the exploit doesn’t work straightforwardly as the chances of gaining root access using this exploit on a specific vulnerable kernel are highly improbable. It requires thorough experimentation with various placements but it’s a worthwhile task to manipulate it.