CVE-2022–1015 — NF_Tables Out-of-Bounds Access — LPE

34 min readAug 15, 2023

Summary

A flaw was found in the Linux kernel in linux/net/netfilter/nf_tables_api.c of the netfilter subsystem. This flaw allows a local user to cause an out-of-bounds write issue.

Description

A flaw was found in the Linux kernel in linux/net/netfilter/nf_tables_api.c of the netfilter subsystem. This flaw allows a local user to cause an out-of-bounds write issue.

What is NF_TABLES and Netfilter ?

NetFilter & NF_Tables

Netfilter, also known as net/netfilter, is a crucial component within the kernel that helps manage network traffic. Think of it as a system of hooks placed throughout the network modules. These hooks act like designated spots where other modules can register to perform specific actions on network packets.

Okay , let’s understand with an example Imagine a busy highway with multiple exits, each marked with a signpost. These signposts represent the hooks in Netfilter. When a network packet passes through the system, it reaches these signposts. At each signpost, there are handlers waiting to take control and make decisions about what to do with the packet.

These handlers can have different instructions for each signpost. For example, one handler might allow the packet to pass through, another might decide to drop it, and yet another might modify its content before allowing it to continue on its journey.

In this way, Netfilter provides a flexible and powerful framework within the kernel to manage network traffic. It allows different modules to work together and apply specific actions based on their designated hooks. This functionality enables tasks such as packet filtering, fire-walling, and network address translation (NAT) to ensure the security and smooth operation of network communication.

High-level netfilter architecture (source: arthurchiao.art)

NF_TABLES ?

So continuing from above example **nf_tables** is like a team of specialized security personnel within the security team. They are assigned to specific checkpoints and have the expertise to analyze and make decisions about each vehicle they encounter.

When a vehicle (network packet) reaches a checkpoint (hook), the corresponding nf_tables personnel take over. They examine the vehicle (packet content) and determine whether it should be allowed or not, or undergo further investigation based on specific rules and policies.

NFTables is the modern Linux kernel packet classification framework that is built into the Linux kernel. Now and almost all Linux kernel distribution has already made the switch to nf tables for packet filtration and Iptables has become short of legacy tool. The network engineers also use a command line tool called nft which is used to write rules in order to filter the traffic.

nf_tables filters the traffic using the concept of Tables, Chains, and Rules.

So there are Tables containing one or more chains, and each chain has one or more rules. So the process is to create a table then create a chain under it and then define rules under the chain.

Let’s take an example of the above statement, We as a System Engineer wanted to restrict the access to one of the vulnerable website in our server named as demo.testfire.net , how can we achieve this with nftables , we will use nft command line tool to achieve filtration.

To achieve the same I am going to use famous nft command line tool.

# NFT commands to add tables and chains
nft add table <familytype> <table_name>
example : nft add table ip leak_table 
# nft command to add chain 
nft add chain <table_family> <table_name> <chain_name> 
example: nft add chain ip leak_table output_chain 
# nft command to add rule 
nft add rule <table_family> <table_name> <chain_name> <rule-definition> 
example: nft add rule ip leak_table output_chain ip saddr 192.168.1.0/24 accept

So before we block the destination domain I wanted to know the IP address of the domain so pinged the domain to grab the ip-address and got 65.61.137.117

So how do the nf_tables, rule, and chains will look to block the outbound traffic to the specified domain

So after the rules were applied you can notice before (green color) we were able to get a response but after the ruleset is applied in output hook we are not able to reach the specific domain and daddr (destination) address we specified.

Expression and Registers

In nf_tables, expressions, and registers are used to define actions and store data while processing network packets. Expressions are building blocks that allow you to perform various actions on packets, such as accepting, dropping, or modifying them. Registers are memory locations used to store packet data temporarily for further processing.

An example of an expression is the “counter” expression, which is used to count the number of packets that match a specific rule. Let’s see how to use the “counter” expression to count incoming packets on a specific port.

Suppose you want to count the number of incoming packets on port 80 (HTTP) using nf_tables. First, you would create a new table to store the rule:

nft add table ip leak_chain

Then, you would add a new chain to the table:

nft add chain ip leak_chain input { type filter hook input priority 0 \; }

Next, you can define the rule with the “counter” expression:

nft add rule ip leak_chain input tcp dport 80 counter

Now, every time a packet with TCP destination port 80 arrives, nf_tables will increment the counter for that rule. To view the counters, you can use the following command:

nft list ruleset

You will see the counters associated with the rule:

table ip leak_chain {
    chain input {
        type filter hook input priority 0; policy accept;
        tcp dport 80 counter packets 1024 bytes 122880
    }
}
`

In this example, 1024 packets with a total of 122880 bytes have been counted so far.

Registers in nf_tables are used to store specific packet data that can be referenced in subsequent rules or actions. For example, you can use a register to store the source IP address of a packet and then use that information in a different rule.

Overall, expressions and registers in nf_tables provide powerful mechanisms for customizing packet processing and implementing advanced filtering and networking logic.

Of course, it’s very simplified information but I hope it sets up some information to get started with.

If you wanted to know more information about nf_tables , you can find that here: https://wiki.nftables.org/wiki-nftables/index.php/What_is_nftables%3F

Build the Lab

As this is a kernel module vulnerability it’s typical to debug, so you need to have a little bit more patience than usual 🦐

I will try to keep it more simple, so for the lab I used

VirtualBox

2. I used 2 Linux Virtual Machine as the vulnerabilities lies from v5.12 to v5.17 kernel version in Linux machine.

As we have to debug a Kernel Module and Kernel is a user-space process so GDB alone cannot use it for debugging hence we need an Client/Server architecture. Kernel programs can be debugged remotely using the combination of gdbserver on the target machine and gdb on the host machine/development machine. The Linux kernel has a GDB Server implementation called KGDB. It communicates with a GDB client over network or serial port connection.

Host/Development Machine: Runs gdb against the vmlinux file which contains the symbols and performs debugging
Target Machine: Runs kgdb and is the machine to be debugged

    ------------------                              --------------------
    |       Host      |                             |       Target     |
    |                 |                             |                  |
    |  -------------  |                             |   ------------   |
    | |     gdb     | |<--------------------------->|  |    kgdb    |  |
    | |             | |             Serial or       |  |            |  |
    | --------------  |             Ethernet        |  -------------   |
    |       |         |             Connection      |        |         |
    |  -------------- |                             |  --------------  |
    | | Kernel image ||                             |  |Linux Kernel | |
    | | with debug   ||                             |  |(zImage)     | |
    | | symbols      ||                             |  |             | |
    | | (vmlinux)    ||                             |  --------------- |
    | ----------------|                             |                  |
    -------------------                             --------------------
Hence Two machines are required for using kgdb:

KGDB is a GDB Server implementation integrated into the Linux Kernel, It supports serial port communication (available in the mainline kernel) and network communication (patch required)It’s available in the mainline Linux kernel since version 2.6.26 (x86 and sparc) and 2.6.27 (arm, mips and ppc)Enables full control over kernel execution on target, including memory read and write, step-by-step execution, and even breakpoints in interrupt handlers

There might be other ways to do it but I generally do the above way.

3. I am using Ubuntu AMD64-22.04 LTS iso: https://releases.ubuntu.com/.

Connect and Create a Serial Port in VirtualBox

The assumption for the step:

This has been an assumed that user have ISO images downloaded at local and already created 2 VM's with that.For the Demonstration I created 2 machines named as target and Dev Machine.To create a serial port in VirtualBox and Connect the machines it's very easy

Select your target machine from virtualbox and go to the settings options
Once you are in the settings tab of the target server select the Serial Ports and enter the below configuration :

Don’t check the Connect to existing pipe/socket as we don't have any previous ones.

3. Once we have the serial port configured follow the step 1 and 2 for dev machine as well , but in dev machine you need to check Connect to existing pipe/socket and make sure you specify the same Path/Address

Once that’s done Congratulations Labs are ready

WARNING

! Do not start the Dev machine first other wise you will see an error of serial
port as you might have already noticed that we connected the 2 machines 
together with serial port

DEBUGGING KERNEL — nf_tables

As we discussed already the vulnerability lies in nf_tables and it's a kernel module so to debug a kernel module we need to follow some steps so let's do those initial settings first:

Verify the machines (dev & target) are communicating in serial-port , to verify the communication between the dev and target machine, send the message on serial ports

The current version of the kernel is 22.04 if you have downloaded it from the ubuntu official website it will not be an older version so we have to downgrade the kernel , let's continue to do that step in Debugging stage.

2. Download the affected versions of kernel as per the cve mentioned and details affected versions are v5.12 to v5.17 , so to accomplish this step I did download the v5.16-rc3 from official kernel github.

git checkout to the v5.16-rc3 tag and you can verify the same below with the git status

Once you have checked the affected version of the kernel you need to install this image and update it to your grub but before we do that we need some libraries to be available

1. build-essentials
2. flex
3. bison

Once the packages are installed let’s enable KGDB in the config file to debug the kernel and to enable KGDB settings please move inside the git repo where we have downloaded the kernel source and run make nconfig command

This command will bring the config file in graphical view and verify the KGDB variable value is enabled.

Select kernel hacking

2. Select the Generic kernel Debugging Instruments

3. Verify the KGDB and magic sysrq option is selected

Once these settings are verified we need to verify one more variable DEBUG_INFO it should be y as well, to look for the variable press f8 and search for the value

As from the verification process, all things are verified, libraries have been installed and things are in place, as the flaw is in nf_tables we need to make sure that this module is also enabled and installed so let's verify that too

To do that we will go to the Networking Support > Networking Option > Network Packet Filtering Framework > Core Netfilter Configuration

For the safer side (as it takes a lot of time to install modules or install kernel image) and we should not miss any class or file debugging I have enabled all netfilter modules for nf_tables so that we don't have to repeat this step for any miss.

Press f6 and save the changes and run make -j8 command to build the Linux kernel with multiple threads in parallel. Go out Grab coffee as it going to take a long believe me very long

And after 4 hours of my return I encountered the below error :

So if you also face this error you can resolve this by making CONFIG_SYSTEM_TRUSTED_KEYS as empty string

After the changes re-install with the same command make -j8

and wait for vmlinux file in the location.

After a make -j8 success build I got the vmlinux file available with some built-in modules

After make -j8 success you need to run make modules_install command and wait for installation and completion of the command.

Once that’s completed run make install and this will update the v5.16-rc3 modules in boot , once that's done just write update-grub and reboot command to restart the machine.

During the restart of the machine, it will display the option to select the kernel version, select v5.16-rc3 and boot the kernel.

Verify the kernel version by writing uname -r

Now we have downgraded to the affected version of kernel

Next, I wanted to enable the GDB-Script in the affected target machine, GDB Scripts are a collection of helper scripts that can simplify kernel debugging steps

Todo that we have to perform 2 steps

In target machine we have to enable CONFIG_GDB_SCRIPT which was enabled in our target machine already.

2. In Dev machine we have to create a ~/.gdbinit machine and write add-auto-load-safe-path <location-bin-file>

In my dev machine you can verify that I have added peda as well as those who don't know peda , it's yet one another very helpful tool which integrates with gdb and helps in reading registers , functions etc.

Integrating peda with GDB : https://qiita.com/miyase256/items/248a486cca671686c58c

To start the debugging on target machine we also have to copy the debugged build and compiled kernel linux folder to the dev server. To make copy easy I installed open ssh in target server and used scp command in dev server to copy linux compiled folder from target machine to dev

In target machine we made a tar.gz file and In dev server used scp command to copy linux.tar.gz from target to dev

make tar.gz file with tar : tar -czvf linux.tar.gz linux

In dev server I copied the folder at /home/target/Desktop/linux

scp target@10.0.2.15:/home/target/Desktop/linux.tar.gz /home/target/Desktop/

Next to debug kernel we have to specify the serial port and baud rate to the kgdboc so that we can debug kernel from the dev machine.

Run the sysrq magic sequence in target server

echo g > /proc/sysrq-trigger

On the dev server run target remote /dev/ttyS0

We can see the kgdb breakpoints triggered let's put the breakpoint in our suspected functions but first try to know the real root cause

As we have enabled GDB-Script let's load our beloved affected nf_tables.ko module, and to do that we use apropos lx so just write lx-symbols to load nf_tables.ko and other existing modules from kernel to GDB

Once the module is loaded let’s try to put the breakpoint in nft_do_chain function under nf_tables_core.c and start your static analysis

Static Analysis

Root Cause:

Initial Analysis

The vulnerability originates from the functions nft_validate_register_store and nft_validate_register_load. These functions are responsible for ensuring that register indexes and data intended for writing (storing) or reading (loading) are within the valid range of registers. To understand this better, let's delve into the parsing functions: nft_parse_register_store and nft_parse_register_load. These parsing functions invoke the aforementioned validation functions, setting the stage for a closer examination.

/* net/netfilter/nf_tables_api.c */
int nft_parse_register_load(const struct nlattr *attr, u8 *sreg, u32 len)
{
 u32 reg; // 4 byte register variable
 int err;
 reg = nft_parse_register(attr); // gets the register index from an attribute
 err = nft_validate_register_load(reg, len); // calls the validating function
 if (err < 0) // if the validating function didn't return an error everything is fine
  return err;
 *sreg = reg; // save the register index into sreg (a pointer that is provided as an argument)
 // sreg = source register -> the register from which we read
 return 0;
}
EXPORT_SYMBOL_GPL(nft_parse_register_load);
int nft_parse_register_store(const struct nft_ctx *ctx,
        const struct nlattr *attr, u8 *dreg,
        const struct nft_data *data,
        enum nft_data_types type, unsigned int len)
{
 int err;
 u32 reg; // 4 byte register variable
 reg = nft_parse_register(attr); // parsed from an attribute
 err = nft_validate_register_store(ctx, reg, data, type, len);
 /* here we pass a bit more arguments to the validating function */
 /* because we are going to be writing into the registers and not reading from them */
 if (err < 0)
  return err;
 *dreg = reg; // once again saves the register index into dreg
 // dreg = destination register -> the register in which we write
 return 0;
}

Within the provided code snippet, it’s noteworthy that the variable reg is designed as a 32-bit unsigned integer (u32), while both sreg and dreg pointers correspond to 8-bit unsigned variables (u8). This alignment is logical when considering the underlying register mechanism. The entire register space spans 80 bytes, indicated by 0x50. Therefore, after validation, preserving merely the least significant byte suffices. In cases where the register index is valid and falls within bounds, the information can always be accommodated within these 8 bits.

For instance, imagine a hardware setup where a microcontroller governs several hardware components, each represented by a specific register. These registers can store various types of data, from configuration settings to sensor readings. In this scenario, the reg variable could be employed to interact with a larger register, allowing the storage of a more substantial range of information. Meanwhile, sreg and dreg pointers could be utilized for narrower registers dedicated to specific functions, such as controlling LEDs or reading button states. Since the register space is 80 bytes, it's prudent to optimize memory usage by retaining only the essential data for further processing, which aligns with the provided code's design.

Initially, the architecture featured a mere set of four 16-byte registers. Over time, these registers evolved into a configuration of sixteen 4-byte registers. Yet, for compatibility purposes, the original 16-byte register offsets remained intact. This effectively presents the registers as a unified buffer, delineated by two distinct offset types.

+------------------------+
| 16-byte Register 1     |  (Offset: 0)
|                        |
+------------------------+
| 16-byte Register 2     |  (Offset: 16)
|                        |
+------------------------+
| 16-byte Register 3     |  (Offset: 32)
|                        |
+------------------------+
| 16-byte Register 4     |  (Offset: 48)
|                        |
+------------------------+
| 4-byte Register 1      |  (Offset: 0)
|                        |
+------------------------+
| 4-byte Register 2      |  (Offset: 4)
|                        |
+------------------------+
| 4-byte Register 3      |  (Offset: 8)
|                        |
+------------------------+
| ...                    |
|                        |
+------------------------+
| 4-byte Register 16     |  (Offset: 60)
|                        |
+------------------------+

enum nft_registers {
	NFT_REG_VERDICT,
	NFT_REG_1,
	NFT_REG_2,
	NFT_REG_3,
	NFT_REG_4,
	__NFT_REG_MAX,
        NFT_REG32_00    = 8,
        NFT_REG32_01,
        NFT_REG32_02,
        ...
        NFT_REG32_13,
        NFT_REG32_14,
        NFT_REG32_15,
};

Examining the enum type reveals the presence of both offset types within it. NFT_REG_VERDICT points to zero, while NFT_REG_1 through NFT_REG_4 point to indexes from one to four. This pattern continues with NFT_REG32_00 defined as eight, and subsequent indexes incrementing by one.

For example, consider an analogy with a toolbox where different tools are stored in compartments labeled 0, 1, 2, 3, and 4. Each compartment can hold a specific tool, and the labels represent the indexes of those compartments. Furthermore, there’s an auxiliary set of compartments labeled 8, 9, 10, and so on, each accommodating a different tool. This arrangement allows easy access to the tools based on their designated indexes.

Diagram:

+----------------+
|  NFT_REG32_00  | (Index: 8)
+----------------+
|  NFT_REG32_01  | (Index: 9)
+----------------+
|  NFT_REG32_02  | (Index: 10)
+----------------+
|  NFT_REG32_03  | (Index: 11)
+----------------+
|    ...         |
+----------------+
|  NFT_REG_1     | (Index: 1)
+----------------+
|  NFT_REG_2     | (Index: 2)
+----------------+
|  NFT_REG_3     | (Index: 3)
+----------------+
|  NFT_REG_4     | (Index: 4)
+----------------+
|  NFT_REG_VERDICT| (Index: 0)
+----------------+

In this analogy, the compartments correspond to the enum members, and the indexes represent their associated values. The enumeration offers a convenient and structured way to reference different tools, facilitating efficient usage based on their specific positions.

When the initialization process of an expression encounters the requirement to parse a register from the user’s netlink message, it invokes either the nft_parse_register_load or the nft_parse_register_store routine, based on whether it's dealing with a source register or a destination register.

For instance, imagine a scenario in which a custom firewall rule is being constructed using nftables. The user sends a netlink message to configure this rule, specifying source and destination registers. The initialization routine of the expression parses these registers using the appropriate routine: nft_parse_register_load for the source register and nft_parse_register_store for the destination register. This ensures that the user's intent is accurately captured and processed within the nftables framework.

         +-------------------------+
         | User's Netlink Message  |
         +-------------------------+
                     |
                     v
+----------------------------------+
| nftables Initialization Process  |
+----------------------------------+
          |                    |
          v                    v
+-------------------+   +-------------------+
| nft_parse_register |   | nft_parse_register |
| _load Routine     |   | _store Routine     |
+-------------------+   +-------------------+

The user’s netlink message is processed by the nftables initialization process. Depending on whether the register is a source or destination, the appropriate parsing routine is invoked to handle the register information. This process ensures accurate and targeted parsing of the user’s input, enhancing the overall functionality and reliability of the nftables expression.

nft_parse_register ?

Parse and translation happen in the nft_parse_register function

When handling register values passed through an netlink attribute, a specific transformation takes place based on the range of the register index. If the index falls between NFT_REG_VERDICT and NFT_REG_4 (inclusive of zero to four), the calculation applies a scaling factor. This factor is either multiplying the register index by 16 divided by 4 or directly by 4, depending on whether the old or new register offsets are used.

Consider an analogy with a set of numbered boxes, where each box represents a different register index. If the boxes numbered 0 to 4 are chosen, the calculation either increases the index by a factor of 4 or maps it to a specific register index. This mapping ensures alignment between the enum values and the actual register indices. For example, an enum value like NFT_REG32_00, which might seem to map to index 0, is actually aligned to index 4 due to the presence of a 16-byte verdict register at the beginning.

In essence, this process streamlines the register handling to accommodate changes in register sizes and offsets, providing a consistent and aligned mapping for efficient management.

[Diagram: Register Transformation]
```
       +----------------------------------------+
Index: | 0 | 1 | 2 | 3 | 4 | ... | NFT_REG32_00 |
       +----------------------------------------+
                           |
                           v
                   (Transformation)
                           |
                           v
       +----------------------------------------+
Index: | 4 | 8 | 12| 16| 20| ... |     ...     |
       +----------------------------------------+
```

This represents how the register indices are transformed based on the calculation described above. The transformation ensures that the enum values and register indices are correctly aligned and that the handling of register offsets is streamlined for efficient use.

nft_validate_register_load ?

Let’s debug further and look nft_validate_register_load now

you might have got the vulnerability already yes you are right

if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data))

This code snippet seems to hint at a potential integer overflow, doesn’t it? Let’s dive into it with an example. Imagine we have a scenario where reg is set to a large value, such as 0xFFFFFFFE, and len is set to 2. When this is plugged into the condition, the multiplication reg * NFT_REG32_SIZE would result in a value that's close to the maximum value that can be held in a 32-bit integer. Adding len to this could potentially cause an integer overflow. This would cause the condition to evaluate as true, which might lead to consequences in the code execution.

reg * NFT_REG32_SIZE + len
   -----------------------------
      |      |         |
   reg value  |         |
              |         |
       NFT_REG32_SIZE  len
              |         |
              |         |
              v         v
  +-------------------+---------------------+
  |  Potential Integer Overflow Vulnerability |
  +------------------------------------------+

The “reg value” is multiplied by NFT_REG32_SIZE and then added to len. If the resulting value exceeds the size of the struct nft_regs, an integer overflow might occur, potentially leading to code execution as well.

A Big Notable Consideration?

However, it’s important to highlight a key consideration. Our entire analysis hinges on the assumption that the register arriving at the validation function is a 32-bit entity. This assumption, though reasonable, may not hold true in every scenario. The parameter of the function is of the enum type 'nft_registers'. Now, enums are typically designed to hold integer values (32-bit) by default. Nonetheless, an optimization might come into play, resizing the enum to only accommodate the values explicitly defined in its enumeration. If this optimization is in effect, our 'nft_registers' enum might be shrunk down to a mere 1 byte (char). Consequently, only the least-significant byte would make its way to the flawed validation process, adding an extra layer of complexity to our understanding.

To check whether this optimization is active or not let’s put a breakpoint and disassemble the nft_parse_register_load

Let’s disassemble the function and check

If you take a look at the line <+42> and below you can see that this is generated assembly of the vulnerable if statement

Let’s take a look into what the disassembled statement means

lea edx,[rdx+rax*4]: This instruction uses the "load effective address" (LEA) operation to calculate a memory address and store it in the edx register. It adds the value of the rdx register to four times the value of the rax register, effectively performing the calculation: edx = rdx + (rax * 4). The result of this calculation is then stored in the edx register for further use.
cmp edx,0x50: This instruction compares the value stored in the edx register with the immediate value 0x50 (which is hexadecimal for 80 in decimal). The cmp instruction subtracts the immediate value from the register value and sets the flags based on the result of the subtraction. This comparison is typically used in conditional branching operations. The result of the comparison can influence subsequent instructions, such as conditional jumps or other branching decisions.

Congratulations so in this kernel version it represents the function is not shrunk by enum optimization

Exploitation and Explanation — POV:

We have 2 exploits available for the vulnerability / POC:

@David Bouman’s: https://github.com/pqlx/CVE-2022-1015
@Yordan: https://github.com/ysanatomic/CVE-2022-1015

Requirement to run the exploit:

You need libmnl-dev and libnftnl-dev packages installed in your machine.

NOTE | WARNING:

Also before running the exploit make sure you have libnftnl updated version support in your os, as it has been observed in some os like Ubuntu 18.04 LTS libnftnl last supported version is 1.0.0.7 and this version doesn't support bitwise-op in the library hence this will blow and exploits will not work.

Now that we’ve established that there are no optimization obstacles, let’s delve into exploring the potential for exploiting this scenario.

To successfully exploit this situation, our focus would be on the creation and modification of nf_tables objects, such as tables and chains. This requires the CAP_NET_ADMIN capability. Fortunately, obtaining this capability can be achieved within a user+network namespace. However, it’s crucial to note that we should exit the namespace during the exploitation process to ensure security.

At its core, this vulnerability arises from an improper validation process. This oversight grants us the ability to manipulate register values in a way that allows us to access addresses on the stack that exist outside of nft_regs. As a result, this opens the door to Out-Of-Bounds Read and Write scenarios, which, in turn, can lead to the execution of Arbitrary Code within the kernel-space.

           +---------------------------------------------+
           |                                             |
           |           Potential Exploitation            |
           |                                             |
           +---------------------------------------------+
                       |
                       v
           +---------------------------------------------+
           |                                             |
           |   Namespace 1 (User + Network)              |
           |                                             |
           |   +-------------------------------+         |
           |   |     CAP_NET_ADMIN Capability     |      |
           |   |   (Required for nf_tables access) |     |
           |   +-------------------------------+         |
           |                                             |
           +---------------------------------------------+
                       |
                       v
           +---------------------------------------------+
           |                                             |
           |         Exploitation Outside Namespace      |
           |                                             |
           |   +-----------------------------------+     |
           |   |   Improper Validation Exploited   |     |
           |   |   Out-Of-Bounds Access            |     |
           |   |   Arbitrary Code Execution        |     |
           |   +-----------------------------------+     |
           |                                             |
           +---------------------------------------------+

We will begin within a user+network namespace where we acquire the necessary capability. We then transition out of the namespace to perform the exploitation, leveraging the vulnerability’s improper validation to gain Out-Of-Bounds Read and Write capabilities, ultimately leading to Arbitrary Code Execution in the kernel-space.

Pseudocode demonstrating the above situation reference taken from :

Example 1:

reference: https://github.com/ysanatomic/CVE-2022-1015

void setup_user_and_network_ns(void) {
	uid_t uid = getuid();
	gid_t gid = getgid();
if (unshare(CLONE_NEWUSER) < 0) {
    perror("[-] unshare(CLONE_NEWUSER)");
  exit(EXIT_FAILURE); 
  }
  if (unshare(CLONE_NEWNET) < 0) {
    perror("[-] unshare(CLONE_NEWNET)");
  exit(EXIT_FAILURE); 
  }
  cpu_set_t set;
  CPU_ZERO(&set);
  CPU_SET(0, &set);
  if (sched_setaffinity(getpid(), sizeof(set), &set) < 0) {
    perror("[-] sched_setaffinity");
  exit(EXIT_FAILURE); 
  }
 // now we map uid and gid
 write_to_file("/proc/self/uid_map", "0 %d 1", uid);
  // deny setgroups (see user_namespaces(7))
  write_to_file("/proc/self/setgroups", "deny");
  // remap gid
  write_to_file("/proc/self/gid_map", "0 %d 1", gid);
}

Example 2:

reference: https://github.com/pqlx/CVE-2022-1015/blob/721190651f60ab069d9aaa71967dd196912c201a/pwn.c#L473

puts("[+] Dropping into network namespace");
        
        // We're too lazy to perform uid mapping and such.
        char* new_argv[] = {
            "/usr/bin/unshare",
            "-Urn",
            argv[0],
            "EXPLOIT",
            NULL
        };
        execve(new_argv[0], new_argv, envp);

Exploring Primitives:

Now, let’s delve into the fundamental building blocks of our exploration. These building blocks, known as primitives, form the foundation of our approach. They involve the use of registers, either for reading data from them or writing data to them. Our task now is to identify the most potent primitives that will aid us in exploiting this vulnerability to our advantage.

nft_immediate_expr:

This expression grants us the ability to write constant data directly into registers. In theory, this could be a valuable tool for an Out-of-Bounds (OOB) write operation. However, there is a constraint: we can only write up to 16 bytes at a time. This limitation is significant, as it greatly restricts the potential values that can be stored in the registers.

For instance, the smallest valid register value that can pass the validation condition is 0xfffffffc. This restricted range poses challenges when trying to manipulate the registers for our purposes.

nft_payload:

The nft_payload expression is a powerful tool for an OOB read operation. It enables direct copying of data from the packet into the registers. One of its advantages is the ability to read up to 0xff bytes at a time, which is the maximum achievable by any expression. Now, let's determine the boundaries of this capability.

At the lower bound, our len value reaches the maximum at 0xff. At this point, the minimal valid register value that can satisfy the validation condition is 0xffffffc1. Consequently, the smallest offset at which we can read is 0xc1 * 4 = 0x304, relative to the beginning of nft_regs on the stack.

Conversely, the upper bound is reached when our register value is at its peak, 0xff. In this scenario, the highest attainable length is 0x54, which results in a calculation of 0x3fffffff 4 + 0x54 = 0x50, where 0x50 is less than or equal to 0x50. As a result, the highest offset at which we can read becomes 0xff 4 + 0x54 = 0x450.

In summary, our scope for reading spans from the lowest offset of 0x304 to the highest offset of 0x450. This allows us to extract a total of 0x14c = 332 bytes from the stack.

+---------------------------------------------------------+
|                  Stack Memory                           |
|                                                         |
|          +-----------------------------------------+    |
| Offset   |              nft_regs                   |    |
|          |                                         |    |
| 0x000    +-----------------------------------------+    |
|          |                ...                      |    |
|          +-----------------------------------------+    |
|          |      0x000  |  0x000  |  0x000  |  0x000|    |
|          +-----------------------------------------+    |
|          |    ...                                  |    |
|          +-----------------------------------------+    |
| Offset   |  0x304     |                            |    |
|          +-----------------------------------------+    |
|          |    ...                                  |    |
|          +-----------------------------------------+    |
| Offset   |  0x450         |                        |    |
|          +-----------------------------------------+    |
|          |                                         |    |
|          +-----------------------------------------+    |
|          |                ...                      |    |
|          +-----------------------------------------+    |
|          |              nft_stack                  |    |
|          +-----------------------------------------+    |
|          |                                         |    |
|          +-----------------------------------------+    |
+---------------------------------------------------------+

This analysis equips us with a clear understanding of the potential of these primitives, allowing us to strategically leverage them for our exploration and exploitation endeavors.

nft_bitwise Expression:

The nft_bitwise expression is a powerful expression within nf_tables that allows performing bitwise operations on registers. This versatile expression enables the manipulation of data within specified registers, offering flexibility for various scenarios.

Consider a scenario where you want to copy specific data from one register to another without altering the data itself. This is where the nft_bitwise expression becomes valuable. By setting the operation type (op) to NFT_BITWISE_LSHIFT or NFT_BITWISE_RSHIFT and providing a data value of zero, you can effectively copy data between registers.

So if you look at struct definition :

struct nft_bitwise {
	u8			sreg;
	u8			dreg;
	enum nft_bitwise_ops	op:8;
	u8			len;
	struct nft_data		mask;
	struct nft_data		xor;
	struct nft_data		data;
};

It takes a sreg and len which specifies to what registers we are going to be performing the bitwise operations. The destination dreg specifies where we are going to be putting the data from the registers we are performing the bitwise operation to.

Here’s an analogy:

Imagine you have two rooms (registers) with different items (data) in them. You want to transfer items from one room to another without changing the items themselves. The nft_bitwise expression acts like a careful transfer, ensuring the data remains intact during the process.

+------------------+    nft_bitwise    +------------------+
| Source Register  |  -------------->  | Destination Reg. |
|   (Room A)       |                   |    (Room B)      |
| Data: 0xA5B3     |                   | Data: 0x0000     |
+------------------+                   +------------------+

Bounds of Operation:

The nft_bitwise expression has specific boundaries for its operations. These boundaries determine the range of valid data manipulation. Our max length cannot be 0xff because both our sreg and dreg would be out-of-bounds which we don’t want. So our length must be 0x40 = 64 at the maximum (16 data registers each 4 bytes).

Lower Bound: As our maximum value is 0x40 which means that our lower bound would be when our register value is 0xfffffff0 - because 0xfffffff0 4 + 0x40 = 0x00 < 0x50. Converted to byte offset that would be 0xf0 4 = 0x3c0 relative to the beginning of nft_regs.
Upper Bound: The highest value for a register we can have is 0xff. In that case 0x3fffffff 4 + 0x40 = 0x3c < 0x50. converted to a byte offset that is 0xff 4 + 0x40 = 0x43c.

So our range of offset would be 0x3c0 to offset 0x43c

These are all expressions we needed for the exploitation.

Exploitation code reference :

Example 1 reference: https://github.com/ysanatomic/CVE-2022-1015/blob/1368b4e83f656a4cc868d85b61b8f048bef20752/exploit.c#L298C1-L318C2

static void add_bitwise(struct nftnl_rule *r, uint32_t shift_type, uint32_t expr_len,
    uint32_t expr_sreg, uint32_t expr_dreg, void* data, uint32_t data_len)
{
	if(expr_len > 0xff) {
		printf("Bitwise len is over 0xff \n");
		exit(EXIT_FAILURE);
	}

Example 2 reference: https://github.com/pqlx/CVE-2022-1015/blob/721190651f60ab069d9aaa71967dd196912c201a/pwn.c#L66

static int calc_vuln_expr_params_div(struct vuln_expr_params* result, uint8_t desired, uint32_t min_len, uint32_t max_len, int shift)
{
    uint64_t base_ = (uint64_t)(1) << (32 - shift);
    uint32_t base = (uint32_t)(base_ - 1);
    if (base == 0xffffffff) {
        base = 0xfffffffb; // max actual value 
    }

Exploitation Strategy

Our approach to exploitation is relatively straightforward. We focus on manipulating the netfilter hook and packet protocols within the firewall. These factors play a significant role in altering the stack layout, which, in turn, affects our ability to perform out-of-bounds (OOB) reads and writes. In cases where the existing stack layout isn’t conducive to our OOB range, we can experiment with different hook configurations and packet protocols until we achieve a favorable stack layout that facilitates our desired actions. Our strategy involves these steps:

Identifying Favorable Hook and Protocol: We begin by searching for a suitable combination of netfilter hook and packet protocol. This combination should result in a kernel address falling within our OOB read range. This address will play a crucial role in further stages of our exploitation.
Leaking and Kernel Base Calculation: Once we’ve found the right hook and protocol, we perform a “leak” operation to extract the address we identified earlier. With this address, we can calculate the kernel’s base, which is a fundamental reference point for subsequent actions.
Optimizing OOB Write Layout: The next step involves finding another hook and protocol combination that ensures the stack layout at our OOB write range is favorable for our purposes. This favorable layout allows us to inject a complete Return-Oriented Programming (ROP) chain onto the stack.
Building and Injecting ROP Chain: With a suitable stack layout in place, we construct a ROP chain tailored to our requirements. This chain is meticulously crafted to perform the desired actions within the altered stack layout. Once prepared, we inject this ROP chain into the stack, accomplishing our goals.

Leaking a Kernel Address

To craft a robust exploit, our first task involves obtaining a leaked kernel image address.

Picture this: the kernel image base address is like a starting point for the kernel, with a unique position among 512 possibilities. It's a bit like a hidden treasure's location on a map, but here, we're dealing with kernel memory. In some attack scenarios, a 1 in 512 chance might be acceptable, but having a more reliable exploit would be even better.

Here’s where things get interesting. We’re going to leverage our nft_bitwise OOB (out-of-bounds) read primitive to cleverly manipulate the stack. Think of it as a skillful chess move. We're hoping that within the 0x7c byte range, we can peek into, there's a good chance we'll catch a glimpse of a valuable kernel address.

Now, let’s visualize this process:

                  +--------------------------------------------+
                  |                                            |
                  |                                            |
                  |             Kernel Image                   |
                  |                                            |
                  |                                            |
                  +--------------------------------------------+
                                                  ^            
                                                  |
                                                  |
              +----------------+   Stack Data     |     Interval we can read
              |   nft_bitwise  | <----------------+     (0x7c bytes)
              |     OOB read   |
              +----------------+

We’re focusing on the Kernel Image and our nft_bitwise OOB read operation. We're carefully manipulating the stack, hoping that somewhere within that range, a piece of the kernel's memory is waiting to be discovered. With a bit of luck and skill, we're aiming to extract that elusive kernel image address, a crucial step in building a stable exploit.

Remember, just like an adventurous treasure hunt, this process involves careful planning and strategy. The ultimate goal? Turning those odds into a reliable exploit that provides a pathway to our desired outcome.

Execution of nft_do_chain?

The nft_do_chain function plays a pivotal role. When a hook is "triggered," nft_do_chain is invoked to iterate through the rules within a chain and execute their expressions.

Upon analyzing the assembly generated by nft_do_chain, our focus is on identifying instructions that access the registers. This insight helps us determine the precise location of the registers on the stack.

As we already have a breakpoint at nft_do_chain let's disassemble and look into that

The important instruction is at line number <+125>

In the nft_do_chain function's do_chain section, we encounter a crucial line of code: regs.verdict.code = NFT_CONTINUE;. This particular line establishes the default verdict code, which you probably recognize as NFT_CONTINUE.

Let’s break down the significance of this action and its implications. Within the realm of nftables, verdict codes serve as decision signals that dictate how packets should be processed within a chain. They determine the course of action for a packet, whether it should proceed, break the chain, jump to another chain, and so on.

The verdict codes are enumerated, and you can find them listed as follows:

enum nft_verdicts {
	NFT_CONTINUE = -1, // -1 is 0xffffffff due to Two's Complement
	NFT_BREAK    = -2,
	NFT_JUMP     = -3,
	NFT_GOTO     = -4,
	NFT_RETURN   = -5,
};

So, when we encounter the instruction <+125>, which sets the verdict register to NFT_CONTINUE, it signifies that the default course of action is to allow the packet to continue through the chain. This is a pivotal step in determining the fate of the packet based on the rules and actions specified in the chain.

To visualize the concept, consider a simplified analogy: Imagine you’re directing traffic at an intersection. You have different signs (verdict codes) that signal whether a vehicle can proceed, stop, change lanes, or take a different route. The NFT_CONTINUE verdict is akin to the "Green Light," indicating that the packet should keep moving through the chain without hindrance.

      +----------------------------+
      |                            |
      |    +--> NFT_CONTINUE -->   |
      |    |                       |
      |    |                       |
      |    |                       |
Packet+--> |   Chain Processing    |
      |    |                       |
      |    |                       |
      |    |                       |
      |    +--> NFT_JUMP ----------+
      |                            |
      +----------------------------+

Moreover, in the lines of code at <+343> and <+348>, a validation check ensures that the verdict remains as NFT_CONTINUE. This validation is crucial to maintaining the intended behavior of allowing the packet to proceed without interruption.

with this, we know where on the stack the nft_regs are located.

After spending a good amount of time it’s been clear that reading registers from nft_do_chain is not going to be an easy task.

A simple approach we can take is to utilize the nft_bitwise OOB read primitive to manipulate stack data and store it into our registers. Considering that we can read an interval of up to 0x7c bytes, it’s highly likely that a kernel address could be present within this range.

Address leaking is now a straightforward process. With a .text address available for leaking in our Out-of-Bounds (OOB) read range, we can strategically utilize an output hook and send a UDP packet to ourselves on the loopback interface. This maneuver effectively provides us with an address to work with. The subsequent steps involve constructing a rule with the appropriate expressions to harness this leaked address.

Let’s break down the procedure with an example:

Imagine you’re in a building (the kernel memory) and you want to send a secret message (the leaked address) to your friend who is in another room (a user-space program). You can do this by using a set of hand signals (expressions) that you both understand.

1. Copying the Address to Registers: To initiate this covert communication, you first copy the secret message onto your hand (registers). This way, you’re ready to reveal it when the time is right.

2. Writing the Address to Payload: Next, you carefully inscribe the secret message onto a small note (UDP packet’s payload). This note is then attached to a paper airplane (UDP packet) that you throw across the hallway (network).

3. Listening and Receiving: Your friend eagerly awaits by the window, ready to catch the paper airplane (UDP packet). As soon as the paper airplane arrives, your friend retrieves the note and reads the secret message.

In terms of the expressions mentioned:

- The bitwise expression can be likened to encoding the secret message onto your hand. It prepares the information to be utilized.

- The payload_set expression corresponds to crafting the note and attaching it to the paper airplane. The specific instructions ensure that the secret message is accurately placed.

- The UDP listener represents your friend’s vigilance by the window, waiting to catch the paper airplane.

Here’s a simplified diagram to illustrate the process:

+-----------------+                    +-----------------+
|   Your Room     |                    |   Friend's Room |
| (Kernel Memory) |                    |(User-space prog)|
+-----------------+                    +-----------------+
        |                                      ^
        | (1) Copy the Address to Registers    |
        |                                      |
        v                                      |
+----------------------------------------------+
|    Expressions and Rules (Secret Message)   |
|                                              |
| +------------------------------------------+ |
| |  Bitwise Expression (Encode)             | |
| |  Payload Set Expression (Note Creation)  | |
| +------------------------------------------+ |
|                                              |
+----------------------------------------------+
        |                                      ^
        | (2) Write Address to Payload (Note)  |
        |                                      |
        v                                      |
+----------------------------------------------+
|           UDP Packet (Paper Airplane)         |
|                                              |
| +------------------------------------------+ |
| |             UDP Listener (Window)        | |
| +------------------------------------------+ |
|                                              |
+----------------------------------------------+
        |                                      ^
        | (3) Receive and Read the Address     |
        |                                      |
        v                                      |
+-----------------------+             +------------------+
|  Friend Retrieves the |             | Secret Message  |
|   Note from Airplane  |             |  (Leaked Address)|
+-----------------------+             +------------------+

So if we carefully orchestrated expressions and rules (hand signals and notes) to communicate a specific address covertly, ultimately allowing us to bypass KASLR and progress toward achieving Local Privilege Escalation.

Having successfully uncovered the method to expose the kernel address, the next objective of exploiting it revolves around achieving the coveted state of Arbitrary Code Execution. In prior discussions concerning primitives, we highlighted the nft_payload expression as the most suitable candidate for an Out-of-Bounds (OOB) write. This is attributed to its ability to inscribe up to 0xff bytes or 32 eight-byte words. Exploit aspiration is to extend this capability to write approximately 20 or more words onto the stack while maintaining system stability.

Patch Diffing

Starting from version 5.12, certain kernels are susceptible to a vulnerability, as indicated by commit 345023b0db31. However, this vulnerability was addressed and fixed in version 5.17, which can be seen in commit 6e1acfa387b9.

This code modification introduced a vulnerability. Previously, the return value of nft_parse_register was implicitly down casted to an u8 by assignment to priv->dreg.

To mitigate this vulnerability, the fix involves stricter validation of input registers before their usage. Here’s an example of the updated code:

Final Thoughts

Exploring this vulnerability was a captivating journey of rediscovery. While the nf_tables codebase may appear intricate initially, it reveals its simplicity once you become familiar with its nuances.

The process of exploitation was akin to an enriching educational adventure, albeit with occasional challenges. Particularly intriguing was the pursuit of an opportune hook point in the stack, offering favorable conditions for exploitation.

I extend my sincere gratitude to David Bouman and Yordan (anatomic) for the exploit and detailed analysis, which served as a valuable guide, especially a comprehensive overview of nf_tables that kickstarted my investigative journey.

I trust that reading this account was as delightful for you as it was for me to craft it.

Also sometimes the exploit doesn’t work straightforward as the chances of gaining root access using this exploit on a specific vulnerable kernel are highly improbable. It requires thorough experimentation with various chain hook placements (such as input vs. output), adjustments to nft_bitwise address leak offsets, and meticulous positioning of ROP gadgets and symbols

Luckily in my case Yordan crafted exploit worked without any such circumstances, but if you are facing issues please update offsets , chains , bitwise address and symbols