In our previous research, we used WinAFL to fuzz user-space applications running on Windows, and found over 50 vulnerabilities in Adobe Reader and Microsoft Edge.
For our next challenge, we decided to go after something bigger: fuzzing the Windows kernel. As an added bonus, we can take our user-space bugs and use them together with any kernel bugs we find to create a full chain – because RCEs without a sandbox escape/privilege escalation are pretty much worthless nowadays.
With a target in mind, we set out to explore the kernel fuzzer landscape, see what options we have in the pursuit of our goal, and perhaps heavily modify existing tools to better suit our needs.
We have plenty of experience with AFL and WinAFL, so we started our journey looking for a similar fuzzer that can be used to attack the Windows kernel.
A short Google search inevitably brought us to kAFL, AFL with a `k` as the prefix sounds like exactly what we need.
kAFL is a research fuzzer from the Ruhr-Universität Bochum university that leverages AFL style fuzzing to attack OS kernels. At first sight, it seemed to be exactly what we were looking for. kAFL supports Linux, macOS, and Windows and was used to find vulnerabilities in the Linux kernel Ext4 filesystem and in macOS.
kAFL has similar principles to AFL, but since it targets OS kernels, it needs to do more work around the fuzzing loop. The fuzzing loop is the process where, in each cycle, one test case is tested against its target and the feedback is processed (see Figure 1).
Figure 1: Fuzzing loop cycle.
When kAFL first starts, the fuzzer (1) spawns multiple virtual machines running the target OS from a saved state. In the VM snapshot, there is a preloaded agent (2) running inside the VM.
The agent (2) and the fuzzer (1) cooperate to drive the fuzzing process forward. The agent runs in user space and starts communicating with the fuzzer through hypercalls and sends the range addresses of the target driver to the fuzzer. The addresses limit the code coverage traces just for the ranges that the agent supplies.
At the beginning of the loop, the fuzzer sends an input (3) to the agent through shared memory. kAFL uses a mutation strategy similar to AFL’s to generate new inputs.
Next, the agent notifies the hypervisor to start (4) collecting coverage. Then the agent sends (5) the inputs to a target kernel component: for example, if we are targeting a driver named test.sys (6) that is responsible for parsing compressed images, the agent sends generated input to the driver to test it.
Finally, the agent asks to stop (7) collecting coverage from KVM (8), and the fuzzer processes the coverage trace. kAFL’s coverage implementation uses Intel Processor Trace (IntelPT or IPT) for the coverage feedback mechanism.
When the guest OS tries to start, stop or (9) collect coverage, it issues a hypercall to KVM.
kAFL’s crash detection mechanism (see Figure 2) works as follows:
Figure 2: kAFL crash detection.
The agent (1) inside the VM issues a hypercall (2) to KVM with the addresses of BugCheck and BugCheckEx. KVM (3), in turn, patches (4) these addresses with a shellcode (5) that issues a hypercall when executed.
Therefore, when the machine encounters a bug, the kernel calls the patched versions of BugCheck or BugCheckEx to issue the hypercall to notify (6) the fuzzer of a crash.
Now that we understand the mechanisms, we considered how this can be adjusted to our needs in Windows environments.
Our focus is on parts that are accessible from user space. These parts are fairly complicated and can be used for local Privilege Escalation (PE).
From our experience, AFL is good for the following targets:
Fast targets that can perform more than 100 iterations per second.
Parsers – especially for binary formats.
We looked for suitable targets in the Windows kernel (Figure 3).
Figure 3: Windows kernel components.
These are the targets we had in mind:
File systems such as NTFS, FAT, VHD, and others.
Crypto/Code integrity (CI).
Fonts (which were moved to User space starting with Windows 10).
A typical kernel bug in windows
We took a step back and looked at a fairly typical kernel bug – CVE-2018-0744:
Figure 4: A typical bug in win32k.
This program contains multiple system calls that take as input highly structured data such as structs, constants (magic numbers), function pointers, strings, and flags. In addition, there is a dependency between system calls: the output of one syscall is used as the input for other syscalls. This type of structure is very common in the case of kernel bugs, where a sequence of syscalls is used to reach a buggy state where a vulnerability is triggered.
The importance of Structure Aware fuzzing and examples can be found here.
Windows kernel attack surface: kAFL VS Syscall fuzzer
After we observed the bug described above, we realized that using an AFL style fuzzer is going to limit us to relatively small parts of the kernel. The majority of the Windows kernel is reachable from syscalls which involve highly structured data, but using kAFL would limit us to binary parsers in the kernel such as device drivers, file systems, PE format, registry and others. These parts are relatively small compared to the amount of code reachable from syscalls. So if we had a syscall fuzzer, we could potentially reach more attack surfaces, such as the virtual memory management, processes manager, graphics, user winapi, gdi, security, network and many more.
At this point, we realized that we needed to look for a syscall fuzzer.
Syzkaller is a coverage guided structure-aware kernel fuzzer (a.k.a smart syscall fuzzer). It supports several operating systems, and runs on multiple machine types (Qemu, GCE, Mobile phones, …) and multiple architectures (x86-64, aarch64). To date, Syzkaller has found 3700 bugs in the Linux kernel, with modest estimations that 1 out of 6 of the bugs found are security bugs.
Syzkaller is a structure-aware fuzzer, meaning that it has a description for each syscall. Syscall descriptions are written to text files in a `go`-like syntax. Syz-sysgen is one of the Syzkaller tools and is used to parse and format the syscalls descriptions. When this process is completed successfully, it transforms the text files into `go` code that are compiled together with the fuzzer code to an executable called syz-fuzzer.
Syz-fuzzer is the main executable for driving the fuzzing process inside the guest VM.
Syzkaller has its own syntax to describe programs, syscalls, structs, unions and more. The generated programs are also called syz programs. An example can be found here.
Syzkaller employs a few mutation strategies for mutating existing programs. Syzkaller saves the programs that provide new code coverage in syz format in a database. This database is also known as the corpus. That allows us to stop the fuzzer, make our changes, and then continue from the same spot we stopped at.
Figure 5: Syzkaller architecture (Linux).
Syzkaller’s main binary is called syz-manager (1). When it starts, it performs the following actions:
Load the corpus (2) of programs from earlier runs, start multiple test (3) machines, copy the executor (6) and fuzzer (5) binaries to the machine using ssh (4), and execute Syz-fuzzer (5).
Syz-fuzzer (5) then fetches the corpus from the manager and starts generating programs. Each program is sent back to the manager for safekeeping in case of a crash. Syz-fuzzer then sends the program through IPC (7) to the executor (6) which runs the syscalls (8) and collects coverage from the kernel (9), KCOV in case of Linux.
KCOV is a compile time instrumentation feature which allows us, from user space, to get per thread code coverage in the entire kernel. If a new coverage trace is detected, the fuzzer (11) reports back to the manager.
Syzkaller aims to be an unsupervised fuzzer, which means that it tries to automate the entire fuzzing process. An example of this property is that in the case of a crash, Syzkaller spawns multiple reproducer machines to dissect the crashing syz programs from the programs log. The reproducers try to minimize the crashing program as much as possible. When the process is complete, most of the time Syzkaller will reproduce either a syz program or a C code which reproduces the crash. Syzkaller is also able to extract a list of maintainers from git and email them the details of the crash.
Syzkaller supports the Linux kernel and has impressive results. Looking at Syzkaller, we thought to ourselves: if only we could fuzz Linux kernel on Windows. This led us to explore WSL.
Windows Subsystem for Linux (WSL) is a compatibility layer for running Linux binaries natively on Windows. It translates between Linux syscalls to Windows API. The first version was released in 2016 and includes 2 drivers: lxcore and lxss.
It was designed for running bash and core Linux commands for developers.
WSLv1 uses a lightweight process called pico process to host Linux binaries and dedicated drivers called pico providers to handle the syscalls from the pico processes (for more information see here: 1, 2).
As WSL is relatively similar to the Linux kernel, we can re-use most of the existing grammar for Linux and the syz-executor and syz-fuzzer binaries which are compatible with the Linux environment.
We wanted to find bugs for Privilege Escalation (PE), but WSL v1 is not shipped by default and might be difficult to exploit from a sandbox since it runs in a different type of process (PICO process). But we thought that it would be better to get some experience with Syzkaller on Windows with minimal changes.
And the porting began
We first installed a Linux distribution from the Microsoft store and used Ubuntu as our distribution. We started with adding a ssh server with “apt install openssh-server” and we configured ssh keys. Next, we wanted to add coverage tracing support. Unfortunately, the Windows kernel is closed source and doesn’t provide compile time instrumentation like KCOV in Linux.
We thought of a few alternatives that would help us get coverage trace:
Using an emulator like QEMU / BOCHS and adding coverage instrumentation.
Using static binary instrumentation like in pe-afl.
Using a hypervisor with coverage sampling like in apple-pie.
Using hardware support for coverage like Intel-PT.
We decided to use Intel-PT because it provides traces for compiled binaries in run time, it’s relatively fast, and it supplies full coverage information, meaning we can get the starting Instruction Pointer (IP) of each basic block we visited in its original order.
Using Intel-PT from inside our VM, where the target OS runs, requires a few modifications to KVM.
We used large parts of kAFL kvm patches to support coverage with Intel-PT.
In addition, we created a KCOV-like interface through hypercalls, so when the executor tries to start, stop or collect coverage, it issues hypercalls.
We needed a bug oracle to enable us to detect crashes. The Syzkaller crash detection mechanism reads the output of the VM console and relies on pre-defined regular expressions to detect kernel panics, warnings, etc.
We needed a crash detection mechanism for our port, so we could print to the output console a warning that Syzkaller could catch.
To detect BSOD, we used kAFL’s technique. We patched BugCheck and BugCheckEx with a shellcode that issues a hypercall and notifies that a crash happened by writing a unique message to the QEMU output console.
We added a regex into syz-manager to detect crash messages from QEMU’s output console. To improve our detection for bugs in the kernel, we also used Driver Verifier with special pools to detect pool corruptions (“verifier /flags 0x1 /driver lxss.sys lxcore.sys”).
A common issue with fuzzers is that they encounter the same bug many times. To avoid duplicate bugs, Syzkaller requires a unique output for each crash. Our first approach was to extract a few relative addresses from the stack that are within the modules ranges that we trace, and print them to the QEMU output console.
Figure 6: Symbolizer #1 result.
Before running the fuzzer, we wanted to make sure that it can actually find a real bug, as otherwise we are just wasting CPU time. Unfortunately, at the time we couldn’t find a public PoC of a real bug to perform this test.
Therefore, we decided to patch a specific flow in one of the syscalls to emulate a bug. The fuzzer was able to find it, which was a good sign, and we ran the fuzzer.
First fuzzing attempt
A short time after we started the fuzzer, we noticed a crash with this error message: CRITICAL_STRUCTURE_CORRUPTION. We quickly found out that it was due to Patch Guard. Our crash detection mechanism was based on kAFL, where we patch BugCheck and BugCheckEx with a shellcode that issues a hypercall on a crash, which is what PatchGuard was designed to catch.
To work around this issue, we added a driver that starts on boot and registers a bugcheck callback with ntos using KeRegisterBugCheckCallback. Now when the kernel crashes, it calls our driver that will then issue a hypercall notifying the fuzzer of a crash.
We ran the fuzzer again and got a new bug with a different error code. We tried to reproduce the crash to help us understand it, and discovered that performing root cause analysis from offsets and random junk off the stack is difficult. We decided that we needed a better approach to get crash information.
We tried to run `kd` on our host machine under Wine to produce a call stack, but that didn’t work well, as it took around 5 minutes to generate the call stack.
This approach creates a bottleneck to our fuzzer. In the process of reproduction, Syzkaller attempts to minimize the crashing program(s) as much as possible, and it will wait for the call stack with each minimization attempt to determine if it’s the same crash. Therefore, we decided to use a remote Windows machine with KD, and tunnel all the udp connections there. That actually worked well, but when we scaled it up to 38 machines, connections were dropped and Syzkaller translated it as “hangs.”
At this point, we asked ourselves, how are KD and WinDBG able to generate a call stack?
The answer is they use StackWalk from DbgHelp.dll.
To generate a call stack, we need the StackFrame, ContextRecord and ReadMemoryRoutine.
Figure 7: Symbolizer architecture.
Figure 7 shows the architecture:
We retrieved the stack, registers and driver addresses from the guest using KVM back to QEMU.
QEMU sent it to a remote Windows machine, where our symbolizer calls StackWalk with all relevant arguments and retrieved a call stack.
Having a Windows machine running alongside our fuzzer is not ideal, and we thought how hard it would be to implement minimal Kernel Debugger in `go` and compile it to Syzkaller.
We started with a PDB parser and fetcher. After that we implemented a x64 stack unwinder using the unwind information stored in the PE.
The last part was to implement KD serial, which worked pretty slow, so we started working on KDNET and after we had finished, we integrated it to Syzkaller.
This solution was far better than the previous solutions. Our de-duplication mechanism is now based on the faulting frame. We also get a BugCheck error code, registers and a call stack.
Another issue we encountered was coverage stability. Syzkaller uses multiple threads to find data races. For example, when a generated program has 4 syscalls, it can divide it into two threads so one thread runs syscalls 1 and 2 and the other thread runs syscalls 3 and 4.
In our coverage implementation, we used one buffer per process. In practice, running the same program multiple times will result in different coverage traces each run.
Coverage instability hurts the fuzzers ability to find new and interesting code paths and essentially bugs.
We wanted to fix this issue by changing our coverage implementation to be similar to KCOV’s implementation.
We knew that KCOV is tracking coverage per thread, and we wanted to be able to have that mechanism.
To create KCOV-like traces, we need:
Tracking threads in KVM for swapping buffers.
Adding thread handle awareness to our KCOV hypercall API.
For tracking threads, we needed a hook for context switches. We know that we can get the current thread from the global segment:
Figure 8: KeGetCurrentThread function.
We went to see what happens during a context switch, and we found the swapgs instruction in the function that handles context switch. When swapgs occur, this causes a VMExit which a hypervisor can catch.
Figure 9: swapgs inside the SwapContext function.
This means that if we can track swapgs, we can also monitor the thread swaps in KVM.
This looked like a good hooking point to monitor the context switch and handle IntelPT for traced threads.
So we removed the disable intercept for MSR_KERNEL_GS_BASE.
Figure 10: MSR intercept.
That allowed us to have a hook and switch ToPa buffers at each context switch. The ToPa entries describe to Intel-PT the physical addresses where it can write the output of the trace.
We still had a few more minor issues to deal with:
Disabling services and auto loaded programs as well as unnecessary services to make boot faster.
Windows update randomly restarted our machines and consumed lots of CPU.
Windows defender randomly killed our fuzzer.
In general, we adjusted our guest machine for best performance.
WSL Fuzzing Results
Overall, we fuzzed WSL for 4 weeks with 38 vCPUs. At the end, we had a working prototype and a much better understanding of how Syzkaller works.
We found 4 DoS bugs and a few deadlocks. However, we didn’t find any security vulnerability, which was disappointing for us, but we decided to move to a real PE target.
Moving to a real target
Fuzzing WSL was a good way to get to know Syzkaller on Windows. But at this point, we wanted to go back to a real Privilege Escalation target – one that is shipped with Windows by default and accessible from a variety of sandboxes.
We looked at the Windows kernel attack surface and decided to start with Win32k. Win32k is the kernel side of the Windows subsystem, which is the GUI infrastructure of the operating system. It is also a common target for Local Privilege Escalation (LPE) because it’s accessible from many sandboxes.
It includes the kernel side of two subsystems:
The Window Manager also known as User.
The Graphic Device Interface also known as GDI.
It has many syscalls (~1200) meaning it’s a good target for grammar-based fuzzers (as shown earlier CVE-2018-0744). Starting from Windows 10, win32k is divided into multiple drivers: win32k, win32kbase and win32kfull.
To make Syzkaller work for win32k we had to change a few things:
Compile fuzzer and executor binaries to Windows.
OS related changes.
Exposing Windows syscalls to the fuzzer.
Cross-compiling with mingw++ for convenience.
Starting with the fuzzer source code, we added relevant implementation for Windows such as pipes, shared memory and more.
The grammar is a crucial part of the fuzzer which we explain in depth later.
We then moved to fix the executor to cross-compile using MinGW. We also had to fix shared memory, and pipes, and disable fork mode since it doesn’t exist in Windows.
As part of grammar compiling, syz-sysgen generates a header file (syscalls.h) which includes all the syscall names\numbers. In the case of Windows, we settled on the exported syscall wrappers and WinAPI (e.g. CreateWindowExA and NtUserSetSystemMenu).
Most of the syscalls wrapper are exported inside win32u.dll and gdi32.dll. To expose them to our executor binary, we used gendef to generate definitions files from the dll. We then used mingw-dlltool to generate library files and we eventually linked them to the executor.
As we said earlier, we wanted to make sure that our fuzzer is able to reproduce old bugs, as otherwise we are wasting CPU time.
This time we had a real bug (CVE-2018-0744, see Figure 4) and we wanted to reproduce that. We added the relevant syscalls and let the fuzzer find it, but unfortunately, it failed. We suspected that we had a bug, so we wrote a syz program and used syz-execprog, Syzkaller to execute syz programs directly, to see that it works. The syscalls were called successfully, but unfortunately the machine didn’t crash.
After a short time, we realized that the fuzzer was running under session 0. All services, including our ssh service, are console applications that run under session 0 and were not designed to run GUI. So we changed it to run as a normal user under session 1. Once we did that, Syzkaller was able to reproduce the bug successfully.
Our conclusion is that we always have to test new code by emulating bugs or reproducing old ones.
We added 15 API in total and ran the fuzzer again.
We got the first crash in win32kfull!_OpenClipboard, the crash was a Use-After-Free. But for some reason, this crash didn’t reproduce on other machines. At first we thought that it was due to another bug that we had created, but it was reproducible on the same machine but without the fuzzer.
The call stack and the crashing program didn’t help us understand what was wrong.
So we went and looked in IDA at the crashing area:
Figure 11: Crashing site – win32kfull!_OpenClipboard.
We noticed that the crash happens inside a conditional block where it depends on a flag of an ETW provider: Win32kTraceLoggingLevel.
This flag is turned on in some machines and off in others, so we conclude that we probably got an A/B test machine. We reported this crash and re-installed Windows again.
We ran the fuzzer again and got a new bug, this time a Denial-Of-Service in RegisterClassExA. At this point, our motivation skyrocketed, because if 15 syscalls resulted in 2 bugs, that means 1500 syscalls would result in 200 bugs.
Grammar in win32k
Because there was no prior public research on syscall fuzzing win32k, we had to create correct grammar from scratch.
Our first thought was that maybe we could automate this process, but we stumbled upon 2 problems:
First, Windows headers are not enough to generate grammar, as they don’t provide crucial information for a syscall fuzzer such as unique strings, some DWORD parameters are actually flags, and many structs are defined as LPVOID.
Second, many syscalls are simply not documented (e.g. NtUserSetSystemMenu).
Fortunately, many parts of Windows are technically open source:
Windows NT Leaked sources – https://github.com/ZoloZiak/WinNT4
Windows 2000 Leaked sources – https://github.com/pustladi/Windows-2000
Windows Research Kit – https://github.com/Zer0Mem0ry/ntoskrnl
We looked for each syscall in MSDN and in the leaked sources, and we also verified it with IDA and WinDBG.
Many API signatures that we generated were easy to produce, but some were a real nightmare – involved lots of structs, undocumented arguments, some syscalls had 15 arguments and more.
After a few hundred syscalls, we ran the fuzzer again and we got 3 GDI vulnerabilities and some DoS bugs(!).
At this point, we covered a few hundred syscalls in win32k. We wanted to find more bugs. So we concluded that it’s time to go deeper and look for more information regarding Win32k and reach more complicated attack surfaces.
Fuzzers are not magical, in order to find bugs we need to make sure we cover most of the attack surfaces in our target.
We went back to read more prior work of Win32k, understand old bugs and bug classes. We then tried to support the newly learned attack surfaces to our fuzzer.
One example is with GDI Shared Handle. The _PEB!GdiSharedHandleTable is an array of pointers to a struct that has information about shared GDI handles between all processes. We added this to Syzkaller by adding a pseudo syscall GetGdiHandle(type, index) that gets a type of handle and index. This function iterates over the GDI shared handle table array from initialization up to index, and returns the last handle that is the same type as requested.
This resulted in CVE-2019-1159, a Use-After-Free triggered by one syscall with global GDI handle that is created on boot.
We fuzzed for 1.5 months with 60 vCPUs.
We found 10 vulnerabilities (3 pending, 1 duplicate)
We also found 3 DoS bugs, 1 crash in WinLogon and a few deadlocks.
LPE → RCE?
Local privilege escalation bugs are cool, but how about an RCE?
Introducing WMF – Windows Metafile Format.
WMF is an image file format. It was designed back in the 1990s and supports both vector graphics and bitmaps. Microsoft extended this format over the years as the following formats
Microsoft also added a feature to the format that lets you add a record that is played back to reproduce graphical output. When these records are played back, the image parser calls an NtGdi system call. You can read more about this format in j00ru’s lecture.
The amount of syscalls that accept an EMF file is limited, but luckily for us, we found a vulnerability in StretchBlt, which accepts an EMF file
Our goal was to find Windows kernel bugs using a fuzzer.
We started exploring the fuzzers landscape in the Windows kernel, and since we had experience with AFL style fuzzers, we looked for one that performs similarly and found kAFL.
We looked at kAFL and searched for attack surfaces in the Windows kernel, but we found out quickly that a syscall fuzzer can reach a lot more attack surfaces.
We searched for syscall fuzzers and found Syzkaller.
At this point, we started porting it to WSL as it’s the most similar to Linux kernel and we could get some experience with Syzkaller on Windows. We implemented coverage instrumentation for the Windows kernel using IntelPT. We shared a crash detection mechanism, our crash symbolizer approach and that was used for bug de-duplication. We found a few coverage stability issues and shared our solution for that.
After we found some DoS bugs, we decided to move to a real PE target – win32k – but we had to implement missing parts in Syzkaller. We then did a sanity check and stress test to make sure the fuzzer is not wasting CPU time. After that we invested a lot of time in writing grammar, reading about our target and eventually adding support for newly learned parts in Win32k back to the fuzzer.
Overall, our research lead us to find 8 vulnerabilities, DoS bugs and deadlocks in the Windows 10 Kernel.