Promotor: Prof. Dr. H. Bos
Co-Promotor: Dr. C. Giuffrida
Software keeps on growing. New functionality is constantly being added to active projects, leading to a constant influx of changes and, in general, to an increase in the number of source code lines developers need to manage. For example, the Linux kernel grew from 180 thousand lines of code in version 1.0, to 19.5 millions in 4.1 and proceeded to reach 30 million lines of code by version 5.11 [77]; Chromium also grew from 24.2 million lines of code in 2019 to 33.3 millions in 2024 [71]. A whole array of software engineering techniques and tooling has been developed just to manage the increasing size and complexity, such as object oriented programming and git. When a software project is placed at a privilege boundary, as it is for the Linux kernel, its interface becomes part of an attack surface because it may receive malicious data from attackers with well-defined capabilities. It thus becomes necessary to verify that an attacker cannot gain any additional capabilities by exploiting vulnerabilities in such software. This process is commonly defined as security review and can be conducted in various forms. The most straightforward one is manual review: a security expert inspects the code and tries to find vulnerabilities the programmers may have inadvertently introduced. The obvious problem with this approach is that it does not scale, neither in size nor in time: companies will need more security experts when the software grows in size, but also if the number of changes to it increases, requiring more frequent reviews. Exactly as for software engineering, the increase in size and pace of evolution of modern software has warranted the development of new techniques that allow to partially automate security reviews and thus reduce the manpower required to conduct one. These techniques can roughly be split in static and dynamic program analysis techniques. The former category simulates more closely the work performed by a human reviewer as it does not run the code, it just inspects it looking for bug patterns. This characteristic allows to apply these techniques even to software that cannot easily be run in isolation, but also implies that false positives may be generated when tools “misinterpret” the software examined. The second category, dynamic techniques, is instead characterized by the execution of the software itself to examine its behavior. These techniques usually requires more engineering work to set up, but have the advantage of not producing false positives. 1 2 CHAPTER 1. INTRODUCTION It is common practice for reviewers to employ tools belonging to both these categories when conducting a security review, using manual review only to look for classes of bugs, such as data races, that are hard to find otherwise. When considering dynamic techniques, fuzzing is the one that has seen the most adoption in the industry in recent years. An important example of its deployment is OSSFuzz [63], a framework maintained by Google meant to ensure that open source software which is commonly used in security-sensitive contexts is constantly tested using fuzzing. Microsoft also invested into a tool called SAGE [30] to test its software internally and found a myriad of memory corruption bugs as a result. When looking at the Linux kernel, another project developed by Google, Syzkaller [72], constantly tests, with fuzzing, the syscalls exposed to userland looking for vulnerabilities. In addition, the offensive security community actively maintains projects such as AFL++ [25] and LibAFL [26], which are commonly deployed by practitioners in fuzzing campaigns to look for vulnerabilities. Fuzzing is based on the idea of looking for unexpected behavior in software by passing unexpected input data, which is usually randomly generated. This makes it well suited for security reviews because it allows the reviewer to specify exactly the interface from which an attacker may introduce malicious data: this ensures that the fuzzer will report only bugs that are triggerable from that interface and that are consequently relevant for the analysis at hand. A fuzzer is thus essentially trying to solve a search problem: given an input space, which depends on the interface, it is looking for specific inputs, or test cases, that can trigger bugs. The performance of the search, so the ability to find more bugs, thus depends on two main factors: how the inputs are generated and how fast they are generated. Indeed, generating higher quality inputs will require fewer attempts before triggering bugs, but also simply being able to make more attempts because the fuzzer is running faster will allow to find bugs faster. For this reason, execution speed has always been one of the most important characteristics in a fuzzer and each new optimization introduced has to nail a perfect balance between its overhead and its benefits to generate a net improvement in performance. The recent success fuzzing enjoyed can be attributed to the introduction of a test case generation technique that nailed this balance perfectly: coverage-guided, mutation-based, greybox fuzzing, pioneered by AFL [82] in 2013. The main innovation brought along by AFL was the use of instrumentation to collect lightweight coverage statistics which are then used in the test case generation algorithm itself: if a test case covers previously unseen code, it is inserted in a queue and later mutated to produce new, but similar, test cases. The intuition behind this approach is that if a test case covered unseen code, it is likely that its execution trace touched code in a module that is still not explored fully; test cases similar to it are then likely to explore code in the same module. Following this algorithm, fuzzing becomes an optimization problem where the goal is to maximize code coverage under the assumption that more coverage will lead to uncovering new bugs. Several academic authors have pushed to further improve fuzzing both in terms of test case generation techniques and in terms of execution speed, often applying the general Introduction 3 concept of fuzzing to a specific subset of software where additional performance can be obtained relying on stronger assumptions. There are three such specializations that are relevant for this thesis: snapshot-based fuzzing, hybrid fuzzing and directed fuzzing.