Source dfde60dd... — STIMSMITH

SOURCE ARCHIVE

SHA256: dfde60ddde2381e40984428394ffaddd863de28737ab56e5f8d9cb485221a69b

URL: https://comsec-files.ethz.ch/papers/cascade_sec24.pdf

TYPE: application/pdf

SIZE: 821.3 KB

FETCHED: 6/14/2026, 11:39:14 AM

EXTRACTOR: liteparse

CHARS: 108,164

EXTRACTED CONTENT

108,164 chars

Cascade: CPU Fuzzing via Intricate Program Generation

Flavien Solt Katharina Ceesay-Seitz Kaveh Razavi ETH Zurich ETH Zurich ETH Zurich

    Abstract This paper presents Cascade, a new CPU fuzzer that explic-

Generating interesting test cases for CPU fuzzing is akin itly generates valid RISC-V programs of arbitrary length with to generating programs that exercise unusual states inside highly randomized and interdependent data and control flows. the CPU. The performance of CPU fuzzing is heavily influ- Executing many instructions per program enables Cascade to enced by the quality of these programs and by the overhead of efficiently trigger bugs. When a bug is triggered, the inter- bug detection. Our analysis of existing state-of-the-art CPU dependence of the data and control flows results in program fuzzers shows that they generate programs that are either non-termination, enabling Cascade to detect bugs at program overly simple or execute a small fraction of their instruc- granularity without any runtime overhead. Our evaluation tions due to invalid control flows. Combined with expensive shows that Cascade achieves significantly more coverage than instruction-granular bug detection mechanisms, this leads to the state-of-the-art fuzzers and discovers a large number of inefficient fuzzing campaigns. We present Cascade, a new new bugs in RISC-V CPUs of various complexities. approach for generating valid RISC-V programs of arbitrary length with highly randomized and interdependent control Programs for CPU fuzzing. To investigate properties of and data flows. Cascade relies on a new technique called the programs generated by state-of-the-art CPU fuzzers [32, asymmetric ISA pre-simulation for entangling control flows 35, 36], we defined two metrics measuring 1) completion: with data flows when generating programs. This entanglement the proportion of the instructions in a program that actually results in non-termination when a program triggers a bug in execute, 2) prevalence: the overhead due to non-randomized the target CPU, enabling Cascade to detect a CPU bug at pro- initial and final sequences. We find that for each generated pro- gram granularity without introducing any runtime overhead. gram, only a small fraction of the instructions gets a chance Our evaluation shows that long Cascade programs are more to actually execute on the target CPU and that most executed effective in exercising the CPU’s internal design. Cascade instructions perform non-randomized initialization. These achieves 28.2x to 97x more coverage than the state-of-the-art findings hint at low instruction throughput in existing state-of- CPU fuzzers and uncovers 37 new bugs (28 new CVEs) in the-art CPU fuzzers. Furthermore, detecting a bug triggered 5 RISC-V CPUs with varying degrees of complexity. The by these programs introduces its own set of challenges. programs that trigger these bugs are long and intricate, im- Bug detection. Generally, the most obvious way to check peding triaging. To address this challenge, Cascade features whether a program triggered a bug is by checking the CPU’s an automated pruning method that reduces a program to a architectural state (i.e., registers and memory) instruction by minimal number of instructions that trigger the bug. instruction against a golden model [8, 9, 30, 34, 35]. Apart from performance implications, this approach has practical 1 Introduction limitations; as an example, the registers are not always easy to identify in a given CPU’s RTL design. In particular, out- With the increasing popularity of open-source RISC-V CPUs of-order CPUs may keep multiple versions of the registers at and the high cost of formal verification, CPU fuzzing is gain- the same time and identifying the correct ones for monitoring ing momentum [8, 9, 14, 30, 32, 34, 35]. The effectiveness of introduces a non-trivial effort when porting a fuzzer to a new CPU fuzzing strongly depends on the quality of the test cases CPU [4]. It is also possible to force termination by handling and efficient bug detection. State-of-the-art CPU fuzzers fail uncontrolled exceptions in the programs [32]. While this tech- at both: they execute only a small fraction of the intended nique reduces the problem to only checking the architectural instructions per test case, and rely on incomplete or expensive state at the end, it can miss bugs that happen in the middle instruction-granular bug detection mechanisms at runtime. of the program. An ideal solution generates programs that

execute to completion without the need for checking the state during program execution Cascade. We rely on the idea that finding a bug-triggering program in a CPU is equivalent to finding a program for which the CPU behaves incorrectly. We propose a new fuzzer called Cascade that constructs complex and random RISC-V programs that exert the CPU’s architectural and microarchi- tectural features without requiring time-consuming and error- prone expert effort specific to each CPU. The programs gener- ated by Cascade mix a highly randomized data flow with the control flow, while steering the control flow at all ficiently predict some necessary register values, we introduce the novel notion of asymmetric ISA pre-simulation, where in- stead of using the Instruction Set Architecture (ISA) simulator to compare the CPU under test with a golden reference model, we use it to construct programs with valid and predictable architectural control flows. Constructing valid programs with highly interdependent control and data flows provides us with three interesting properties. First, the programs can be long, which significantly boost fuzzing performance. Second, with highly randomized data and control flows, we explore unusual operand values and control flows. Lastly, the highly entangled data and control flows enables the non-pervasive detection of bugs amidst long programs by transforming bug expressions into program non-terminations, enabling Cascade to detect bugs without any runtime overhead. The programs generated by Cascade exercise a rich set of functionalities provided by the RISC-V ISA; they support ex- ceptions without (necessarily) causing termination, data flow- dependent privilege transitions, complex FPU (Floating-Point Unit) operations, and operations under randomized Control and Status Registers (CSRs) exploring different operational states of the CPU. We evaluate Cascade on 6 real-world RISC-V CPUs of different complexities and ISA extensions: VexRiscv, PicoRV32, Kronos, CVA6, Rocket and BOOM. Compared to the state-of-the-art fuzzers such as TheHuzz and DifuzzRTL, Cascade achieves the same coverage 28.2 and 97 times faster, respectively. Cascade discovers 37 new bugs (29 new CVEs) in 5 of these 6 designs which is more than all the state-of-the-art CPU fuzzers combined [10, 14, 32, 35, 36]. We additionally found a critical bug in the popular Yosys synthesizer that results in a wrong netlist. Automated program reduction. Cascade-generated pro- grams that trigger these new bugs can be long and highly com- plex, making the analysis of the non-termination intractable by humans. To tackle this challenge, Cascade relies on a new automated program reduction technique that creates minor in-place modifications to a bug-triggering program. These modifications iteratively reduce the program to a minimal triggering form while preserving the sufficient bug-triggering CPU state. Using this technique, we find that result in hangs, wrong values and exceptions, decode issues and quantifiable performance counter inaccuracies. Further- more, we find that Cascade’s program reduction is signifi- cantly helpful when reporting the bugs to the respective CPU maintainers. Contributions. Our contributions are as follows: • We design and implement Cascade for generating valid RISC-V programs with highly complex and entangled data and control flows for finding CPU bugs. Cascade correct-by-construction programs achieve high fuzzing performance without introducing any overhead for bug detection. times. To ef- • We design and implement a new analysis method to effi- ciently reduce Cascade-generated programs to a minimal form, while preserving the bug-triggering behavior. • We evaluate the performance of Cascade in terms of speed and coverage and compare it with state-of-the- art CPU fuzzers [32, 35, 36]. We report a total of 37 new bugs found in 5 of the 6 considered RISC-V CPU designs. We further report a bug in the popular open- source Yosys synthesizer. Open sourcing. For the benefit of the research and CPU design and testing communities, we publish the source code and experiments of Cascade at this URL: https://comsec.ethz.ch/cascade. 2 Background In this section, we provide background on formal verification, fuzzing software and hardware, and finally on RISC-V. 2.1 Formal verification of hardware Assertion-based formal verification aims to prove that a hardware design satisfies certain properties, for all possi- ble input values that it may receive and for infinite depth in time. Usually, formal verification engineers manually write properties that target specific verification goals, often tak- ing design-specific knowledge into account. Tools for au- tomated property generation either generate a certain kind of properties, like information flow properties [19], require a (semi-)formal model of the specification [21, 38], or use novel languages [18]. Furthermore, exhaustively verifying complex properties on real world designs does not always scale and requires semi-manual abstraction techniques such as black-boxing or initial value abstraction [20,25]. Given the high computational and manual cost of formal verification, alternative fuzzing techniques are starting to gain popularity. bug- 2.2 Software fuzzing these bugs Fuzzing consists in applying random inputs to a unit under test and observing whether the unit behaves as expected and

whether the output is correct. The goal is to find unknown                                      Average completion rate (19.3%)
bugs by covering as much state space as possible by iteratively                                 Median completion rate (1.7%)
mutating the input data.             While fuzzing usually does not      25
provide formal guarantees of correctness, it has been shown               0
to be a very effective technique to find bugs [23].                        0 10 20 30     40    50 60     70  80 90     100
           Every fuzzer needs a strategy to generate test cases and          Fuzzing stage completion (%)
to detect bugs when they happen.               Software fuzzers may      Figure 1: Completion rates of DifuzzRTL executions.
rely on some form of coverage feedback [15, 26, 29, 58],
static or dynamic taint analysis [16, 27, 40, 42] or gram-
mars [6, 24, 28, 44, 51, 57] to incrementally find test cases          specification, which share most features. Second, CPUs may
that better exercise the software’s functionality.         Software    implement ISA extensions. In addition to the base ISA, com-
fuzzers mainly rely on two techniques to detect when bugs              mon extensions are F (floating-point), D (double-precision
are triggered: crashes [29] and sanitizers [45].         Sanitziers    support), M (integer multiplication and division), A (atomic
provide, for example, address related checks like buffer over-         operations), and C (compressed instructions). Compared with
flows [45], checks for undefined behavior such as division by          other established ISAs, RISC-V ISA features a small number
zero [22] or floating-point numerical issues [17].                     of instructions. In particular, RISC-V requires two instruc-
                                                                       tions to load an immediate 32-bit value into a register (lui fol-
2.3 Hardware fuzzing                                                   lowed addi). Furthermore, the only operations that influence
Due to the high cost of formal verification, CPUs are an inter-        the program control flow are jal (direct jump), jalr (indi-
esting target for fuzzing. Hardware fuzzers for CPUs differ            rect jump), branches, exceptions and privilege level changes.
from software fuzzers on both aspects of test-case generation          Note that in RISC-V, the targets of branches are immediates.
and bug detection. While software fuzzers often generate               In the case of self-modifying code, the fence.i instruction
random input data streams with little or no input format con-          must be executed before executing newly-stored instructions.
siderations, hardware fuzzers need to prioritize generation of           RISC-V, in its common implementation in open source
inputs that follow certain protocols, like bus or ISA specifi-         CPUs, supports up to three privilege levels which are ma-
cations, in order to be effective [55]. Inspired by software           chine mode (M), supervisor mode (S) and user mode (U),
fuzzing, state-of-the-art hardware fuzzers generate new test           in decreasing order of privilege. The M mode is the only
cases by generating random instruction sequences and mutat-            mandatory privilege level. Upward privilege transitions are
ing the test case [8–10, 14, 30, 32, 34–36]. Every new CPU             done through interrupts or exceptions, and downward transi-
fuzzer finds new bugs, but often fewer [8, 9, 14, 30, 32, 34, 35].     tions happen through specific instructions (mret and sret).
       Since processors hang only in rare cases, crash detection is    Depending on the target privilege level, the architectural fetch
not sufficient for bug detection, and sanitizers are currently         address following an exception is pre-set in the mtvec or
limited to handwritten SystemVerilog assertions [36]. There-           stvec Control and Status Registers (CSRs). CSRs are further
fore, hardware fuzzing needs new methods for bug detection.            used to configure how the CPU should operate in certain con-
Most hardware fuzzers apply differential fuzzing by compar-            ditions, such as whether exceptions should be delegated to
ing register values of the CPU under test with the results from        another privilege level, or if the floating-point unit is enabled.
a purely software-based Instruction Set Simulator (ISS) that           Spike [47] is a widely used open-source ISS for RISC-V.
serves as a golden model [32]. Comparing the result of every
instruction is expensive in terms of runtime, and only feasible        3   Motivation and Challenges
for simple CPUs where a direct mapping from RTL design
signals to registers is possible. Some CPU fuzzers [32] dump           In this section, we first analyze important aspects of the
the register values via storage instructions at the end of a test      recently-published CPU fuzzers DifuzzRTL [32] and The-
case and compare the results with the ones from the ISS. Such          Huzz [35]. From these observations, we describe challenges
a comparison is only possible if a test case completes and             which will guide the design of Cascade.
intermediate deviations between the ISS and the CPU under
test are propagated through time until the end of the test case.       3.1 Observations
2.4 RISC-V                                                             We collected 500 CPU inputs generated by DifuzzRTL [32]
                                                                       for the Rocket core to understand key aspects of test cases
RISC-V [59] is a free and open ISA, consisting of an un-               generated by a state-of-the-art CPU fuzzer. We also analyze
privileged and a privileged specification. RISC-V targets a            the test cases generated by TheHuzz [35] based on the de-
large diversity of CPUs, and therefore provides a set of op-           scription in their paper since TheHuzz is not open source at
tions. First, CPUs may comply with the 32-bit or the 64-bit            the time of this writing.

Test cases (%)

                                      Average prevalence (5.8%)       Challenge 1. How to generate programs that complete
  25                                  Median prevalence (3.0%)        and have a high fuzzing prevalence?

   0    0    5 10 15  20                  25    30 35   40                      Section 4 discusses a new design for program construc-
    Prevalence of fuzzing instructions (%)                            tion. Longer programs have a higher prevalence, but only if
Figure 2: Prevalence of fuzzing instructions in DifuzzRTL.            they complete. We propose to build valid programs that are
                                                                      expected to complete, by pre-defining a control flow ahead
                                                                      of execution. This is in contrast with existing CPU fuzzers
Completion.            Figure 1 shows the percentage of the fuzzing   that rely on mutation-based test-case generation strategies.
instructions that execute at least once in each test case pro-        The result of this step is a set of intermediate programs that
duced by DifuzzRTL. We measure completion on the ISS,                 have instructions with complex data flows, but control flows
assumed bug-free. We observe that these programs do not               that are not dependent on the (complex) data flows. Our next
generally complete the execution of their fuzzing sections,           challenge is to make the control flows dependent on the com-
mostly because of the difficulty to predict intermediate val-         plex data flows while ensuring completion. Entangling data
ues that affect the control flow, such as operands of branch          flows into control flows has two major benefits. First, it helps
instructions. Similarly, TheHuzz limits its test cases to 20          finding bugs related to (speculative) control flows. Second,
instructions, (10 of the first being non-control-flow instruc-        it enables transforming a data-flow bug symptom into a pro-
tions), again, because it is difficult to control the behavior of     gram non-termination, effectively providing a non-pervasive
control-flow instructions when fuzzing.                               design-agnostic way of detecting data-flow bugs in arbitrarily
                                                                      long and complex programs without any runtime overhead.
 Observation 1. Completion: CPU fuzzers struggle with
 fully executing their test cases.                                    Challenge 2. How to generate valid programs with a high
                                                                      degree of dependence between data and control flows?

Prevalence.              In the programs generated by DifuzzRTL, we       Section 5 proposes a novel method for efficiently construct-
separate the instructions in two categories:           overhead and   ing a complex data-dependent control flow for test cases,
fuzzing instructions. Overhead instructions are generic, non-         called asymmetric ISA pre-simulation. Asymmetric ISA pre-
randomized instructions such as setup routines or hard-coded          simulation mixes a highly randomized data flow with the
exception handlers, while fuzzing instructions are the ones           control flow of the intermediate programs. This method is
actually randomized. We define prevalence as the proportion           based on the new insight that instead of using an ISS for dif-
of executed instructions that are fuzzing instructions. Figure 2      ferential fuzzing, we can use it for generating a valid program.
shows the prevalence in the programs generated by Difuz-              Repeated usage of the ISS for the ultimate program genera-
zRTL. Strikingly, only a small proportion of the executed             tion, however, would introduce a large performance penalty.
instructions are fuzzing instructions. Similarly, the very short      We design a new scheme that allows us to reduce the number
fuzzing sequences of TheHuzz are preceded by an overwhelm-            of ISS calls to only one per generated program.
ing amount of configuration (overhead) instructions [35, 46].               Our prototype fuzzer implementation of the ideas presented
 Observation 2. Prevalence: most executed instructions                in Sections 4 and 5, called Cascade, generates programs that
 correspond to overhead and are not fuzzing.                          trigger a large number of new bugs, more than any CPU
                                                                      fuzzer so far. The programs leading to these bugs, however,
                                                                      may be long and complex by construction, to an extent that
Conclusion.             The existing coverage-guided approaches are   makes it inconceivable to interpret them manually from logic
insufficient. They produce malformed non-completing pro-              waveforms, like it may have been done so far. This leads us
grams dominated by overhead instructions.                             to our last challenge.

3.2     Overview of challenges                                         Challenge 3. How to extract and interpret the bug from
                                                                      a complex program?
Our analysis suggests that we might significantly improve
fuzzing results by constructing programs that address these                  Section 6 discusses a new algorithm for iterative program
limitations. We provide an overview of the challenges based           reduction to find a minimal number of instructions inside
on our previous observations and how we address them in the           program triggering CPU bugs. We successfully applied it for
rest of this paper. The first challenge concerns completion           all the bugs found by Cascade, which significantly helped us
and prevalence of the generated programs.                             when reporting the issues to the respective CPU maintainers.

Test cases (%)

Intermediate program construction 2. Ultimate program construction Listing 1: Example basic block. params block I « [JF xor x3,x4,x9 csrrwi x9,mcause,15 beq x9,x4,0x8000098e () fld f8, ( x3 ) feq.s x4,f9, f8 test cases database jalr x9, ( x7 )
1. ```
Program
                                                                   behavior while reducing the program complexity as described
 Figure 3: Overview of Cascade.                                    in Section 6. The analysis consists in iteratively reducing
                                                                   the program (f) and re-running it to check if the bug is still
```

4 Design triggered (g), and to produce a minimal program that is easy to understand (h), as explained in Section 6. We outline Cascade, before explaining the intermediate pro- gram construction. In Section 5, we explain how Cascade 4.2 Intermediate program construction uses ISS feedback to entangle data and control flow. We first explain the structure of the programs generated by 4.1 Cascade overview Cascade. We then show how Cascade selects instructions to form basic blocks. We finally discuss the memory manage- Figure 3 provides an overview of the fuzzing process and its ment mechanism for ensuring sound program construction. components. Cascade proceeds in four steps: (1) intermediate program construction, (2) ultimate program construction, (3) 4.2.1 High-level program structure program execution and (4) program reduction. The program construction (steps 1 and 2) is decoupled from the other steps, Basic blocks. A program is a sequence of instructions so the same program can be executed on multiple versions of within basic blocks. Basic blocks are made of zero or more the same CPU, or on CPUs with compatible extensions. instructions that do not affect the control flow, followed by a single instruction that affects the control flow. Accordingly, Intermediate program construction. Cascade takes as an Cascade constructs programs as sequences of basic blocks, input the supported ISA extensions, addresses for dumping where the last instruction of a basic block steers to the next registers and stopping the RTL simulation, and descriptions basic block. Each basic block is placed at a random location of known bugs to circumvent, and starts with a brief cali- in memory (see Section 4.2.3). The order of basic blocks’ bration stage (taking less than a second) to evaluate some execution is not necessarily identical with their placement in CPU parameters, such as supported CSR bits (a). Generating memory. The memory outside of basic blocks and of their programs is then a parallel task. Cascade first generates the data is left uninitialized, defaulting to zeros in simulation. intermediate program as a sequence of basic blocks, where control flow is isolated from the data flow (b). Initial and final basic blocks. All programs start with an initial basic block that sets up an initial state and random Ultimate program construction. An ISS then executes the register values before jumping to the first fuzzing basic block, whole intermediate program once to collect the data-flow de- which eventually jumps to the next, and so on. The program pendent values of registers (c). Cascade uses this information ends with a final basic block that optionally dumps register to entangle the data flow with the control flow, as explained values, and sends a signal indicating that the program’s end in Section 5 to produce the ultimate program (d). It is defined was reached. No overhead instruction, as defined in Section 3, as a triple (ELF file, descriptor, and the expected final register is executed between the initial and final basic blocks. values (optional)), where the descriptor is a short identifier that permits re-generating the same program. 4.2.2 Basic block generation Program execution. The program execution is also a par- allel task. Each ultimate program is executed independently Control-flow behaviors. We call instructions that do not on a simulated design under test. The output (e) is a pair change the control flow of a given program still instructions, (descriptor, success). The descriptor is the same as in (d). The and others hopping instructions. For example, a branch can success flag is raised if the program terminated successfully, be taken (still) or non-taken (hopping). Similarly, a memory and optionally if the dumped register values match. load instruction such as lw can succeed (still) or raise an exception (hopping). We define the instructions that may Program reduction. The program reduction phase takes be still or hopping depending on register values, or that are as input a test descriptor that leads to an execution failure unconditionally hopping but whose destination depends on and produces a reduced program that preserves the buggy register values, as cf-ambiguous. For example, beq and lw

are cf-ambiguous, but add and illegal instructions are not. (c) later transformations of the program (discussed in Sec- Listing 1 shows an example basic block, where cf-ambiguous tions 5 and 6) do not have unintended effects. Intuitively, the instructions are represented in red. constraint (b) would prevent from detecting some potential Picking instructions. Instructions are grouped in cate- bugs related to self-modifying code. RISC-V requires self- gories, which are listed in Appendix A. Some groups of modifying code to use fence.i, which means that such bugs instructions will behave in the same context-dependent way, are limited in scope. We leave the exploration of such bugs for example all floating-point instructions will be conditioned as future work and instead focus on non-self-modifying code. by whether an FPU is present and activated. Hence, Cascade Progressive memory allocation. Cascade allocates mem- picks instructions hierarchically, by first choosing a category, ory on the fly for each new instruction. Whenever a hopping and then a specific instruction. Both are chosen randomly instruction is generated, space is allocated for the first instruc- with certain probabilities, which are varied between programs. tion in the next basic block, at a reachable random new ad- When picking a cf-ambiguous instruction, Cascade chooses dress that offers enough space for a new basic block. Initially, immediately whether it must be still or hopping. Cascade space is allocated (at random locations) for the initial and final biases the choice of operands by granting higher probabilities basic blocks, which have known upper bounds in length. Ad- to registers recently used as outputs. ditionally, to anticipate program reduction, Cascade allocates In particular, Cascade supports complex FPU operations space for a context setter block used for program reduction and CSR interactions. It additionally supports exceptions and leaves it empty (details are discussed in Section 6). and privilege switches as simple hopping instructions, which can only be picked under certain architectural conditions that Strong memory allocation. The memory allocator can are generated by the mechanism explained in Section 5. Ul- strongly allocate memory areas, i.e., forbid loads from there. timately, the complex data flow originates from combining Reading from a memory location could entangle its data with initial static data with a diverse stream of random instructions. the data and control flows. Strong allocations prevent read- ing from a memory location that stores very specific parts Circumventing known bugs. Ideally, discovered bugs of the program where some instructions differ between the should be fixed immediately. In reality, however, this may intermediate and ultimate programs as described in Section 5. take time and effort. Due to limited human resources, some bugs may remain unfixed for a long time, polluting bug re- Memory operations. Memory stores, randomized in num- ports and potentially restrain test case continuation. Known ber and size, can only target some specific memory areas, allo- bugs may be ignored after triaging the causes of bug reports. cated at the start of the program’s creation. Memory loads can However, known CPU bugs can influence the control flow of target any address, except for the strong allocations. Heuris- Cascade programs early, shadowing the rest of the program. tically, we bias memory loads to target more often memory Furthermore, triaging is an expensive operation that must areas that have been recently written, although so far, this then be repeated many times for the instances of the same bug. specific heuristic has not been critical in finding bugs. Instead, Cascade allows circumventing known bugs by con- structing programs that will not trigger them. By increasing 5 the completion rate, Cascade is able to progress and find more Ultimate Program Construction bugs faster. As an example, in the Rocket core, the retired The fundamental idea behind asymmetric ISA pre-simulation instruction performance counter ignores ecall and ebreak is to execute an intermediate program on the ISS to collect instructions (R1) [1]. The circumvention configures Cascade feedback for constructing the ultimate program with a control to avoid that generated programs read this counter after these flow that is identical but dependent on the data flow. instructions until it has been written again, hence it covers exactly the bug. Steering the control flow. There are three schemes for gen- Note that Cascade’s approach of circumventing certain erating an arbitrary, but controlled, control flow. The first bugs may result in the under-exploration of the components scheme is not to control the values used by cf-ambiguous in which the bugs exist. Automatically circumventing bugs instructions such as jalr but to place the next basic block ac- via automated RTL patching is an interesting future direction cordingly. This scheme is not viable because most addresses that can address this issue altogether. are inaccessible or already allocated. The second scheme is not to involve the random data flow and rely on direct 4.2.3 Memory management branches or registers loaded with fixed values. This is how the control flow of the intermediate program is constructed. To produce sound programs, Cascade imposes some con- The third scheme entangles data and control flow by observ- straints on the memory layout of the generated programs. In ing the data flow and applying an offset to a register used by particular, Cascade ensures that (a) instructions do not over- a cf-ambiguous instruction. We rely on the third scheme to lap, (b) store operations do not overwrite instructions, and construct the control flow of the ultimate program.

                                   (a) lui     ro, imm,            overwritten by an instruction to become gen (e’), or by an
                                 | (b) addi       rors,   imm      ordinary instruction to become free (e).
                                  |(©) wor    roms(rum)   ro       Pre-simulation.               Cascade relies on an ISS to determine

Figure 4: Life cycle of registers regarding offset construc- the value of dependent registers when encountering a cf- tion. The instructions can be separated by other (unrelated) ambiguous instruction. So far, the state of the art [8, 9, 14, 30, instructions. 32, 34, 35] always submits the same program that is given to the CPU to the ISS. If we imitated the state of the art, the ISS should be called every time a cf-ambiguous instruction is 5.1 Offset construction and register lifecycle encountered becauser rapp would be random until the correct Some cf-ambiguous instructions such as indirect jumps o f f can be computed from rd. The ISS could not proceed to (jalr) require a specific operand value val that Cascade the next basic block (or complete a correct memory operation) intends to impose. Following the principle of dependency without running at least once per cf-ambiguous instruction. preservation, we let val depend on a randomly picked depen- We introduce the asymmetric ISA pre-simulation method dent register rd. Since we do not want to constrain rd’s value, to address this critical performance bottleneck by requiring rd is generated by the random data flow and its value is un- a single ISS execution. The fundamental insight is to pro- known when we pick the cf-ambiguous instruction. Hence, to vide to the ISS not the ultimate program that the CPU under calculate val we propose to generate an offset register value test will execute, but an intermediate program such that (a) to be eventually combined with rd’s value. The combination the values of all dependent registers (in the constructions of of rd and ro f f is performed by an offset applier instruction, for applied registers) are identical between the two programs instance xor, whose output is defined as the applied register and (b) the programs’ control flows are identical and (c) in rapp and holds the intended value val. the intermediate program, no applied register depends on the randomized data flow. Branches. Because applied registers are a somehow pre- Concretely, the intermediate program differs from the in- cious resource, Cascade does not use this method for branches. tended ultimate program in three aspects. When generating Instead, it uses the ISS feedback to obtain the operand values the intermediate program: 1) we choose the immediates imm1 and selects a suitable branch opcode, depending on whether and imm2 to set the value of ro f f to val, 2) we substitute of the the branch should be still or hopping. offset-applying instruction with ar mv instruction, which copies o f f into rapp, and 3) we transform non-taken branches into Registers involved. In total, this construction involves two NOPs, because the value of the operands are not yet known. registers (rd, whose value is randomized by the program’s data Given these transformations, the intermediate program com- flow, and ro f f to offset its value). Since there is no reason to plies with the aforementioned requirements. specifically reuse rd or ro f f as an output of the offset applier, Intermediate register states. This scheme for offset con- a third register could be used as rapp, in accordance with the struction guarantees by design that at any point in the program, principle of maximizing the degrees of freedom. all free and applied registers have the same value in the inter- Offset state machines. This scheme implies that 1) rapp mediate and ultimate program. This invariant does not hold must be available for use when the cf-ambiguous instruction is for registers in the gen, ready and unrel states. Therefore, they picked, 2) rapp’s calculation requires that ro f f is ready before should never be used as the input for any instruction other than the offset applier instruction, and 3) rd must be available the next instruction in the offset construction cycle, respec- when the offset applier instruction is executed. To comply tively steps (b) and (c) in Figure 4. Note that state machine with these requirements, we maintain a simple state machine transitions are a specific instruction category in Cascade, as for each architectural integer register. The state machine is enumerated in Appendix A. composed of five states: free, under generation (gen), ready, unreliable (unrel) and applied, as illustrated in Figure 4. All 5.2 Privilege transitions registers initially start in the free state. Cascade can create instruction (a) to move a free register to state gen, or (b) to Exceptions are a particular case of hopping instructions. They move a register from gen to ready. When there is at least require some basic bookkeeping in the fuzzer to maintain the one register, ro f f , in the ready state, then Cascade can pick privilege state and delegation flags, which indicate the target an offset applier instruction. If Cascade chooses to define privilege level active when some exception occurs. rapp = ro f f , then ro f f is moved to the applied state (c). Else, We use one to two offset/dependent/applied register triples ro f f is moved to the unrel state (c’) and rapp is moved to per exception. The first is used to populate the trap vector, the applied state (c”). Once rapp is used by a cf-ambiguous either mtvec or stvec, depending on the current privilege instruction (d), it returns to the free state. An unrel register state and whether the exception will be delegated. The second must not be used as an input to any instruction and may be is used when an exception depends on a register value, e.g., in

  a)
BE        |      |                                                    toward the final block will replace the (bug-triggering) hop-
                                                                      ping instruction of Bn−1, hence removing Bn erases the buggy
                                                                      behavior. The tail instruction being a hopping instruction is
  b)   B                                                         BE   hence the necessary and sufficient condition for failing.
                                                                      6.2 Identifying the bug’s head
  o        =                                                          Most often, a single instruction, provided with the correct
                                                                      architectural context, is sufficient to trigger the bug reliably.
Figure 5: (a) Original, (b) tail-reduced and (c) fully reduced        In such cases, finding the tail instruction is enough to under-
program. Black rectangles and arrows represent basic blocks           stand the bug. However, some bugs required a sequence of
and control flow. Ctx is the context setter block.                    instructions (up to 2), possibly far apart, to be triggered, such
                                                                      as the bugs V1-V9 and V14 that we describe in Section 7,
a misaligned memory load exception. It is populated similarly         because they rely on a specific microarchitectural context.
as an indirect jump or a load would be. Note that since basic         Hence, for these bugs, identifying the head is necessary. The
blocks are generated on the fly, the value to be inserted into        result of this step is illustrated in Figure 5 (c).
the trap vector is only known when the exception-triggering           Maintaining the architectural context.             Identifying the
instruction is generated, which may be many basic blocks              tail instruction can be done by exclusively inserting direct
later; while this adds some necessary complexity to the fuzzer,       jump instructions in the right places iteratively, but identifying
it does not negatively affect the program’s degrees of free-          the head instruction is more challenging because removing
dom in any way. The implementation of downward privilege              predecessor instructions ahead may influence the architectural
transitions is in all respects comparable with exceptions.            state. We leverage the following insight to identify the bug’s
                                                                      head: we find the head by preserving the architectural state but
6     Program Reduction                                               simplifying the microarchitectural state. Concretely, only for
                                                                      this step, a context setter basic block is inserted, by replacing
                                                                      the initial block’s hopping instruction.          Once a candidate
Cascade generates potentially long test cases, and CPU bugs           head basic block and instruction is chosen, the context setter
are revealed by programs not terminating, thanks to the en-           uses the ISS to infer the architectural context, including for
tanglement of the data and control flows. To understand the           instance register values, privilege level and some performance
underlying CPU bug, it is necessary to reduce the programs to         counter values.       It then loads the architectural state of the
a minimal form, while preserving the instructions and states          CPU, using simple instructions such as wide loads for register
that trigger the CPU bug. We first show how to find the last          values, and basic instruction sequences for populating CSRs
instruction (tail) involved in triggering the bug, then the first     and setting the proper privilege. The detection of the bug’s
(head). Figure 5 illustrates the program reduction process.           head is then performed similarly to the tail, following the
                                                                      converse algorithm where the head instruction is the first
6.1    Identifying the bug’s tail                                     instruction that, when omitted, erases the buggy behavior.
We propose to reduce the program progressively by trans-              Sandwich instructions.            For all bugs found by Cascade so
forming some instructions into direct jumps that skip some            far, identifying the tail and head instructions has always been
of the last basic blocks and observing whether the bug is still       sufficient for understanding and reproducing the bugs. How-
triggered. The result is illustrated in Figure 5 (b).                 ever, Cascade includes facilities to flatten the remaining in-
                                                                      structions (in black in Figure 5 (c)) and removing them itera-
Identifying the tail basic block and instruction.           Cascade   tively. Such transformations are not guaranteed to work on a

finds via a binary search the last basic block which, when ~~ given specific program, for example if the hopping instruction omitted along with its successors, erases the buggy behavior. was an exception and some following instruction checks the To remove such a final sequence, Cascade replaces its pre- exception cause. In practice, finding another program that decessor’s hopping instruction with a direct jump toward the reveals the same bug in case of failure of advanced transfor- final block. Cascade then searches the bug’s tail instruction mations is sufficient, notably because finding new programs in the converse way. that trigger the same bug is fast, as we show in Section 7.6.3. Failing control-flow instructions. If the tail instruction is a 7 Evaluation hopping instruction, the algorithm above will find a tail basic block Bn, but no tail instruction. This is because for skipping In this section, we evaluate Cascade in terms of performance, the basic block Bn but not Bn−1, a direct jump instruction program metrics, coverage, and the discovered bugs. We

use microbenchmarks to quantify the performance of pro-                    100                                         Interm. program construc.
gram construction and compare it with previous work (Sec-                   80    78.9%  77.8%                         Asymmetric ISA pre-sim.
                                                                                                                       Final ELF writing
tion 7.1).              We evaluate the impact of program lengths on        60                      55.2%      54.7%    49.8%
fuzzing throughput (Section 7.2). We then evaluate the pro-                 40
gram metrics for Cascade (Section 7.3) as we did for Difuz-                 20                                                          26.5%
zRTL in Section 3.1.                We compare the coverage achieved         0
by Cascade according to multiple coverage metrics and com-                        kronos picorv32   vexriscv   rocket   cva6            boom
pare it with the state-of-the-art fuzzers that specifically target      Figure 6: Performance of program construction. The rest of
these metrics (Section 7.4). We then show the bug discovery             the time is spent in the RTL simulation.
efficacy of programs of different lengths (Section 7.5). We              10⁵             PicoRV32    Rocket
also describe the 37 new bugs found in 5 of the 6 evaluated                              Kronos      CVA6
RISC-V CPUs and in Yosys, evaluate the time to detection                 10³             VexRiscv    BOOM
(Section 7.6) and provide a full list in Appendix D. Finally, we         10¹
evaluate the performance of program reduction (Section 7.7).
Evaluation setting.            The performance results were obtained               1         10      100       1,000    10,000          100,000
                                                                                   Number of fuzzing instructions per program (program length)
on a machine equipped with two AMD EPYC 7H12 pro-                       Figure 7:                 CPU-under-test execution throughput given pro-
cessors at 2.6 GHz containing 256 logical cores and 1 TB                gram length. Note the logarithmic Y axis.
of DRAM. We use Verilator 5.005 to simulate the CPUs’
RTL. As an ISS, we use spike (version 1.1.1-dev, commit
fcbdbe79). We use a recent version of a widely-used com-                Input reuse.                 To make Cascade’s evaluation as pessimistic
mercial simulator to collect simulator-based coverage similar           as possible, we systematically dynamically generate new in-
to previous work [35]. We use the most recent versions of               puts by default. Note that by construction, Cascade’s inputs
each CPU, where bugs are fixed or circumvented by Cascade.              are reusable across designs that share compatible ISA exten-
Appendix B provides detailed information about these designs.           sions, and across CPU generations. Hence the input genera-
Notably, we tested Cascade on a variety of CPU complexities,            tion is, in fact, a one-time cost that can further be amortized.
from a simple minimal 32-bit integer core (PicoRV32) to an
application-class Linux-capable out-of-order core (BOOM).               Runtime overhead.                   DifuzzRTL reports a runtime overhead
We implemented Cascade as 6 k lines of Python code.                     of 6.1% to 6.9% for control register coverage, and 97% for
Baselines.                  We compare with the existing open-source    multiplexer select coverage. TheHuzz reports a runtime over-
generic CPU fuzzers, which are RFUZZ [36] and Difuz-                    head of 71% These slowdowns do not include input genera-
zRTL [32]. Despite the claim made in the original paper [35],           tion. Cascade incurs no runtime overhead by design.
TheHuzz is not open source at the time of this writing, and its
authors were reluctant to answer any question or share their            7.2     Throughput of long programs
code over a period of a year, hence we rely on the results              To understand the throughput boost provided by long pro-
reported in their paper [35]. We re-implemented RFUZZ to                grams, we measure the number of fuzzing instructions exe-
support Verilog, and relied on the Docker image provided by             cuted per second when controlling the number of instructions
DifuzzRTL [31].                                                         per program generated by Cascade, including program gen-
7.1 Program generation performance                                      eration time. The experiment was run for 1 core-hour per
                                                                        bar. Figure 7 shows the throughputs of program execution,
To quantify the performance of program construction, we                 while Figure 8 details the average duration of a single pro-
measure the amount of time spent in intermediate program                gram generation. Constructing very long programs requires
construction, asymmetric ISA pre-simulation and RTL sim-                managing a wider memory range, resulting in longer program
ulation, for each CPU under test, over 24 hours of fuzzing.             generation times. Consequently, the overhead of generating
Figure 6 shows the results.                                             longer programs counteracts the improvements in terms of
                                                                        fuzzing throughput when programs become too large.
Results.    While by construction, the duration of the instruc-                      To find the sweet spot for the length of programs, Figure 9
tion generation and asymmetric ISA pre-simulation is identi-            shows that the effective fuzzing throughput when considering
cal across designs, the RTL simulation time increases with the          both the raw fuzzing throughput and the overhead of pro-
complexity of the design, hence the proportion of time spent            gram generation at the same time. These results show that
generating programs largely decreases with the complexity               programs of 10 k instructions generally provide the best ef-
of the designs. When generating the programs in real time,              fective fuzzing throughput, and that the fuzzing throughput
the program generation takes between 26.5% and 78.9% of                 is improved by three orders of magnitude between single-
the total fuzzing time.                                                 instruction and 10 k-instruction programs.

Exec. fuzz. instrs/s Time per step (%)

 10²       PicoRV32   Rocket                                                                     Average prevalence (90.3%)
           Kronos     CVA6                                                   10                  Median prevalence (92.5%)
 10¹       VexRiscv   BOOM                                                    0
                                                                                   0   10   20   30 40 50 60 70     80     90
                                                                                            Prevalence of fuzzing instructions (%)
          1     10           100      1,000  10,000   100,000            Figure 10: Prevalence of fuzzing instructions for Cascade.
          Number of fuzzing instructions per program (program length)
Figure 8:                Program generation performance given program                                          Average #deps (2.1)
length. Note the logarithmic Y axis.                                         25                                Median #deps (2.0)
                                                                              0
 10³                                                                                   0       1    2  3     4     5     6
                                                                                                 Length of the dependency chain
                                            PicoRV32   Rocket
 10¹                                        Kronos     CVA6              Figure 11: Length of the dependency chains for DifuzzRTL.
                                            VexRiscv   BOOM              Control-flow instructions are represented in orange.
          1     10           100      1,000  10,000   100,000
          Number of fuzzing instructions per program (program length)
Figure 9: Effective fuzzing throughput given program length.             7.4       Coverage evaluation
Note the logarithmic Y axis.                                             We show that Cascade is faster in increasing coverage com-
7.3     Program metrics                                                  pared to the state-of-the-art fuzzers with their own coverage
                                                                         metrics. We consider the open-source coverage metrics used
As part of our initial observations in Section 3, we exposed the         for fuzzing (i.e., multiplexer select and control register cover-
completion and prevalence program properties. We evaluate                age). To compare with the results reported by TheHuzz [35],
these metrics for Cascade. We additionally evaluate the length           we additionally consider the coverage metrics provided by
of dependency chains between fuzzing instructions, which                 the commercial simulator (branches, conditions, expressions,
matters for non-termination when a bug is triggered.                     FSM states and transitions, statements, and toggles).
Prevalence and completion for Cascade.                  Since Cascade    7.4.1     Control register coverage (DifuzzRTL)
always completes except when finding a CPU bug, we observe
the expected completion rate of 100%. Figure 10 shows the                We first compare the control register coverage of Cascade
prevalence of fuzzing instructions for Cascade.              The high    with the one achieved by DifuzzRTL, which explicitly aims
prevalence is because programs are relatively long and fully             at maximizing this coverage.  DifuzzRTL supports legacy
randomized except the initial and final basic blocks.                    versions of Rocket and BOOM. Running Cascade on this
Dependencies.                    We analyze dependency chains between    BOOM version leads to quasi-systematic timeouts, revealing
fuzzing instructions. On top of entangling the data flow with            old bugs in the obsolete BOOM RTL. Hence, we executed all
the control flow to force non-termination when a bug is trig-            test cases on the legacy Rocket core provided in the Docker
gered, these dependencies can further exercise corner cases              image [31], which was already exempt of unexpected bugs.
inside a CPU’s design. For this analysis, we calculate the               Results.      Figure 13 shows the control register coverage
length of instruction dependency chains. Each register starts            achieved by Cascade and DifuzzRTL. Cascade achieves more
        with a dependency number of -1, which gets reset when it is a    coverage than DifuzzRTL, in a shorter time. In particular,
destination of a CSR read or of an instruction that only takes           Cascade achieves the same coverage in 30 minutes (15 min-
immediates. We calculate an instruction’s dependency num-                utes when using a pre-generated corpus) as DifuzzRTL in 48
ber as the maximum of the dependency numbers of the source               hours, which is a speedup of 97x (respectively 186x).
registers, plus one, which also becomes the new dependency
number of the destination register.                                      7.4.2     Multiplexer select coverage (RFUZZ)
          Figure 11 and Figure 12 show the distribution of the length
of the dependency chains for DifuzzRTL and Cascade, re-                  We now compare the multiplexer select coverage achieved
spectively. The fuzzing instructions with zero dependencies              by Cascade with the one achieved by RFUZZ. We adapted
here are instructions in the form of xor rd, rd, rd. The                 RFUZZ as a sequence of Yosys [60] passes to support Ver-
programs generated by DifuzzRTL have very few interdepen-                ilog, more general than FIRRTL [33], and ensured that the
dencies. Similarly, TheHuzz does not explicitly generate or              results match with the original open-source implementation.
favor programs with dependencies [35], hence we expect even              We will open-source the instrumentation code to foster further
lower numbers. In contrast, Cascade generates programs with              research. We execute Cascade on the RFUZZ-instrumented
longer dependency chains, which could further be improved if             versions of the designs as well. We execute the RFUZZ exper-
needed by lowering the probabilities of picking dependency-              iments until completion, i.e., as implemented in the original
resetting instructions such as CSR reads.                                RFUZZ fuzzer, when all input sequences in the adherence

Fuzz. instrs/s Avg. gen. time (s)

Test cases (%) Test cases (%)

    20       Average dependencies (9.5)                                   100         PicoRV32 (172 coverage points)
             Median dependencies (4.0)                                     50                                          Cascade
     0  0                50        100                                      0                                          RFUZZ
            Length of the dependency chain                                100         Kronos (178 coverage points)
Figure 12:             Length of the dependency chains for Cascade.        50                                          Cascade
                                                                                                                       RFUZZ
Control-flow instructions are represented in orange. Fuzzing                0
instructions depended on up to 270 other fuzzing instructions.            100         VexRiscv (634 coverage points)   Cascade
                                                                           50                                          RFUZZ
                                                                            0
    1,000,000          Cascade (with corpus)                              100         Rocket (2265 coverage points)    Cascade
      500,000          Cascade (live generation)                           50                                          RFUZZ
                       DifuzzRTL                                            0
                                                                          100         BOOM (7752 coverage points)
            0                                                              50                                          Cascade
            0    10 20 30 40     48                                         0                                          RFUZZ
            Time (h)                                                       0      20   40      60                      80     100
        Figure 13: Achieved control register coverage.                                  Time (seconds)
                                                                      Figure 14: Multiplexer select coverage achieved by Cascade.

to the corpus cease to increase coverage. We had to exclude
CVA6 because of the Yosys bug found by Cascade.                        450k
Results.            Figure 14 shows the multiplexer select coverage    425k           Cascade      Intersection: 37k instrs
achieved by Cascade and RFUZZ. First, in terms of cover-               400k           DifuzzRTL
age, for the two simpler designs, RFUZZ is a bit slower than               0     200k   400k   600k     800k               1,000k
Cascade (note that in this experiment, Cascade also suffers                            Number of instructions
from the runtime overhead due to the instrumentation), but            Figure 15:    Simulator-collected coverage per instruction of
eventually achieves a superior coverage. Since RFUZZ is a             Cascade and DifuzzRTL. Note that the y-axis starts at 400k.
lower-level fuzzer, it is expected to be able to toggle some
more multiplexer signals eventually, for example by fuzzing           7.5     Efficacy of long programs
the bus protocol, whereas an ISA-level fuzzer like Cascade
and many others [8, 9, 14, 30, 32, 34, 35] will not explore these     To better understand the influence of program length on the
state machines. However, as soon as CPUs become more                  efficacy of finding bugs, we enforce the number of fuzzing
complex, RFUZZ is unable to make any progress.                        instructions per program, fuzz on 64 cores and report the
        Second, regarding delay, the order of magnitude to saturate   bug discovery times in Figure 16. Even limited to a single
coverage for Cascade is approximately 100 seconds, which              fuzzing instruction, Cascade discovers C8-C9 in a core-hour,
corresponds to its order of magnitude for finding bugs as             undetected by TheHuzz [35] and HypFuzz [14]. Yet clearly,
we show later. This suggests that, although a naive RFUZZ             longer programs are more efficient at finding bugs. Also note
implementation seems inadequate to finding bugs in CPUs,              that some bugs are hard to separate for this measurement, in
especially when they are complex, its coverage metric seems           particular C2-C7 and C10, and may overlap. The bug found
relevant to evaluating the quality of a fuzzing campaign.             only by the longest programs in 24 core-hours is K3.

7.4.3   Simulator-based coverage (TheHuzz)                            7.6    Bug discoveries
We compare the simulator-based coverage achieved by Cas-              Cascade discovered 37 new bugs in 5 CPUs and 1 bug in
cade with the one reported by TheHuzz. We reproduce the               the Yosys synthesizer. Figure 17 classifies the new bugs in
experiment from TheHuzz that led to their Figure 5 [35]. We           six categories, and Appendix D provides more information
compare the simulator coverages of Cascade and DifuzzRTL,             in terms of description, CWEs and CVEs. We first analyze
and use DifuzzRTL as a pivot to compare with TheHuzz. Note            the discovered bugs and their security implications, and then
that to fit the methodology of TheHuzz, both DifuzzRTL and            evaluate the bug detection performance of Cascade.
Cascade rely on pre-generated corpuses in this experiment.
Results.     Figure 15 shows a per-instruction speedup of Cas-        7.6.1  Bug descriptions
cade over DifuzzRTL of 27x for the simulator-collected cov-
erage, while TheHuzz reports a speedup of 3.33x. Since The-           Exceptions.      Cascade discovered 11 exception-related bugs
Huzz reports a runtime slowdown of 71%, Cascade is 28.2x              in 4 designs, which we define as missing or spurious excep-
faster than TheHuzz on TheHuzz’s target coverage metric.              tions. For example, in VexRiscv, interactions with a disabled










        Coverage pts   Test cases (%)

Simulator coverage Coverage points (%)

 38                                                                              maintainers contacted us to test a fix, which we found to fix
 30                                                                              some bug (C2), but preserving another (C3).
 20                              1 fuzz. instr.           1,000 fuzz. instr.
 10                              10 fuzz. instr.         10,000 fuzz. instr.     Hangs.       Cascade discovered 3 bugs that cause hangs in Kro-
                                 100 fuzz. instr.                                nos, PicoRV32 and VexRiscv (V12, P1, K2). In PicoRV32,
 0     0.0 2.5 5.0           7.5 10.0   12.5          15.0     17.5     20.0     the hang is systematic for accessing some CSR addresses. The
     Median time to discovery (core-hours)                                       hang in Kronos, also related to CSR access, depends on the
Figure 16: Time to reveal each bug when fuzzing with fixed                       microarchitectural state. The hang in VexRiscv is achievable
numbers of instructions on 64 cores.                                             when speculatively executing an illegal compressed floating-
                                                                                 point instruction and later executing a legitimate floating-
     12     11                                       PicoRV32           CVA6     point instruction: microarchitectural resources are reserved
 10                                                  Kronos      BOOM
                                                     VexRiscv                    but are never released, which results in a deadlock. The latter
 5                           4       4                3         3                bug is similar to the recently-discovered Zenbleed bug [39],
                                                                                 where an architectural bug leaves traces after speculation.
     Uarchvals Exceptions    Archvals Archflags       Hangs  Perfcnt
Figure 17:                         Discovered CPU bugs. Exceptions: missing or   Performance counter inaccuracies.            Cascade discovered
spurious exceptions. Uarchvals: wrong computations under                         3 inaccurate performance counter bugs (Perfcnts) in Kronos,
microarchitectural conditions. Archvals: systematic wrong                        VexRiscv and BOOM (K4, V13, B2). They incur an offset in
computations. Archflags: wrong status flags. Hangs: CPU                          the retired instruction counters when written by software.
hangs. Perfcnt: wrong performance counter values
FPU are wrongly permitted (V10, V11). In Kronos, writes                          Yosys logic synthesizer bug.      Yosys [60] is a popular open-
to a non-existent CSR fail to trigger an exception (K3). In                      source logic synthesizer, which is typically used for instru-
PicoRV32, Kronos and CVA6, spurious exceptions may be                            menting a CPU [32, 36, 48, 54]. It may also be used to part of
triggered by some CSR accesses (P2-P5, C8-C9) or incor-                          an emulation or ASIC flow. Cascade found a bug (Y1) that
rectly decoded valid instructions (P6, K5).                                      leads to incorrect logic synthesis of CVA6’s FPU.
Microarchitectural-state-dependent wrong computations.                           Other findings.            Besides discovering new bugs in CPUs
Cascade discovered 12 bugs that produce wrong computa-                           and in Yosys, Cascade found known bugs in CVA6, and the
tions under certain microarchitectural conditions (Uarchvals)                    performance counter inaccuracy in Rocket discovered by Di-
in Kronos and VexRiscv. In Kronos, a bug in the hazard detec-                    fuzzRTL and reported by TheHuzz and HypFuzz [14, 32, 35].
tion unit (K1) causes, under some conditions, wrong register                           Additionally, Cascade found two issues related to X over-
forwarding in a read-after-write-after-write double-hazard,                      propagation in VexRiscv (shadowing of RTL logical signals
where it forwards the first written value instead of the sec-                    by undefined values [33, 56]), which may happen under some
ond. In VexRiscv and CVA6, wrong calculations occur under                        conditions. First, some uninitialized instruction cache lines
some microarchitectural FPU conditions (V1-V9, V14, C10).                        can enter the pipeline and pollute signals. Second, in case
Such bugs are often hard to fix. Indeed, the VexRiscv main-                      of FPU underflow, a hardware table is accessed with a too
tainers proposed a fix, which solved most of the occurrences,                    large index. HypFuzz [14] discusses the implications of such
but Cascade discovered a way to tamper with the FPU’s mi-                        vulnerabilities when describing their third vulnerability [14].
croarchitectural state again with a different approach (V8, V9,
V14), which has ultimately been fixed by the maintainers.                        7.6.2   Security implications
Systematic wrong computations.                            Cascade discovered 4
bugs in BOOM and CVA6 that produce wrong output val-                             We classify the security implications of the discovered bugs
ues regardless of the microarchitectural state (Archvals). In                    into six categories, as presented in Figure 19. The same bug
BOOM, double-precision divisions and square roots ignore                         may belong to more than one category.
the (immediate) static rounding mode (B1). In CVA6, we
found two wrong output sign bugs (C1, C6), and one unex-                         Data-flow integrity.         We define data-flow integrity bugs
pected infinity bug (C7).                                                        as allowing a malicious time-shared user to cause a victim
                                                                                 entity to execute computations with a wrong result. Cascade
Systematic wrong flags.                         Cascade discovered 4 incorrect   found 12 such bugs, where preparing the (micro)architectural
flag bugs in CVA6 (Archflags) (C2-C5). Since flags are typi-                     state can cause wrong computations (V1-V9, V14, K1, B1).
cally used to guide a control flow, emitting FPU operations                      For example, exploiting the BOOM dynamic rounding bug
that will set flags incorrectly may provide an illegitimate con-                 (B1), an attacker process can force a victim process to take a
trol flow influence. Such bugs are difficult to fix. The CVA6                    different floating-point rounding mode than expected.

New bugs New bugs found

1000 100 10 1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 P1 P2 P3 P4 P5 P6 K1 K2 K3 K4 K5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 B1 B2 Y1 R1 Figure 18: Time to reveal each bug when fuzzing on 64 processes. R1 is the known instruction counting bug in Rocket. Note the logarithmic scale. Bugs are labeled as follows: B (BOOM), C (CVA6), K (Kronos), P (PicoRV32), V (VexRiscv), and Y (Yosys).

20    18              PicoRV32          VexRiscv     BOOM
                      Kronos            CVA6        Yosys                30
          12                                                             20
10                    9     8                                            10
 0                                      4     3     1                       PicoRV32 Kronos VexRiscv Rocket  CVA6 BOOM
                                                                            Figure 20: Program reduction performance.

Figure 19:         Security implications of discovered bugs. Note   Denial of service.        All the hangs that we found (V12, P1,
that some bugs belong to more than one category.                    K2) can happen at any privilege level, leading to DoS attacks.
                                                                    Logic hiding.        Using the discovered Yosys bug (Y1), a ma-
Information leakage.           Sources of information leakage are   licious contributor can inject bugs into a design while submit-
many. For example, data-flow integrity bugs may also en-            ting an apparently innocuous RTL design, by transforming the
able information leakage in the opposite direction, from the        RTL into an equivalent one that will trigger Y1 accordingly.
victim to an attacker running on the same core. So do bugs
that illegally set some flags (C2-C5), and some unauthorized        7.6.3   Bug detection performance
accesses (V10, V11), creating core-local side channels.             We fuzz on 64 processes for 24 hours and summarize the
Control-flow hijack.         Cascade found 9 bugs that let an at-   time to discover each bug in Figure 18, where we repeated
tacker process influence the control flow of a victim process       the discovery 10 times with different seeds. Note that some
(C1-C7, V5, V6). For example, CVA6 sets wrong flags in di-          bugs are hard to separate for this experiment, in particular
verse situations (C2-C5) and produces wrong finitude or sign        C2-C7 and C10, and may overlap. In most runs, bugs are
results (C1, C6, C7), which typically influence the control         discovered in less than 18 core-hours (17 minutes). Design
flow. Concretely, we found that the inexact flag is not set in      and bug complexities influence the discovery time.
some cases of underflow or overflow (C4), preventing some
potential security checks. On VexRiscv, the microarchitecture       7.7    Program reduction performance
can be prepared to corrupt comparisons, for example making
two equal registers be considered distinct (V5, V6).         This   We evaluate the performance of Cascade’s program reduction.
preparation must be done by a process on the same core.             Since all designs still have at least one non-fixed bug (V14,
                                                                    P6, K4, C9, B2 and R1), we focus on them for this analysis.
Spurious exceptions.           Cascade found 8 spurious exception   We run the fuzzer to collect 100 programs that trigger the bug
bugs (P2-P6, K5, C8, C9), which break isolation boundaries          in each design. Then, we perform the program reduction on
from higher privilege levels, for example, in the context of        these programs, and measure the time required to reduce each
trusted execution environments. For example, in PicoRV32,           program and normalize it by the number of instructions in the
reading some mandatory CSRs causes exceptions (P2), hence,          program, including the initial and final blocks.
a malicious machine could give the illusion that they are           Results.      Some program reductions could fail, e.g., because
accessible, but actually the machine solely emulates the inter-     a load instruction would target the hopping instruction of a
actions, providing arbitrary values.                                basic block preceding a candidate tail block during the re-
                                                                    duction phase and its result would propagate to the control
Missing checks.              Cascade found 4 missing checks, with   flow. Experimentally, we did not observe such failures. Fig-
several implications. First, they may permit to bypass security     ure 20 shows the time to reduce each program, normalized by
checks (V10, V11), problematic, for example, if the FPU is          the length of the program. program lengths varied between
time-shared with other cores.          Second, they can deceive a   193 and 62, 870 instructions. The timeout upper bound, con-
victim to believe that a feature is supported (K3, P5). They        servatively set to 30 × ninstructions + 1000 cycles, influences
can additionally be exploited to escape program analysis,           program reduction performance. In this specific experiment,
where supposedly dead code is shadowed by an exception.             all programs were reduced to a single instruction.

New bugs Time to discovery (s)

Seconds / k instr

Info leakageDF violationsCF hijackSpur. except.No check DoS Logic hiding

8 Discussion of Rocket and BOOM [31]. ProcessorFuzz [12] is a concur- rent work that generates instructions and collects coverage We discuss porting Cascade to other ISAs and limitations of of control and status registers on the ISS and reported the coverage metrics used by the state-of-the-art CPU fuzzers. discovery of 9 new bugs (see Appendix C), including 7 on Adaptations to other ISAs. While the fundamentals of BlackParrot, maintained by certain authors of ProcessorFuzz. Cascade’s approach comply with most widespread ISAs, its Kabylkas et al. present Dromajo and Logic Fuzzer [34] and current implementation is RISC-V-specific. Adaptation to report the discovery of 13 new bugs, including 4 with the RISC ISAs such as ARM and MIPS requires ISS and instruc- help of Logic Fuzzer. Bruns et al. [8] exploited ISS coverage tion generator adaptations. We expect porting Cascade to to find new VexRiscv bugs. Similarly, Herdt et al. [30] ex- CISC ISAs to be more challenging due to more instructions. ploited ISS coverage to find 10 new bugs in their own core. N. Coverage metrics. The coverage metrics used by the state- Bruns et al. [9] generate an infinite instruction stream on the of-the-art CPU fuzzers such as TheHuzz [35] and Difuz- memory channel to find a bug in their own core. Such an infi- zRTL [32] are dominated by ineffective terms while intro- nite instruction stream is known to cause compatibility issues ducing runtime overhead. For TheHuzz, toggle coverage due to strong microarchitectural assumptions [8]. Kande et represents around a million points and 89% of the achievable al. [35] proposed TheHuzz, based on commercial RTL simula- points on Rocket. Given that TheHuzz considers the sum tor coverage feedback, and fond 7 new bugs. Chen et al. [14] of all coverage points of all types, any mutation that would proposed a technique to speed up the coverage obtained by produce such a single new toggle would be considered as dis- TheHuzz and found one new bug (see Appendix C). Cascade covering new coverage. Limitations of such simple metrics is the first fuzzer to take a constructive approach to tackle the are well-known [53], yet the more powerful alternative, func- observations and challenges exposed in Section 3. tional coverage, requires significantly more effort [7]. This 10 explains why HypFuzz, building on the same metric as The- Conclusion Huzz, found a single new bug (see Appendix C). DifuzzRTL We presented Cascade, a CPU fuzzer based on the explicit defines control registers as registers which control a multi- construction of intricate RISC-V programs. Cascade is effec- plexer inside the same HDL module. This coverage metric tive: it finds 37 new bugs on 5 RISC-V CPUs with varying continues to increase when exploring registers whose fanout degrees of complexity and vastly outperforms the state-of-the- multiplexers already took all values. Ultimately, DifuzzRTL art coverage-guided CPU fuzzers. What sets Cascade apart optimizes for exploring an immense number of combinations from the state of the art is its ability to efficiently construct of values for registers that are mainly arbitrarily selected [12]. long complex programs that enable high-throughput CPU 9 Related Work fuzzing while terminating by design. Any non-termination signifies the discovery of a bug in the target CPU. We de- In the recent years, hardware fuzzing has flourished. We first scribed how Cascade generates its programs using a new cover generic hardware fuzzers, then CPU fuzzers. technique, asymmetric ISA pre-simulation, which enables Cascade to efficiently entangle an arbitrarily-complex con- Generic hardware fuzzers. RFUZZ [36] is a generic hard- trol flow with an arbitrarily-complex data flow. Since the ware fuzzer that relies on multiplexer control signals to gener- bug-triggering programs generated by Cascade may be long ate inputs based on AFL [29]. DirectFuzz [10] targets RFUZZ and complex, we introduced a new technique for program toward specific modules. Trippel et al. [55] proposed to fuzz reduction transforming a program to the only few instructions the hardware simulation binary itself, using software fuzzing that trigger the CPU bug. methods. Ruep et al. [43] proposed a fuzzer for SpinalHDL Ethical considerations. We reported all bugs to their re- designs. Li et al. [37] proposed a new coverage metric for spective maintainers, and proposed fixes when our understand- fuzzing. Ragab et al. [41] proposed a distance-to-target feed- ing of the language and design was sufficient. back metric to direct fuzzers towards specific targets. None of these publications reported the discovery of any bugs. Acknowledgements CPU fuzzers. Many active open-source repositories rely on testing tools, such as the RISC-V compliance suite [46], The authors would like to thank the anonymous reviewers for which generates basic unit tests, and RISCV-DV, a UVM- their valuable feedback, Tobias Kovats for his contribution based testing framework based on commercial simulators [2]. to RFUZZ re-implementation, and the maintainers of the de- To address their insufficiencies, several projects proposed signs we tested for their support in understanding and fixing fuzzing CPUs. DifuzzRTL [32] generates instructions and some of the bugs. This work was supported in part by a Mi- collects control register coverage to guide the fuzzing process crosoft Swiss JRC grant and by the Swiss State Secretariat for and discovered 16 new bugs. Its source code, while relying on Education, Research and Innovation under contract number obsolete languages, is available for fuzzing legacy versions MB22.00057 (ERC-StG PROMISE).

References [13] C. Chen. Miss illegal instruction exception when rd of mulhu is the same as rs1 or rs2. https://github.com [1] Chips Alliance. Ebreak instruction retires. h t t p s : /openhwgroup/cva6/issues/885#issuecomment-1 //github.com/chipsalliance/rocket-chip/issu 469547149. [Online; accessed 4-June-2023]. es/2672. [Online; accessed 4-June-2023]. [14] C. Chen, R. Kande, N. Nyugen, F. Andersen, A. Tyagi, [2] Chips Alliance. Risc-v dv. https://github.com A. R. Sadeghi, and J. Rajendran. Hypfuzz: Formal- /chipsalliance/riscv- dv. [Online; accessed assisted processor fuzzing. arXiv:2304.02485, 2023. 4-June-2023]. [15] H. Chen, Y. Li, B. Chen, Y. Xue, and Y. Liu. Fot: A [3] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, versatile, configurable, extensible fuzzing framework. D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, In ACM FSE, 2018. A. Izraelevitz, et al. The rocket chip generator. UC Berkeley, 2016. [16] P. Chen and H. Chen. Angora: Efficient fuzzing by principled search. In IEEE SP, 2018. [4] K. Asanovic, D. A. Patterson, and C. Celio. The berkeley out-of-order machine (boom): An industry- [17] C. Courbet. Nsan: a floating-point numerical sanitizer. competitive, synthesizable, parameterized risc-v proces- In ACM SIGPLAN CC, 2021. sor. Technical report, UC Berkeley, 2015. [18] S. Deng, D. Gümü¸so˘glu, W. Xiong, S. Sari, Y. S. Gener, [5] K. Asanovic, D. A. Patterson, and C. Celio. The C. Lu, O. Demir, and J. Szefer. Secchisel framework for berkeley out-of-order machine (boom): An industry- security verification of secure processor architectures. competitive, synthesizable, parameterized risc-v proces- In HASP, 2019. sor. UC Berkeley, 2015. [19] C. Deutschbein, A. Meza, F. Restuccia, R. Kastner, and [6] T. Blazytko, C. Aschermann, M. Schlögel, A. Abbasi, C. Sturton. Isadora: Automated information flow prop- S. Schumilo, S. Wörner, and T. Holz. Grimoire: Syn- erty generation for hardware designs. In ASHES, 2021. thesizing structure while fuzzing. In USENIX SEC, [20] K. Devarajegowda, M. R. Fadiheh, E. Singh, C. Barrett, 2019. S. Mitra, W. Ecker, D. Stoffel, and W. Kunz. Gap-free [7] T. Bojan, M. A. Arreola, E. Shlomo, and T. Shachar. processor verification by s2qed and property generation. Functional coverage measurements and results in post- In DATE, 2020. silicon validation of core™ 2 duo family. In 2007 IEEE [21] K. Devarajegowda, E. Kaja, S. Prebeck, and W. Ecker. HLDVT, 2007. Isa modeling with trace notation for context free prop- [8] N. Bruns, V. Herdt, D. Große, and R. Drechsler. Effi- erty generation. In DAC, 2021. cient cross-level processor verification using coverage- [22] LLVM Developers. Undefinedbehaviorsanitizer. http guided fuzzing. In VLSI, pages 97–103, 2022. s://clang.llvm.org/docs/UndefinedBehaviorS [9] N. Bruns, V. Herdt, E. Jentzsch, and R. Drechsler. Cross- anitizer.html. [Online; accessed 4-June-2023]. level processor verification via endless randomized in- [23] Z. Y. Ding and C. Le Goues. An empirical study of struction stream generation with coverage-guided aging. oss-fuzz bugs. In MSR, 2021. In DATE, 2022. [10] S. Canakci, L. Delshadtehrani, F. Eris, M. B. Taylor, [24] Martin Eberlein, Yannic Noller, Thomas Vogel, and M. Egele, and A. Joshi. Directfuzz: Automated test Lars Grunske. Evolutionary grammar-based fuzzing. In generation for rtl designs using directed graybox fuzzing. SSBSE, 2020. In DAC, 2021. [25] M. R. Fadiheh, J. Müller, R. Brinkmann, S. Mitra, [11] S. Canakci, C. Rajapaksha, L. Delshadtehrani, D. Stoffel, and W. Kunz. A formal approach for de- A. Nataraja, M. B. Taylor, M. Egele, and A. Joshi. tecting vulnerabilities to transient execution attacks in Processorfuzz: Guiding processor fuzzing using control out-of-order processors. In DAC, 2020. and status registers. arXiv:2209.01789, 2022. [26] A. Fioraldi, D. Maier, H. Eißfeldt, and M. Heuse. Afl++ [12] S. Canakci, C. Rajapaksha, L. Delshadtehrani, combining incremental steps of fuzzing research. In A. Nataraja, M. B. Taylor, M. Egele, and A. Joshi. Pro- WOOT, pages 10–10, 2020. cessorfuzz: Processor fuzzing with control and status [27] V. Ganesh, T. Leek, and M. Rinard. Taint-based directed registers guidance. In HOST, 2023. whitebox fuzzing. In ICSE, 2009.

    [28] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-     [42] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida,

based whitebox fuzzing. In ASPLOS, 2008. and H. Bos. Vuzzer: Application-aware evolutionary [29] Google. American fuzzy lop. https://github.com fuzzing. In NDSS, volume 17, pages 1–14, 2017. /google/AFL. [Online; accessed 4-June-2023]. [43] K. Ruep and D. Große. Spinalfuzz: Coverage-guided [30] V. Herdt, D. Große, E. Jentzsch, and R. Drechsler. Ef- fuzzing for spinalhdl designs. In IEEE ETS, 2022. ficient cross-level testing for processor verification: A [44] S. Sargsyan, S. Kurmangaleev, M. Mehrabyan, risc-v case-study. In FDL, 2020. M. Mishechkin, T. Ghukasyan, and S. Asryan. [31] J. Hur, S. Song, D. Kwon, E. Baek, J. Kim, and B. Lee. Grammar-based fuzzing. In IVMEM, 2018. Difuzzrtl: Differential fuzz testing to find cpu bugs. [45] K. Serebryany, D. Bruening, A. Potapenko, and https://github.com/compsec-snu/difuzz-rtl. D. Vyukov. Addresssanitizer: A fast address sanity [Online; accessed 4-June-2023]. checker. https://clang.llvm.org/docs/Addres [32] J. Hur, S. Song, D. Kwon, E. Baek, J. Kim, and B. Lee. sSanitizer.html. [Online; accessed 4-June-2023]. Difuzzrtl: Differential fuzz testing to find cpu bugs. In [46] RISC-V Software. Risc-v unit tests. https://github IEEE SP, 2021. .com/riscv-software-src/riscv-tests. [Online; [33] A. Izraelevitz, J. Koenig, P. Li, R. Lin, A. Wang, A. Mag- accessed 4-June-2023]. yar, D. Kim, C. Schmidt, C. Markley, J. Lawson, et al. [47] RISC-V Software. Spike risc-v isa simulator. https: Reusability is firrtl ground: Hardware construction lan- //github.com/riscv-software-src/riscv-isa guages, compiler frameworks, and transformations. In -sim. [Online; accessed 4-June-2023]. ICCAD, 2017. [48] F. Solt, B. Gras, and K. Razavi. Cellift: Leveraging [34] N. Kabylkas, T. Thorn, S. Srinath, P. Xekalakis, and cells for scalable and precise dynamic information flow J. Renau. Effective processor verification with logic tracking in rtl. In USENIX SEC, 2022. fuzzer enhanced co-simulation. In MICRO, 2021. [49] SonalPinto. Kronos risc-v (with all cascade fixes). ht [35] R. Kande, A. Crump, G. Persyn, P. Jauernig, A. R. t p s : / / g i t h u b . c o m / c a s c a d e - a r t i f a c t s Sadeghi, A. Tyagi, and J. Rajendran. {TheHuzz}: - d e s i g n s / c a s c a d e - k r o n o s. [Online; accessed Instruction fuzzing of processors using {Golden- 7-December-2023]. Reference} models for finding {Software-Exploitable} vulnerabilities. In USENIX SEC, 2022. [50] SpinalHDL. Vexriscv. h t t p s : / / g i t h u b . c o m / S p i n a l H D L / V e x R i s c v. [Online; accessed [36] K. Laeufer, J. Koenig, D. Kim, J. Bachrach, and K. Sen. 2-September-2023]. Rfuzz: Coverage-directed fuzz testing of rtl on fpgas. In ICCAD, 2018. [51] P. Srivastava and M. Payer. Gramatron: Effective [37] T. Li, H. Zou, D. Luo, and W. Qu. Symbolic simulation grammar-aware fuzzing. In ISSTA, 2021. enhanced coverage-directed fuzz testing of rtl design. [52] S. Sutherland. I’m still in love with my x! In Design In ISCAS, 2021. and Verification Conference (DVCon), 2013. [38] T. Ludwig, J. Urdahl, D. Stoffel, and W. Kunz. Proper- [53] S. Tasiran and K. Keutzer. Coverage metrics for func- ties first—correct-by-construction rtl design in system- tional validation of hardware designs. In IEEE Design level design flows. In IEEE TCAD, 2020. & Test of Computers, 2001. [39] T. Ormandy. Zenbleed. h t t p s : / / l o c k . c m p x c [54] M. Tiwari, H. M. G. Wassel, B. Mazloom, S. Mysore, h g 8 b . c o m / z e n b l e e d . h t ml. [Online; accessed F. T. Chong, and T. Sherwood. Complete information 2-September-2023]. flow tracking from the gates up. In ASPLOS, 2009.

    [40] S. Österlund, K. Razavi, H. Bos, and C. Giuffrida.       [55] T. Trippel, K. G. Shin, A. Chernyakhovsky, G. Kelly,

Parmesan: Sanitizer-guided greybox fuzzing. In D. Rizzo, and M. Hicks. Fuzzing hardware like software. USENIX SEC, 2020. In USENIX SEC, 2022.

  [41] H. Ragab, K. Koning, H. Bos, and C. Giuffrida. Bugs-   [56] M. Turpin and P. V. Engineer.      The dangers of living

bunny: Hopping to rtl targets with a directed hardware- with an x (bugs hidden in your verilog). In Synopsys design fuzzer. In SILM, 2022. Users Group Meeting, 2003.

[57] J. Wang, B. Chen, L. Wei, and Y. Liu. Superion: and written in SystemVerilog (IEEE 1800-2017). Rocket [3] Grammar-aware greybox fuzzing. In ICSE, 2019. (004297b, BigCore, rv64imafd) is a reference application- [58] P. Wang, X. Zhou, K. Lu, T. Yue, and Y. Liu. Sok: class CPU with Linux support, maintained by the Chips Al- The progress, challenges, and perspectives of directed liance, written in Chisel. BOOM [5] (004297b, Medium- greybox fuzzing. arXiv:2005.11907, 2020. Boom, rv64imafd) is an out-of-order application-class CPU with Linux support, maintained by the Chips Alliance, written [59] A. Waterman and K. Asanovic. The risc-v instruction in Chisel. set manual. https://github.com/riscv/riscv-i sa-manual. [Online; accessed 4-June-2023]. C Considerations of Previous Fuzzing Claims [60] C. Wolf, J. Glaser, and J. Kepler. Yosys-a free verilog In this appendix, we correct some claims from previous work synthesis suite. In Austrochip, 2013. and underline that an ISS-CPU mismatch is not always a bug. [61] YosysHQ. Picorv32 - a size-optimized risc-v cpu. ht In TheHuzz [35], it is considered as a bug (B4) that CVA6 tps://github.com/YosysHQ/picorv32. [Online; does not throw an exception when executing self-modifying accessed 2-September-2023]. code without fence.i. RISC-V does not specify any be- havior in this case, hence this is a feature request and not a [62] F. Zaruba and L. Benini. The cost of application-class bug. In HyPFuzz [14], their vulnerability V2 was later de- processing: Energy and performance analysis of a linux- nied by the CVA6 maintainers [13], and V3 is not a bug [52]. ready 1.7-ghz 64-bit risc-v core in 22-nm fdsoi technol- Since V3 is not a bug, we naturally did not count similar ogy. In VLSI, 2019. occurrences as bugs in the evaluation of Cascade (see Sec- tion 7.6.1). Conclusively, HypFuzz only describes a single A Instruction Categories new bug. In the archived version of ProcessorFuzz [11], Bug 10 is not a bug but a feature request, because the described The instruction categories used in Cascade are the following: behavior is explicitly permitted by the RISC-V specification. REGFSM (Register lifecycle), FPUFSM (Update the FPU state), ALU (rv32 ALU operations), ALU64 (rv64 ALU oper- D Enumeration of Discovered Bugs ations), MULDIV (rv32 multiplications and divisions), MUL- DIV64 (rv64 multiplications and divisions), AMO (rv32 atom- Table 1 shows the bugs we found in the three designs, the ics), AMO64 (rv64 atomics), JAL (Direct jumps), JALR (Indi- corresponding CWEs and CVEs. rect jumps), BRANCH (Branches), MEM (rv32 integer mem- Concurrent findings. After a first bug report campaign, we ory operations), MEM64 (rv64 integer memory operations), submitted some new bug reports that were, in the meanwhile, MEMFPU (rv32 floating memory operations), FPU (rv32 concurrently found by the VexRiscv maintainer. These bugs floating-point operations), FPU64 (rv64 floating-point opera- are V10 and V11. From existing github issues, it cannot be tions), MEMFPUD (rv32 double memory operations), FPUD excluded that symptoms of C1 had been noticed in the past. (rv32 double operations), FPUD64 (rv64 double operations), However, the root cause was clearly not understood, until TVECFSM (Update trap vector), PPFSM (Update previous Cascade and its analysis facilities allowed us to propose a fix, privileges), EPCFSM (Update trap previous PC), MEDELEG that has been approved and merged by the maintainers. (Update exception delegation), EXCEPTION (Trigger an ex- ception), RDWRCSR (Read/write some CSRs), DWNPRV (Transition privileges downward), FENCES (Fences or wfi).

B CPUs under Test

  Our testbench consists of 6 RISC-V CPUs of varying com-

plexities and ISA extensions. VexRiscv [50] (e142e12, Linux, AHB-L, FPU, rv32imfd) is a highly parametrable CPU written in SpinalHDL, supporting Linux. PicoRV32 [61] (f00a88c, Default, rv32im) is a size-optimized CPU written in Verilog (IEEE 1364-2005). Kronos [49] (13678d4, Default, rv32i) is a CPU optimized for FPGA applications, written in SystemVerilog (IEEE 1800-2017). CVA6 [62] (109f9e9, 4- way 8 kB caches, rv64imafd) is an application-class CPU with Linux support supported by the OpenHW Group organization,

                 Table 1: Bug Report Table.

Design Id Bug Description CWE CVE Sev. V1 Non-deterministic conversion from single-precision float to int 681 2023-34885 4.9 V2 fmin with one NaN does not always return the other operand 193 2023-34885 4.9 V3 Conversion from double to float may pollute the mantissa 681 2023-34885 4.9 V4 Dependent arithmetic/muldiv FPU operations may yield incorrect results 193 2023-34885 4.9 V5 Equal registers may be considered distinct by fle.s and feq.s 697 2023-34885 4.9 V6 flt.s may return 1 when operands are equal 697 2023-34885 4.9 V7 Under some microarchitectural conditions, square root may be imprecise 1339 2023-34885 4.9 VexRiscv V8 Single-precision muldiv followed by conversion may pollute the mantissa 681 2023-34891 4.9 V9 Dependent arithmetic/muldiv operations may cause largely wrong output 682 2023-34891 4.9 V10 Operations on floating-point registers are authorized when FPU is disabled 1189 2023-34885 4.9 V11 Wrong access control to the FPU flags leaks information 1189 2023-34892 4.9 V12 Hang on speculatively executed compressed FPU instructions 1342 2023-34896 7.7 V13 Inaccurate instruction count when minstret is written by software 684 2023-40063 3.6 V14 Some register comparisons are still incorrect despite a partial fix 697 2023-34885 4.9 P1 Accessing a non-implemented CSR causes the CPU to hang 1281 2023-34898 4.6 P2 Spurious exceptions when reading mandatory CSRs 1281 2023-34897 2.6 PicoRV32 P3 Performance counters are not writable 284 2023-34900 2.6 P4 Performance counters can only be read using some opcodes 284 2023-34914 2.6 P5 Performance counter addresses are incorrect 684 2023-34913 2.6 P6 Spurious exception when decoding fence instructions 705 2023-34899 5.0 K1 RaWaW double-hazard may cause a wrong register value to be forwarded 226 2023-34902 6.6 K2 Reading existing CSRs causes the CPU to hang in some uarch conditions 1281 2023-34901 7.1 Kronos K3 In some uarch conditions, no exception when writing inexistent CSRs 1281 2023-42310 2.6 K4 Inaccurate instruction count when minstret is written by software 684 2023-40066 3.6 K5 Incorrect decode logic for fence and fence.i 684 2023-34903 5.0 2023-40064 5.0 C1 Double-precision multiplications yield wrong sign when rounding down 682 2023-34904 4.4 C2 Single-precision floating-point operations may treat NaNs as zeros 684 2023-34906 5.1 C3 Division by NAN incorrectly sets NX and NV fflags 684 2023-34905 5.1 C4 The inexact (NX) flag not set in case of overflow or underflow 684 2023-34907 5.1 CVA6 C5 Division of zero by zero incorrectly sets the DZ flag 684 2023-34909 5.1 C6 Plus and Minus infinity microarchitectural structures are inverted 1221 2023-34910 4.4 C7 Infinities are not rounded properly and stick to infinity 1339 2023-34911 4.4 C8 Spurious exceptions when reading some performance counters 684 2023-34911 5.0 C9 Wrong supervisor performance counter access control 684 2023-42311 5.0 C10 Under some microarchitectural circumstances, wrong NAN conversion 682 2023-34908 5.1 BOOM B1 Static rounding is ignored for fdiv.s and fsqrt.s 1339 2023-34882 6.5 B2 Inaccurate instruction count when minstret is written by software 684 2023-40065 3.6 Yosys Y1 Logic synthesis of CVA6 inserts a logic bug into the FPU 682 2023-34884 6.8