Source f29a2b66... — STIMSMITH

SOURCE ARCHIVE

SHA256: f29a2b6656711f9b5e900270d7e91da7fc4c9e7916cd74589626a3075bb1365c

URL: https://www.ndss-symposium.org/wp-content/uploads/2026-s1663-paper.pdf

TYPE: application/pdf

SIZE: 1088.2 KB

FETCHED: 6/14/2026, 10:22:09 AM

EXTRACTOR: liteparse

CHARS: 105,525

EXTRACTED CONTENT

105,525 chars

                                             GoldenFuzz:
                             Generative Golden Reference Hardware Fuzzing

        Lichao Wu                        Mohamadreza Rostami                        Huimin Li

Technical University of Darmstadt Technical University of Darmstadt Technical University of Darmstadt lichao.wu@trust.tu-darmstadt.de mohamadreza.rostami@trust.tu-darmstadt.de huimin.li@trust.tu-darmstadt.de

       Nikhilesh Singh                           Ahmad-Reza Sadeghi

Technical University of Darmstadt Technical University of Darmstadt nikhilesh.singh@trust.tu-darmstadt.de ahmad.sadeghi@trust.informatik.tu-darmstadt.de

Abstract—Modern hardware systems, driven by demands for leakages [4]–[6]. Fixing such flaws post-fabrication could be high performance and application-specific functionality, have a challenging and costly task [7]. Consequently, detecting and grown increasingly complex, introducing large surfaces for bugs mitigating hardware vulnerabilities in the pre-silicon stage and security-critical vulnerabilities. Fuzzing has emerged as a scalable solution for discovering such flaws. Yet, existing hard- becomes crucial to preserving system stability and security ware fuzzers suffer from limited semantic awareness, inefficient with a reduced budget. test refinement, and high computational overhead due to reliance The hardware community has proposed several security ver- on slow device simulation. ification techniques, from formal static methods to simulation- In this paper, we present GoldenFuzz, a novel two-stage hard- based dynamic approaches [8]–[17]. In recent years, fuzzing ware fuzzing framework that partially decouples test case refine- has gained attention for its capacity to automate vulnerability ment from coverage and vulnerability exploration. GoldenFuzz leverages a fast, ISA-compliant Golden Reference Model (GRM) detection and scalability on complex hardware designs. By as a “digital twin” of the Device Under Test (DUT). It fuzzes the systematically probing target devices, hardware fuzzing excels GRM first, enabling rapid, low-cost test case refinement, acceler- in discovering bugs and security vulnerabilities, making it a ating deep architectural exploration and vulnerability discovery practical tool to improve hardware security in increasingly on DUT. During the fuzzing pipeline, GoldenFuzz iteratively sophisticated computing systems [18]–[28]. Industry leaders, constructs test cases by concatenating carefully chosen instruction blocks that balance the subtle inter- and intra-instructions qual- including Intel and Google, are actively investing in hardware ity. A feedback-driven mechanism leveraging insights from both fuzzing research [18] to strengthen their verification efforts. high- and low-coverage samples further enhances GoldenFuzz’s Unfortunately, existing fuzzers face several significant chal- capability in hardware state exploration. Our evaluation of three lenges. First, they primarily rely on random mutations or RISC-V processors, RocketChip, BOOM, and CVA6, demon- heuristics to generate test cases, failing to capture the complex strates that GoldenFuzz significantly outperforms existing fuzzers in achieving the highest coverage with minimal test case length dependencies and execution semantics inherent to modern and computational overhead. GoldenFuzz uncovers all known Instruction Set Architectures (ISAs). This shortcoming leads vulnerabilities and discovers five new ones, four of which are to duplicated test cases and subpar exploration of edge cases. classified as highly severe with CVSS v3 severity scores exceeding Second, previous work typically refines test cases in an ad- seven out of ten. It also identifies two previously unknown hoc manner, lacking a structured mechanism to refine input vulnerabilities in the commercial BA51-H core extension. based on semantic insights. Consequently, they struggle to I. INTRODUCTION produce meaningful instruction sequences that reach deeper Modern processors contain billions of transistors, multi- hardware states. Although some fuzzers attempt to improve ple cores, and sophisticated performance optimization mech- test selection through, for example, steered control flow [24] anisms that offer unprecedented computational capabilities. or reinforcement learning [21], they often operate without a This complexity simultaneously enlarges the surface of the clear separation between test case refinement and coverage hardware attack, exposing systems to a broader range of exploration, leading to inefficiencies in both processes. Finally, functional bugs and security-critical flaws, for example, system existing work relies on executing test cases solely on the slow instability [1], incorrect computation [2], [3], and information simulated DUT. Each iteration requires evaluating numerous test cases, leading to excessive computational overhead and limiting the fuzzer’s ability to explore a broad range of hardware states in a constrained time frame.

Network and Distributed System Security (NDSS) Symposium 2026     Our goals and contributions:           We present GoldenFuzz, a
23-27 February 2026, San Diego, CA, USA                           novel hardware fuzzing framework that addresses these bottle-
ISBN 979-8-9919276-8-0                                            necks by decoupling the traditional monolithic fuzzing process
https://dx.doi.org/10.14722/ndss.2026.231663
www.ndss-symposium.org

from direct, continuous interaction with the target hardware.      generation strategy to expose new vulnerabilities.
    Concretely, GoldenFuzz restructures hardware fuzzing into    •      To the best of our knowledge, GoldenFuzz identifies all
two  distinct stages: the Golden     Reference Model    (GRM)          previously known bugs and vulnerabilities. Additionally,
fuzzing and the Device Under Test (DUT) fuzzing. In the first           five new vulnerabilities are discovered in tested cores

fuzzing stage, inspired by the concept of a digital twin, Gold- (RocketChip [29], Boom [30], and CVA6 [31]), including enFuzz introduces a fast, software-based GRM of the DUT1. four critical ones with CVSS 3.0 scores above seven This enables rapid exploration and efficient refinement of (from a maximum of ten). For the real-world application, fuzzing strategies, guiding the generation of test cases that in- it identifies two previously unknown vulnerabilities in the herently achieve deeper architectural coverage. Subsequently, commercial BA51-H core extension. the optimized fuzzing policy is transferred directly to the DUT The remainder of this paper is structured as follows. We fuzzing stage, greatly accelerating the discovery of meaningful provide the necessary background information in Section II. hardware behaviors and potential vulnerabilities. GoldenFuzz In Section III, we describe the design of GoldenFuzz in detail; integrates a customized language model that captures the se- the implementation is introduced in Section IV. Section V mantic structures of instruction sequences, generating smarter, evaluates the performance of GoldenFuzz regarding the hard- more targeted test cases, effectively serving as a translator ware coverage and computational cost. Section VI details the fluent in the processor’s intricate architectural dialect. By fuzzing findings, including detected vulnerabilities and bugs. continuously analyzing both successful and unsuccessful test Section VII performs an ablation and hyperparameter study cases, GoldenFuzz adapts its strategy to systematically target on several critical fuzzing settings. Section VIII discusses the complex, hard-to-reach processor states. Together, these de- GoldenFuzz with more insights. Section IX discusses related sign innovations enable GoldenFuzz to significantly surpass works. Section X concludes this work. state-of-the-art fuzzers in coverage growth, uncover deeper II. P and previously hidden vulnerabilities, and dramatically reduce RELIMINARIES computational overhead in hardware fuzzing. Concretely, we A. Fuzzing make the following contributions: Fuzzing is a widely used methodology for testing and • We introduce a two-stage hardware fuzzer, GoldenFuzz, verifying complex hardware and software designs, such as that first emphasizes rapid test case refinement at low processors, cryptographic modules, and communication pro- computational cost, then shifts to focused vulnerability tocols, to uncover potential vulnerabilities [20], [32]–[34]. A detection to uncover critical weaknesses. During the first traditional fuzzer begins with randomly produced yet valid phase, GoldenFuzz, for the first time, utilizes a GRM, a test stimuli. State-of-the-art fuzzers have employed iterative software model designed to execute RISC-V instructions mutation algorithms to expand the DUT’s state space cover- in strict adherence to the ISA specification, to represent age. Throughout this process, the DUT’s outputs, including the DUT. This implementation facilitates the evolution execution traces and any crash information, are captured and of test cases to contain richer semantics, potentially analyzed. In software contexts, suspicious crashes may directly achieving higher hardware state coverage. This insight reveal exploitable bugs. For hardware, the DUT’s execution transfers seamlessly to the DUT, reducing discrepancies traces are compared against a GRM or predefined assertions. and maximizing meaningful fuzzing outcomes. Any detected discrepancies become red flags for potential • We introduce a block-wise test case generation scheme vulnerabilities. This iterative cycle repeats until an acceptable that produces multiple instruction blocks in each iteration level of state-space coverage is achieved or critical weaknesses and appends the chosen block to subsequent iterations. A are discovered. GRM is only used for vulnerability detection; new scoring mechanism, integrating inter- and intra-test it does not influence the fuzzing pipeline. case evaluations, drives extensive instruction space ex- Hardware fuzzing strategies are generally categorized into ploration, ensuring that each progressive block effectively black, grey, and white boxes based on the degree of internal aligns with fuzzing objectives. knowledge available about the DUT [23], [35]. Among these • Our framework leverages a novel language model–based techniques, coverage-based white-box fuzzing has become generator capable of accurately producing assembly in- particularly popular for hardware verification, as it systemati- structions by understanding inter-instruction semantics. cally evaluates state-space exploration using coverage metrics This generator integrates the target’s feedback, enhancing such as finite-state machine (FSM) coverage, line coverage, the test case generation capability with higher coverage. condition coverage, and multiplexer (MUX) toggle coverage. Furthermore, for the first time, we incorporate both Within a hardware fuzzing pipeline, initial input seeds are “winning” and “losing” test cases (in terms of hardware generated and mutated to produce multiple test cases. Feed- coverage) into the fuzzing process. By directly comparing back from simulation-based coverage analysis of these test these paired cases, our method continually refines its test cases then guides the selective refinement of promising inputs while discarding unproductive ones. This feedback-driven loop 1Although the GRM and DUT implementations may differ internally due to supports efficient navigation of the DUT’s state space; any hardware-specific optimizations or undocumented behaviors, they both strictly anomalies or unexpected DUT responses are recorded for follow the ISA specification. follow-up vulnerability assessment.

     2

B. Language Model and Fine-Tuning                                      1  Fuzzer Init.   2 GRM Fuzzing                 4 Vul. Detection
                Language models are one of the most advanced methods
in Natural Language Processing (NLP), as they significantly
increase the capability of intelligent systems to analyze and              LM            Random Inst. Block Gen.          Differential
generate coherent text. They support various tasks by predict-            Pretraining                                     Test
ing how words naturally follow one another, including trans-                             3     DUT Fuzzing
lation, summarization, and conversational agents [36]. These
language models often rely on transformer architectures [37],
which use attention mechanisms to capture complex relation-               Assembly
                                                                          Collection           Best Inst. Block Gen.      Exe. Traces
ships across long text sequences. Among the most influential
transformer-based models are Generative Pre-trained Trans-                 Fig. 1: An Overview of the GoldenFuzz framework.
formers (GPT) [38], built using a unidirectional approach.
GPT models predict each successive token using only previous           fuzzers [19], [21], [23]: their tendency to generate test cases
tokens, relying on stacked transformer decoder layers with             containing limited (fewer than 10) executable instructions,
multi-head self-attention and feed-forward networks.                   and their computational inefficiency due to the prohibitive
         Training a language model can be highly resource-intensive;   overhead of cycle-level hardware simulation.
fine-tuning becomes an effective and low-cost learning strategy                    First, since the GRM strictly follows the RISC-V ISA
when customizing pre-trained language models for specific              specification, refining the fuzzing policy using GRM feedback
tasks. Common methods include Reinforcement Learning from              encourages the generation of high-quality test cases. While
Human Feedback (RLHF) [39], which aligns the model’s                   acknowledging that subtle deviations or undocumented be-
outputs with human preferences or guidelines, and Direct Pref-         haviors in the DUT might not always be perfectly reflected
erence Optimization (DPO) [40], [41], which avoids explicit            in the GRM, GoldenFuzz addresses this potential mismatch
reward shaping in favor of pairwise comparisons between                not by attempting to improve coverage directly from the
candidate outputs. These methods enable the model to capture           GRM, but by setting a more general objective: to enhance
subtle human-defined quality metrics and to produce responses          test case validity based on the ISA. This approach enables
that align with particular application needs.                          the fuzzer to learn valid concatenations between instructions
 III. GOLDENFUZZ                                                       and leave the more dedicated, coverage-guided fuzzing to the
A. General Framework                                                   DUT. Second, the GRM significantly reduces computational

An overview of GoldenFuzz is shown in Fig. 1. First, our overhead. Unlike DUT simulations, the GRM provides a fuzzer, powered by a customized language model (LM), is pre- fast, ISA-compliant software environment, enabling efficient trained on a corpus of assembly instructions. Next, instead of fuzzing-policy training. By quickly learning patterns of struc- immediately testing on the slower DUT, the LM first interacts turally valid instruction sequences, GoldenFuzz transitions with a fast software-based GRM. During this stage, the LM smoothly to real hardware, generating high-quality test cases generates small groups of instructions, denoted as instruction without costly online DUT-guided test case refinement. As blocks, based on previously tested examples. These instruction empirically validated in Section VII, this approach facilitates blocks are rapidly evaluated on the GRM, and the feedback is more valid and executable instructions per test case. Besides, used to teach the LM to generate more valid and meaningful as demonstrated in Section VI, our GRM-driven approach instruction sequences. enables the discovery of vulnerability classes that require deep After sufficient refinement, the DUT fuzzing stage starts, and sustained execution paths, issues that are typically missed where the improved LM now targets the actual hardware by conventional random or shallow test generation strategies. (DUT) with a similar fuzzing pipeline. Because the LM was C. Fuzzer Initialization previously refined to produce instruction blocks with high se- mantic validity, test cases at this stage have a greater potential Before the two fuzzing stages, GoldenFuzz is initialized to explore deep, complex hardware states, achieving more to gain essential instruction generation capability. GoldenFuzz effective testing while minimizing slow hardware simulation treats a test case, containing a sequence of instructions, as a overhead. Finally, by comparing execution traces from the linguistic construct akin to a sentence composed of meaningful DUT against those from the GRM, GoldenFuzz efficiently words. Essentially, each instruction is viewed as a linguis- identifies discrepancies that reveal vulnerabilities and bugs. tic unit within a higher-level syntax, ensuring its collective B. The Motivation Behind the GRM Fuzzing arrangement conveys a coherent intent to the DUT. Our approach involves a fuzzer driven by a customized language Traditionally, GRM serves primarily as an oracle to identify model (implementation detailed in Section IV-A), leveraging mismatches and bugs by comparing DUT outputs against its adeptness in natural language processing tasks to generate expected behavior. In contrast, GoldenFuzz applies GRM hardware-focused instruction synthesis. Note that an effective to coarsely refine the fuzzing policy before the actual instruction generation engine must internalize both intra- DUT fuzzing, addressing two critical limitations of prior instruction semantics (valid and meaningful construction of a

 3

      single instruction) and inter-instruction semantics (synergistic              cases with non-trivial control flow patterns. For DUT
         assembly of multiple instructions to unveil vulnerabilities).               fuzzing, we employ a dual-layer scoring system. This
           In this section, we initialize the model with robust intra-            approach incentivizes newly uncovered coverage within a
        instruction knowledge. The refinement of its inter-instruction       single test case (intra-test scoring) while deducting points
           capabilities through dynamic interaction with the target is         for coverage already identified by other tests (inter-test
elaborated in Section III-D.                                                     scoring). As a result, this technique drives the fuzzing
                 We prepare a diverse set of randomly sampled instruc-        process toward more hidden hardware states.
         tions to equip the fuzzer with a foundational grasp of intra-    •            Fuzzing Policy Refinement and Memory Update. Using
            instruction semantics. More concretely, we assemble a cor-             the computed scores, we form preference pairs by iden-
pus of J  assembly instructions      {I1, I2, . . . , IJ }. The choice        tifying “winning” (e.g.,  , the color indicates the fuzzing
            of assembly instructions enhances the semantic connections        iteration) and “losing” (       ) test cases based on their
         between instructions, facilitating the assembly of meaningful           scores. These pairs are directly fed back to the fuzzer,
         sequences that can probe deeper into vulnerabilities. The in-             driving it toward producing instruction sequences that
      structions corpus is concatenated into a single linear structure          deliver improved coverage. At the same time, the fuzzer’s
D, separated by <eoi> (end of instruction):                                     memory is updated to maintain a consistent record of test
 D = I1 <eoi> I2 <eoi> I3 <eoi> · · · IJ                    <eoi>. (1)        cases aligned with the evolving fuzzing policy.
             This explicit segmentation aids in clarifying instruction    1) Block-Wise Test  Case     Generation: One   of   the    core

boundaries and preserving per-instruction semantics. Next, novelties in GoldenFuzz is a structured, block-wise test case each instruction Ii is tokenized as Ti, forming T = generation strategy that balances complexity, coverage, and T1 T T2 T · · · TJ T such that each Ti contains learning efficiency. Rather than generating full instruction ki tokens ti,1, ti,2, . . . , ti,k . sequences in a single step, GoldenFuzz constructs test cases i incrementally from smaller units called instruction blocks We define an auto-regressive objective that compels the model to predict the subsequent token given all previously (IBs). Each IB contains one or more instructions, depending observed tokens. Let t = {t1, t2, . . . , tK} represents the on the current fuzzing configuration. In each fuzzing iteration entire token sequence of T , with K (≥ J) being the total i, GoldenFuzz selects a set of previously generated IBs either number of tokens. The model is trained to estimate Pθ(ti+1 | randomly or based on performance, depending on the fuzzing t1, t2, . . . , ti) for each i, where θ denotes the trainable pa- stage (see Fig. 1). These selected IBs are used as prefixes rameters. We minimize the negative log-likelihood L: to generate new candidate blocks. Precisely, let bji denote the j-th selected block from iteration i. For each bji , we K −1 generate N new instruction blocks, forming a candidate set L(θ) = − X log Pθ(ti+1 | t1, t2, . . . , ti). (2) Bi+1 = {b1i+1, b2i+1, . . . , bN }. Each new test case is then i+1 i=1 constructed by concatenating a selected prefix bji with one Eq. 2 ensures that the model incrementally refines its of the newly generated candidates bki+1 ∈ Bi+1, resulting in parameters to produce tokens following previous contexts. bi,i+1 = bji ⊕ bki+1 (for simplicity, we omit the specific j D. Fuzzing Pipeline and k in the notation). This strategy has three key advan- As mentioned in Section III-A, both the GRM fuzzing and tages. First, by using bji as a starting point, we anchor the execution in a known hardware state, allowing the fuzzer to DUT fuzzing phases in GoldenFuzz follow a similar fuzzing focus its exploration on how the new block bki+1 affects the pipeline. The key difference lies in their targets and scoring state transition. Second, this block-wise construction simplifies functions. An overview of the GoldenFuzz pipeline is shown learning: instead of reasoning over full test cases of length in Fig. 2, consisting of three major steps. N instructions, the fuzzer only needs to learn over smaller • Test Case Generation and Simulation. Guided by the blocks of size roughly N/M when the test case is divided current fuzzing policy, we produce test cases by sampling into M blocks. This decomposition reduces the complexity from the GoldenFuzz’s memory and iteratively creating, of the learning problem and improves convergence during selecting (presented by the heart symbol), and concatenat- feedback-guided optimization. Finally, repeatedly sampling ing instruction blocks (IBs) with each containing multiple new IBs allows the fuzzer to explore different (and potentially instructions. The chosen target (GRM or DUT) executes new) hardware spaces. As a result, each iteration pushes the these IB concatenations and provides feedback. hardware state exploration further, guiding the search process • Test Case Scoring. GoldenFuzz evaluates IBs differently into less-discovered state space while preserving the insight for each fuzzing stage. For GRM fuzzing, the IB is gained from earlier discoveries. simply evaluated by its validity. Instead of enforcing Algorithm 1 illustrates the instruction block (IB) generation validity at the instruction level, an IB that is executable process for a single fuzzing iteration. The process begins with with no GRM exception is considered valid. IB-level the selected IB set Bi from iteration i. To generate the next set validity allows the fuzzer to learn the valid concatenation Bi+1, the fuzzer extends each IB b ∈ Bi by taking a sequence between instructions, potentially construct multi-IB test of actions (Line 5), each appending a new instruction token

                               4

        Fuzzer        Iterative Instruction Block Generation              Simulation                    Scoring        Perference Pairs
                       Gen. #1              1     2      3            ...                       #1

                     ID: 1,3,7
        Policy         Gen. #2              1     3      7            ...    DUT/GRM            #1
                                                                                                #2

                   ID:...
                         3,7,8      ...               ...                                                ...      ...
                       Gen. #N              3     7      8            ...                    #N-1
     Memory                                                               Coverage Map          #N
                     ID: 2,7,9                  Memory Update & Policy Refinement
                                               Fig. 2: An Overview of the GoldenFuzz Pipeline.

Algorithm 1 Instruction Block Generation.                             total coverage only after the test case is complete. In contrast,
Require:     Bi, fuzzer, num blocks, num inst per block               in the DUT fuzzing stage, GoldenFuzz evaluates each IB and
Ensure: Bi+1                                                          employs a dual-layer scoring system:
 1:  Bi+1 ← {}
 2:  for b in Bi do                                                   Intra-test case scoring. We employ intra-test case scoring to
 3:     cnt ← 0                                                       incentivize newly uncovered coverage points within a single
 4:     while cnt < num inst per block do                             test case. Let bi  be the    i-th IB in a test case, and let H
 5:            actions ← fuzzer(b, num blocks)                        represent the set of coverage points already revealed by bbⁱ
 6:            b, eoi ← step(b, actions)                                                                                             i.
 7:            if eoi then                                            When transitioning to IB bi+1, the coverage newly discovered
 8:               cnt ← cnt + 1                                       by combining bi with bi+1 is denoted by G(bi,i+1). As newly
 9:            end if
10:     end while                                                     uncovered states are more valuable (more likely to trigger
11:           Bi+1 ← append(b)                                        unexplored hardware stats), they have a higher weight (w) than
12:  end for                                                          previously seen states. Specifically, we use two base weights,
13:  Bi+1 ← remove_dead_blocks(Bi+1)
14:  if GRM fuzzing then                                              α and β (with α < β), so that
15:     Bi+1 ← get_random_n(Bi+1)                                                                  (
16:  else                                                                                           α,   if  x ∈ H ,
17:     Bi+1 ← get_best_n(Bi+1)                                                          w(x) =                  bi                 (3)
18:  end if                                                                                         β,   if  x /∈ Hbi,

    where x denotes a coverage point. α is a lower reward for

to b (Line 6). This continues until a special separator token, coverage points that merely repeat what was already observed , is generated, signaling the end of an instruction (Line by previous IBs, whereas β is a higher reward to incentivize 7). Once the number of instructions in the current IB reaches new coverage. the predefined limit, num_inst_per_block (Line 4), the Inter-test case scoring. Inter-test case scoring deducts the fully constructed block, consisting of the original prefix and coverage score already identified by other test cases. Let f (x) the newly generated instructions, is added to Bi+1 (Line 11). be the frequency with which a coverage point x is encountered After generation, each IB in Bi+1 is simulated to determine by test cases other than the current one. We introduce a small whether it is “dead”. We consider an IB “dead” if it meets constant factor to reduce β in proportion to f (x), thereby one of two conditions: (1) it is syntactically invalid, meaning lowering the reward for states that are already popular among it violates ISA specification, or (2) it prematurely terminates different test cases: execution, for example by containing control-flow terminators like a ret (return) instruction. Since dead IBs cannot be β′(x) = max β − f (x) · factor, α . (4) meaningfully extended in future iterations, they are excluded from subsequent iterations. However, during the GRM fuzzing Intuitively, as f (x) increases, β′(x) linearly transitions from stage, these IBs still serve as useful negative (“losing”) exam- β (fully rewarding) down toward α, reflecting the diminishing ples, contributing to the fuzzing policy refinement. scores assigned to repeatedly covered points. 2) Inter and Intra-test Case Scoring: Traditional hardware Combining two scoring schemes, the overall score for the fuzzing methods usually assess each test case fully, measuring transition from bi to bi+1 is defined by summing the adjusted

                                                         5

weights for every coverage point x in G(bi,i+1). We have:              establishes a target reward margin, ensuring that preference
    S(bi,i+1) =     X               w′(x),                       (5)   differences translate into meaningful fuzzing policy shifts.
                                                                       Eq. 7 encourages the fuzzer to shift output probability mass
    x∈G(bi,i+1)                                                        toward generating “winner” instruction sequences that are
where β in Eq. 3 is replaced with β′(x) in Eq. 4 to form w′(x).        empirically more likely to uncover new coverage points.
This strategy enables dynamic score adjustment in response                        However, naively applying Eq. 7 to our fuzzing scheme
to coverage frequency, potentially mitigating the risk of the          introduces significant challenges. DPO is typically employed
fuzzer over-prioritizing already successful test cases. As shown       as an offline optimization method, where preference pairs are
in Section V-A, the hardware coverage keeps increasing with            collected before optimization. In contrast, hardware fuzzing
more test cases, while other fuzzer are saturated early.               demands an online optimization approach, where the fuzzing

Fuzzing Policy Refinement: After scoring test cases, the policy is refined iteratively during runtime. This continuous next step is to refine the fuzzing policy using a principled, data- refinement introduces the risk of overfitting. Over time, the driven approach. Traditional hardware fuzzers, as described generated instruction risks drifting toward narrower instruction in Section II-A, typically adopt a mutation-based strategy sets, neglecting other potentially valuable regions of the state inspired by the American Fuzzy Lop (AFL) pipeline [32]. space in the “distributional tails” [42]. Such rare points (a.k.a. They randomly mutate successful test cases while discarding corner cases) are critical for exposing deep and previously unsuccessful ones. While straightforward, this approach is unseen vulnerabilities. Eventually, the fuzzer could collapse, inefficient and wastes valuable information: it neglects insights unable to produce even syntactically correct instructions. from “losing” test cases that could guide future iterations and To address this, we introduce a fuzzing memory M to bal- makes refining “good” test cases erratic, as it lacks feedback ance immediate gains with maintaining exploration diversity. on which mutations are most effective. After each iteration i, we identify N top IBs and preference GoldenFuzz overcomes these limitations by directly inte- pairs from Bi and store these “exemplars” in M. M follows grating feedback into the fuzzing process. Instead of randomly the “first-in-first-out” principle: we remove the oldest set from guessing which test cases are promising, GoldenFuzz explic- iteration i − N to prevent unbounded growth and maintain a itly pairs “winning” and “losing” test cases based on their rolling window of strong candidates: scores, forming preference pairs. These pairs allow the fuzzer to efficiently utilize feedback regardless of test case coverage M ← (M \ Bi−N⋆ ) ∪ Bi+1⋆ . (8) and refine its preference (i.e., fuzzer’s fuzzing policy) through direct comparison. By consistently prioritizing “winning” test The IB and preference pairs are sampled from M during cases, the fuzzer evolves toward generating test cases with a new refinement iteration or fuzzing policy. We assign ex- higher coverage, thus probing into the deeper state space. ponential recency weighting to each sample so that recent Concretely, we define the intrinsic reward of a test case b samples are more likely to be sampled, but keep older samples using the model’s likelihood of generating it: in the mix. This prevents forgetting earlier coverage strategies and guards against overfitting to the current iteration’s local β X r(b) = |b| |b| log πθ(ti | t<i), (6) maxima [43]. Over time, this yields a stable fuzzing process. where ti denotes the i=1 E. Bug and Vulnerability Detection i-th token of b; t<i represents the

sequence of tokens preceding  ti; πθ          is the current fuzzing   Differential testing      is extensively utilized in    hardware

policy parameterized by θ; and β is introduced to scale the fuzzing to identify “crashes”. Under this methodology, a single reward. This per-token likelihood quantifies how “natural” or test case is executed on both the DUT and a GRM based on policy-consistent a given test case is. Normalizing by sequence the ISA. The execution traces obtained from these models are length ensures that test cases are not unfairly penalized or then compared [19], [21], [23], [24]. In alignment with state- rewarded based on their instruction length alone, thus forc- of-the-art hardware fuzzers, GoldenFuzz employs differential ing GoldenFuzz to generate high-quality IBs instead of just testing involving the DUT and the GRM. This approach making, e.g., long IBs, to win more rewards. has demonstrated its effectiveness across various hardware Using feedback from the GRM or DUT, pairwise prefer- fuzzers, particularly in the context of RISC-V fuzzers, which ences are established by comparing test cases and identifying have, to date, facilitated the discovery of most of bugs and a “winner” based on the scoring function (Eq. 5). Given two vulnerabilities [19]–[21], [23]–[25], test cases bw (“winner”) and bl (“loser”), employing Simple Although highly effective at identifying issues, differential Preference Optimization (SimPO) [41], a specialized form of testing can generate many mismatches, many of which are Direct Preference Optimization [40] with lower computation either duplicates or false positives [23]. Given the manual cost, we update the fuzzing policy by minimizing: nature of vulnerability analysis, prolonged investigation of L(θ) = −E(bw,bₗ)∼B [log σ (r(bi,w) − r(bi,l) − γ)] , (7) erroneous or redundant mismatches can significantly hinder the efficiency and scalability of hardware fuzzing, particularly where B denotes the set of preference pairs constructed from as design complexity grows. In our workflow, each unique the GRM evaluations; σ denotes the sigmoid function; γ mismatch, whether classified as a true bug or a false positive,

initially requires manual inspection. For the five new vulnera- precision floating-point operations, atomic memory operations, bilities, confirmation and classification typically took between compressed instructions, and machine-level instructions. This 5 and 30 minutes per case, depending on the complexity of exhaustive inclusion allows our fuzzer to generate relevant the scenario and the need to trace privilege transitions or test cases with maximum diversity, including less frequently instruction sequences. In the early stages of fuzzing, when utilized or specialized components of the instruction set. many mismatches are new, this manual effort could increase. Besides, due to the generative nature of the underlying To (partially) address this limitation and improve scalability, LLM, diverse and even syntactically incorrect instructions still we propose a filtering approach to streamline the analysis emerge naturally. While undesirable in typical LLM tasks, process. After each mismatch is analyzed, it is added to a this characteristic is beneficial in hardware fuzzing, as such known mismatch list along with its environmental context incorrectness helps evaluate corner cases. The initial training, (e.g., privilege level, instruction type, register values, and described in Section III-C, involves 50,000 epochs with a exception details). If the same mismatch recurs, the system can learning rate 1e-6 and takes approximately one hour on one automatically classify it without further manual intervention. NVIDIA A6000 GPU. During the subsequent fuzzing policy As fuzzing progresses, the proportion of previously seen mis- refinement phase (discussed in Section IV-B), we lower the matches increases, and the filtering mechanism substantially learning rate to 2e-7 to maintain stable fine-tuning and ensure reduces the number of cases requiring manual analysis. This that the fuzzer’s output remains coherent. approach ensures that, even for large and complex designs, the manual effort required for vulnerability confirmation and B. Hardware Fuzzing Settings analysis remains manageable. Recall in Section III-D, GoldenFuzz initiates the fuzzing IV. IMPLEMENTATION process by generating test cases constructed from multiple A. Fuzzer Design and Pre-training instruction blocks (IBs). In our configuration, each test case contains five IBs, each with six instructions. Preliminary We implement our fuzzer using a GPT-2 language model experiments confirm that carefully chosen sets of around 30 tailored for hardware instruction generation. Employing open- instructions are generally sufficient to detect vulnerabilities, source architecture, consisting of 1.5 billion parameters and a aligning with our observations on state-of-the-art fuzzers. We vocabulary of over 50 000 tokens, poses two major challenges. assess this hyperparameter choice in Section VII-B. In both First, GPT-2’s pretrained parameters are grounded in natural fuzzing stages, 80 IBs are sampled from the fuzzing memory (English) language, which differs substantially from assembly in each iteration, each of which forms the input to the language and binary code syntax. Second, GPT-2 employs Byte-Pair model for generating five new IBs. Encoding (BPE) to handle tokenization, splitting infrequently Each generated test case is executed on the GRM or the occurring words into smaller subword units. Although BPE DUT. During the GRM fuzzing, as mentioned in Section III-B, benefits broad-domain language tasks, it introduces unneces- the fuzzing policy was refined by the test case validity judged sary complexity for hardware instructions. To address these by the GRM. In the DUT fuzzing stage, hardware cover- issues, we built a customized GPT model designed explic- age, through Synopsys VCS [44], is used as the feedback, itly for the RISC-V instruction set. Instead of relying on which is determined by examining a range of metrics that subwords, we assign individual tokens to each opcode and capture design behavior. These include finite state machine operand. This straightforward tokenization strategy preserves (FSM) coverage, condition coverage, and line coverage. FSM opcode/operand-level correctness without vastly increasing the coverage evaluates how thoroughly the test cases explore the vocabulary size. Besides, the token length of instruction is different states and transitions within state machines. Condi- more than half reduced compared with BPE, making the tion coverage evaluates if logical conditions, such as branches training stage much easier, as the model does not need to learn and conditional expressions, have been covered, thus revealing long token patterns. During our preliminary study, we also how well decision points in the logic are tested. Line coverage experimented with reducing the model size by decreasing the measures how many lines of the Register Transfer Level (RTL) number of layers, attention heads, and embedding dimensions. code have been stimulated. These coverage metrics offer a However, this led to suboptimal performance during fuzzing, comprehensive assessment of each test case’s effectiveness in potentially due to the reduced model capacity limiting its validating the hardware, guiding the fuzzing process toward ability to capture the complex dependencies and semantics more complete and revealing explorations of the design’s required to generate effective and diverse instruction blocks. behavior. Besides, we randomize initial register values to Training data preparation is fundamental to the fuzzer’s maximize the likelihood of triggering corner cases. In terms of performance and its capacity to explore the DUT. To ensure scoring function, we choose α = 0.1, β = 1, and factor = 1e-5 comprehensive test case coverage with different instructions, to heavily reward the exploration of new hardware states. our custom fuzzer model is trained from scratch using 10 After computing scores, we form preference pairs of “win- million randomly generated RISC-V assembly instructions, ning” and “losing” IBs to refine the fuzzing policy, guided including all possible RISC-V instructions and extensions: by the test case score using coverage metrics as inputs. We the fundamental 32-bit and 64-bit RISC-V ISAs and vari- consider the best-score and worst-score IBs to be “winners” ous extensions, such as integer multiplication/division, single- and “losers”, respectively. Hyperparameter tuning is crucial to

7

this iterative learning process. We pay special attention to three                A. Hardware Coverage
hyperparameters: the learning rate, the reward scaling factor β                                                          GoldenFuzz adopts a coverage-guided white-box fuzzing
for sequence likelihood (Eq. 6), and the target reward marginγ                                                    strategy to maximize exploration of hardware states, thereby
                                         (Eq. 7). We test a range of values (1e-7, 2e-7, 5e-7, 1e-6, and       elevating the probability of exposing bugs and vulnerabilities.
5e-6) and find that a smaller learning rate (e.g., 2e-7) provides                                                        For the hardware coverage benchmark, we compare Gold-
efficient refinements and prevents the fuzzer from collapsing                                                      enFuzz against four state-of-the-art fuzzers: Cascade [24],
into incoherent or repetitive outputs. Similarly, β = 10 (chosen                                                       DifuzzRTL [25], TheHuzz [19], and ChatFuzz [21], with a
from a range of 1 to 10) delivers a balanced scaling between                                                      special emphasis on Cascade, the most recent among them, and
winning and losing responses. For the target reward marginγ                                                         ChatFuzz, the most recent LLM-based fuzzer. For a thorough
                                            , we settle on 0.8 after a fine-grained grid search from 0.1              assessment, we measure condition, line, and FSM coverage
to 1 in increments of 0.1. Each training iteration uses a batch                                                     across three RISC-V cores: RocketChip [29], BOOM [30], and
size of 128, enabling efficient GPU memory usage and stable                                                         CVA6 [31]. Due to page limit, we limit our analysis of Di-
fuzzing policy update. In general, these hyperparameters can                                                          fuzzRTL and TheHuzz to condition coverage on RocketChip.
be directly applied to new targets, with adaptation (if needed)
starting from small learning rates and gradual adjustment of
reward scaling.
                                                  GoldenFuzz employs the Spike simulator [45] as the GRM       § go    ne                            Coverage Increase:
during the profiling stage, while DUT implementations include                                                      [ee ol ee 2.09%
RocketChip [29], BOOM [30], and CVA6 [31]. We rely on                                                          ="
a combined simulation workflow involving Synopsys VCS
and Spike for vulnerability detection. Synopsys VCS pro-                                                                                     Fm                          |
vides detailed hardware simulation traces, recording register                                                        — EL
updates and memory operations at each executed instruction’s                                                   7,
boundary. Concurrently, Spike serves as the high-level refer-                                                                                    CI               tases
ence model for RISC-V ISA execution, producing idealized                                                             (a) RocketChip                  (b) Boom
reference traces that outline the expected register and memory
states after each instruction when running a RISC-V binary.                                                          Coverage Increase:  5.16%   ar
Our     framework                                                 identifies  discrepancies in register values,
memory addresses, and memory contents by comparing the                                                               Coverage Increase:
hardware simulation traces from VCS and Spike’s execution
traces utilizing a mismatch detector. Any mismatch flags a                                                     =»    i                   1.12%  Ba                       ee
potential bug or vulnerability in the DUT or Spike, which                                                                                        wl                  =o  Te
will be analyzed manually by GoldenFuzz’s user to confirm                                                      Hl                                wl”
the bug or vulnerability.
                                                    We developed an automated framework to identify mis-               (c) CVA6                  (d) All Fuzzers on RocketChip
matches by parsing trace outputs generated by the target cores.                                                        Fig. 3: Coverage Benchmark.
For each test case, this tool processes execution traces for both
the core and the GRM, which include details such as time,
clock cycles, addresses, instructions, execution privilege levels,                                                 The results are shown Fig. 3. We adjust the figure scale to
register values post-instruction execution, memory transac-                                                      better visualize the trend in coverage. The coverage increase
tions, and exception details. These traces are then compared                                                        introduced by GoldenFuzz compared with the best performing
instruction by instruction. Since the initial portions of all test                                                fuzzer is shown in the figure title. GoldenFuzz consistently
cases are identical due to the initialization of the environment,                                                     outperforms Cascade and ChatFuzz across all tested cores
the parser skips these instructions. Upon detecting a mismatch                                                       ~ and coverage metrics, except FSM coverage on RocketChip
between the core and GRM traces, the framework applies the                                                     (bottom figure in Fig. 3a), where both fuzzers achieve the same
filtering approach described in Section III-E. The mismatch                                                          coverage. Notably, Cascade employs basic block concatena-
is disregarded if the environmental information associated                                                     tion, where each block terminates with an instruction that mod-
with the mismatch aligns with any predefined filter criteria.                                                ifies the program counter. This approach allows its test cases to
Otherwise, it is logged in a file for further manual analysis.                                                  grow to thousands of instructions, with 10 000 instructions as
                                                                                                                  optimal for maximizing coverage and vulnerability detection.
                                                                                                              In contrast, GoldenFuzz operates with significantly shorter test
                                                                      V. PERFORMANCE EVALUATION                     cases with 30 instructions. Consequently, Cascade, in some
    test scenarios, achieves higher coverage at the beginning due

This section evaluates the performance of GoldenFuzz to its longer test sequences. However, the early coverage through a comprehensive analysis and benchmark on hardware gains do not inherently sustain long-term exploration. The coverage and computational cost. coverage of GoldenFuzz steadily increases across all three

                                                                              8

cores, eventually surpassing Cascade, whose coverage plateaus                        VI. FUZZING FINDINGS
earlier. On the other hand, although ChatFuzz also employs a         A. Testcase Quality
language model, the generated binary test case constrains the
fuzzer to understand the inter- and intra-instruction semantics,          While Section V-A has shown that GoldenFuzz significantly
eventually leading to lower coverage.                                    improves hardware coverage, this section provides a deeper

Next, we compare coverage among all tested fuzzers on empirical analysis of the mechanisms driving this improve- RocketChip using condition coverage metrics, again including ment. Concretely, we explain why GoldenFuzz achieves higher Cascade for completeness. As presented in Fig. 3d, Golden- and faster coverage compared to existing fuzzers. Fuzz significantly exceeds the coverage achieved by Difuz- 1 li x2, 0xa9b1d00fffffffff zRTL, TheHuzz, and ChatFuzz, each plateaus quickly. By 2 csrs pmpaddr0, x2 contrast, GoldenFuzz maintains superior coverage even as the 3 ... number of test cases grows. Impressively, it achieves coverage 4 li x14, 0xff0f0fccdfaaaa1f comparable to that of other fuzzers using test cases of only 30 5 csrs pmpcfg0 , x14 instructions and less than 1% of test cases, demonstrating its 6 ... efficiency in exploring diverse hardware states. This advantage 7 li x7, 0x000000000005bcfa translates directly to detecting bugs and vulnerabilities with 8 csrs mstatus, x7 9 ... minimal hardware simulation overhead, affirming GoldenFuzz 10 mret as a high-performance tool for hardware security assessments. We evaluate the robustness of GoldenFuzz in Section VII-C. Listing 1: Testcase Quality Running Example Listing 1 shows a simplified test case generated by Gold- B. Computational Cost enFuzz. While the instruction values and memory operations involved in this testcase may initially seems arbitrary, they The computational efficiency of a fuzzer directly impacts actually constitute a minimal and valid sequence required to its overall runtime performance. The overhead in GoldenFuzz transition the processor from Machine (M-mode) to Supervisor primarily stems from four components: (1) fuzzer pertaining, (S-mode) privilege. Traditional fuzzers and verification tools (2) test case generation, (3) fuzzer’s fuzzing policy refinement, often fail to uncover such vulnerabilities because they struggle and (4) DUT instrumentation. The pre-training of the fuzzers, with the enormous search space and typically lack the semantic a one-time task, takes around one hour to finish on an NVIDIA understanding needed to generate these specific privilege- A6000 GPU. Although the three reset components primarily escalation scenarios. As a result, these tools are unlikely to run on a GPU, we analyze their computational performance on produce the precise instruction and memory patterns necessary both CPU and GPU for a fair comparison with related works. to trigger the vulnerabilities within a reasonable timeframe. In On an AMD EPYC 9684X CPU, GoldenFuzz generates contrast, GoldenFuzz leverages semantic guidance to synthe- each test case in just 0.34 seconds, which shrinks to 0.012 size such targeted behaviors efficiently, often within the first seconds on an A6000 GPU. These times significantly outper- 1 000 test cases. This capability explains why GoldenFuzz form advanced fuzzing tools such as Cascade (2.06 seconds was able to detect the five new vulnerabilities that other per test) and TheHuzz (2.47 seconds) in the same CPU set- tools missed. These results empirically support several key ting. Indeed, GoldenFuzz ’s instruction block-based workflow mechanisms behind our framework’s effectiveness: keeps overhead low: each new test only requires (1) choosing • Effectiveness of GRM feedback. The GRM steers the existing instruction blocks and (2) appending one block of six generation process toward semantically valid test cases additional instructions. By scaling up parallel input generation, that are less likely to trigger immediate exceptions, thus trading off memory for the ability to produce many tests enabling deeper exploration of the DUT. For instance, simultaneously. instructions like li x2, 0xa9b1d00fffffffff and In the GRM fuzzing stage, evaluating a test case takes csrs pmpcfg0, x14 emerge naturally as the fuzzer around 0.004 seconds with CPU, contributing only minimal explores the input space. additional cost. Direct preference optimization appears to • Coverage-driven DUT fuzzing. When running on the be time-intensive. However, with our customized and small DUT, feedback based on hardware coverage metrics language model and limited preference pairs per iteration, (e.g., condition or transition coverage) guides the fuzzer each tuning iteration completes in under 40 seconds on an to generate tests that explore unvisited states. Privilege A6000 GPU and less than 200 seconds using only a CPU. The transitions (Line 10), such as mret and sret, require bottleneck for GoldenFuzz lies in DUT instrumentation, which multiple conditions to be satisfied. Their presence in the averages 1.36 seconds per test case when running 80 test cases test case demonstrates that GoldenFuzz can learn and in parallel. Despite this, GoldenFuzz reduces its total testing exploit such constraints to expand coverage. volume by more than half compared to other fuzzers and • Language model integration for enhanced understanding. still achieves comparable or even superior coverage (Fig. 3d), By integrating coverage feedback into the training pro- substantially reducing the overall computational overhead. cess, the language model-based fuzzer learns to associate

9

        instruction patterns with state transitions. For example,  3   lb t3,  0(t1)         // Exp(0x78)  != Obs(0x78)
 the co-occurrence of pmpaddr and pmpcfg reflects the 4 li t0, (1 << 37) // Set MBE (bit 37)
 model’s understanding of PMP (Physical Memory Pro- 5 csrs mstatus, t0
 tection) configurations, indicating that the model captures 6 sw t2, 0(t1)
   semantic dependencies across instruction sequences.             7   lb t3,  0(t1)         // Exp(0x12)  != Obs(0x78)
   B. Detected Vulnerabilities                                         Listing 2: Sample Code Snippet Demonstrating the Triggering
                                                                       of Vulnerabilities 1 and 2.
       GoldenFuzz identified five previously unknown vulnerabili-
   ties in the open-source cores and two from the commercialized       In the Listing       2, we present code demonstrating the first

BA51-H core [46]. These vulnerabilities have been reported to vulnerability associated with MBE. Notably, the same code can the respective benchmark developers, who confirmed that they be adapted to exploit a second vulnerability by altering the were unaware of these issues before our disclosure. We have execution mode from M-mode to S-mode. This transition can obtained the following Common Vulnerabilities and Exposures be achieved by appropriately configuring the mstatus regis- (CVE) entries for these vulnerabilities, CVE-2025-45883 ter followed by executing the MRET instruction. Additionally, and CVE-2025-45881. The five vulnerabilities we identified the offset for setting the bit (t0) should be adjusted to 36, are severe (four have a CVSS 3.0 score exceeding 7) and corresponding to SBE, instead of 37, which pertains to MBE. complex, involving multi-instruction execution paths and trig- However, in both vulnerability scenarios, the expected value gers from different privilege modes. Besides, the effectiveness from the last load instruction, Line 8, should be 0x12, rep- of GoldenFuzz is not limited to discovering these specific resenting a big-endian load for 0x12345678. Nonetheless, vulnerabilities. During our analysis, we observed that our in both cases, the processor loads 0x78, indicating that the fuzzer could also trigger bugs and vulnerabilities previously endianness configuration did not change despite setting the reported but remained unresolved. Furthermore, static analysis MBE and SBE bits. revealed that our fuzzer could generate test cases for vul- The vulnerabilities V1 and V2 are classified as severe due nerabilities already addressed in recent updates [47], [48]. In to their high CVSS 3.0 scores of 7.5. These vulnerabilities certain instances, our fuzzer even detected bugs not reported present multiple attack vectors that adversaries can exploit in related fuzzing studies, although the benchmark developers to compromise the confidentiality, integrity, and availability acknowledged being aware of these issues. of the system. Specifically, the processor fails to enforce the The following paragraphs explain each vulnerability and expected endianness for certain memory access instructions, present Proof-of-Concept (PoC) code to demonstrate how such as sw and lb. An attacker can exploit this flaw by cre- these vulnerabilities can be triggered. The PoC code examples ating scenarios where discrepancies between the expected and are simplified and represent minimal subsets of the test cases actual behavior affect critical memory management structures, generated by the fuzzer. such as page tables. This exploitation can facilitate bypass- V1 & V2 Incorrect Endiannes Changes in CVA6 Proces- ing memory isolation mechanisms, leading to the corruption sors. The RISC-V ISA specification defines the MBE, and SBE of page tables or unauthorized access to memory regions. fields in the mstatus and mstatush registers as controlling Consequently, the attacker may gain elevated privileges or the endianness of memory accesses, with the exception of extract sensitive information. Furthermore, by combining these instruction fetches, which are inherently little-endian. These vulnerabilities with a software memory corruption attack, fields have the following specific roles. Machine Endianness an attacker can destabilize the operating system kernel or (MBE) governs the endianness for memory accesses in M- hypervisor, resulting in arbitrary code execution or denial-of- mode. Setting MBE to 0 enforces little-endian memory accesses service conditions. while setting it to 1 enforces big-endian accesses. Supervisor V3 Improper Masking of Delegated Supervisor Timer Endianness (SBE) determines the endianness of memory ac- Interrupts (STI) in CVA6 Processors. According to the cesses in S-mode, provided that S-mode is supported in the RISC-V ISA specification [49], [50], delegated interrupts implementation. The SBE field is of particular importance for should be masked at the delegator privilege level. Specifically, supervisor- and Hypervisor-level operations, such as page table if the Supervisor Timer Interrupt (STI) is delegated to S- management. However, a deviation from the expected behavior mode by setting the appropriate bit (5th bit in mideleg), has been observed on the RV64 CVA6 core. Specifically, ma- the interrupt should not be taken when executing in M-mode. nipulating the MBE or SBE fields in the mstatus register does In this configuration, STIs should only trigger in S-mode, not alter the endianness of explicit memory access instructions with control transferring to the corresponding S-mode interrupt as anticipated. For instance, despite clearing or setting the MBE handler. Conversely, if mideleg[5] is cleared, the interrupt or SBE bits, the endianness of subsequent sw (store word) and is not delegated and should be taken in any mode, transferring lb (load byte) instructions remains unaffected. The issue can control to the M-mode handler. be reproduced via Listing 2. However, a deviation from the specified behavior has been observed in the RV64 CVA6 [31] core. Specifically, when 1 li t2, 0x12345678 the STI is delegated to the S-mode by setting mideleg[5], 2 sw t2, 0(t1) the interrupt remains visible in the M-mode, contrary to

expectations. Even though the STI is delegated, it is still an illegal instruction exception. Upon triggering the exception, reflected in the mip register while executing in the M- the stval register should be set to the value of the faulting mode rather than being masked. Instead, the interrupt should instruction. In this case, stval should hold the value of only appear in the sip register when executing in S-mode, the HFENCE.GVMA instruction itself. However, a deviation indicating proper delegation. This behavior violates the RISC- from this expected behavior has been observed in the CVA6 V specification, as delegated interrupts should not appear at core [31]. Specifically, instead of setting stval to the the delegator privilege level (in this case, M-mode). The issue correct faulting instruction value (0x62000073), the core can be reproduced using the PoC code snippet in Listing 3. erroneously sets it to 0x1. This behavior violates the RISC-V specification, as the stval register is expected to contain the 1 li t0, (1 << 5)// Load STI interrupt faulting instruction. The issue can be reproduced using the PoC 23 csrs mip, t0 // Set STI in mip register code snippet in Listing 4. The vulnerability V4 is classified as csrr t0, mip // Check mask 4 csrr t0, sip // Check delegation moderate (CVSS 3.0 score 5.5), due to its potential impact on exception handling and debugging. This vulnerability could Listing 3: Sample Code Demonstrating the Masking of lead to uncertainty during debugging or exception handling. Delegated STI. This, in turn, may result in incorrect exception handling The vulnerability V3 is classified as severe due to its high behavior, particularly in complex systems that rely on accurate CVSS 3.0 scores of 7.6. Specifically, when the STI is delegated faulting instruction information for diagnostics or recovery to the S-mode, it remains visible in the M-mode, violating procedures. In systems where accurate fault reporting is crucial the expected isolation between privilege levels defined by the for security or stability, this could introduce difficulties in iden- RISC-V ISA. An attacker operating in S-mode can exploit this tifying the root cause of the exception, potentially affecting flaw by triggering repetitive STI interrupts and observing side system reliability or security. effects in M-mode, such as execution timing variations or un- V5 Access Control Issue in CSR Register Files. During expected interrupt handling. This behavior allows the attacker the discussion with the CVA6 developers regarding vulner- to infer sensitive information about the M-mode firmware’s abilities V1 and V2, we realized a newer vulnerability that internal state, including interrupt handling logic, memory man- can be interpreted by V1 and V2, which could be far more agement operations, or context-switching activities. Moreover, critical. This issue represents a critical access control problem if M-mode software inadvertently processes these unexpected in the Control Status Register (CSR) register files module. interrupts, it may expose unintended information or introduce Specifically, the MBE and SBE bits, which are expected to be vulnerabilities. For instance, M-mode may log the interrupts, read-only and set to zero, can be modified, contrary to the update counters, or modify critical state variables, enabling CVA6 specification [51]. As demonstrated in Listing 2, the an S-mode attacker to deduce whether M-mode is engaged in MBE and SBE bits could be changed, despite being expected specific privileged operations or to infer details about memory to be locked at zero. The vulnerability V5 is classified as layout and interrupt delegation mechanisms. Such leakage severe (CVSS 3.0 score 7.6), due to its potential impact on can be leveraged to bypass privilege isolation, potentially CSR access control, as it allows for unintended modification escalating the attacker’s privileges or providing a foothold for of critical status information that is supposed to be protected. further exploitation. Two New Vulnerabilities on a Commercial Core. Golden- V4 Improper Handling of stval Register in CVA6 Fuzz successfully detected two bugs in the implementation of Processors for HFENCE.GVMA Instruction. According to extensions of Beyond’s BA51-H core [46], a commercialized the RISC-V ISA specification [49], [50], when an illegal design developed during the CROSSCON project [52]. For instruction exception occurs in Hypervisor/Supervisor Mode confidentiality reasons, we cannot disclose the full bug details. (HS-mode), the stval register can optionally return the However, one of the identified issues appeared while accessing faulting instruction bits. Specifically, if stval is written with specific registers, highlighting a subtle interaction flaw within a nonzero value when an illegal instruction exception occurs, it the newly developed functionalities. This discovery demon- should contain the shortest of: a) the actual faulting instruction, strates GoldenFuzz’s capability to uncover critical design flaws b) the first ILEN bits of the faulting instruction, and c) the first in evolving commercial hardware, showcasing its significant SXLEN bits of the faulting instruction. value for industrial applications. 1 li t0, (1 << 20) // Load TVM VII. ABLATION AND HYPERPARAMETER STUDY 2 csrs mstatus, t0 // set TVM 3 /// Switch to HS-mode A. The Need for GRM Fuzzing 4 hfence.gvma zero, zero // Exception here As the core component of GoldenFuzz, GRM plays a crucial Listing 4: Sample Code Demonstrating the Improper stval role in reducing overhead within the GoldenFuzz framework Handling for HFENCE.GVMA by acting as a “digital twin” of the DUT. To assess the effectiveness and necessity of this component, we conduct an Furthermore, in HS-mode, when the mstatus.TVM flag ablation study on the performance of GoldenFuzz with and is set, executing the HFENCE.GVMA instruction should trigger without GRM.

                                    11

213k


 Lo                                     H               |                                En

 Zo                                     5           wl
       Tour                                                  !        —
           GRM                                      |                  118     Inst.     H

     Number of test cases                                    Number of test cases            Number of test cases
 (a) The Need for the GRM Exploration.                  (b) Instruction Block Settings.      (c) Robustness of GoldenFuzz.
       Fig. 4: Ablation and Hyperparameter Studies on Critical Settings.

        Figure 4a presents the results across three key metrics: 1)   In     contrast,   larger IBs (e.g., 1-30) are more prone      to
coverage points, which aggregate condition, line, and finite-         becoming dead, i.e., failing to execute due to syntax or runtime
state machine (FSM) coverage; 2) invalid rate, which measures         errors such as invalid instructions or premature termination.
the proportion of IBs that are syntactically invalid; and 3)          Indeed, longer IBs have a higher chance of being discarded,
extendable rate, which quantifies the percentage of IBs that are      which reduces the pool of viable test cases and leads to lower
suitable for concatenation in future iterations. As defined in        overall coverage. This degradation can reach a point where the
Section III-D, an IB is considered extendable (or, not “dead”)        system fails to generate any valid IBs at all.
if it is syntactically valid and fully executable, or largely so            While configurations like 30-1 (30 one-instruction IBs) may

in the case of branch instructions. Across all three metrics, offer greater stability due to minimal instruction complexity GRM-guided fuzzing (red curve) consistently outperforms the per block, they introduce significant computational overhead baseline that explores solely the DUT. Specifically, GRM and reduce the contextual expressiveness of each block. Our fuzzing achieves higher overall coverage, generates fewer experiments show that a middle ground, such as the 5-6 con- invalid IBs, and yields a greater proportion of extendable figuration, offers the best trade-off between learning efficiency, IBs. These results highlight a critical advantage of using a coverage growth, and computational feasibility. GRM during early-stage policy refinement: it enables safer and more efficient coverage exploration without risking the C. Robustness of GoldenFuzz. stability or limited observability of the DUT. From a higher To evaluate the robustness and consistency of our fuzzing level, by leveraging the GRM for initial, rapid, and low- strategy, we conducted five independent fuzzing runs and cost coverage-guided test case refinement before targeting the measured the resulting coverage across three metrics: line, actual hardware, GoldenFuzz significantly accelerates deep coverage, and FSM coverage. Figure 4c presents the mean cov- architectural exploration and vulnerability discovery. erage and its variance (shaded regions) as the number of test B. The Selection of Instruction Blocks Settings. cases increases; the y-axis scales are set differently for better visibility. Across all three metrics, GoldenFuzz demonstrates Instruction block-based testcase generation enable Gold- stable and consistent behavior. Indeed, the narrow variance enFuzz to steadily explore new hardware states with low bands across all metrics demonstrate that GoldenFuzz is not computation overhead. To evaluate the impact of instruction only effective but also robust, consistently achieving high ar- block (IB) granularity on fuzzing effectiveness, we compare chitectural coverage regardless of initialization or randomness three block configurations: five IBs of six instructions each (5- in the fuzzing process. 6), three IBs of ten instructions (3-10), and one IB containing all 30 instructions (1-30). The y-axis in Fig. 4b shows the VIII. DISCUSSION hardware coverage points achieved over learning iterations. Cross-design Generalization. A key advantage of Golden- The results demonstrate that using more, smaller IBs (5- Fuzz is its portability across implementations of the same ISA. 6) consistently leads to higher coverage. This improvement In our evaluation across three distinct RISC-V cores, including stems from two key advantages. First, smaller blocks offer the Out-of-Order BOOM processor [30], no changes were re- finer-grained exploration, allowing each IB to independently quired to the framework. Indeed, all ISA-compliant processors, probe different hardware states. This increases the likelihood regardless of their microarchitecture, must expose the same of reaching diverse execution paths and reduces the risk of architectural state at instruction retirement. Consequently, coverage stagnation. Second, block-wise generation signifi- GoldenFuzz is directly applicable to any conforming RISC-V cantly eases the learning task for the fuzzer. Indeed, the fuzzer core. Adapting GoldenFuzz to other ISAs (e.g., ARM or x86) only needs to learn how to construct shorter, more manageable and more complex (closed-source) designs follows the same IBs, reducing the search space and accelerating convergence. workflow. While technically feasible, the practical limitations

are primarily tied to (1) the availability of GRM and (2) the syntax and semantics and using runtime information feedback closed-source nature of ISA and design implementation. Recall to mutate instructions. ProcessorFuzz [28] is a concurrent work that GoldenFuzz relies on GRM for the low-cost fuzzing policy that generates instructions and collects coverage of control refinement and, like other white-box fuzzers [19], [21], [24], and status registers. However, these works only focus on the [25], leverages RTL-level coverage feedback to explore deeper coverage of registers generating the select signals of MUXes, hardware states. The lack of GRM would degrade GoldenFuzz leading to missing bugs and vulnerabilities. To increase the to a conventional fuzzer that directly interacts with DUT; design coverage and detect more vulnerabilities, HyPFuzz [20] the lack of design insight further degrades GoldenFuzz into was proposed to guide the fuzzer by the formal verifica- a pure random fuzzer and poses a significant challenge to tion tools reaching hard-to-reach design spaces. Alternatively, vulnerability detection. These availability and open-source SoC Fuzzer [57] directs the fuzzing based on the security constraints reflect a broader challenge in applying advanced properties (generic cost function) that detect vulnerabilities fuzzing techniques to commercial and closed-source platforms. in the DUT. Finally, Cascade [24] aims to enhance instruc- LLM Advances for GoldenFuzz. Recent advances in Large tion execution efficiency by constructing long programs and Language Models (LLMs) offer promising avenues for fur- eliminating control flow influences. It conducts the entire ther improving hardware vulnerability detection. Incorporating fuzzing process at the program-level granularity without using techniques like Retrieval-Augmented Generation (RAG) [53] any mutation strategies to guide fuzzing. However, these could enhance hardware fuzzing by integrating design spec- approaches overlook the complexity of input semantics and ifications, thereby increasing the efficiency of vulnerability receive limited feedback from the DUT. Unfortunately, overly detection. For instance, an LLM could retrieve relevant design long and complex test cases lead to high time consumption; documentation to inform its test case generation, leading to the basic block’s usage constrains the test case’s variety, thus more targeted and effective fuzzing. Additionally, allowing the reducing the bug detection capability. ChatFuzz [21] uses fuzzer to understand human languages opens the opportunities a pre-trained language model, fine-tuned by reinforcement of applying prompt engineering techniques to generate more learning on instruction binaries, for the hardware fuzzing. refined test cases specifically tailored for, e.g., targeted build- In contrast, we specifically customize a GPT model with ing blocks within the hardware design or certain CWEs. These the proposed hardware fuzzing scheme, leading to lighter advancements could lead to more intelligent and context-aware computation overhead and significantly better performance in fuzzing strategies, thus uncovering critical vulnerabilities. hardware coverage and vulnerability detection. IX. RELATED WORK X. CONCLUSION Hardware fuzzing frameworks have been widely deployed In this paper, we present GoldenFuzz, a novel language- to various designs, such as SoCs, CPUs, and isolated IP blocks. model-based hardware fuzzer that decouples test case refine- We categorize them into generic fuzzers and processor fuzzers. ment from coverage and vulnerability exploration. It first Generic Fuzzers. Existing works, such as Laufer et al. [54] uses a software-based Golden Reference Model (GRM), a and Li et al. [55], demonstrate the feasibility of using FPGA “digital twin” conforming to the DUT’s ISA, to efficiently emulation-based generic fuzzing based on multiplexer control refine fuzzing strategies before executing targeted tests on the signals. However, these approaches are reliant on specific actual DUT, reducing the cost for cycle-accurate simulations. hardware design languages (e.g., HDL), limiting their scal- Empirical results show that GoldenFuzz outperforms state- ability. Further, the overheads in monitoring multiplexers in of-the-art fuzzers in both coverage and computational cost. complex designs hamper the usability [28]. In contrast, Trippel It detects all previously known vulnerabilities and uncovers et al. [18] proposed fuzzing hardware-like software by fuzzing seven new ones in open-source and commercial cores, four the hardware simulation binary rather than porting software of which are rated highly severe (CVSS > 7.0), highlighting fuzzers directly on the hardware designs. Verilator [56] trans- their real-world impact. lates the hardware to equivalent software. This approach ACKNOWLEDGEMENT allows fuzzing to utilize existing software coverage metrics, such as basic block and edge coverage [28]. Still, it faces Our research work was partially funded by Intel’s Scalable scalability problems when, e.g., fuzzing a whole CPU. Assurance Program, DFG-SFB 1119-236615297, the Euro- Processor Fuzzers. TheHuzz [19] simulates the RTL design pean Union under Horizon Europe Programme-Grant Agree- of the processor with the binary format of the instruction ment 101070537-CrossCon, NSF-DFG-Grant 538883423, the using Synopsys VCS [44] that traces code coverage through European Research Council under the ERC Programme-Grant various metrics, including branch, condition, toggle, FSM, 101055025-HYDRANOS, and Synopsys (special thanks to and functional coverage. However, this method suffers from Catherine Le Lan) with EDA tools licenses support. This work low computation efficiency and hardware coverage. Difuz- does not in any way constitute an Intel endorsement of a zRTL [25] generates instructions and collects control register product or supplier. Any opinions, findings, conclusions, or coverage to guide the fuzzing process. Following this work, recommendations expressed herein are those of the authors MorFuzz [26] achieves a final coverage that is 4.4 times and do not necessarily reflect those of Intel, the European higher than DifuzzRTL by generating fuzzing seeds based on Union, or the European Research Council.

                       13

                                                        ETHICS CONSIDERATIONS                       [8]  S. R. Sarangi, A. Tiwari, and J. Torrellas, “Phoenix: Detecting and
                                                                                                         recovering from permanent processor design bugs with programmable
                         GoldenFuzz is a hardware fuzzing tool developed to ad-                          hardware,” in          2006 39th Annual IEEE/ACM International Symposium
                    vance the functional and security verification of hardware,                     [9]  on Microarchitecture (MICRO’06).     IEEE, 2006, pp. 26–37.
                   particularly processors. Its intended users include security                          C. Deutschbein and C. Sturton, “Mining security critical linear temporal
                                                                                                         logic specifications for processors,” in 2018 19th International Workshop
                   researchers, hardware manufacturers and designers, and hard-                          on Microprocessor and SOC Test and Verification (MTV).       IEEE, 2018,

ware security companies. By using the GoldenFuzz frame- [10] pp. 18–23. work, users can discover new bugs and vulnerabilities in B. Wile, J. Goss, and W. Roesner, Comprehensive functional verification: The complete industry cycle. Morgan Kaufmann, 2005. hardware designs under test. We thoroughly evaluated Gold- [11] G. Dessouky, D. Gens, P. Haney, G. Persyn, A. Kanuparthi, H. Khattri, enFuzz on three widely recognized benchmarks, discovering J. M. Fung, A.-R. Sadeghi, and J. Rajendran, “{HardFails}: insights five new bugs and vulnerabilities. In line with the Menlo into {software-exploitable} hardware bugs,” in 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 213–230. Report principles [58], particularly the principle of ”Respect [12] E. M. Clarke, W. Klieber, M. Nov´aˇcek, and P. Zuliani, “Model checking for Persons”, we have ensured that the security vulnerabili- and the state explosion problem,” in LASER Summer School on Software ties identified by GoldenFuzz were promptly communicated [13] Engineering. Springer, 2011, pp. 1–30. I. Wagner and V. Bertacco, “Engineering trust with semantic guardians,” to the responsible teams for CVA6 [31], BOOM [30], and in 2007 Design, Automation & Test in Europe Conference & Exhibition. RocketChip [29]. This timely disclosure was essential to [14] IEEE, 2007, pp. 1–6. mitigate risks and protect individuals from potential harm that M. Hicks, C. Sturton, S. T. King, and J. M. Smith, “Specs: A lightweight runtime mechanism for protecting software from security-critical pro- could arise if adversaries were to uncover these vulnerabilities cessor bugs,” in Proceedings of the Twentieth International Conference independently. The responsible parties have acknowledged the on Architectural Support for Programming Languages and Operating issues and are actively working on implementing fixes to [15] Systems, 2015, pp. 517–529. X. Li, M. Tiwari, J. K. Oberg, V. Kashyap, F. T. Chong, T. Sherwood, address the concerns raised in this paper. and B. Hardekopf, “Caisson: a hardware description language for secure To further adhere to the principle of justice, which re- information flow,” ACM Sigplan Notices, vol. 46, no. 6, pp. 109–120, quires fairness in distributing benefits and burdens, we have [16] 2011. X. Li, V. Kashyap, J. K. Oberg, M. Tiwari, V. R. Rajarathinam, R. Kast- carefully considered the potential risks associated with the ner, T. Sherwood, B. Hardekopf, and F. T. Chong, “Sapper: A language misuse of GoldenFuzz. To prevent harm and ensure that the for hardware-level security policy enforcement,” in Proceedings of the framework is used to advance research and enhance hardware 19th international conference on Architectural support for programming languages and operating systems, 2014, pp. 97–112. security, access to the source code for GoldenFuzz will be [17] D. Zhang, Y. Wang, G. E. Suh, and A. C. Myers, “A hardware design restricted. It will be made available only upon request and with language for timing-sensitive information-flow security,” Acm Sigplan confirmation that it will be used exclusively for responsible [18] Notices, vol. 50, no. 4, pp. 503–516, 2015. T. Trippel, K. G. Shin, A. Chernyakhovsky, G. Kelly, D. Rizzo, and research by academic users and hardware manufacturers. This M. Hicks, “Fuzzing hardware like software,” in 31st USENIX Security approach aligns with the Menlo Report [58] emphasis on [19] Symposium (USENIX Security 22), 2022, pp. 3237–3254. promoting social value while protecting individual rights and R. Kande, A. Crump, G. Persyn, P. Jauernig, A.-R. Sadeghi, A. Tyagi, and J. Rajendran, “{TheHuzz}: Instruction fuzzing of processors using privacy, ensuring that GoldenFuzz contributes positively to the {Golden-Reference} models for finding {Software-Exploitable} vulner- hardware security community without introducing risks. abilities,” in 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 3219–3236. [20] C. Chen, R. Kande, N. Nguyen, F. Andersen, A. Tyagi, A.-R. Sadeghi, REFERENCES and J. Rajendran, “{HyPFuzz}:{Formal-Assisted} processor fuzzing,” in 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. [1] Intel. (2024) July 2024 update on instability reports [21] 1361–1378. on intel core 13th and 14th gen desktop processors. M. Rostami, M. Chilese, S. Zeitouni, R. Kande, J. Rajendran, and A.-R. [Online]. Available: https://community.intel.com/t5/Processors/ Sadeghi, “Beyond random inputs: A novel ml-based hardware fuzzing,” July-2024-Update-on-Instability-Reports-on-Intel-Core-13th-and/m-p/ in 2024 Design, Automation & Test in Europe Conference & Exhibition 1617113 [22] (DATE). IEEE, 2024, pp. 1–6. [2] I. Corporation, “Pentium fdiv bug,” https://www.cs.earlham.edu/∼dusko/ M. Rostami, S. Zeitouni, R. Kande, C. Chen, P. Mahmoody, J. Rajen- cs63/fdiv.htm, 2001. dran, and A.-R. Sadeghi, “Lost and found in speculation: Hybrid spec- [3] A. Corporation, “Barcelona tlb bug,” https://www.amd.com/content/ ulative vulnerability detection,” in Proceedings of the 61st ACM/IEEE dam/amd/en/documents/archived-tech-docs/revision-guides/41322 [23] Design Automation Conference, 2024, pp. 1–6. 10h Rev Gd.pdf, 2012. M. Rostami, C. Chen, R. Kande, H. Li, J. Rajendran, and A.-R. Sadeghi, “Fuzzerfly effect: Hardware fuzzing for memory safety,” IEEE Security [4] P. Borrello, A. Kogler, M. Schwarzl, M. Lipp, D. Gruss, and M. Schwarz, and Privacy, vol. 22, no. 4, pp. 76–86, 2024. “ÆPIC Leak: Architecturally leaking uninitialized data from the mi- [24] F. Solt, K. Ceesay-Seitz, and K. Razavi, “Cascade: Cpu fuzzing via croarchitecture,” in 31st USENIX Security Symposium (USENIX Security intricate program generation,” in Proc. 33rd USENIX Secur. Symp, 2024, 22), 2022. pp. 1–18. [5] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Ham- [25] J. Hur, S. Song, D. Kwon, E. Baek, J. Kim, and B. Lee, “Difuzzrtl: burg, M. Lipp, S. Mangard, T. Prescher et al., “Spectre attacks: Exploit- Differential fuzz testing to find cpu bugs,” in 2021 IEEE Symposium on ing speculative execution,” Communications of the ACM, vol. 63, no. 7, Security and Privacy (SP). IEEE, 2021, pp. 1286–1303. pp. 93–101, 2020. [26] J. Xu, Y. Liu, S. He, H. Lin, Y. Zhou, and C. Wang, “{MorFuzz}: [6] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, Fuzzing processor via runtime instruction morphing enhanced synchro- P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” arXiv nizable co-simulation,” in 32nd USENIX Security Symposium (USENIX preprint arXiv:1801.01207, 2018. Security 23), 2023, pp. 1307–1324. [7] Intel, “Machine check error avoidance on page size change/cve2018- [27] P. Borkar, C. Chen, M. Rostami, N. Singh, R. Kande, A.-R. Sadeghi, 12207,” https://www.intel.com/content/www/us/en/developer/articles/ C. Rebeiro, and J. Rajendran, “Whisperfuzz: White-box fuzzing for troubleshooting/software-security-guidance/advisory-guidance/ detecting and locating timing vulnerabilities in processors,” 2024. machine-check-error-avoidance-page-size-change.html, 2019. [Online]. Available: https://arxiv.org/abs/2402.03704

                                                      14

[28]  S. Canakci, C. Rajapaksha, L. Delshadtehrani, A. Nataraja, M. B. Taylor,          augmented generation for knowledge-intensive nlp tasks,” Advances in
      M. Egele, and A. Joshi, “Processorfuzz: Processor fuzzing with control            neural information processing systems, vol. 33, pp. 9459–9474, 2020.
 and status registers guidance,” in 2023 IEEE International Symposium [54] K. Laeufer, J. Koenig, D. Kim, J. Bachrach, and K. Sen, “Rfuzz:
      on Hardware Oriented Security and Trust (HOST).            IEEE, 2023, pp.        Coverage-directed fuzz testing of rtl on fpgas,” in         2018 IEEE/ACM
      1–12.                                                                             International Conference on Computer-Aided Design (ICCAD).          IEEE,
[29]  A.  et   al.,  “The   Rocket  Chip   Generator,”  no.       UCB/EECS-2016-        2018, pp. 1–8.
 17, 2016. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/ [55] T. Li, H. Zou, D. Luo, and W. Qu, “Symbolic simulation enhanced
      TechRpts/2016/EECS-2016-17.html                                                   coverage-directed fuzz testing of rtl design,” in 2021 IEEE International
[30]  J. Zhao, B. Korpan, A. Gonzalez, and K. Asanovic, “SonicBOOM:                     Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5.
 The 3rd Generation Berkeley Out-of-Order Machine,” 4th Workshop on [56] Verilator, “Welcome to verilator,” https://www.veripool.org/verilator/,
      Computer Architecture Research with RISC-V, 2020.                                 2019.

[31] F. Zaruba and L. Benini, “The cost of application-class processing: [57] M. M. Hossain, A. Vafaei, K. Z. Azar, F. Rahman, F. Farahmandi, Energy and performance analysis of a linux-ready 1.7-ghz 64-bit risc- and M. Tehranipoor, “Socfuzzer: Soc vulnerability detection using cost v core in 22-nm fdsoi technology,” IEEE Transactions on Very Large function enabled fuzz testing,” in 2023 Design, Automation & Test in Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, 2019. Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6. [32] Google, “Americal fuzzy loop,” https://github.com/google/AFL, 2019. [58] “The menlo report: Ethical principles guiding information and com- [33] M. Ammann, L. Hirschi, and S. Kremer, “Dy fuzzing: formal dolev- munication technology research,” https://www.dhs.gov/sites/default/files/ yao models meet cryptographic protocol fuzz testing,” in 2024 IEEE publications/CSD-MenloPrinciplesCORE-20120803 1.pdf, 2012. Symposium on Security and Privacy (SP). IEEE, 2024, pp. 1481–1499. [34] Y. Chen, T. Lan, and G. Venkataramani, “Exploring effective fuzzing strategies to analyze communication protocols,” in Proceedings of the 3rd ACM Workshop on Forming an Ecosystem Around Software Trans- formation, 2019, pp. 17–23. [35] R. Saravanan and S. M. Pudukotai Dinakarrao, “The fuzz odyssey: A survey on hardware fuzzing frameworks for hardware design verifica- tion,” in Proceedings of the Great Lakes Symposium on VLSI 2024, 2024, pp. 192–197. [36] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023. [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [38] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” 2018. [39] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022. [40] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024. [41] Y. Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,” arXiv preprint arXiv:2405.14734, 2024. [42] I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, “Ai models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024. [43] W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney, “Revisiting fundamentals of experience replay,” in International conference on machine learning. PMLR, 2020, pp. 3061–3071. [44] Synopsys, “Vcs, the industry’s highest performance simulation solution,” https://www.synopsys.com/verification/simulation/vcs.html. [45] RISC-V, “Spike risc-v isa simulator,” https://github.com/ riscv-software-src/riscv-isa-sim. [46] S. Pinto and M. Breskvar, A Novel Trusted Execution Environment for Next-Generation RISC-V MCUs. Embedded World Conference, 2024. [47] CVA6, “Cva6 fix,” https://github.com/openhwgroup/cva6/pull/2685. [48] ——, “Cva6 fix,” https://github.com/openhwgroup/cva6/pull/2064. [49] RISC-V, “The risc-v instruction set manual volume i: Unprivileged isa,” https://github.com/riscv/riscv-isa-manual/releases/tag/ riscv-isa-release-8b9dc50-2024-08-30, 2024. [50] ——, “The risc-v instruction set manual volume ii: Privileged isa,” https://github.com/riscv/riscv-isa-manual/releases/tag/ riscv-isa-release-8b9dc50-2024-08-30, 2024. [51] openhwgroup, “Endianness control in mstatus and mstatush registers,” https://docs.openhwgroup.org/projects/cva6-user-manual/04 cv32a65x/ riscv/priv.html. [52] CROSSCON, “CROSSCON: Cross-platform open security stack for connected devices (horizon-cl3-2021-cs-01-02),” https://crosscon.eu/. [53] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel et al., “Retrieval-