SOURCE ARCHIVE
EXTRACTED CONTENT
118,054 chars na
® DejaVuzz: Disclosing Transient Execution Bugs with Dynamic Swappable Memory and Differential Information Flow Tracking Assisted Processor Fuzzing
Jinyan Xu Yangye Zhou Xingzhi Zhang
Zhejiang University Zhejiang University Zhejiang University
Hangzhou, Zhejiang, China Hangzhou, Zhejiang, China Hangzhou, Zhejiang, China
phantom@zju.edu.cn zhouyangye@zju.edu.cn xingzhizhang@zju.edu.cn
Yinshuai Li Qinhan Tan Yinqian Zhang∗
Southern University of Science and Princeton University Southern University of Science and Technology Princeton, New Jersey, USA Technology Shenzhen, Guangdong, China qinhant@princeton.edu Shenzhen, Guangdong, China liys2022@mail.sustech.edu.cn yinqianz@acm.org
Yajin Zhou Rui Chang Wenbo Shen
Zhejiang University Zhejiang University Zhejiang University
Hangzhou, Zhejiang, China Hangzhou, Zhejiang, China Hangzhou, Zhejiang, China
yajin_zhou@zju.edu.cn crix1021@zju.edu.cn shenwenbo@zju.edu.cn
Abstract of diverse transient windows. The differential information Transient execution vulnerabilities have emerged as a critical flow tracking aids in observing the propagation of sensitive threat to modern processors. Hardware fuzzing testing tech- data across the microarchitecture. Based on taints, DejaVuzz niques have recently shown promising results in discovering designs the taint coverage matrix to guide mutation and transient execution bugs in large-scale out-of-order proces- uses taint liveness annotations to identify exploitable leak- sor designs. However, their poor microarchitectural control- ~~ ages. Our evaluation shows that DejaVuzz outperforms the lability and observability prevent them from effectively and ~~ state-of-the-art fuzzer SpecDoctor, triggering more com- efficiently detecting transient execution vulnerabilities. prehensive transient windows with lower training overhead This paper proposes DejaVuzz, a novel pre-silicon stage ~~ and achieving a 4.7× coverage improvement. And DejaVuzz processor transient execution bug fuzzer. DejaVuzz utilizes also mitigates control flow over-tainting with acceptable two innovative operating primitives: dynamic swappable overhead and identifies 5 previously undiscovered transient memory and differential information flow tracking, enabling execution vulnerabilities (with 6 CVEs assigned) on BOOM more effective and efficient transient execution vulnerabil- and XiangShan. ity detection. The dynamic swappable memory enables the CCS Concepts: • Security and privacy → Side-channel isolation of different instruction streams within the same analysis and countermeasures. address space. Leveraging this capability, DejaVuzz gener- ates targeted training for arbitrary transient windows and ~~ Keywords: transient execution bug; hardware fuzzing; pro- eliminates ineffective training, enabling efficient triggering cessor fuzzing; information flow tracking ∗Corresponding author. ACM Reference Format: Jinyan Xu, Yangye Zhou, Xingzhi Zhang, Yinshuai Li, Qinhan Tan, Permission to make digital or hard copies of all or part of this work for Yinqian Zhang, Yajin Zhou, Rui Chang, and Wenbo Shen. 2025. De- personal or classroom use is granted without fee provided that copies jaVuzz: Disclosing Transient Execution Bugs with Dynamic Swap- are not made or distributed for profit or commercial advantage and that pable Memory and Differential Information Flow Tracking Assisted copies bear this notice and the full citation on the first page. Copyrights Processor Fuzzing. In Proceedings of the 30th ACM International for components of this work owned by others than the author(s) must Conference on Architectural Support for Programming Languages be honored. Abstracting with credit is permitted. To copy otherwise, or and Operating Systems, Volume 3 (ASPLOS ’25), March 30-April 3, republish, to post on servers or to redistribute to lists, requires prior specific 2025, Rotterdam, Netherlands. ACM, New York, NY, USA, 17 pages. permission and/or a fee. Request permissions from permissions@acm.org. https://doi.org/10.1145/3676642.3736115 ASPLOS ’25, Rotterdam, Netherlands © 2025 Copyright held by the owner/author(s). Publication rights licensed 1 to ACM. Introduction ACM ISBN 979-8-4007-1080-3/2025/03 The recent discovery of transient execution vulnerabilities https://doi.org/10.1145/3676642.3736115 has unveiled a significant threat to modern processors. These
64
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
vulnerabilities, such as Spectre [21] and Meltdown [25], ex- the verification, thereby limiting their practical adoption. To ploit speculative execution, a key performance optimization effectively and efficiently fuzz transient execution bugs, the feature, to leak sensitive data through side channels. The following two challenges need to be addressed. ongoing battle between attackers and defenders resembles First, only transiently executed instructions are considered a continuous cat-and-mouse game. For example, Spectre- effective fuzzing payloads, so the fuzzer needs to efficiently V2 [21] promoted the privilege-isolated branch prediction trigger diverse transient windows for fuzzing. However, trig- deployment, but follow-up research soon discovered bugs [3, gering these transient windows requires deliberate microar- 44, 50] in other speculation components. Similarly, after Fore- chitecture training. Due to significant differences in training shadow [45] was patched, Microarchitectural Data Sampling patterns among various microarchitecture components, ex- (MDS) [4, 46] attacks emerged. This arms race not only chal- isting approaches generate limited transient windows with lenges the effectiveness of existing defense mechanisms but high training overhead (§6.2). The inability to generate vari- also underscores the necessity of a proactive approach to ous transient windows means the microarchitecture cannot automated transient execution bug detection. be fully explored. Additionally, ineffective training instruc- Some efforts [27, 28, 51] have been applied to commodity tions waste simulation time, increasing training overhead processors. However, due to the black-box nature of off-the- and reducing the fuzzing throughput. shelf processors, these approaches rely heavily on template- Second, the fuzzer needs to perceive the propagation of based generation and fixed side channels, which makes it sensitive data during transient execution to guide mutation difficult for them to uncover new vulnerabilities. On the con- and detect leakages. Information flow tracking is a promising trary, detection approaches at the pre-silicon stage have yet solution, but it suffers from the control flow over-tainting to be extensively studied. Detecting these vulnerabilities dur- problem in complex designs [37]. Due to the lack of effective ing the Register Transfer Level (RTL) development phase is methods to trace sensitive data, existing fuzzers cannot mea- crucial, as hardware bugs are usually difficult to fix once the sure coverage or identify exploitable leakages (§6.3). Lacking design is manufactured. Early detection allows for timely re- coverage metrics means that the quality of stimuli cannot mediation, preventing these bugs from being integrated into be assessed, leading to inefficient input mutation. Passing production hardware. Therefore, proactive testing and veri- unexploitable leakages to subsequent stages not only results fication at the pre-silicon stage are imperative for ensuring in false positives but also makes later phases futile, further processor microarchitecture security. misguiding the fuzzing process. Formal verification and fuzzing are commonly used meth- To address the challenges mentioned, we propose De- ods for existing processor RTL transient execution bug detec- jaVuzz, an effective and efficient pre-silicon processor fuzzer tion. Although formal approaches can prove security prop- for transient execution vulnerabilities, powered by two novel erties exhaustively, limited by the state explosion problem, operating primitives: dynamic swappable memory and dif- existing methods [9, 39, 43, 55] solve the scalability prob- ferential information flow tracking. Dynamic swappable lem by modeling processor transient execution behavior at memory serves as an isolation primitive, responsible for a higher level of abstraction. However, given the complex- transparently switching instruction sequences to control the ity of the out-of-order processor design, the microarchitec- microarchitecture to trigger desired transient execution be- ture implementation details ignored by the model are highly haviors. This primitive resolves conflicts between instruction error-prone [18, 53]. Furthermore, the complicated design sequences by time-sharing the address space. To increase pre-knowledge and heavy manual efforts required for hard- the diversity of triggered transient windows, DejaVuzz iso- ware modeling and security property definition also impede lates training and transient instruction sequences to generate the application of formal methods to complex designs. arbitrary transient windows, and uses the training deriva- Recently, processor fuzzing has demonstrated promising tion strategy to derive targeted training based on transient results in verifying large-scale complex processor designs [5, execution information. To reduce the training overhead, De- 19, 20, 36, 53], and researchers also have begun applying jaVuzz isolates each training instruction sequence to explore fuzzing to detect transient execution vulnerabilities [11, 12, different training effects, and eliminates ineffective training 18]. IntroSpectre [12] and TEESec [11] use gadget tem- through the training reduction strategy. Differential informa- plates to trigger Meltdown-type transient execution vulnera- tion flow tracking acts as the tracing primitive that is respon- bilities and identify leakage by searching for secret values in sible for observing microarchitectural state changes caused the microarchitecture logs. SpecDoctor [18], on the other by sensitive data. This primitive eliminates the control flow hand, employs a multi-phase random instruction generation over-tainting problem by comparing whether different se- process and utilizes differential testing to detect sensitive crets can produce different selections on the same control data leakage. However, due to the complexity of the transient signal. With the help of taints, DejaVuzz designs a taint cover- execution vulnerabilities, current fuzzing methods are either age matrix to evaluate how sensitive data propagates during too limited [11, 12], only capable of identifying specific leak- the transient execution, effectively guiding exploration. Fur- age patterns, or too inefficient [18], taking days to complete thermore, DejaVuzz introduces taint liveness annotations
65
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
to bind state registers to related taint registers. By using training: ① training: ① training: ① annotated state registers as liveness signals, DejaVuzz filters bne a0, a0, foo transient: call transient out unexploitable taints to reduce false positives. transient: # with diff a0 window: ③ beq a0, a0, bar ② jalr a1, 0(a0) ② la t0, secret ld s0, 0(t0) Overall, this paper makes the following contributions: window: ③ window: ③ la t0, leak la t0, secret la t0, secret add t0, t0, s0 • We summarize the challenges of transient execution bug ld s0, 0(t0) ld s0, 0(t0) ld t0, 0(t0) fuzzing in terms of microarchitectural controllability and la t0, leak la t0, leak transient: add t0, t0, s0 add t0, t0, s0 add ra, s0, a0 ② microarchitectural observability and propose two novel ld t0, 0(t0) ld t0, 0(t0) ret operating primitives: a dynamic swappable memory model Spectre-V1 Spectre-V2 Spectre-RSB to resolve address space conflicts for better microarchitec- Figure 1. Training and transient execution sections of tural control, and a differential information flow tracking Spectre-V1, Spectre-V2 and Spectre-RSB. The secret decod- technique to mitigate control flow over-tainting for im- ing step 4 proved microarchitectural observation. ○ is omitted. • Utilizing these two operating primitives, we develop a new processor fuzzing framework named DejaVuzz, which can last two types, complex transient windows are mixed with effectively and efficiently detect transient execution bugs. the training section. Triggering such complex transient win- DejaVuzz designs training derivation and training reduc- dows is challenging, as the stimulus generator must carefully tion strategies atop dynamically swappable memory to handle the semantics of training and transient execution to efficiently trigger diverse transient windows, and utilizes ensure the window section is executed transiently as expected. taints generated by differential information flow tracking Otherwise, non-speculative execution of the window section to guide fuzzing and identify leakage. during training could lead to false positives. • We evaluate DejaVuzz on two well-known RISC-V out-of- 2.2 Hardware Dynamic Information Flow Tracking order processors [54, 57]. Compared to the SOTA fuzzer SpecDoctor [18], DejaVuzz achieves a 4.7× improvement Information Flow Tracking (IFT) has been widely deployed in coverage with more comprehensive transient windows at all levels of hardware abstraction to understand how in- and lower training overhead. DejaVuzz mitigates control formation flows through a system [15, 24, 41, 48]. Hardware flow over-tainting with acceptable overhead and identifies dynamic IFT, known as taint tracking, can dynamically ver- 5 previously unknown transient execution vulnerabilities, ify information flow properties at runtime. This is achieved all of which are assigned CVE numbers. by marking sensitive state elements with taints at the circuit To facilitate the community and future research, we pub- level and propagating the taints based on the operations on lish the source code and experiments of DejaVuzz at https: sensitive data. There are three instrumentation levels for //github.com/sycuricon/DejaVuzz. the hardware dynamic IFT mechanism: gate level [42], RTL level [2], and cell level [37]. Figure 2 shows how hardware 2 Background dynamic IFT is implemented in hardware. The dynamic IFT 2.1 Transient Execution Vulnerabilities instrumentation generates a shadow circuit based on the original circuit, all registers in the original circuit are copied As shown in Figure 1, the process of exploiting a transient to store taints, and the combinational logic gates are replaced execution bug can be divided into the following 4 attack with the corresponding taint propagation policy implementa- steps: 1 ○ training the target microarchitecture, 2 ○ triggering tion. The taint propagation policies are a set of rules that are a transient window through the trained state, 3 ○ accessing responsible for tainting outputs that are affected by tainted sensitive data and encoding it into a side channel, and 4 ○ inputs. Policies 1 and 2 are the state-of-the-art taint propaga- subsequently decoding the secret from the side channel. tion policies [2, 37] for the AND and MUX cells, respectively. However, different types of transient windows exhibit By using shadow circuits, dynamic IFT provides the ability to highly varied training patterns. For Spectre-V1, the train- observe the information flow of the design without affecting ing section (blue stripe) and the transient execution section the original functionality. (yellow stripe) are independent, which means these two sec- 𝑂𝑡 = (𝐴&𝐵𝑡 )|(𝐵 &𝐴𝑡 )|(𝐴𝑡 &𝐵𝑡 ) (1) tions can be generated independently as long as the branch 𝐴𝑁 𝐷 instructions have the same address offset. However, this is 𝑂𝑀𝑈 𝑋𝑡 = (𝑆 ? 𝐵𝑡 :𝐴𝑡 )|(𝑆𝑡 ? (𝐴ˆ𝐵)|(𝐴𝑡 |𝐵𝑡 ) :0) (2) not always true for other transient execution bugs such as Taints generated by the direct computation of input taints Spectre-V2 and Spectre-RSB [22, 26]. The Spectre-V2 attack and signals, like in Policy 1, are referred to as data taints. requires different arguments (a0) to switch between train- In Policy 2, in addition to selecting data taints via the selec- ing and exploiting the Branch Target Buffer (BTB) with the tion signal S, the underlined component produces control same code. And the Spectre-RSB attack requires tempting taints due to the conditional selection semantics of the mul- the processor to speculatively return to a corrupt address tiplexer. Unlike data taints, which are only impacted by the by training the Return Stack Buffer (RSB). As seen in the actually executed code, control taints also consider changes
66
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
rob_3_uopc enq_uopc_T
rob_3_uopc Bt At rob_3_uopc_T random generation [19, 20, 33]. As discussed in §2.1, a tran-
A O D Q AB MUX sient execution attack involves multiple steps. Thus, exist-
B S Reg enq_uopc IFTS St DTaintt Qt ing fuzzers strategically divide the generation into multi-
update_rob3 Reg ple phases. For instance, IntroSpectre and TEESec insert
O update_rob3 A ANDOt 0 rob_tail_idx _T helper gadgets before the main gadget to satisfy the re-
match_rob3 enq_valid B IFT quired memory access paths in the software execution model.
enq_validA B = 3 match_rob3 At Bt Eq 3 SpecDoctor sequentially progresses through the transient- erate a complete stimulus. During each phase, additional (a) rob_tail_idx enq_valid _T (b)IFT rob_tail_idx trigger, secret-transmit, and secret-receive phases to gen- instructions are randomly appended to those generated in Figure 2. Hardware dynamic information flow tracking in- the previous phase until specific goals are met. The goals of strumentation. a) is the example circuit from the BOOM RoB each phase are to trigger a RoB rollback, generate microar- module, b) is the corresponding IFT shadow circuit. chitectural differences, and cause timing differences. Second, the fuzzer uses an RTL simulator to convert the De- sign Under Test (DUT) into a software model and then uses the model to execute the generated instruction sequences. occurring on unselected branches (i.e., the 𝐴ˆ𝐵 term). Thus, During simulation, the fuzzer leverages instrumentation to once taints propagate to the control flow, it can easily lead to measure coverage to guide mutations. Existing fuzzers de- over-tainting [35, 37]. Since taint propagation policies only fine several coverage metrics to reflect the general processor generate taints without eliminating them, more registers be- behavior, such as mux toggle coverage [23], control regis- come tainted as the circuit executes, making it increasingly ter coverage [19, 53], and hardware behavior coverage [20]. difficult to identify target information flows precisely. However, transient execution bug fuzzers focus solely on According to our evaluation (§6.3), the state-of-the-art microarchitectural state changes caused by sensitive data. hardware dynamic IFT mechanism CellIFT [37] suffers from Therefore, existing general processor behavior coverage met- the control flow over-tainting problem. Next, we use the Re- rics are unsuitable for transient execution bugs, as they are order Buffer (RoB) module of BOOM [57] in Figure 2 as unaware of the propagation of sensitive data. an example to explain how the taint explosion occurs dur- Third, the fuzzer analyzes the microarchitecture to deter- ing the RoB rollback. The third RoB entry updates its op- mine if any bug exists. Unlike the functional bugs that can code field register rob_3_uopc with the new opcode enq_uopc be detected using co-simulation [19, 53], transient execution when a valid micro-operation is enqueued (enq_valid is high) vulnerabilities require detailed microarchitecture analysis. and the tail pointer points to the third entry (rob_tail_idx For example, IntroSpectre and TEESec dump the microar- is equal to 3). Before the RoB rollback, instructions using chitecture at each cycle and then assess whether leakage has tainted sensitive data as operands in step ○3 write back occurred based on the presence of the secret values in the and taint the RoB state register. When the RoB rolls back, log. SpecDoctor observes execution behavior by hashing the movement of the tail pointer causes rob_tail_idx to be the final state of the timing components after transient exe- tainted. Since the frontend also uses the RoB index to main- cution and evaluates leakage by comparing the consistency tain state, enq_valid is tainted. According to Policy 1, both of the hash values between different variants. inputs are tainted (the comparison result of the Equal cell 3 is also tainted due to the tainted ), causing the rob_tail_idx Operating Primitives MUX selection signals to be marked as tainted. Furthermore, In this section, we first analyze the challenges of transient based on Policy 2, the register rob_3_uopc is also marked as execution fuzzing based on the key capabilities required by tainted due to the different input data. All 736 RoB entry a fuzzer and identify their root causes. Next, we present field registers have a similar update logic. Therefore, they the design of two novel operating primitives and explain are all suddenly tainted when the RoB rolls back. how they address the root causes. For the challenges, we use designs based on the primitives to address them in §4. 2.3 Processor Fuzzing for Transient Execution Bugs Processor fuzzing has been employed to detect various bugs, 3.1 Challenges and Root Causes including functional bugs [19, 20, 36, 53], transient execu- The task of a transient execution bug fuzzer is to generate tion bugs [11, 12, 18], and side channel bugs [33]. Although instruction sequences that trigger transient windows and en- bugs are characterized differently, existing fuzzers generally code secrets into the microarchitecture, and then determine follow a similar workflow consisting of three main steps. whether the encoded states can leak the secrets. To achieve First, the input generator generates instruction sequences this, a competent fuzzer must possess two key capabilities. as stimuli either based on constraints [36, 53] or through First, it must effectively train the microarchitecture to trigger
67
MUXenq_uopc AND
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
0x0010 bne x20, x20, L1 W1 conflict with the Unutilized microarchitecture training instructions not only // ... branch training reduce the fuzzing throughput but also diminish the training la x21, data0 W2 conflict with the effectiveness due to potential conflicts. lw x19, 4(x21) 0x0010 trigger condition The root cause of the above challenges is the address space // ... 0x1010 bne x20, x21, L? conflict. Since the fuzzer cannot predict training effective- mul x18, x19, x20 la x25, secret ness or transient window locations, instructions are hardly sltu x22, x23, x20 lb x24, 0(x25) la x23, data2 add x23, x22, x24 placed at the desired address. For example, training instruc- addi x21, x18, 0x3 W3 ld x25, 0(x23) tions may occupy addresses needed for transient windows, and different training instructions cannot be tested at the Figure 3. Assuming the branch instruction at 0x1010 can same address. This makes it difficult for existing fuzzers to trigger transient windows at different addresses by using arrange instructions linearly to trigger the desired transient different branch targets L?, only transient windows that do execution behaviors. not conflict with training instructions can be exploited. Microarchitectural Observability [14, 29, 56] concerns the ability of a fuzzer to monitor and measure the effects of sensitive data on the microarchitecture. Despite having diverse transient windows, since we are only interested in complete access to the processor’s internal states, existing transiently executed behaviors. Second, it must accurately fuzzers fail to track how sensitive data propagates through track the propagation of sensitive data, as we only focus on the microarchitecture, leading to two challenges. microarchitectural state changes caused by secrets. Based on C2-1. Feedback Gap. Prior work ignores the coverage ma- this observation, we define these two capabilities as microar- trix and thus fails to provide feedback for input mutation, chitectural controllability and observability, respectively. leading to blind and random input mutation. This problem Microarchitectural Controllability [8, 30, 49] refers is caused by the lack of ability to track the propagation pro- to the ability of a fuzzer to efficiently manipulate microar- cess of sensitive data. IntroSpectre and TEESec cannot chitecture to trigger desired transient execution behaviors. capture secrets after arithmetic operations due to the use of Existing fuzzers generate transient windows using template- value matching. SpecDoctor only computes the hash of the based [11, 12] or random-based [18] methods. While they can final state, and the compressed execution process prevents successfully trigger transient windows, they fail to address capturing the different propagation paths during execution. the following two challenges. The missing coverage matrix leaves a gap between input C1-1. Limited Window Types. Template-based methods mutation and execution, making it difficult for the fuzzer to are limited to specific transient window templates, while explore all possible transient behaviors efficiently. random-based methods also fail to generate arbitrary tran- C2-2. Imprecise Oracle. Caches and buffers are extensively sient windows. As shown by W3 in Figure 3, SpecDoctor used in processor microarchitecture to improve performance randomly generates training instructions and replaces the and typically include state registers to indicate the validity RoB squashed instructions with the secret encoding instruc- of the current data. For example, the Line Fill Buffer (LFB) tions to exploit. However, when the RoB squashed instruc- in BOOM is managed by the Miss Status Holding Regis- tions are mixed with the training instructions (i.e., complex ter (MSHR). Once the cache line refill is completed, MSHR transient windows in §2.1), replacing them may invalidate switches its state register to invalid to indicate that the data transient execution. For example, replacing branch training in the LFB is outdated instead of clearing the LFB. Existing can prevent the predictor from reaching the desired pre- work has incorrectly considered this scenario as vulnerable, diction state (W1), while replacing the assignment to the as IntroSpectre and TEESec would match the sensitive condition comparison register x21 could change the branch data remaining in the LFB. It would also cause SpecDoctor outcome (W2). For this reason, SpecDoctor discards all to generate different hashes. Due to the imprecise oracles, transient windows containing backward jumps. As a result, existing fuzzers pass these false positives to subsequent steps, existing fuzzers are limited to exploring only a restricted resulting in meaningless execution. subset of transient window types. The root cause of the above challenges is the lack of a C1-2. Inefficient Training. Making the fuzzer recognize the mechanism to track state changes caused by sensitive data. microarchitectural state changes caused by randomly gener- Without the ability to observe the information flow of sen- ated instructions and subsequently exploit them is exception- sitive data, existing fuzzers are unable to measure coverage ally challenging. IntroSpectre and TEESec use a manual based on the distribution of encoded sensitive data or query software execution model to assist in setting up the required state registers to identify exploitable leakages. microarchitecture but cannot train states beyond the model. SpecDoctor also has difficulty assembling matched training- 3.2 Dynamic Swappable Memory exploitation instruction pairs because meaningless random Instead of using scalability-limited templates to solve the training instructions often occupy the required addresses. address space conflict, the core insight of DejaVuzz is that
68
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
Swap Training Transient Cycle environment, including state initialization, trap handling, Scheduletrigger_train0: (1) 0x0 transient: (3) and runtime instruction sequence scheduling. To facilitate # setup x20, x21 modifying secrets, each DUT has a dedicated region for stor- 0x1010 bne x20, x20, L1 j 0x1010 # address gap ing sensitive data and mutable operands. The swappable window: trigger_train1:(2) la x25, secret region is used to hold instruction sequences with different add x23, x22, x24 0x1010 beq x20, x20, L1 lb x24, 0(x25) semantics. Each DUT can load the required instruction se- ld x25, 0(x23) quence into the swappable region at runtime according to the 0x1010 bne x20, x21, win swap schedule. Typically, DejaVuzz first executes all training Memory Layout instruction sequences on the DUT, then updates sensitive firmware: data:
trap handler # secret # swap packets data permissions, and finally executes the transient instruc-
swap scheduler # mutable operand # at runtime tion sequence. Once a sequence is completed, an exception
Shared Dedicated Swappable is triggered, and then the trap handler flushes the instruc- Figure 4. Using swapMem to trigger transient execution. tion cache and loads the next sequence into the swappable region. After swapping the new sequence, the DUT jumps to its entry and continues execution. address space can be time-shared by different semantics. Fig- The swapMem enhances microarchitectural controllability ure 4 shows how scheduling instruction sequences within as the isolation primitive, resolving address space conflicts. the same address space enables triggering complex transient In §4.1, we will discuss how to design instruction sequence windows that could not be generated in Figure 3. During generation strategies based on swapMem to trigger diverse simulation, we first load the training instruction sequence windows and optimize training overhead. (1) or (2) into memory to train the predictor. After training, we flush the memory and load the transient instruction se- 3.3 Differential Information Flow Tracking quence (3) to trigger the backward transient window. For the DejaVuzz intends to employ the information flow tracking training instruction sequence, since the full address space technique to identify state changes caused by secrets. How- is available, we do not need to use similar addresses like ever, as discussed in §2.2, the control flow over-tainting prob- 0x0010 to train the predictor. Instead, we can directly place lem makes it impossible to identify the propagation of sen- a branch training instruction at 0x1010. Additionally, we can sitive data. Thus, we propose differential information flow explore different training effects, such as using sequence (1) tracking (diffIFT) to mitigate the over-tainting problem. to train the prediction as untaken or sequence (2) to train it When fuzzing transient execution vulnerabilities, we con- as taken. For the transient instruction sequence, since train- sider leakage to occur only if executing the same instruction ing instructions are not in the current sequence, W1 type sequence with different secrets results in different behav- conflicts are avoided, and conflicted register assignments can iors. However, Policy 2 considers arbitrary input differences be moved to other available addresses (e.g., 0x0) to resolve rather than differences caused by secrets. Therefore, a core W2 type conflicts. As shown in sequence (3), after setting up insight of DejaVuzz is that if no secret can influence the the registers, DUT can directly jump to 0x1010 to trigger the value of a control signal, then even if it is tainted, it should transient window without any conflicts. Besides generating be ignored, as it cannot select another path. However, it is ex- arbitrary transient windows, we can also identify effective tremely expensive to precisely compute all potential values training by trying different training instruction sequences. of each control signal in the out-of-order processor for all For example, by trying combinations (1)(3) and (2)(3), we input secrets at each cycle [16]. Inspired by the multi-variant can find that only (2) contributes to triggering the transient execution [7, 31, 34], DejaVuzz approximates the solution window. Thus, switching instruction sequences on demand with concrete values from multiple variants. To be specific, at different stages effectively resolves address space conflicts, DejaVuzz creates a differential testing testbench to determine allowing the fuzzer to effectively control the microarchitec- if sensitive data can produce different values on a control ture to trigger desired transient execution behaviors. signal by executing the same instructions on two identical However, implementing the above switching process with DUTs with different secrets. Table 1 lists the updated control assembly instructions can pollute memory-related training taint propagation rules for all supported control flow cells. states. To address this, we propose the dynamic swappable The overall policies are similar to CellIFT, except the control memory (swapMem), enabling transparent instruction se- taints only propagate when cross-instance comparison sig- quence switching. Since side channel bugs require multiple nals are high. The highlighted signals with the diff subscript DUT instances with different secrets to detect behavioral represent cross-instance comparison signals. Take the multi- differences, the swapMem is specifically designed for this plexer as an example, when diffIFT encounters a multiplexer scenario. As shown at the bottom of Figure 4, the swapMem whose selection signal 𝑆 is tainted, diffIFT checks whether consists of three regions. The shared region is shared across the selection signals are consistent between the variants (i.e., multiple DUT instances and contains the essential execution 𝑆diff = 𝑆𝐷𝑈𝑇₁ˆ𝑆𝐷𝑈𝑇₂). If there is a difference, it indicates that
69
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Table 1. The control taint propagation policies of diffIFT. can be exploited. Finally, DejaVuzz reports test cases that violate transient window constant time execution or contain Cell Type Propagation Policy exploitable taints as potential bugs. Multiplexer (𝑆 ?𝐵𝑡 :𝐴𝑡 )|(𝑆𝑡 &𝑆diff ? (𝐴ˆ𝐵)|(𝐴𝑡 |𝐵𝑡 ) :0) Comparison Cell 𝑂diff & |(𝐴𝑡 |𝐵𝑡 ) 4.1 Phase 1: Transient Window Triggering Register with En (𝐸𝑛?𝐷𝑡 :𝑄𝑡 )|(𝐸𝑛𝑡 &𝐸𝑛diff ? (𝐷ˆ𝑄)|(𝐷𝑡 |𝑄𝑡 ) :0) Memory Read 𝑚𝑒𝑚𝑡 [𝑎𝑑𝑑𝑟 ]|{WIDTH {𝑎𝑑𝑑𝑟 𝑡 & 𝑎𝑑𝑑𝑟diff }} Phase 1 focuses on triggering diverse transient windows Memory Write (Wen?Wdata𝑡 :𝑚𝑒𝑚𝑡 [𝑎𝑑𝑑𝑟 ])|{WIDTH {( with minimal overhead. For challenge C1-1, DejaVuzz uses Wen𝑡 &Wendiff )|(𝑎𝑑𝑑𝑟 𝑡 & 𝑎𝑑𝑑𝑟diff &Wen)}} swapMem to isolate transient execution from training to gen- erate arbitrary transient windows, and employs the training sensitive data can generate different selections, and diffIFT, derivation strategy (§4.1.1) to generate targeted training. For therefore, performs control taint propagation. Otherwise, challenge C1-2, DejaVuzz further isolates each training and diffIFT only considers data taint propagation. We instrument applies the training reduction strategy (§4.1.2) to identify the DUT at the RTL IR level and thus support word-level and eliminate ineffective training. cells and non-flattened memories. Additionally, the data taint 4.1.1 Step 1.1: Trigger Generation. While swapMem re- propagation policies for data flow cells in diffIFT are consis- solves address space conflicts, allowing DejaVuzz to gen- tent with CellIFT. erate arbitrary transient windows, effective training is still It is worth noting that diffIFT is an underapproximation of required to trigger them. To train the required microarchitec- information flow since it uses concrete values. If a secret pair ture components for triggering transient windows, DejaVuzz happens to produce the same value on a secret dependent employs the training derivation strategy. It first randomly control signal, a false negative will occur. When this hap- generates a transient window and then derives targeted train- pens, data taints still propagate accurately, but control taints ing based on the expected transient window. are suppressed due to identical control signals. Therefore, Trigger Instruction Generation. In this step, DejaVuzz DejaVuzz generates secrets for the variant DUT by flipping only generates the trigger section of the transient packet each bit of the original secret to avoid using identical values. ( 1 ). The transient packet refers to the instruction sequence Besides, by leveraging the dedicated region in swapMem, that triggers a transient window and transiently accesses and DejaVuzz can directly load different secret pairs to mitigate encodes sensitive data (i.e., transient instruction sequence false negatives without regenerating the input. (3) in Figure 4). DejaVuzz first randomly generates trigger The diffIFT serves as the tracing primitive to enhance instructions based on the trigger type encoded in the seed. microarchitectural observability. With the help of taints, De- The trigger instructions supported by DejaVuzz cover the jaVuzz is able to observe sensitive data and its derived values entire basic instruction set, including sequential execution across the microarchitecture. In §4.2 and §4.3, we will explain instructions (e.g., integer or floating-point arithmetic opera- how to use taint to compute coverage and identify leakages. tions, valid memory accesses), control transfer instructions 4 The DejaVuzz Framework (e.g., branches, indirect jumps, and returns), and instructions that may trigger architectural exceptions (e.g., illegal instruc- In this section, we demonstrate how DejaVuzz builds on op- tions, memory access violations). In the example shown in erating primitives to address the challenges in §3.1, enabling Figure 5, suppose DejaVuzz plans to trigger a transient win- effective and efficient transient execution bug fuzzing. dow caused by a return address misprediction. Next, De- Overview. As shown in Figure 5, the workflow of DejaVuzz jaVuzz generates a dummy transient window filled with nop consists of three phases. The first two phases focus on trig- instructions ( 2 ). For sequential execution instructions and gering and exploring transient execution, while the final exceptions, the transient window is placed immediately af- phase is responsible for detecting leakage. DejaVuzz lever- ter the trigger instruction by default. For control transfer ages swapMem to isolate different instruction sequences instructions, DejaVuzz randomly selects whether to place within the same address space. In Phase 1, DejaVuzz derives the transient window after the trigger instruction. Finally, targeted training for diverse transient windows and evalu- DejaVuzz uses an ISA simulator to compute the operands ates each training to eliminate ineffective training. In Phase required to trigger the transient window and generate the re- 2, DejaVuzz completes the transient window and attempts lated register initialization instructions. Therefore, DejaVuzz to encode sensitive data into the microarchitecture. During covers transient windows triggered by all instruction types, simulation, DejaVuzz uses diffIFT to track sensitive data effectively enhancing transient window diversity. propagation and collects taint as coverage to guide explo- Trigger Training Derivation. DejaVuzz uses the transient ration. In Phase 3, DejaVuzz first checks transient window execution information in transient packets to randomly gen- constant time execution violations. If no timing differences erate multiple trigger training packets ( 3 ). The trigger train- are detected, it further uses taint liveness annotations to ing packet refers to the instruction sequence used for train- check whether secrets encoded into the microarchitecture ing microarchitecture to trigger the transient window (i.e.,
70
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
S1.1 Trigger Generation S2.1 Window Completion
Seed transient: transient: trigger_train_0: transient: trigger_train_0:
Corpus +0x0 trigger: ❶ trigger: nop trigger: nop add ra, s0, a0 +0x4 ret window: ❺ call swap_done window: call swap_done ld s0, 0(t0) i la t0, secret window: ❷ la t0, secret ld s0, 0(t0) +0x8. nop la t0, leak window_train: nop window_train: Bin. .. nop add t0, t0, s0 ii la a0, secret ❻ nop ❼ la a0, secret nop ld t0, 0(t0) ld a1, 0(a0) nop ld a1, 0(a0)
RTL Sim. trigger_train_0:
+0x0 nop ❸ Feedback to Bin. Bin. S3.1 Constant Time Bin. Bin.
+0x4 call swap_done
... swap_done: Phase 1/2 secret0 secret1 Execution Analysis secret0 secret1
Trace Log # exit w/o ret Generation RTL Sim. Bug RTL Sim. ❹ trigger_train_1: w/ diffIFT Report w/ diffIFT add t0, t1, s2 S1.2 Trigger trigger_train_2: S2.2 Coverage S3.2 Tainted Sink Optimization sub t1, t0, s0 Measurement Liveness Analysis ... Trace Log Taint Log Sanitized Taint Log
Figure 5. DejaVuzz fuzzing workflow for finding transient execution vulnerabilities, taking Spectre-RSB for example.
training instruction sequences (1) and (2) in Figure 4). For the number of its committed instructions, it indicates that each trigger training packet, DejaVuzz first generates a ran- the transient window has been successfully triggered. dom training instruction, and then inserts nop instructions Training Reduction. Although trigger training packets to align it with the trigger instruction in the transient packet. are derived from the transient packet for targeted training, In the example, we generate three trigger training packets, not all training contributes to triggering the transient win- with the training instructions all placed at the same address dow. Fortunately, since each training instruction is isolated (i.e., 0x4) as the trigger instruction ret. Next, DejaVuzz fur- in its packet, DejaVuzz can identify ineffective packets by ther adjusts the control flow of the training instruction if removing one at a time and re-simulating the remaining the training instruction is a control transfer instruction. To packets to see if the transient window still triggers ( 4 ). If be specific, DejaVuzz adjusts the control flow of the train- removing a trigger training packet does not affect transient ing instruction to match the control flow of the generated window triggering, it will be permanently discarded from transient window, enhancing the training effectiveness for the swap schedule. Otherwise, the packet is necessary, and control flow prediction. For example, DejaVuzz avoids using DejaVuzz will keep it in the swap schedule. DejaVuzz eval- a ret instruction to exit the packet trigger_train_0, ensur- uates each trigger training packet in the order of the swap ing that the predicted return address in the RSB matches the schedule. This process repeats until only necessary trigger start address of the transient window (i.e., 0x8). By deriving training packets remain or none are available. It is obvious training from transient execution information, DejaVuzz not that integer arithmetic operations do not contribute to re- only generates diverse transient windows but also produces turn address prediction. Therefore, in the example, DejaVuzz targeted training, ensuring the fuzzer can more effectively finds that discarding trigger_train_1 and trigger_train_2 control the microarchitecture to trigger desired transient does not affect the triggering of the transient window, and execution behaviors. finally removes them. By discarding ineffective trigger train- ing packets, DejaVuzz is able to trigger transient windows 4.1.2 Step 1.2: Trigger Optimization. After generating with minimal training overhead. the trigger training packets, DejaVuzz evaluates which pack- ets are helpful in triggering transient windows. Leveraging 4.2 Phase 2: Transient Execution Exploration swapMem, DejaVuzz employs the training reduction strategy DejaVuzz explores which microarchitectures can be used that identifies and discards ineffective trigger training pack- to encode secrets during this phase. DejaVuzz uses taints ets without affecting transient window triggering, thereby as the coverage to guide the exploration (§4.2.2), effectively reducing training overhead. addressing challenge C2-1. Transient Execution Recognition. DejaVuzz packages all these packets together with a swap schedule, which defines 4.2.1 Step 2.1: Window Completion. DejaVuzz replaces their execution order. The schedule specifies that the trigger the dummy transient window with real payloads and gener- training packets are executed first, followed by the transient ates a complete test case. packet. After RTL simulation, DejaVuzz analyzes the RoB Transient Window Completion. DejaVuzz generates two port events recorded in the trace log. If the number of en- blocks in the window section ( 5 ): (i) the secret access block queued instructions within the transient window exceeds and (ii) the secret encoding block. In the secret access block,
71
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
besides fixed instructions to access sensitive data, it also ran- show low coverage growth, DejaVuzz will discard the seed
domly masks the high-order bits of the address to attempt to and return to Phase 1.
cover MDS-type bugs. In the secret encoding block, DejaVuzz
randomly generates instructions that depend on secrets in 4.3 Phase 3: Transient Leakage Analysis
order to propagate secrets across the microarchitecture. In this phase, DejaVuzz analyzes whether the final state can
Window Training Derivation. Similar to trigger training leak sensitive data. For challenge C2-2, DejaVuzz uses taint
packets, DejaVuzz also derives window training packets for liveness annotations to filter out unexploitable taints in the
the secret access block ( 6 ). The window training packet final analysis phase (§4.3.2).
is the training instruction sequence used to train memory-
related states used by the transient window. In the example, 4.3.1 Step 3.1: Constant Time Execution Analysis. For
DejaVuzz attempts to warm up sensitive data into the pro- test cases that successfully access and propagate sensitive
cessor’s internal buffers in advance, such as the data cache data, DejaVuzz further analyzes whether leakage occurred.
and the load buffer. The generated window training packets It first compares the execution time of the transient window
are scheduled before the trigger training packets in the swap between the variants. If results are inconsistent, it indicates
schedule to avoid invalidating the transient window. that sensitive data has caused timing side channels, such
as port contention, during the transient window. DejaVuzz
4.2.2 Step 2.2: Coverage Measurement. DejaVuzz per- directly reports these test cases as potential vulnerabilities.
forms RTL simulation using the diffIFT instrumented DUTs Encode Sanitization. Although test cases with transient
and measures coverage from the taint log to guide subse- window constant time execution cannot directly leak secrets
quent stimulus generation. through the timing side channel, the encoded sensitive data
Taint Coverage. DejaVuzz introduces the first secret sensi- may still be leaked via other side channels. Since accessing
tive coverage matrix designed for transient execution vulner- sensitive data during training also generates taints, we need
ability fuzzing. The taint coverage treats the total number of to distinguish the taints caused by the secret encoding block
taints within a local range as an independent coverage point. before further analyzing whether the encoded sensitive data
To be specific, DejaVuzz inserts a new register array bitmap can be exploited. Therefore, DejaVuzz replaces the secret
into each RTL module. During each clock cycle, DejaVuzz encoding block in the transient packet with nop instructions
uses the number of tainted registers within the module as the ( 7 ) and re-runs the simulation. By comparing the sanitized
index and writes 1 to the corresponding slot in the bitmap. taint log with the original taint log, DejaVuzz can identify
After the transient execution, DejaVuzz checks the value the taints generated by the secret encoding block.
of each slot in the bitmap. If a slot’s value is 1, it indicates 4.3.2 Step 3.2: Tainted Sink Liveness Analysis. The
that the corresponding number of taints has been explored taints produced by diffIFT only indicate reachability. As the
within the module, and DejaVuzz records the index of such a LFB example in §3.1, not all encoded secrets are exploitable.
slot and its module name as a tuple. Finally, DejaVuzz evalu- Therefore, DejaVuzz further analyzes taint liveness to deter-
ates input exploration based on the total number of collected mine whether the tainted sinks can be exploited.
(module, index) tuples. Taint Liveness Annotation. Inspired by selective data pro-
The taint coverage has two key properties. The first is lo- tection [1, 32, 52], DejaVuzz uses annotations to bind taint
cality, as coverage is measured at the module level, reflecting registers to their corresponding state registers. Developers
the propagation of sensitive data across different hierarchies. can annotate the registers with the
The second is position-insensitive, which helps filter out liveness_mask custom
redundant encoding. For example, when sensitive data is en- attribute [6, 40] to declare their state registers. Taking LFB as
coded in different slots of the cache data array, the coverage an example, the mshr_valid_vec signal comes from the state
points generated by the cache module remain identical. register in MSHR, and the lb register is the data buffer in
Coverage Feedback. Once all packets are ready, DejaVuzz LFB. Line 4 shows the annotation. During diffIFT instrumen-
duplicates them with different secrets to generate two swap- tation, DejaVuzz automatically connects the liveness signal
pable stimuli for diffIFT. After RTL simulation, DejaVuzz mshr_valid_vec to the taint register of lb.
first identifies the cycle range of the transient window by an- 1 wire mshrs_0_valid, mshrs_1_valid; alyzing RoB port events in the trace log, and then examines 2 wire [15:0] mshr_valid_vec = taint changes within this range from the taint log. If taints 3 increase, it indicates that sensitive data has been successfully {8{mshrs_1_valid}, 8{mshrs_0_valid}}; 4 (* liveness_mask = "mshr_valid_vec" *) propagated, and DejaVuzz subsequently measures taint cov- 5 reg [63:0] lb [15:0]; erage based on the taint log. If the coverage increase is less 6 than the average increase or sensitive data is not success- 7 BoomMSHR mshrs_0 (.io_mshr_valid(mshrs_0_valid)); fully propagated, DejaVuzz mutates the seed to regenerate 8 BoomMSHR mshrs_1 (.io_mshr_valid(mshrs_1_valid)); the window section. If the results after multiple attempts still
72
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
However, since the implementation of the state registers Table 2. Summary of the cores used for evaluation. is coupled with the microarchitecture, developers may be unable to reference them directly. To accommodate various Feature BOOM XiangShan implementation, we design the liveness signal interface as a Configuration SmallBOOM MinimalConfig generic vector, with each bit representing whether the corre- ISA RV64GC RV64GC sponding slot in the taint register array is valid. For example, Issue Width 1 2 the lower 8 entries of lb are managed by mshrs_0, while the Commit Width 1 2 upper 8 entries are managed by mshrs_1. We can construct Verilog LoC 171K 893K the liveness signal as shown in lines 2-3. DejaVuzz currently IFT Netlist LoC 1130K 8979K Annotation LoC 212 592 requires developers to manually convert state registers into liveness signal vectors. Table 2 shows the manual effort re- quired for annotation and patching. By default, DejaVuzz treats all register arrays (including those registers generated We use the industry-standard RTL simulator Synopsys VCS by Chisel Vec constructor) as potential sinks, and developers for RTL simulation. Limited by the number of licenses, we can customize sinks as needed. Finally, DejaVuzz identifies only used a maximum of 16 threads in the experiments. the target sinks from the filtered taint log and reports those We evaluate DejaVuzz on BOOM [57] and XiangShan [54], with valid liveness signals as potential vulnerabilities. two well-known out-of-order processors that are actively maintained in the RISC-V community. BOOM is the third gen- 5 Implementation eration of the Berkeley out-of-order machine and is widely The implementation consists of 1) a testharness generator evaluated in related academic work [9, 11, 12, 18, 37, 39]. responsible for instrumenting RTL source code and integrat- XiangShan is currently the most high-performance open- ing swapMem and two DUT instances into a testbench, and source RISC-V core and thus has a more complex architecture. 2) the fuzzing pipeline illustrated in Figure 5. Their configurations are summarized in Table 2. Testharness Generator. We implement the swapMem atop Since IntroSpectre and TEESec only focus on Meltdown- the Starship SoC generator [38], with ∼300 LoC Python for type vulnerabilities and their released artifacts do not include swapMem RTL model generation and ∼500 LoC DPI-C for a complete fuzzing framework, we only compare DejaVuzz swapMem runtime. The diffIFT instrumentation adds new with SpecDoctor. Due to the complex manual patching of passes in the Yosys synthesizer to insert taint cells for taint the DUT required by SpecDoctor, we only compare the propagation, involving ∼1KLoC C++. The taint cell library of BOOM supported by both. diffIFT is implemented in Verilog, which also uses ∼1KLoC. 6.2 Microarchitectural Controllability Evaluation Fuzzing Pipeline. The fuzzing pipeline consists of ∼6500 LoC Python and ∼180 LoC RISC-V assembly code, which We collect 2,500 transient windows separately and summa- includes stimulus generation and fuzzing management. De- rize their types and training overhead in Table 3. The Train- jaVuzz uses seeds to generate stimuli, which contain con- ing Overhead (TO) refers to the number of training instruc- figurations for trigger instructions and transient windows, tions generated to trigger transient windows. Since DejaVuzz as well as entropy for the random instruction generator. uses nop instructions to align training instructions with trig- The generator supports the RV64GC instruction set and cov- ger instructions, we also compute the Effective Training ers common transient window types. The fuzzing manager Overhead (ETO) by excluding the padding nop instructions. employs a multi-threaded design, allowing multiple RTL For misprediction-type transient windows, since predictors simulation instances to run in parallel. have default prediction states, we exclude transient windows that require no training to trigger. 6 Evaluation The results show that SpecDoctor can only cover 4 types We evaluate DejaVuzz by answering the following questions: of transient windows on BOOM and requires about 125 in- structions for training. Instead, DejaVuzz can trigger all • RQ 1. How effective and efficient is DejaVuzz in triggering types of transient windows with minimal overhead. No- diverse transient windows? (§6.2) tably, the training reduction strategy successfully identi- • RQ 2. How well does DejaVuzz trace sensitive data, im- fies the necessary training packets for triggering the tran- prove coverage, and identify leakages? (§6.3) sient window. Therefore, DejaVuzz can trigger exception- • RQ 3. Can DejaVuzz uncover previously unknown tran- type transient windows with zero overhead and use a few sient execution bugs in real-world processors? (§6.4) training instructions (excluding nop instructions) to trigger 6.1 Experimental Setup misprediction-type windows. To show the effectiveness of the training derivation strategy, we introduce the DejaVuzz∗ All experiments are conducted on a machine with dual AMD variant. DejaVuzz∗ still uses swapMem, but its training pack- EPYC 9334 processors featuring 64 cores and 512GB of RAM. ets consist of random instructions instead of deriving from
73
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Table 3. Training overhead for different types of transient windows.
Load/Store Load/Store Load/Store Illegal Memory Branch Indirect Jump Return Address
Processor Fuzzer Access Fault Page Fault Misalign Instruction Disambiguation Misprediction Misprediction Misprediction
TO (ETO) TO (ETO) TO (ETO) TO (ETO) TO (ETO) TO (ETO) TO (ETO) TO (ETO)
DejaVuzz 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) é 0.0 (0.0) 86.4 (3.8) 85.7 (2.8) 85.6 (2.7)
BOOM DejaVuzz∗ 1.3 0.1 1.6 é 0.2 102.2 169.5 89.5
SpecDoctor é 126.6 é é 113.5 125.5 122.5 é
XiangShan DejaVuzz 0.1 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 83.9 (2.8) 90.1 (2.9) 88.7 (2.9)
DejaVuzz∗ 0.0 0.0 0.0 0.0 0.4 101.0 é 97.0
é indicates that the corresponding type of transient window failed to trigger.
Table 4. Overhead of differential information flow tracking. 60 60
40 40
Time (s) BOOM XiangShan 20 20 diffIFTFN
Base CellIFT diffIFT Base CellIFT diffIFT diffIFT
Compile 122 2856 268 638 1781 0.6k 0.8k 1.0k 1.2k 0.6k 0.8k 1.0k 1.2k
Spectre-V1 2.0 152.2 4.8 4.0 17.5 Cycle Cycle
Spectre-V2 2.1 152.4 5.7 4.5 Timeout 19.8 400k
Meltdown 2.1 152.6 5.6 4.7 after 8h 19.9 Spectre-V1
Spectre-V4 2.0 152.2 4.9 4.3 17.9 200k Spectre-V2
Meltdown
Spectre-RSB 2.0 152.0 4.8 4.3 17.9 0k CellIFT Spectre-V4
Spectre-RSB
0.6k 0.8k 1.0k 1.2k
Cycle
transient execution information. Due to the training reduc- Figure 6. Taints during executing each test case. The dotted
tion strategy, both DejaVuzz∗ and DejaVuzz have zero train- vertical line represents the start of the transient execution.
ing overhead for exception-type transient windows. How-
ever, since random training fails to align trigger instruc-
tions and match transient execution flows, DejaVuzz∗ can- DUTs instantiated in the testbench, the runtime overhead of
not trigger indirect jump misprediction on XiangShan. For DejaVuzz is still acceptable.
the other misprediction-type transient window, DejaVuzz∗ And to understand the impact of false negatives, we also
incurs higher training overhead due to the lack of targeted introduce the diffIFTFN variant in Figure 6. In the diffIFTFN
training. These results demonstrate that DejaVuzz can effec- variant, the two DUT instances in the testbench use the same
tively and efficiently trigger more diverse transient windows. secret to ensure all control signals are identical, representing
6.3 Microarchitectural Observability Evaluation the worst-case scenario of false negatives. After the transient
window is triggered, the taint gradually increases as the
Micro-benchmark. We first evaluate the overhead of diffIFT secret is loaded into registers. However, since all control
instrumentation at compile and runtime, using the state-of- signals are the same, diffIFTFN fails to propagate control
the-art information flow tracking technique CellIFT as a taints during the process of encoding sensitive data, causing
reference. The compilation duration includes Chisel elabora- the taints to stop increasing. Finally, the remaining taints are
tion, Yosys instrumentation, and VCS synthesis. For runtime data taints carried by residual secrets in multiple caches and
overhead, we manually implement a benchmark covering buffers. Therefore, when false negatives occur, data taints
common transient execution vulnerability test cases and still propagate accurately, but control taints are suppressed
record simulation times. Table 4 shows the results, indicating due to identical control signals.
that the overhead of diffIFT is acceptable compared to Cel- Coverage Evaluation. Next, we evaluate the efficiency of
lIFT. Since CellIFT instruments at the cell level, it requires microarchitecture exploration. Figure 7 illustrates the growth
flattening all memory, resulting in a significantly increased trend of taint coverage on BOOM. Each experiment is re-
compilation time. In contrast, diffIFT instruments at the RTL peated 5 times, and the shaded area represents the 95% con-
IR level, achieving faster instrumentation. Figure 6 further fidence interval. To avoid the impact of simulation perfor-
shows the changes in the taint sum over cycles when ex- mance differences between different RTL simulators, we re-
ecuting the benchmark on BOOM. The result proves that play the phase 3 test cases generated by SpecDoctor in
CellIFT does suffer from taint explosion. Once all registers our environment to obtain comparable results and use the
are tainted, CellIFT loses the ability to track secrets, and number of iterations as the x-axis. The y-axis represents
the simulation speed is severely degraded. By eliminating the number of taint coverage points defined in §4.2.2. Due
control taints caused by identical control signals, diffIFT ef- to the lack of feedback on the sensitive data propagation
fectively mitigates control flow over-tainting. Even with two process, SpecDoctor only performs random mutations on
74
Simulation
Taint Sum Taint Sum
Taint Sum
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
DejaVuzz DejaVuzz SpecDoctor Table 5. Summary of discovered transient execution bugs.
3000 Processor Attack Transient Window1 Encoded Timing2
2000 Type Component
mem-excp i/dcache, (l2)tlb, lsu
1000 Meltdown mispred, i/dcache, (l2)tlb
0 0 5000 10000 15000 20000 BOOM mem-disamb i/dcache, (fau)btb,
Iteration Spectre mem-excp ras, loop, lsu, fpu
mispred, i/dcache, ras,
Figure 7. Taint coverage for 5 trials over 20,000 iterations. mem-disamb loop, lsu, fpu
mem-exp,
Meltdown mispred, illegal, i/dcache
XiangShan mem-disamb
mispred, illegal, lsu, fpu
test cases that can produce different state hashes, limiting its Spectre mem-excp, i/dcache,
ability to effectively guide fuzzing. With the help of taints, mem-disamb
DejaVuzz can guide mutation more effectively, ultimately 1 mem-excp: load/store misalign, load/store access/page fault ex-
exploring 4.7× more coverage than SpecDoctor. Moreover, ceptions; mispredict: control-flow misprediction; illegal: illegal
DejaVuzz achieves the same saturation coverage as SpecDoc- 2 instruction exception; mem-disamb: memory disambiguation.
tor in just 118 iterations. DejaVuzz− is used to demonstrate lsu: load unit contention; fpu: floating-point unit contention;
faubtb: first level branch target buffer; ras: return address stack;
the effectiveness of using diffIFT as coverage. Instead of loop: loop branch predictor.
using taint coverage, it randomly updates the secret encod-
ing block or regenerates a new transient window for each
round. The result shows that DejaVuzz achieves a 22% cov- without resorting to inefficient and nondeterministic random
erage improvement over DejaVuzz− and achieves the same decode instruction generation.
coverage in 7,200 iterations that DejaVuzz− requires 20,000
iterations to reach. The coverage difference between them 6.4 Bugs Found in Real-World Processors
demonstrates that using taints as coverage enables more Note that the coverage is only used to evaluate exploration,
efficient microarchitecture exploration. higher coverage does not guarantee more bugs. Therefore,
Liveness Evaluation. We also found an interesting phe- we also compare the bugs found during the evaluation. Ta-
nomenon that SpecDoctor did not report any vulnerabilities ble 5 categorizes all transient execution vulnerabilities dis-
during the coverage evaluation. According to SpecDoctor’s covered by DejaVuzz based on the attack type, transient
design, its phase 3 identified a total of 75 test cases that could window type, and exploited timing component. In compar-
encode sensitive data into the timing components and gen- ison, SpecDoctor can only encode sensitive data into the
erate different state hashes. And in its phase 4, SpecDoctor dcache or trigger lsu port contention. Regarding first bug
attempts to generate random instructions to decode secrets detection time, SpecDoctor takes several days, whereas De-
from those timing components. Unfortunately, SpecDoctor jaVuzz detects the first bug in an average of about 10 minutes
spent nearly a week executing 100,000 iterations without with 16 threads. Similar to existing work [9, 12, 39], DejaVuzz
finding any vulnerabilities. We use taint liveness annotations can cover all trigger variations of known transient execution
to analyze all 75 test cases, and find that only 17 of them are vulnerabilities. For example, using an unaligned memory
real leakages, while the rest are false positives. Most false access instead of a page fault to trigger the transient window
positives are caused by secrets that fail to be encoded into in Meltdown. Additionally, DejaVuzz discovers 5 previously
the microarchitecture but still remain in the data cache. An undiscovered transient execution vulnerabilities.
exception is an invalid test case that executes the transient B1. MeltDown-Sampling (CVE-2024-44594) is a hybrid
window during the training. Limited by poor microarchitec- vulnerability of Meltdown and MDS on XiangShan, allow-
tural observability, SpecDoctor spends a significant amount ing attackers to sample controllable targets using illegal ad-
of time futilely generating random instructions to decode dresses within a transient window. DejaVuzz generates il-
unexploitable false positives. To further validate the effec- legal addresses (e.g., 0x8000...80004000) through the secret
tiveness of taint liveness annotations, we re-execute the test access blocks with masks. Due to inconsistent wire widths,
cases using a DejaVuzz variant without taint liveness anno- when the illegal address is sent to the load unit from the
tations. Only 21 test cases are correctly identified, while the pipeline, the high-bit mask is implicitly truncated. Thus, at-
remaining 54 cases are misclassified due to residual invalid tackers can sample the secret located at 0x80004000.
taints in physical registers or RoB. This highlights the effec- B2. Phantom-RSB (CVE-2024-44591) is a vulnerability
tiveness of taint liveness annotations. With the help of the on BOOM that allows transiently executed instructions to
liveness signals, DejaVuzz can identify exploitable leakages update RSB. As shown in the code below, an attacker can
75
Taint Coverage
↱↱↱↱↱↱
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
corrupt the RSB based on sensitive data. Although BOOM im- All of the above vulnerabilities can be exploited to leak
plements a mitigation that restores the Top-Of-Stack (TOS) sensitive data. B1 can directly leak secrets across privilege
pointer and the return address in the top entry after mispre- boundaries, while B2–B5 require access permission for sensi-
dictions (line 11), DejaVuzz discovers that BOOM does not tive data to trigger. We disclosed identified bugs by sending
restore entries below the TOS pointer (line 10). After the RSB bug reports to the respective communities following the se-
is corrupted, the attacker can leak the secret by measuring curity policies listed for the associated project. According to
the execution time of the ret instruction. the maintainers, all vulnerabilities in XiangShan have been
fixed, while bugs in BOOM will be retained for future re-
21 beq a0, a0, foo # Predicting the branch untaken, now TOS→X search. Therefore, we recommend against using the BOOM la t0, secret # Loading secret 3 ld s0, 0(t0) processor in security-critical environments. 4 andi s0, s0, 0x1 # If secret=1, ra=addr of line6, a valid 5 sub s0, x0, s0 # addr; else ra=0, an illegal addr 7 6 auipc ra, 0 # Following code requires ra has a valid Discussion and Limitation 87 and ra, ra, s0 # addr, illegal addr will be blocked Precision Trade-off. Implementing precise IFT is inevitably 9 jalr x0, 12(ra) # Return to next, TOS→X-1 expensive since it is an NP-complete problem [17]. Although jalr x0, 16(ra) # Return to next, TOS→X-2 10 jalr ra, 20(ra) # Call to next, overwrite X-1 diffIFT can mitigate false positives caused by control flow 11 jalr ra, 24(ra) # Call to next, overwrite X over-tainting, it also introduces false negatives due to the B3. Phantom-BTB (CVE-2024-44590) is a vulnerability inability to exhaustively compare all secrets. In practice, De- similar to Boombard [18], where BOOM updates the BTB jaVuzz, as a dynamic verification solution, can mitigate false for exceptions under certain conditions. The following code negatives by repeatedly attempting different secret pairs. illustrates the details. Due to a race condition bug in BOOM, Training Preference. Some predictors may require longer when an indirect jump misprediction coincides with an ex- training patterns. For instance, in the case of branch mis- ception commit, BOOM misinterprets the exception as an predictions triggered by branch instructions, training a loop indirect jump and uses the prediction correction for the mis- predictor to trigger requires a much longer training instruc- predicted indirect jump (line 12) to update the BTB entry tion sequence compared to training a local branch history (line 1) of the instruction that triggered the exception. table to trigger. Therefore, due to the training reduction strategy, DejaVuzz prefers to choose the least costly training 1 lw t0, 1(x0) # Triggering a misalign exception instruction sequence. 2 la t0, secret # Loading secret Stimulus Migration. The stimuli generated by DejaVuzz 34 ld s0, 0(t0) only work on swapMem. Fortunately, developers usually 5 andi s0, s0, 0x1 # If secret=1, ra=addr of line6, a valid only need simulation waveform files to pinpoint bugs. If the sub s0, x0, s0 # addr; else ra=0, an illegal addr 6 auipc ra, 0 # Following code requires ra has a valid stimuli must be migrated to a standard memory model (e.g., 87 and ra, ra, s0 # addr, illegal addr will be blocked for writing general-purpose exploitations), careful manual 9 jalr x0, 12(ra) packet stitching is required. nop # Padding nop to make the final 10 # ... # misprediction commit with the Manual Annotation. Since the state registers are coupled 11 nop # exception in the same cycle to the implementation, they and their bound taints may 12 jalr x0, 12(ra) # Misprediction reside in different pipeline stages or even across modules. B4. Spectre-Refetch (CVE-2024-44592, CVE-2024-44593) Limited by the loss of semantic information during the design is a variant of Spectre-Rewind [10] discovered on both BOOM synthesis to RTL, DejaVuzz currently relies on manual taint and XiangShan. DejaVuzz found that the instruction fetch liveness annotations. We leave the automatic taint liveness unit can also be a resource to cause port contention. Specifi- annotation (such as using type-safe hardware description cally, placing the secret dependent branch at an address that languages or large language models) for future work. triggers an instruction cache miss causes the processor to 8 Related Work preempt the fetch unit during transient execution. This al- lows attackers to infer the secret by measuring the execution Processor Fuzzing. Encouraged by the promising results of time of the first instruction after the transient window. processor fuzzing on functional bugs [19, 20, 36, 53], several B5. Spectre-Reload (CVE-2024-44595) is another variant approaches have applied processor fuzzing to transient exe- of Spectre-Rewind on XiangShan. DejaVuzz found that load cution vulnerabilities. IntroSpectre [12] and TEESec [11] queue entries contend for the load write-back port of the use manually crafted gadgets to generate Meltdown-type memory access unit due to prioritization. By replacing the vulnerabilities and detect leakages by analyzing processor floating-point division instructions in the secret dependent runtime logs. SpecDoctor [18] generates stimuli for tran- branch with cache-missing load instructions, attackers can sient execution attacks in multiple phases and determines detect increased latency in cache-missing loads before the bugs by observing the final execution time. However, these transient window. approaches have the following main limitations. First, they
76
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
linearly generate transient windows or randomly combine 10 Acknowledgments instructions for training, resulting in limited diversity and The authors would like to thank Kaveh Razavi, Katharina efficiency in triggering transient windows. Second, they can Ceesay-Seitz, Tobias Kovats, Yuheng Yang, Ziqi Yuan, Zi- only analyze shallow information from the microarchitec- hang Zhong, Kaiqi Chen, Jin Shi, Tyler Hunt (the shepherd), ture, making it impossible to provide feedback on the prop- and all anonymous reviewers for their comments and allo- agation of sensitive data or identify exploitable leakages. cated time towards improving this work. The authors also To solve these limitations, DejaVuzz uses swapMem to gen- appreciate the developers in the open-source RISC-V pro- erate and optimize training instructions to trigger diverse cessor communities for their helpful responses to our bug transient windows efficiently, and employs differential in- reports. They are Jerry Zhao from the BOOM team, as well formation flow tracking to trace sensitive data to provide as Yungang Bao, Yinan Xu, Miaomiao Yuan, Zhizun Wang, coverage feedback and detect exploitable leakages. and Sen Liang from the XiangShan team. The authors are Black-box Microarchitecture Fuzzing. Commercial pro- partially supported by the National Key R&D Program of cessors lack interfaces for obtaining fine-grained internal China (No. 2022YFE0113200) and the National Natural Sci- state information, leading to limited fuzzing exploration ence Foundation of China under Grant No. 62361166633 and space. Most of the existing black-box fuzzers, such as Speech- Grant U21A20464. Any opinions, findings, and conclusions Miner [51] and Transynther [27], rely on domain knowledge or recommendations expressed in this material are solely and can only detect vulnerability variants within a limited those of the authors and do not necessarily reflect the views template scope. Revizor [28, 29] introduces the model-based of the funding agencies. relational testing approach that generates random instruc- tions to trigger contract violations. However, due to the References limited microarchitectural controllability, they cannot even cover some known vulnerabilities that require simple train- [1] Salman Ahmed, Hans Liljestrand, Hani Jamjoom, Matthew Hicks, ing. Integrating swapMem (e.g., through DMA) can provide N Asokan, and Danfeng Daphne Yao. 2023. Not All Data are Cre- better control over the microarchitecture, facilitating deeper ated Equal: Data and Pointer Prioritization for Scalable Protection Against {Data-Oriented} Attacks. In 32nd USENIX Security Sympo- testing of black-box processors. sium (USENIX Security 23). 1433–1450. Formal Verification. By rigorously defining speculative [2] Armaiti Ardeshiricham, Wei Hu, Joshua Marxen, and Ryan Kastner. contracts [13], ideally, formal verification can catch all tran- 2017. Register transfer level information flow tracking for provably sient execution bugs or prove security. However, in practice, secure hardware design. In Design, Automation & Test in Europe Con- today’s formal verification tools usually suffer from limited ference & Exhibition (DATE), 2017. IEEE, 1691–1696. [3] Enrico Barberis, Pietro Frigo, Marius Muench, Herbert Bos, and Cris- scalability and cannot be directly applied to complex out- tiano Giuffrida. 2022. Branch History Injection: On the Effectiveness of-order processors. To bypass this limitation, optimized of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks. verification schemes [9, 39, 43, 47] verify abstract models of In 31st USENIX Security Symposium (USENIX Security 22). USENIX As- out-of-order processors. However, the effectiveness of such sociation, Boston, MA, 971–988. https://www.usenix.org/conference/ formal checks depends on the precision of the models (e.g., usenixsecurity22/presentation/barberis [4] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz both B2-B4 escape previous formal analyses on BOOM). De- Lipp, Marina Minkin, Daniel Moghimi, Frank Piessens, Michael jaVuzz can be used as a complement to formal verification to Schwarz, Berk Sunar, Jo Van Bulck, and Yuval Yarom. 2019. Fallout: verify implementation details that are ignored by the models. Leaking data on meltdown-resistant cpus. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 769–784. 9 Conclusion [5] Chen Chen, Rahul Kande, Nathan Nguyen, Flemming Andersen, Aakash Tyagi, Ahmad-Reza Sadeghi, and Jeyavijayan Rajendran. 2023. In this paper, we presented DejaVuzz, a novel pre-silicon {HyPFuzz}:{Formal-Assisted} Processor Fuzzing. In 32nd USENIX processor fuzzer designed to detect transient execution vul- Security Symposium (USENIX Security 23). 1361–1378. nerabilities effectively and efficiently. DejaVuzz introduces [6] Design Automation Standards Committee et al. 2009. IEEE Standard two innovative operating primitives to enhance microarchi- VHDL Language Reference Manual. IEEE Std 1076-2008 (Revision of IEEE Std 1076-2002) (2009), 1–640. tectural controllability and observability. By leveraging dy- [7] Benjamin Cox, David Evans, Adrian Filipi, Jonathan Rowanhill, Wei namic swappable memory and differential information flow Hu, Jack Davidson, John Knight, Anh Nguyen-Tuong, and Jason Hiser. tracking, DejaVuzz efficiently triggers diverse transient win- 2006. N-Variant Systems: A Secretless Framework for Security through dows, effectively guides mutation, and identifies exploitable Diversity.. In USENIX Security Symposium, Vol. 114. 114. leakages. We evaluated DejaVuzz on two well-known RISC-V [8] Catherine Easdon, Michael Schwarz, Martin Schwarzl, and Daniel Gruss. 2022. Rapid prototyping for microarchitectural attacks. In 31st out-of-order processors and achieved up to 4.7× improve- USENIX Security Symposium (USENIX Security 22). 3861–3877. ment in coverage compared to the state-of-the-art fuzzer [9] Mohammad Rahmani Fadiheh, Alex Wezel, Johannes Müller, Jörg Bor- SpecDoctor. Moreover, DejaVuzz identified 5 new transient mann, Sayak Ray, Jason M Fung, Subhasish Mitra, Dominik Stoffel, and execution vulnerabilities (with 6 CVEs assigned), showing Wolfgang Kunz. 2022. An exhaustive approach to detecting transient its effectiveness in detecting previously unknown bugs. execution side channels in RTL designs of processors. IEEE Trans. Comput. 72, 1 (2022), 222–235.
77
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
[10] Jacob Fustos, Michael Bechtel, and Heechul Yun. 2020. SpectreRewind: kernel memory from user space. Commun. ACM 63, 6 (2020), 46–56. Leaking secrets to past instructions. In Proceedings of the 4th ACM [26] Giorgi Maisuradze and Christian Rossow. 2018. ret2spec: Speculative Workshop on Attacks and Solutions in Hardware Security. 117–126. execution using return stack buffers. In Proceedings of the 2018 ACM [11] Moein Ghaniyoun, Kristin Barber, Yuan Xiao, Yinqian Zhang, and SIGSAC Conference on Computer and Communications Security. 2109– Radu Teodorescu. 2023. Teesec: Pre-silicon vulnerability discovery 2122. for trusted execution environments. In Proceedings of the 50th Annual [27] Daniel Moghimi, Moritz Lipp, Berk Sunar, and Michael Schwarz. 2020. International Symposium on Computer Architecture. 1–15. Medusa: Microarchitectural data leakage via automated attack synthe- [12] Moein Ghaniyoun, Kristin Barber, Yinqian Zhang, and Radu Teodor- sis. In 29th USENIX Security Symposium (USENIX Security 20). 1427– escu. 2021. Introspectre: A pre-silicon framework for discovery and 1444. analysis of transient execution vulnerabilities. In 2021 ACM/IEEE 48th [28] Oleksii Oleksenko, Christof Fetzer, Boris Köpf, and Mark Silberstein. Annual International Symposium on Computer Architecture (ISCA). IEEE, 2022. Revizor: Testing black-box CPUs against speculation contracts. 874–887. In Proceedings of the 27th ACM International Conference on Architectural [13] Marco Guarnieri, Boris Köpf, Jan Reineke, and Pepe Vila. 2021. Support for Programming Languages and Operating Systems. 226–239. Hardware-software contracts for secure speculation. In 2021 IEEE [29] Oleksii Oleksenko, Marco Guarnieri, Boris Köpf, and Mark Silberstein. Symposium on Security and Privacy (SP). IEEE, 1868–1883. 2023. Hide and Seek with Spectres: Efficient discovery of speculative [14] Jana Hofmann, Emanuele Vannacci, Cédric Fournet, Boris Köpf, and information leaks with random testing. In 2023 IEEE Symposium on Oleksii Oleksenko. 2023. Speculation at Fault: Modeling and Testing Security and Privacy (SP). IEEE, 1737–1752. Microarchitectural Leakage of {CPU} Exceptions. In 32nd USENIX [30] Oleksii Oleksenko, Bohdan Trach, Mark Silberstein, and Christof Fetzer. Security Symposium (USENIX Security 23). 7143–7160. 2020. {SpecFuzz}: Bringing spectre-type vulnerabilities to the surface. [15] Wei Hu, Armaiti Ardeshiricham, and Ryan Kastner. 2021. Hardware In 29th USENIX Security Symposium (USENIX Security 20). 1481–1498. information flow tracking. ACM Computing Surveys (CSUR) 54, 4 [31] Sebastian Österlund, Koen Koning, Pierre Olivier, Antonio Barbalace, (2021), 1–39. Herbert Bos, and Cristiano Giuffrida. 2019. kMVX: Detecting kernel [16] Wei Hu, Jason Oberg, Ali Irturk, Mohit Tiwari, Timothy Sherwood, information leaks with multi-variant execution. In Proceedings of the Dejun Mu, and Ryan Kastner. 2011. Theoretical Fundamentals of Gate Twenty-Fourth International Conference on Architectural Support for Level Information Flow Tracking. IEEE Transactions on Computer- Programming Languages and Operating Systems. 559–572. Aided Design of Integrated Circuits and Systems 30, 8 (2011), 1128–1140. [32] Tapti Palit, Jarin Firose Moon, Fabian Monrose, and Michalis Poly- doi:10.1109/TCAD.2011.2120970 chronakis. 2021. Dynpta: Combining static and dynamic analysis for [17] Wei Hu, Jason Oberg, Ali Irturk, Mohit Tiwari, Timothy Sherwood, practical selective data protection. In 2021 IEEE Symposium on Security Dejun Mu, and Ryan Kastner. 2012. On the Complexity of Generating and Privacy (SP). IEEE, 1919–1937. Gate Level Information Flow Tracking Logic. IEEE Transactions on [33] Chathura Rajapaksha, Leila Delshadtehrani, Manuel Egele, and Ajay Information Forensics and Security 7, 3 (2012), 1067–1080. doi:10.1109/ Joshi. 2023. SIGFuzz: A framework for discovering microarchitectural TIFS.2012.2189105 timing side channels. In 2023 Design, Automation & Test in Europe [18] Jaewon Hur, Suhwan Song, Sunwoo Kim, and Byoungyoung Lee. 2022. Conference & Exhibition (DATE). IEEE, 1–6. SpecDoctor: Differential fuzz testing to find transient execution vul- [34] Babak Salamat, Todd Jackson, Andreas Gal, and Michael Franz. 2009. nerabilities. In Proceedings of the 2022 ACM SIGSAC Conference on Orchestra: intrusion detection using parallel execution and monitor- Computer and Communications Security. 1473–1487. ing of program variants in user-space. In Proceedings of the 4th ACM [19] Jaewon Hur, Suhwan Song, Dongup Kwon, Eunjin Baek, Jangwoo Kim, European conference on Computer systems. 33–46. and Byoungyoung Lee. 2021. Difuzzrtl: Differential fuzz testing to find [35] Edward J Schwartz, Thanassis Avgerinos, and David Brumley. 2010. cpu bugs. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, All you ever wanted to know about dynamic taint analysis and forward 1286–1303. symbolic execution (but might have been afraid to ask). In 2010 IEEE [20] Rahul Kande, Addison Crump, Garrett Persyn, Patrick Jauernig, symposium on Security and privacy. IEEE, 317–331. Ahmad-Reza Sadeghi, Aakash Tyagi, and Jeyavijayan Rajendran. [36] Flavien Solt, Katharina Ceesay-Seitz, and Kaveh Razavi. 2024. Cascade: 2022. {TheHuzz}: Instruction fuzzing of processors using {Golden- CPU fuzzing via intricate program generation. In Proc. 33rd USENIX Reference} models for finding {Software-Exploitable} vulnerabilities. Secur. Symp. 1–18. In 31st USENIX Security Symposium (USENIX Security 22). 3219–3236. [37] Flavien Solt, Ben Gras, and Kaveh Razavi. 2022. {CellIFT}: Leveraging [21] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Cells for Scalable and Precise Dynamic Information Flow Tracking Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas in {RTL}. In 31st USENIX Security Symposium (USENIX Security 22). Prescher, Michael Schwarz, and Yuval Yarom. 2020. Spectre attacks: 2549–2566. Exploiting speculative execution. Commun. ACM 63, 7 (2020), 93–101. [38] Sycuricon. 2025. Starship SoC Generator. https://github.com/ [22] Esmaeil Mohammadian Koruyeh, Khaled N Khasawneh, Chengyu sycuricon/starship. [Accessed 01-07-2025]. Song, and Nael Abu-Ghazaleh. 2018. Spectre returns! speculation [39] Qinhan Tan, Yuheng Yang, Thomas Bourgeat, Sharad Malik, and attacks using the return stack buffer. In 12th USENIX Workshop on Mengjia Yan. 2024. RTL Verification for Secure Speculation Using Offensive Technologies (WOOT 18). Contract Shadow Logic. arXiv preprint arXiv:2407.12232 (2024). [23] Kevin Laeufer, Jack Koenig, Donggyu Kim, Jonathan Bachrach, and [40] Donald Thomas and Philip Moorby. 2008. The Verilog® hardware Koushik Sen. 2018. RFUZZ: Coverage-directed fuzz testing of RTL on description language. Springer Science & Business Media. FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided [41] Mohit Tiwari, Jason K Oberg, Xun Li, Jonathan Valamehr, Timothy Design (ICCAD). IEEE, 1–8. Levin, Ben Hardekopf, Ryan Kastner, Frederic T Chong, and Timo- [24] Xun Li, Mohit Tiwari, Jason K Oberg, Vineeth Kashyap, Frederic T thy Sherwood. 2011. Crafting a usable microkernel, processor, and Chong, Timothy Sherwood, and Ben Hardekopf. 2011. Caisson: a I/O system with strict and provable information flow security. ACM hardware description language for secure information flow. ACM SIGARCH Computer Architecture News 39, 3 (2011), 189–200. Sigplan Notices 46, 6 (2011), 109–120. [42] Mohit Tiwari, Hassan MG Wassel, Bita Mazloom, Shashidhar Mysore, [25] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Frederic T Chong, and Timothy Sherwood. 2009. Complete infor- Haas, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval mation flow tracking from the gates up. In Proceedings of the 14th Yarom, Mike Hamburg, and Raoul Strackx. 2020. Meltdown: Reading
78
ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands Jinyan Xu et al.
international conference on Architectural support for programming lan- [57] Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste Asanovic. 2020. guages and operating systems. 109–120. SonicBOOM: The 3rd Generation Berkeley Out-of-Order Machine. [43] Caroline Trippel, Daniel Lustig, and Margaret Martonosi. 2018. Check- (May 2020). mate: Automated synthesis of hardware exploits and security litmus tests. In 2018 51st Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO). IEEE, 947–960. A Artifact Appendix [44] Daniël Trujillo, Johannes Wikner, and Kaveh Razavi. 2023. Inception: Exposing new attack surfaces with training in transient execution. In A.1 Abstract 32nd USENIX Security Symposium (USENIX Security 23). 7303–7320. DejaVuzz is a novel pre-silicon stage processor transient [45] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, execution bug fuzzer. This artifact contains the full imple- and Raoul Strackx. 2018. Foreshadow: Extracting the Keys to the Intel mentation of the DejaVuzz, as well as the datasets and scripts SGX Kingdom with Transient Out-of-Order Execution. In Proceedings to reproduce the evaluation results (i.e., Table 3, Table 4, Fig- of the 27th USENIX Security Symposium. USENIX Association. ure 6, Figure 7, and Table 5) in the paper. [46] Stephan van Schaik, Alyssa Milburn, Sebastian Österlund, Pietro Frigo, Giorgi Maisuradze, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2019. RIDL: Rogue In-flight Data Load. In S&P. A.2 Artifact check-list (meta-information) [47] Zilong Wang, Gideon Mohr, Klaus von Gleissenthall, Jan Reineke, • Compilation: RISC-V GNU toolchains 12.2.0. and Marco Guarnieri. 2023. Specification and verification of side- • Data set: SpecDoctor testcase replay dataset, RISC-V tran- channel security for open-source processors via leakage contracts. sient execution bug dataset. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2128–2142. • Run-time environment: Synopsys VCS Simulator 2023.12. [48] Ofir Weisse, Ian Neal, Kevin Loughlin, Thomas F Wenisch, and Baris • Hardware: Machine with >= 16 cores and >= 128 GB RAM. Kasikci. 2019. NDA: Preventing speculative execution attacks at their • Output: Terminal outputs and figures. source. In Proceedings of the 52nd Annual IEEE/ACM International • Experiments: Training overhead (Table 3), instrumentation Symposium on Microarchitecture. 572–586. overhead (Table 4), taint comparison (Figure 6), coverage [49] Sander Wiebing, Alvise de Faveri Tron, Herbert Bos, and Cristiano comparison (Figure 7), and bug reproduction (Table 5). Giuffrida. 2024. InSpectre Gadget: Inspecting the residual attack sur- • How much disk space required?: ∼3 TB. face of cross-privilege Spectre v2. In USENIX Security. • How much time is needed to prepare workflow?: ∼1 [50] Johannes Wikner, Daniël Trujillo, and Kaveh Razavi. 2023. Phantom: hour. Exploiting decoder-detectable mispredictions. In Proceedings of the • How much time is needed to complete experiments?: 56th Annual IEEE/ACM International Symposium on Microarchitecture. 49–61. ∼1 week. [51] Yuan Xiao, Yinqian Zhang, and Radu Teodorescu. 2020. SPEECH- • Publicly available?: https://github.com/sycuricon/DejaVuzz. MINER: A Framework for Investigating and Measuring Speculative • Code/Data licenses?: MIT License. Execution Vulnerabilities. In 27th Annual Network and Distributed • Workflow automation framework used?: phantom-make. System Security Symposium. • Archived?: https://zenodo.org/records/15861610. [52] Jinyan Xu, Haoran Lin, Ziqi Yuan, Wenbo Shen, Yajin Zhou, Rui Chang, Lei Wu, and Kui Ren. 2022. RegVault: hardware assisted selective data randomization for operating system kernels. In Proceedings of the 59th A.3 Description ACM/IEEE Design Automation Conference. 715–720. A.3.1 How to access. The artifact is available on GitHub. [53] Jinyan Xu, Yiyuan Liu, Sirui He, Haoran Lin, Yajin Zhou, and Cong Wang. 2023. {MorFuzz}: Fuzzing processor via runtime instruction A.3.2 Software dependencies. RISC-V toolchain 12.2.0 morphing enhanced synchronizable co-simulation. In 32nd USENIX Security Symposium (USENIX Security 23). 1307–1324. and VCS simulator 2023.12 are required to run the experi- [54] Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen, Lingrui Gou, ments. The toolchain can be obtained from GitHub, while Yue Jin, Qianruo Li, Xin Li, Zuojun Li, Jiawei Lin, Tong Liu, Zhigang Liu, VCS requires a commercial license purchased from Synopsys. Jiazhan Tan, Huaqiang Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang, Fawang Zhang, Linjuan Zhang, Zifei Zhang, Yangyang Zhao, A.3.3 Data sets. Since evaluating SpecDoctor is beyond Yaoyang Zhou, Yike Zhou, Jiangrui Zou, Ye Cai, Dandan Huan, Zusong Li, Jiye Zhao, Zihao Chen, Wei He, Qiyuan Quan, Xingwu Liu, Sa Wang, the scope of this artifact, we provide pre-generated SpecDoc- Kan Shi, Ninghui Sun, and Yungang Bao. 2022. Towards Developing tor test cases for replaying. In addition, the artifact includes High Performance RISC-V Processors Using Agile Methodology. In a bug dataset containing all discovered transient execution 2022 55th IEEE/ACM International Symposium on Microarchitecture vulnerabilities. The datasets and the source code of other (MICRO). 1178–1199. doi:10.1109/MICRO56248.2022.00080 dependent components can be downloaded from Zenodo. [55] Yuheng Yang, Thomas Bourgeat, Stella Lau, and Mengjia Yan. 2023. Pensieve: Microarchitectural modeling for security evaluation. In Pro- ceedings of the 50th Annual International Symposium on Computer A.4 Installation Architecture. 1–15. [56] Ruiyi Zhang, Taehyun Kim, Daniel Weber, and Michael Schwarz. 2023. Please complete the setup according to the Requirements sec- ({M) WAIT} for It: Bridging the Gap between Microarchitectural and tion in the README.md file located in the root directory. Make Architectural Side Channels. In 32nd USENIX Security Symposium sure the setup is fully completed before running any experi- (USENIX Security 23). 7267–7284. ment. Otherwise, the experiments may attempt to recompile dependencies, which can lead to unexpected behavior.
79
DejaVuzz: Disclosing Transient Execution Bugs with Enhanced Processor Fuzzing ASPLOS ’25, March 30-April 3, 2025, Rotterdam, Netherlands
A.5 Experiment workflow Instrumentation Overhead (Table 4) and Taint Com- This artifact includes experiments to reproduce Table 3, Ta- parison (Figure 6). The scripts to reproduce Table 4 and ble 4, Figure 6, Figure 7, and Table 5. Each experiment is orga- Figure 6 are located in exp/table4_figure6. Users should nized into a dedicated subdirectory under the exp/ directory. enter this directory and execute the scripts within it. The To run an experiment, users should go to the correspond- results will be saved in several log files and a figure. The ing subdirectory and follow the instructions provided in its expected outcome is that diffIFT exhibits lower compilation README.md file. In most cases, users simply need to execute and simulation overhead compared to CellIFT. In the taint the provided scripts in order and wait for the results. comparison figure, CellIFT suffers from taint explosion, dif- fIFT demonstrates better control and avoids taint explosion, A.6 Evaluation and expected results and diffIFTFN misses some taints due to false negatives. Training Overhead (Table 3). The scripts to reproduce Coverage Comparison (Figure 7). The scripts to reproduce Table 3 are located in exp/table3. Users should enter this Figure 7 are located in exp/figure7. Users should enter this directory and execute the scripts within it. The results will be directory and execute the scripts within it. The results will printed on the terminal. The expected result is that DejaVuzz be saved in a figure. The expected result is that DejaVuzz can trigger more kinds of transient execution windows with achieves ∼4.7× higher coverage than SpecDoctor, and about lower Training Overhead (TO) compared to SpecDoctor. ∼22% higher coverage than DejaVuzz−. For DejaVuzz and DejaVuzz∗, their results are similar for Bug Reproduction (Table 5). The scripts to reproduce Ta- exception-type transient execution windows. However, De- ble 5 are located in exp/table5. Users should enter this di- jaVuzz shows lower TO for misprediction-type windows, rectory and execute the scripts within it. The results will be especially in terms of Effective Training Overhead (ETO). saved in log files. For expected results, please refer to the detailed execution log analysis in the README.md file within the same directory.
80