SOURCE ARCHIVE
EXTRACTED CONTENT
115,458 chars o
ia DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification Kunlin You Yinan Xu Kehan Feng State Key Lab of Processors State Key Lab of Processors Beijing Institute of Open Source Chip Institute of Computing Technology, Institute of Computing Technology, Beijing, China Chinese Academy of Sciences Chinese Academy of Sciences fengkehan@bosc.ac.cn Beijing, China Beijing, China University of Chinese Academy of xuyinan@ict.ac.cn Sciences Beijing, China youkunlin24s@ict.ac.cn
Luoshan Cai Yaoyang Zhou Yungang Bao
State Key Lab of Processors Beijing Institute of Open Source Chip State Key Lab of Processors
Institute of Computing Technology, Beijing, China Institute of Computing Technology,
Chinese Academy of Sciences zhouyaoyang@bosc.ac.cn Chinese Academy of Sciences
Beijing, China Beijing, China
University of Chinese Academy of University of Chinese Academy of
Sciences Sciences
Beijing, China Beijing, China
cailuoshan22z@ict.ac.cn baoyg@ict.ac.cn
Abstract Keywords
Verification has become the most time-consuming phase in chip Processor Verification, Simulation Acceleration, Co-simulation
development. Co-simulation frameworks simulate the design under ACM Reference Format:
test (DUT) with a golden reference model (REF) and compare their
instruction-level results for verification, causing over 98% commu- Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yun-
nication overhead: although hardware-accelerated platforms, such gang Bao. 2025. DiffTest-H: Toward Semantic-Aware Communication in
as FPGA and emulators, speed up DUT simulation by 300×–10000×, Hardware-Accelerated Processor Verification. In 58th IEEE/ACM Interna-
tional Symposium on Microarchitecture (MICRO ’25), October 18–22, 2025,
overall co-simulation speedup is still limited to 2.5×–20×. Seoul, Republic of Korea. ACM, New York, NY, USA, 15 pages. https://doi.
In this paper, we propose DiffTest-H, a semantic-aware, hardware- org/10.1145/3725843.3756108
accelerated co-simulation framework with three techniques re-
ducing communication overhead while preserving debuggability: 1
(1) Batch minimizes communication frequency by tightly pack- Introduction
ing structurally diverse verification events into a single transfer. Verification has become the most time-consuming phase in modern
(2) Squash reduces data transmission volume by fusing verifica- chip development, accounting for over 50% of the overall work-
tion events with a decoupled checking order. (3) Replay preserves flow [17, 18]. The challenge becomes even more significant for
instruction-level debuggability by reprocessing the original, un- industrial-scale processors with complex microarchitectures and
fused verification events around the failure point. instruction set architectures (ISAs), where exhaustive verification
DiffTest-H is deployed on both Palladium emulator and FPGA is essential for ensuring functional correctness.
to verify a 6-wide, out-of-order RISC-V processor, XiangShan. It Toward more efficient verification, co-simulation frameworks [14,
achieves simulation speeds of 478KHz and 7.8MHz respectively, ~~ 21, 23, 28, 42, 54] have been widely adopted in processor verifica-
with an 80× and 78× speedup over the baseline, 119× and 1945× tion. In co-simulation, the design under test (DUT) and a golden
faster than 16-thread Verilator, and uncovers 151 bugs in XiangShan. reference model (REF) run in parallel, comparing their architectural
states after each instruction. The co-simulation framework extracts
CCS Concepts verification events from the DUT, such as instruction commit and
• Hardware → Functional verification. register updates, and compares them with REF. Additionally, the
DUT-specific non-deterministic events (NDEs) [14, 21, 24, 39, 53, 54],
such as external interrupts and MMIO access, must be fully syn-
chronized to REF to align its architectural states with the DUT.
This work is licensed under a Creative Commons Attribution 4.0 International License. However, existing co-simulation frameworks are still inefficient.
MICRO ’25, Seoul, Republic of Korea Traditional software-based solutions [21, 34, 54, 55] rely on RTL
© 2025 Copyright held by the owner/author(s). simulators [43, 46] to simulate the DUT. Despite extensive research
ACM ISBN 979-8-4007-1573-0/25/10
https://doi.org/10.1145/3725843.3756108 working on its performance [3, 4, 6, 15, 16, 27, 38, 47–49, 58], the
1462
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
Table 1: Verification Events in the DiffTest[34, 54]. increase the complexity of packing and unpacking. For example, the event lengths in DiffTest [34, 54, 55] differ by up to 170×. Existing Category Types Representative Examples packing schemes allocate fixed space for each verification event and Control Flow 5 Exceptions and interrupts, pad invalid events with bubbles, resulting in more communications Instruction commits, Traps, ... to transmit the same set of valid events. Leveraging structural Register Updates 9 CSRs, General-purpose registers, semantics, we can tightly pack variable-length events with space Floating-point registers, ... allocated according to length, and extract packed events with their Memory Access 3 Load/store operations, data structures. Tight packing eliminates bubbles and reduces the Atomic memory operations, ... required packets with less communication frequency. Memory Hierarchy 6 Cache refill operations, (2) Order Semantics denotes the specific checking order of verifi- L1/L2 TLB operations, ... cation events. For example, the NDEs, such as external interrupts, RISC-V Extensions 9 Vector/Hypervisor CSRs, force updates to the REF’s state, requiring prior instructions to be Vector registers, ... checked while subsequent ones remain unchecked. Existing fusion approaches couple communication with checking order: the NDEs break the fusion of other events, and the already fused ones are simulation speed of large-scale DUTs is only at a few kHz, making transmitted ahead to REF for ordered checking, causing frequent it impractical for verification requiring billions of test cycles. fusion breaks and a limited fusion ratio. Leveraging order semantics, Hardware-accelerated platforms, including emulator [7, 8, 19, we can decouple communication from checking order: NDEs are 41, 44] and FPGA [22, 24, 31, 40, 56, 57], offer promising simulation transmitted ahead with order tags while other events continue to speed for better verification efficiency. Our evaluation shows that di- be fused, and the software reorders events by these tags to restore rectly deploying the DUT on the emulator (Cadence Palladium) can the required checking order. Order-decoupled fusion reduces fusion yield a 300× speedup over RTL simulation, and a 10,000× speedup breaks and improves fusion ratio with less data transmitted. on the FPGA (Xilinx VU19P). In contrast, leveraging co-simulations, (3) Behavioral Semantics denotes the architectural behaviors where the DUT and the REF are deployed on the hardware and checked by verification events, which help localize errors to specific software side respectively, the speedup drops to less than 2.5× on microarchitectural components. However, fusing verification events the emulator and 20× on the FPGA. The reason is that the hardware- weakens debuggability by discarding per-instruction behavioral software communication becomes a new bottleneck, with over 98% details. Existing debugging methods rely on hardware snapshots to co-simulation time consumed by communication overhead. rerun the entire DUT for recovering the behavioral details, resulting The hardware-software communication, as a point-to-point inter- in considerable resource and time overhead. Leveraging behavioral action, can be modeled by the LogGP model [1, 12] and decomposed semantics, we can reprocess only the unfused verification events into three phases [10, 11, 25, 26, 30, 32, 37, 45]: communication around the failure point rather than rerun the entire DUT, thereby startup, data transmission, and software processing. For example, enabling lightweight instruction-level debugging. in the co-simulation framework DiffTest [34, 54], which covers Building on the above three semantic properties, we propose 32 verification events as shown in Table 1, each verification event DiffTest-H, a semantic-aware, hardware-accelerated co-simulation requires a handshake to start up communication, and then transfer framework significantly reducing communication overhead while the DUT’s architectural state to the REF for comparison, resulting preserving instruction-level debuggability: in around 15 communications and 1.2 KB transmitted data per cycle. (1) Batch minimizes communication frequency by tightly pack- Existing works explore optimizations across the three phases ing structurally diverse verification events into a single transfer. of communication. The frequency of communication startup can Leveraging structural semantics, Batch computes the offset length be reduced by packing all verification events within a cycle into of each valid event on hardware for tight packing, while the soft- a single transfer [8, 9, 19]. The data transmission volume can be ware parses packed events according to their data structures. reduced by fusing same-type events, such as 𝑁 instruction com- (2) Squash reduces data transmission volume by fusing verifi- mits into a single 𝑁 -commit event [19, 40]. The software pro- cation events with a decoupled checking order. Leveraging order cessing latency can be hidden through hardware-software paral- semantics, Squash allows NDEs to be transmitted ahead with order lelism [9, 24, 31, 56, 57]. However, existing works still face the com- tags, while other events continue to be fused, and the software then munication bottleneck: the state-of-the-art Fromajo [57] achieves reorders events by these tags to restore the required checking order. only 1 MHz co-simulation speed on a 100 MHz FPGA. Moreover, (3) Replay preserves instruction-level debuggability by repro- fusing verification events across cycles discards per-instruction cessing the original, unfused verification events around the failure details, weakening instruction-level debuggability. point. Leveraging behavioral semantics, Replay reprocesses only To address communication challenges, Shannon and Weaver the unfused verification events rather than rerunning the entire introduced semantic communication [52], emphasizing that under- DUT, enabling lightweight instruction-level debugging. standing the information semantics improves communication ef- DiffTest-H is implemented and evaluated within the DiffTest ficiency. In co-simulation, verification events likewise carry three [34, 54, 55] co-simulation framework to verify XiangShan [35, 50, 51, key semantic properties, which can be exploited to optimize com- 54, 55], a 6-wide out-of-order RISC-V processor, covering 32 types munication while preserving debuggability: of verification events, including instructions, cache coherence, TLB, (1) Structural Semantics denotes the length and data structure of vectorization, and virtualization. Deployed on both the Cadence verification events, which vary significantly across event types and Palladium emulator and FPGA, DiffTest-H achieves simulation
1463
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
speeds of 478 KHz and 7.8 MHz respectively, with an 80× and 78 Design Under Test Reference Model
× speedup over the baseline DiffTest, 273× and 1945× faster than (DUT) (REF)
16-thread Verilator. DiffTest-H reduces communication overhead Load Workload Same Initial State Load Workload
by 99.84% on the emulator, and is 7.8× faster than the state-of- ①Instr. Commit, IC ②step(IC) ③IC (workload)Co-sim step
the-art [56, 57] on the FPGA. DiffTest-H has uncovered over 151 DEs (from workload) ④compare with DEs
complex bugs in XiangShan that require up to 2 months to identify ①Ext. Interrupts, EI ...
with Verilator but are detected within 11 hours by DiffTest-H on ②sync(EI) ③EI (DUT) Co-sim step
Palladium. All of these bugs have been confirmed and fixed by NDEs (from DUT) ④compare with NDEs
XiangShan developers with more than 780 lines of code change ... Deterministic
across 19 pull requests. Events, DEs In summary, we make the following contributions in this paper: Mismatch abort() Non-deterministic Abort co-sim Events, NDEs • We identify three stages of hardware-software communica- tion: communication startup, data transmission, software Figure 1: Co-simulation verification workflow. Each DUT processing, and summarize three corresponding optimiza- event notifies the REF for execution and comparison: de- tions: packing, fusion, and hardware-software parallelism. terministic events are executed directly by the REF, while • We propose and open-source DiffTest-H1, a semantic-aware, non-deterministic events are synchronized from the DUT. hardware-accelerated co-simulation framework: Batch min- imizes communication frequency by tightly packing veri- fication events. Squash reduces data volume by fusing events Non-deterministic events (NDEs) [14, 21, 24, 39, 53, 54, 56, 57], with a decoupled checking order. Replay preserves instruction- such as interrupts and MMIO access, challenge co-simulation. These level debuggability by reprocessing events around failure. NDEs are specific to DUT and cannot be reproduced independently • DiffTest-H, evaluated on XiangShan, an open-source 6-wide by the REF. To accommodate this, co-simulation frameworks fully out-of-order RISC-V processor, achieves simulation speeds synchronize these NDEs from DUT to REF at precise instructions of 478 KHz on the Palladium emulator and 7.8 MHz on the to correctly align their architectural states. FPGA, with an 80× and 78 × speedup over baseline DiffTest, With comprehensive checking of diverse verification events at 273× and 1945× faster than 16-thread Verilator, reducing each instruction, co-simulation offers two major advantages: 99.84% communication overhead on the emulator and is 7.8× Verification Sufficiency. Covering a wide range of verification faster than the state-of-the-art [56, 57] on the FPGA. states, co-simulation ensures sufficient verification of DUT under • DiffTest-H uncovers over 151 complicated bugs in Xiang- ISA-level behaviors as well as complex non-deterministic scenarios. Shan, all of which have been fixed by XiangShan developers Instruction-level Debuggability. Conducting comparisons after with over 780 lines of code change across 19 pull requests. each instruction, co-simulation halts upon detecting any mismatch with a precise failure context, including mismatched verification 2 Background events and cycle information for debugging. In this section, we present three key aspects of hardware-accelerated 2.2 Layout of Co-Simulation Framework processor co-simulation: First, we introduce the fundamental prin- ciples and workflow of co-simulation, illustrating how it ensures A typical processor co-simulation framework consists of three ma- verification sufficiency and instruction-level debuggability. Second, jor components [21, 29, 42, 54, 55]: the monitor, the checker, and we demonstrate the general structure of co-simulation framework the communication unit, distributed across hardware and software with an example of the DiffTest framework [34]. Third, we compare to verify the correctness of the DUT. verification platforms of co-simulation, highlighting advantages On the hardware side, monitors are embedded into the processor and bottlenecks of hardware-accelerated co-simulation. to capture verification events such as instruction commits, regis- ter updates, and memory operations. Since these events are dis- 2.1 Processor Co-Simulation Verification tributed across the DUT’s microarchitecture, modern co-simulation Processor co-simulation [21, 28, 42, 54] verifies functional correct- frameworks [21, 54, 55] often implement monitors in high-level ness by running the design under test (DUT) in parallel with a hardware description languages (HDLs) such as Chisel [2], enabling software reference model (REF), and comparing their architectural automated code generation to relieve manual wiring effort. The cap- states after each instruction. tured events are then formatted into structured data packets, which As illustrated in Figure 1, a typical co-simulation workflow be- can be parsed by the software according to their data structure. gins with the DUT and REF in the same initial state, and performs On the software side, the ISA checker operates alongside a REF, instruction-level comparison for processor verification: at each typically an Instruction Set Simulator (ISS) such as Spike [20] and instruction, the co-simulation framework 1 NEMU [33]. The REF starts from the same initial state as the DUT, ○ extracts DUT’s ver- executes instructions accordingly, and is synchronized with non- ification events, such as instruction commits, 2 ○3 execute a corresponding instruction, and 4 ○ notifies REF to deterministic events. The ISA checker uses the verification events ○ compares architec- monitored from the DUT to drive the REF’s execution and performs tural states of the DUT and REF. Once a mismatch is detected, the comparisons after each instruction, as shown in Section 2.1. co-simulation aborts with a detailed bug analysis. Between the monitor and the checker, verification events are 1https://github.com/OpenXiangShan/difftest transmitted across the hardware-software interface, such as DPI-C.
1464
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
Table 2: Comparison of Co-Simulation Platform. 500 KHz. Similarly, the Fromajo co-simulation framework [56, 57]
on a 100 MHz FPGA also experiences communication overhead ex-
Platform Debuggability Cost Optimal Speed ceeding 99%. These findings highlight that, while hardware accelera- RTL Simulator Full visibility Free ∼ 3 KHz tion significantly improves the theoretical speed of DUT emulation, Emulator Waveform Expensive ∼ 500 KHz the overall speed of co-simulation is fundamentally constrained by FPGA Limited Affordable ∼ 50 MHz communication efficiency. 3 Analytical Overhead Model Given the diversity of verification events, modern co-simulation To clarify and quantify the contributors to communication overhead frameworks [21, 34, 54, 55] typically use individual DPI-C func- in hardware-accelerated co-simulation, we introduce an analytical tions for each event, resulting in frequent communication calls and overhead model inspired by the LogGP model [1], which decom- large data transfers. For example, in the co-simulation framework poses the overhead into three stages of hardware-software inter- DiffTest [34, 54, 55] covering 32 types of verification events, each action: communication startup, data transmission, and software event is transmitted through a separate DPI-C interface with an ag- processing. These stages manifest differently across simulation gregated size of 11,496 bytes, leading to substantial communication platforms and DUT designs. To ground the model, we provide a overhead and challenging communication-sensitive simulations. quantitative case study across different DUTs and platforms, and identify three key optimization guidelines for communication. 2.3 Hardware-accelerated Co-simulation 3.1 Theoretical Analysis Co-simulation, as discussed earlier, consists of two main compo- nents: the REF and the DUT. In general, the REF is implemented The communication overhead in hardware-accelerated co-simulation in software for flexibility and ease of maintenance, while the DUT can be modeled as the LogGP model [1, 12], where the overall la- is deployed through three distinct approaches: RTL simulation, tency between the FPGA/emulator and the software can be decom- hardware emulation, and FPGA prototyping. Each option presents posed into three stages [10, 11, 25, 26, 30, 32, 37, 45]: a unique trade-off between simulation speed, debuggability, and Communication Startup. This stage involves handshake and deployment cost, as summarized in Table 2. synchronization for each communication invocation, necessary Depending on the DUT deployment platform, co-simulation can to establish a data connection between the asynchronously run- be categorized into two classes: ning hardware and software. For example, emulator Cadence Palla- Software-based Co-simulation deploys both the REF and the DUT dium performs hardware-software synchronization at every DPI- within a software environment, typically using RTL simulators C function calls [25], while FPGA platforms rely on valid-ready such as Verilator [46] or Synopsys VCS [43]. In this setup, the DUT handshakes as dictated by protocols like XDMA [26]. The startup is translated into a high-level programming representation (e.g., overhead is primarily determined by the communication frequency C++), forming a directed graph where each node simulates a hard- (𝑁invokes) and the per-invocation latency (𝑇sync). ware signal. This method offers full design visibility and facilitates Data Transmission. After the connection is established, data fine-grained debugging. However, the simulation speed is severely is transmitted over the hardware-software link in fixed-length pro- limited, typically reaching only a few KHz. As the complexity of tocol frames, and each frame incurs transmission and propagation the DUT increases, the performance of RTL simulators worsens, delay. The transmission overhead scales with the total data volume making software-based co-simulation impractical for verification (𝑁bytes) and the available bandwidth (𝐵𝑊 ). requiring billions of test cycles. Software processing. On the host side, software must receive Hardware-accelerated Co-simulation addresses the speed lim- data from buffers, drive the REF to execute the same instructions as itation by deploying the DUT onto hardware acceleration plat- the DUT (synchronize non-deterministic behavior such as external forms, such as emulators (e.g., Cadence Palladium [7], Synopsys interrupts, if any), and compare their states for verifying correctness. ZeBu [44], Siemens Veloce [41]) or FPGAs. Unlike software simula- In traditional step-and-compare strategies [34, 54, 55], hardware tion, these platforms directly map the DUT onto physical hardware emulation pauses its clock until software processing completes. components, faithfully reproducing its behavior at much higher This part of latency is abstracted as 𝑇software. speeds—often achieving orders-of-magnitude improvements. The The overall communication overhead can be expressed as: REF remains in software, preserving its flexibility. 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑁 1 However, this deployment across hardware and software intro- Invokes × 𝑇sync + 𝑁Bytes × 𝐵𝑊 + 𝑇software (1) duces a new major bottleneck: communication overhead. Since the DUT and the REF are located on different physical platforms, veri- 3.2 Quantitative Analysis fication states must be frequently transmitted across the hardware- The analytical model presented in Equation 1 provides a general software interface. As the amount of communication (both in terms framework for quantifying communication overhead on hardware- of communication frequency and data volume) increases, the overall accelerated platforms with three key phases: communication startup, simulation speed becomes limited by the efficiency of the communi- data transmission, and software processing. However, in practice, cation interface. For example, in the DiffTest framework applied to the relative contributions of the three phases vary significantly XiangShan [35], communication overhead accounts for over 98% of depending on both the verification coverage required by the DUT the total simulation time when running on Palladium emulator at and the characteristics of the validation platform.
1465
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
NutShell + Palladium XiangShan + Palladium XiangShan + FPGA 1% 1% 0.2% DUT DUT 10% 17% 8% Core 0 Core 1 Global Memory 2% 4% REF 0 REF 1 14% Monitor ISS ISS 87% 68% 88% Acceleration Debugging ISA DUT Emulation Communication Startup Squash Replay Rev. Log Checker Data Tranmission Software Processing Batch Buffer (Delta) Figure 2: Overhead breakdown across DUTs and platforms. Send Recv To demonstrate the model’s generality, we conduct evaluations Communication Queue Queue run based on NutShell [36], a scalar in-order processor, and Xiang- Hardware Software debug Shan [35, 50, 51, 54], a 6-wide out-of-order processor, across both Figure 3: The DiffTest-H Framework. Palladium emulation and FPGA platforms. As shown in Figure 2, XiangShan incurs higher data transmission and software processing overhead than NutShell on the same Palladium platform, primar- ily due to its expanded verification events resulting in larger data volume and more complex checking. When comparing XiangShan five components: monitor, acceleration unit, communication unit, across platforms, the FPGA setup shows higher communication debugging unit, and ISA checker. startup but lower data transmission overhead, which results from From Figure 3, the monitor captures verification events from the the FPGA’s PCIe interface exhibiting higher handshake latency yet DUT. The events are then optimized by the acceleration unit and greater bandwidth compared to Palladium’s internal link. buffering for potential debugging. The acceleration unit applies two key optimizations: Squash 3.3 Guiding Communication Optimizations reduces data volume by fusing events (Section 4.3) and Batch mini- Based on Equation 1, the total overhead of software–hardware co- mizes communication frequency by packing events into a single simulation can be decomposed into three phases: communication packet (Section 4.2). The packed data is then non-blocking trans- startup, data transmission, and software processing, which can be mitted for software processing (Section 4.5), while allowing the optimized through the following optimizations: hardware to continue running at the same time. The frequency of communication startups can be reduced by Upon reception of verification events, the ISA checker runs the packing multiple verification events into a single transfer [8, 9, 19]. REF model accordingly and compares its state against the DUT For example, packing 256 16B events into a single 4KB transfer to verify correctness. Once a mismatch is detected, the debugging reduces startup cost by 256×. flow is triggered: the Replay unit rolls back the fused buggy events The data transmission volume can be reduced by fusing same- and reprocesses the original unfused ones (Section 4.4), providing type events [8, 19, 40, 57], such as 𝑁 instruction commits into one instruction-level debugging details around the failure point. 𝑁 -commit event with 𝑁 × reduction in data volume. 4.2 Batch: Packing with Structural Diversity The software processing latency can be hidden through hard- ware–software parallelism [9, 24, 31, 56, 57], also known as non- 4.2.1 Why semantics matter? Minimizing communication fre- blocking support. Such support is widely available: emulators like quency relies on effective packing of verification events. However, Palladium provide primitives such as GFIFO, while FPGAs can em- the structural diversity of events poses significant challenges to both ulate non-blocking transmission using multi-buffer FIFOs. hardware packing and software unpacking. As shown in Figure 4, the 32 types of verification events in the DiffTest co-simulation 4 Semantic-aware Communication Mechanism framework [34, 54, 55] exhibit size differences of up to 170×, along To reduce communication overhead while preserving debuggability, with highly variable transmission frequencies. we propose DiffTest-H, a semantic-aware, hardware-accelerated As illustrated in Figure 5, existing schemes [8, 9, 19] simplify co-simulation framework. This section introduces the overall frame- packing by assigning each verification event with a fixed-offset work of DiffTest-H and three optimization strategies. region in the packet. On the hardware side, the packer writes valid events into the assigned region, while on the software side, the 4.1 Overview parser always reads from the same region and extracts the event according to its data structure. However, this fixed-offset method DiffTest-H is a semantic-aware, hardware-accelerated co-simulation requires padding for invalid events to preserve offsets for others. framework, covering 32 types of verification events and preserv- Evaluation on DiffTest shows that such padding leads to more than ing instruction-level debuggability. Figure 3 shows the DiffTest-H 60% invalid bubbles in the packet, thereby resulting in 1.67× more framework under a dual-core design. The framework comprises communications to transmit the same set of valid events.
1466
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
Event Size / Byte 1024 Event Size Event Invocations 3 Instr. Commit Reg Update Monitor 256 2Invoc. per Cycle Cycle 0 IC IC RU RU 64 Batch Flow Cycle 1 IC RU 164 1 Cycle 0 2 ICType Pack 1 0 IC 2 RU RU 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Cycle 1 1 IC 1 RU Event ID (ordered by size) Cycle Cycle 0 2 2 IC IC RU RU Figure 4: Verification event size and invocations in baseline Software Hardware Cycle 1 1 1 IC RU DiffTest. Event IDs are ordered by increasing size, with trans- Transmission mission frequency measured as invocations per cycle. Cycle 0+1 2 2 1 1 IC IC RU RU IC RU Meta-guide parse Unpack Problem: Mixed-type packing Fixed-offset Packing IC IC RU RU IC RU Instruction IC IC IC₂ RU Commit 1 2 2 Invalid bubbles Register Update RU₁ RU₂ RU₃ RU₃ EI₃ Fixed RU offset Figure 6: The Batch workflow. Verification events from differ- Batch: Computed-offset ent cycles are packed in three levels, accompanied by a meta External EI EI IC Interrupts 1 3 2 RU₂ RU₃ 1 1 recording its type and structure. The software then unpacks Offset from preceding lengths Coming by cycles ICx1+RUx1 the events based on the meta. Register Updates / RU Figure 5: Comparison of packing schemes. Fixed-offset pack- (V = Valid, I = Invalid) Packed RU Meta Info ing inserts bubbles to preserve offsets, while Batch computes V RU1 entry 1 V RU K offsets as the sum of preceding event lengths. SumV(i) == K-1 count-valids I RU2 ... V RU V ... entry K V Type Num ... K ... I Verification events inherently contain structural semantics, V(i) I RUi-1 Packing Kth Valid RU namely their length and data structure. By leveraging structural V RU semantics, the hardware packer can dynamically allocate space ac- & ... i for (i = K; i < N; i++) if (SumV(i) == K-1 && V(i)) cording to the actual length of each event and tightly pack variable- I RUN entry[K] <= item[i] length events of different types. As shown in Figure 5, the offset of a register update (RU) event can be computed by summing the Figure 7: Type-level packaging in Batch. Packing 𝐾 valid lengths of prefix events. On the software side, the parser can also entries from 𝑁 incoming semantics, and the 𝐾-th entry comes use the length information to compute offsets of specific events from the valid one whose prefix valids is exactly 𝐾 − 1. and reconstruct them according to their data structure. Such tight packing eliminates invalid bubbles in the packet, improving band- width utilization and allowing for transmitting the same set of valid (1) Type-Level. At the first level, Batch collects valid events from events with fewer communications. multiple same-type entries within a cycle and tightly packs them together for subsequent packing. Since events of the same type 4.2.2 How semantics work? To minimize communication fre- have identical structures and lengths, Batch employs a muxtree quency, we propose Batch, a structure-wise mechanism that tightly to aggregate them in parallel, where the 𝐾-th packed entry corre- packs events with different structures and supports dynamic un- sponds to the 𝐾-th valid incoming event of the cycle. As illustrated packing on the software side. Batch is designed to address two main in Figure 7, Batch instantiates a prefix counter for each incoming challenges: (1) how to tightly pack verification events with diverse event to count the number of preceding valid ones in parallel; when structures and lengths; (2) how to dynamically unpack the packed the count reaches 𝐾 − 1 and the current event is valid, that event is events and reconstruct their original data structures. selected as the 𝐾-th packed entry. In this stage, Batch also generates As shown in Figure 6, the Batch workflow consists of two parts: metadata for each packed event, recording its structure and count Pack and Unpack. In the Pack stage, Batch applies a multi-level pack- to support subsequent hardware packing and software unpacking. aging strategy, which tightly packs events of different structures (2) Cycle-Level. At the second level, Batch further packs different with metadata recording the structural information and numbers. types of events within the same cycle. For each type of packed event, In the Unpack stage, Batch will use the metadata to extract events Batch dynamically allocates a region in the packet, with its offset of specific lengths from the packet and invoke the corresponding computed by the length sum of preceding packed events. Since the parser functions to recover their original structures. valid count of packed events varies across cycles, both the offset Multi-level Packing Strategy. Batch leverages a 3-level strategy and the region length need to be computed dynamically according that exploits structural similarity to reduce packing complexity: to the actual counts of valid events recorded in the metadata.
1467
...
Packer
... ... Mux ...
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
(3) Transmission-Level. At the third level, Batch assembles packed Problem: Keep-order Fusion Order-coupled Fusion
data from different cycles into fixed-size transmission packets, with Instruction Trans./Checking Order
metadata and payloads concatenated separately. Because cycle pack- Commit IC₁ IC₂ IC₃ Fusion IC₁ MA₁ IC₂ MA₂ IC₃
ets are variable in length, the residual space of a packet may be
insufficient to hold a full cycle packet, leaving unused space and MMIO MA MA Sync Squash: Order Decoupling
Checking
wasting bandwidth. To address this, Batch uses metadata to split a Access Coming by cycles1 2 (NDEs) Transmission Order Reorder for
MA₁ MA IC
cycle packet at event-type boundaries, filling the remaining space 2 1-3
of the current packet and placing the rest into the next, thereby
reducing required communications with fewer packets. Figure 8: Comparison of fusion schemes. Order-coupled fu-
Dynamic Unpacking with Meta. Packing events with diverse sion breaks instruction fusion at each MMIO access (NDEs),
structures and lengths complicates unpacking, as the parser must while Squash decouples fusion from checking order, trans-
both locate variable-length events within a mixed packet and re- mitting MMIO accesses ahead and then reordering them.
construct their original structures. Batch resolves this challenge
through a meta-guided dynamic unpacking mechanism. Alongside Timeline
each packet, Batch generates a meta that records event types, counts, IC₁ IC
and offsets, and links the meta to its parsing function. Guided by 2 IC₃ IC₄ IC₅
this metadata, the software parser extracts variable-length events MA₂ MA₄
and invokes the corresponding reconstruction functions to restore Fusion & Schedule
their structures. In this way, Batch enables accurate and efficient MA₂ MA₄ IC₁₋₅
unpacking of events despite flexible, tight packing. Hardware Differencing
MA₂ MA₄ IC₁₋₅
Reorder
4.3 Squash: Fusing with Order Decoupling IC· MA₂ IC₃₋₄ MA₄ IC₅
4.3.1 Why semantics matter? Reducing transmission data vol- Software
ume calls for fusing same-type verification across instructions. For IC Instr. Commit MA MMIO Access
instance, a sequence of 𝑁 instructions can be fused into a single (DEs, Fusible) (NDEs, Non-fusible)
𝑁 -commit event with the final PC and the instruction count. How-
ever, non-deterministic events (NDEs) challenge fusion with a strict Figure 9: The Squash workflow. Verification events are fused
checking order requirement. The NDEs, such as external interrupts across cycles, with the NDEs scheduled ahead for continuous
and MMIO access, are specific to the DUT and need to be synchro- fusion of DEs. Differencing removes the redundancy in MA4
nized to REF at precise instructions, forcing updates to the REF’s relative to MA2. The software then completes MA4 and re-
architectural states. As a result, any instructions prior to an NDE stores the checking order by inserting MA back into IC.
should be checked ahead while subsequent ones remain unchecked.
As illustrated in Figure 8, existing fusion approaches [8, 19, 40, remain unchanged over long instruction sequences. Such repeti-
57] couple fusion to the checking order. Once an NDE is detected, tiveness allows for referencing unchanged parts of prior events to
they terminate the ongoing instruction fusion and transmit the avoid redundant transmission.
fused instructions to the REF, ensuring the required checking or-
der by consistent transmission order. The order coupling design 4.3.2 How semantics work? To reduce transmission data vol-
causes frequent fusion breaks and a limited fusion ratio. In real- ume while preserving correct checking order, we propose Squash,
world workloads with substantial device interaction and frequent an order-aware fusion scheme that fuses verification events from
exceptions, such as OS boot, device drivers, and I/O-intensive ap- different instructions with decoupled checking order. Squash ad-
plications, the fusion breaks more often, markedly reducing the dresses two key challenges: (1) how to preserve checking order
overall fusion ratio. and enable continuous fusion without frequent breaks; (2) how to
Verification events inherently carry order semantics, which eliminate redundant event contents for further data reduction.
reflects the required checking order between verification events. By As illustrated in Figure 9, the Squash workflow consists of three
leveraging order semantics, the fusion and communication order stages: fusion and scheduling, differencing, and reordering. On the
can be decoupled from the checking order: NDEs can be trans- hardware side, same-type events are fused across instructions, while
mitted ahead with order tags, indicating after which instruction non-fusible NDEs are scheduled ahead with order tags. Both the
they should be checked. Meanwhile, other events continue to be fused events and NDEs are then differenced to remove redundancy.
fused across instructions. On the software, the checker reorders On the software side, events are completed from the preceding
events by the tags to restore the required checking order. Such event and reordered to the original checking sequence.
order-decoupled fusion reduces NDE-induced fusion breaks and Fusion and Scheduling. Squash fuses verification events of the
improves fusion efficiency with less data volume. same type into a single event that represents the collective effect of a
Moreover, the verification events exhibit natural repetitiveness sequence. For instance, a sequence of instruction commits (ICs) can
and locality. For example, some control and status registers (CSRs) be fused into one fused IC containing dense information such as the
1468
...
Squash Flow
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
final PC, the number of committed instructions, and corresponding Problem: Debugging after Fusion Snapshot Debugging
register updates. Meanwhile, non-fusible NDEs are scheduled for Instruction IC ... ... IC ... IC
transmission ahead with order tags, notifying the checker to check Commit 1 ICₘ ICₙ₋ₖ... ICn n-k n
them at the precise instruction order. By binding each event to Re-execute DUT
its nearest preceding instruction commit, Squash preserves the Fusion ... Replay: Event Replay
checking order and ensures that the REF verifies events in the same Debugging last k instrs ICₙ₋ₖ... ICn
order as the DUT, especially for non-deterministic behaviors such Retransmit Event
as external interrupts and MMIO access.
Differencing. To exploit event repetitiveness, Squash applies Figure 10: Comparison of debugging schemes after fusion.
differencing to remove unchanged fields in each event. Events are Snapshot debugging re-executes the entire DUT from a snap-
decomposed into smaller units (e.g., CSR entries), and only modified shot, while Replay retransmits buffered verification events
ones are transmitted (e.g., via XOR operations). On the software to restore instruction-level debuggability.
side, the checker keeps the latest record and completes events by
filling unchanged fields from the previous ones, as in completing Timeline
MA2 from MA1 in Figure 9. IC₁ IC₂ Buffering Replay
4.4 Replay: Debugging with Instruction-level Fusion Buffer
Behaviors Retransmit
4.4.1 Why semantics matter? Debugging requires localizing IC₁₋₂ IC₁ IC₂
processor errors to specific microarchitectural components. Verifi- Hardware
cation events enable this by checking the DUT’s architectural state Reprocess
after each instruction, with each event corresponding to a specific IC₁₋₂ IC IC
architectural behavior and covering the relevant microarchitectural Software 1-2 1 IC₂
components. For example, bugs in the DUT’s memory subsystem revert()
can be exposed by register updates comparison from load/store IC Instr. Commit IC Revert IC
instructions, as well as the refill value check from cache accesses.
As shown in Figure 10, existing work [9, 24, 31, 56, 57] relies Figure 11: The Replay workflow. Verification events are
on verification events to detect errors, but still falls back to wave- buffered during fusion. When a fused event mismatches,
forms for debugging. To recover waveforms near the failure point, the REF reverts it, notifies the hardware to retransmit the
they snapshot the entire DUT and re-execute from the nearest buffered unfused ones, and reprocesses them for debugging.
checkpoint. Since the root cause may precede the observed failure,
snapshots must be taken periodically, incurring substantial resource
and time overhead. need to be replayed. Replay introduces a token-based management
Verification events also carry behavioral semantics, referring mechanism: tokens are assigned to buffered verification events be-
to architectural behaviors and their mapping to specific microarchi- fore fusion, and fused together during optimization. Upon detecting
tectural components. These semantics can guide debugging by lo- a mismatch, Replay uses these tokens to locate the exact range of
calizing faults more precisely. To address the loss of per-instruction events and notifies the hardware to retransmit only the necessary
detail caused by fusion, we reprocess only the unfused verification buffered events. Tokens also filter out irrelevant events that may
events around the failure, rather than re-executing the entire DUT. arrive between the bug occurrence and replay notification, ensuring
This restores instruction-level behavioral details and pinpoints the consistent replay.
faulty instruction and related microarchitectural component, en- Revert Reference Model. Since mismatches may occur at any
abling lightweight and effective debugging. check, Replay must revert the REF to the latest checkpoint before
4.4.2 How semantics work? To restore instruction-level debug- reprocessing events. Directly snapshotting the REF at each check-
gability, we propose Replay, a lightweight debugging mechanism point would be prohibitively expensive, especially in memory usage.
that localizes bugs by reprocessing unfused events around the fail- Instead, Replay adopts a compensation-based strategy: it records
ure point. Replay addresses two key challenges: (1) how to deter- only the modifications between consecutive checkpoints. When a
mine the range of retransmission; (2) how to recover the REF’s state fused event is found buggy, Replay restores the REF by rolling back
for reprocessing. these changes. For example, the original values of memory updates
As illustrated in Figure 11, the Replay workflow contains a hard- between two checkpoints are logged; reverting simply writes back
ware retransmission module and a software checking module. On these logs in reverse order to achieve lightweight state recovery.
the hardware side, verification events are buffered during fusion and
retransmitted upon notification. On the software side, once a fused 4.5 Other Design Issue
event mismatches, the REF reverts its state, requests retransmission, Software processing latency can be hidden through hardware–software
and reprocesses the unfused events for debugging. parallelism enabled by non-blocking communication. In co-simulation,
Range Determination. Optimizations and communication la- all checks are assumed to match until the first mismatch occurs,
tency make it difficult to identify the range of unfused events that so the DUT does not need to wait for software results and can
1469
Replay Flow
notify()
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
speculatively continue execution in parallel with software process- (e.g., B3–B4) (see ○7 ). Finally, the checker reprocesses the pre- ing, thereby hiding software latency. If the checker later detects a fusion events up to the specific buggy point and completes the mismatch, it asynchronously notifies the ahead-running DUT and co-simulation with a detailed debugging report, thereby preserving terminates the co-simulation. instruction-level debuggability (see 8○). However, non-blocking transmission introduces new challenges such as out-of-order delivery and transmission bursts. To ensure 5 Implementation correct processing, we employ a unified hardware–software inter- face (Section 4.2), where all verification events are transmitted in DiffTest-H is implemented with high-level hardware description structured packets for ordered parsing. To handle bursts, the com- language (HDL), Chisel. This section introduces DiffTest-H ’s com- munication unit incorporates sending and receiving queues with patibility across platforms and designs. To optimize DiffTest-H in backpressure, ensuring reliable and balanced transmission. different scenarios, we have also proposed an open-source tun- ing toolkit, supporting performance evaluation, SQL analysis, and 4.6 Put It All Together iterative debugging of DiffTest-H. Design/Platform Compatibility. As mentioned earlier, the DiffTest-H framework mainly consists of five units besides DUT: Event A Event B monitor unit, acceleration unit, communication unit, check unit, ① Monitor and replay unit. By simply instantiate a piece of probe logic from A0 A1 A2 A3 A4 ② Buffering the DUT, the DiffTest-H framework can automatically generate matching logic for all five units, providing compatibility for differ- B0 B1 B2 B3 B4 ent designs and platforms: we have deployed DiffTest-H on both A0-1 ③ SquashA Replay NutShell, a scalar in-order processor, and XiangShan, a 6-wide out- 2-4 Buffer of-order dual-core processor, across platforms including emulator B0-2 B3-4 ⑦ (Cadence Palladium), FPGA, and RTL Simulator (Verilator, VCS), ④ Batch Replay and speeds up co-simulation with the hardware acceleration. Tuning Toolkit. To further explore the optimization space for Packed Data B3 B4 different designs and co-simulation strategies, we have constructed Hardware ⑤ NonBlock a complete open-source toolkit, which mainly includes three parts: A0-1 A2-4 (1) Performance evaluation support: DiffTest-H integrates per- formance counters in both software and hardware. On the software B0-2 B3-4 B3-4 B3 B4 side, the counters collect performance statistics, such as the trans- mission times and data volume. On the hardware side, the counters Software ⑥ Step & Compare ⑧ Report & Finish monitor performance-related indicators, including Squash fusion ratios and Batch packet utilization. These metrics will be used to Figure 12: The DiffTest-H workflow. guide the adjustment of optimization for better performance. (2) SQL analysis support: DiffTest-H records online transmis- sion data in an SQL database for offline analysis. With this SQL Putting the acceleration and debugging units together, DiffTest- backend, DiffTest-H can also simulate order-decoupled fusion and H effectively optimizes hardware–software communication over- differencing strategy on the software, thereby fully exploiting event head while preserving instruction-level debuggability. Figure 12 correlations and reducing data transmission volume. illustrates the workflow of DiffTest-H and the eight procedures. (3) Iterative debugging support: When debugging DiffTest-H’s On the hardware side, the verification events are captured and verification logic, it is time-consuming and resource-wasting to collected by the monitor unit inserted into DUT (see 1○). The events include the unchanged DUT during compilation and execution. To are buffered for potential debugging purposes (see 2○), and then support independent iteration, DiffTest-H decouples the DUT and ○- 4 optimized by the acceleration unit (see 3 ○). Specifically, Squash verification logic by trace dumping and reloading. The mechanism performs both fusion and differencing: it fuses verification (e.g., dumps the original verification events captured from the DUT dur- B3–B4) and filters out unchanged fields between successive events ing the first run, which is also called the DUT trace. Based on the (e.g., B0-2 and B3-4) to reduce data transmission volume (see 3○). traces, DiffTest-H generates and drives verification logic indepen- Batch further packages diverse events across multiple cycles to dently, supporting lightweight and rapid iterative debugging. minimize communication frequency (see 4○). Through the non-blocking communication unit, the packed events 6 Evaluation are transmitted to the software side (see 5○), where they are ex- tracted with computed offset and reconstructed to their original In this section, we evaluate the performance and resource utilization data structure. The events are then checked step-by-step to verify of DiffTest-H across various DUT scales. We further perform an the DUT against the REF (see 6○). optimization breakdown to quantify the contribution of each strat- Upon detecting a mismatch, the replay unit recovers the REF’s egy within DiffTest-H. Finally, we demonstrate the effectiveness of state by rolling back the last faulty events (e.g., B3-4) and noti- DiffTest-H in the development of XiangShan, a 6-wide out-of-order fies the hardware to retransmit the corresponding buffered data processor. Our results are highlighted as follows:
1470
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
• On Cadence Palladium, DiffTest-H achieves 80× speedup Verilator PLDM (Baseline) PLDM (DiffTest-H)
over baseline, and is 119× faster than a 16-thread Verilator FPGA(Baseline) FPGA(DiffTest-H)
simulation, reducing communication overhead by 99.8%. 10000
• On FPGA, DiffTest-H achieves 78× speedup over baseline, 1000
and is 1945× faster than a 16-thread Verilator simulation, 100
• reducing communication overhead by 98.8%. 101 N/A N/A
DiffTest-H incurs a maximum resource overhead of 26%, NutShell XiangShan XiangShan XiangShan
reduced to 6% when disabling Batch packing. (Minimal) (Default) (Default,2C)
• DiffTest-H uncovers over 151 complex bugs in XiangShan
that require up to 2 months to identify with Verilator but are Figure 13: Performance comparison.
detected within 11 hours by DiffTest-H on Palladium.
6.1 Experimental Setup 100000 DiffTest-H Verilator
10000
1000
Table 3: Experimental Setup. 100
101
Feature Configuration
• NutShell, scalar, inorder
DUT •• XiangShan (Minimal), 2-wide, out-of-order
XiangShan (Default), 6-wide, out-of-order
• XiangShan (Default, dual-core), 6-wide, out-of-order
Platform Emulator: Cadence Palladium Figure 14: Bug detection time.
FPGA: Xilinx VU19P
Workload Linux boot (∼1.7B instruction)
KVM, XVISOR, RVV_TEST, SPEC CPU 2006 benchmark across the following setups: (a) 16-threads Verilator,
the current state-of-the-art RTL simulator; (b) Unoptimized Palla-
dium setup, serving as the baseline for DiffTest-H; (c) DUT-only
Table 4: Scales and verification coverage across DUTs. Palladium setup, representing the theoretical maximum simula-
tion speed without any co-simulation overhead. The performance
DUT Gates Event Avg. Bytes results are quantified in kiloCycles per second (KHz).
Types per Instr. On the large-scale DUT co-simulation, DiffTest-H demonstrates
significant acceleration, achieving an 80× speedup over the unopti-
NutShell 0.6 M 6 93 mized Palladium baseline and 119× faster than a 16-thread Verilator
XiangShan (Minimal) 39.4 M 32 692 simulation. Across all DUT scales, including small and mid-sized
XiangShan (Default) 57.6 M 32 1437 configurations, DiffTest-H consistently delivers over 74× speedup
XiangShan (Default, 2C) 111.8 M 32 3025 compared to the baseline, highlighting its effectiveness across a
range of design complexities.
To demonstrate the generalizability of DiffTest-H across DUTs Furthermore, DiffTest-H’s acceleration capability significantly
and platforms, we evaluate it on both Palladium and FPGA using improves the efficiency of functional debugging. As illustrated in
NutShell and XiangShan across different configurations. To further Figure 14, complex bugs that require millions to billions of simula-
validate DiffTest-H’s effectiveness under full-system workloads, tion cycles to manifest can be detected within 11 hours on Palla-
we employ benchmarks including Linux boot and SPEC CPU2006, dium using DiffTest-H, whereas traditional simulation with Verila-
covering most verification scenarios involving control flow, register tor would take up to 2 months under the same conditions. These
updates, memory access, hierarchy, and optional ISA extensions bugs, uncovered during the verification of the XiangShan project,
listed in Table 1. The experimental setup is listed in Table 3 with have been officially reported to and acknowledged by the Xiang-
scales and verification coverage across DUTs listed in Table 4. Shan development team, demonstrating the practical effectiveness
of DiffTest-H in real-world chip development scenarios.
6.2 Performance Evaluation By greatly accelerating bug discovery and iteration, DiffTest-H
We evaluate DiffTest-H’s performance against both the state-of-the- enables designers to quickly identify and fix bugs, enhancing the
art RTL simulator and emulation platforms under various setups. productivity and reliability of the chip development.
The performance results are obtained on realistic benchmarks, in-
cluding Linux, KVM, XVISOR, RVV_TEST, and SPEC CPU 2006. 6.3 Optimization Breakdown
Figure 13 presents DiffTest-H’s performance when running Linux To evaluate the effectiveness of DiffTest-H’s optimization tech-
Boot across the DUT configurations mentioned in Table 4. For com- niques across different designs and platforms, Table 5 presents
parison, we measure the performance under identical DUT and incremental performance results on NutShell and XiangShan with
1471
Time to Bug / minute Speed / KHz
KVM RVV_Test Xvisor 2C_LINUXKVM+CoreMark spec06/astar spec06/bwavesspec06/gccspec06/mcf spec06/sphinx3 spec06/zeusmp spec06/xalancbmk
Gates / M
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
Table 5: Optimization breakdown across DUTs and platforms. Table 6: Summary of pull requests fixing bugs detected by
DiffTest-H in XiangShan.
Setup NutShell XiangShan XiangShan
on Palladium on Palladium on FPGA Bug Category Pull Requests
Baseline 14 KHz 6 KHz 0.1 MHz Exception and #3639, #4239, #4263, #3991,
+Batch 102 KHz (7×) 24 KHz (4×) 1.3 MHz (13×) interrupt handling errors #3778, #4157
+NonBlock 389 KHz (28×) 71 KHz (12×) 2.2 MHz (22×) Memory hierarchy #3964, #3685, #3621, #4037,
+Squash 1030 KHz (74×) 478 KHz (80×) 7.8 MHz (78×) and coherence issues #3719, #4442
Vector and #3876, #3965, #3690, #3643,
Palladium, and XiangShan with FPGA. Each row shows the benefit control logic errors #3646, #3664, #4361
brought by progressively applying Batch, NonBlock, and Squash.
Batch significantly improves performance by reducing commu-
nication frequency through tight packing of structurally diverse 6.5 Finding Bugs
events, achieving up to 4×–13× speedup over the baseline. Non-
Block further accelerates co-simulation by masking software pro- To demonstrate the effectiveness of DiffTest-H in verification with
cessing latency with hardware-software parallelism, and provides millions of test cycles and a wide range of verification states, we
an additional 2×–4× speedup over Batch. Squash greatly reduces deployed DiffTest-H on XiangShan, an open-source 6-wide out-of-
data volume by fusing events with a decoupled checking order, order dual-core processor within the DiffTest framework. DiffTest-
contributing the final boost to a total of 74× speedup on NutShell, H supports 32 types of verification state, including instructions,
80× on XiangShan (Palladium), and 78× on XiangShan (FPGA). cache coherence, TLB, vectorization, and virtualization.
Overall, these optimizations reduce co-simulation time to about DiffTest-H was extensively used during XiangShan’s develop-
1%–2% of the unoptimized baseline, cutting communication over- ment to run real-world benchmarks such as SPEC06 for error detec-
head by 99.8% on Palladium and 98.8% on FPGA. This demonstrates tion. These workloads trigger complex microarchitectural corner
that DiffTest-H effectively eliminates the primary performance bot- cases between pipeline stages, memory systems, and exception
tleneck in hardware-accelerated co-simulation, achieving both high logic, with many bugs only manifesting after millions or billions of
speed and minimal extra overhead beyond DUT emulation. cycles. Compared to baseline DiffTest, DiffTest-H achieved signif-
icantly shorter runtime to detect the same errors at similar cycle
6.4 Resource Analysis counts, demonstrating higher co-simulation efficiency without sac-
We evaluate the additional resource usage introduced by the com- rificing debuggability.
plete DiffTest-H framework across different configurations of Xi- Over the past six months, DiffTest-H helped XiangShan uncover
angShan. In our setup, DiffTest-H monitors XiangShan by inserting over 151 complex bugs. All 151 complex bugs were confirmed and
128 probes within each core, covering 32 types of verification states. fixed by the XiangShan development team, involving a total of
The basic resource usage for both DiffTest-H and the DUTs is sum- 780 lines of code modifications across 19 pull requests. These bugs
marized in Figure 15, with area results estimated using Cadence span three categories: (1) exception and interrupt handling errors,
Palladium and quantified in million gates. such as incorrect virtual address generation, misaligned load/store
As shown in Figure 15, DiffTest-H incurs approximately a 6% wakeup, and improper interrupt responses; (2) memory hierarchy
area overhead without Batch across different DUT configurations. and coherence issues, including TLB deadlocks during guest page
In this setup, DiffTest-H can operate on platforms with software- faults, StoreQueue condition mismatches, and cache inconsistencies
like communication support, such as Cadence Palladium, achieving under specific faults; (3) vector and control logic errors, such as
accelerated co-simulation with minimal additional area cost. When wrong vstart updates, incorrect vs.dirty settings, and faulty
Batch is enabled, the area overhead increases to an average of vector exception tracking.
25%. This configuration introduces a unified hardware-software Table 6 summarizes 19 pull requests categorized by bug type,
communication interface, significantly simplifying the migration while Figure 14 presents the time savings achieved by DiffTest-H
to platforms lacking software-like communication support. compared to Verilator in detecting these bugs.
DUT DUT+DiffTest-H (No Batch) DUT+DiffTest-H 6.6 Comparison with Prior Work
150 DiffTest-H supports deployment on both emulator and FPGA. As
120 illustrated in Table 7, IBI-check [8] and SBS-check [19] represent
90 state-of-the-art emulator-based solutions, achieving low communi-
60 cation overhead (∼2%) and moderate area overhead (∼20%). How-
300 ever, their verification states are limited to basic events such as
NutShell XiangShan XiangShan XiangShan instruction commits and register updates, still incapable of detect-
(Minimal) (Default) (Default,2C) ing more complex architectural behaviors such as non-determinism
discussed in Section 4.3. In contrast, DiffTest-H expands the ver-
Figure 15: Resource usages. ification states to 32 architectural behaviors while reducing the
communication overhead to just 0.4% with similar area overhead.
1472
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
Table 7: Comparison of hardware-accelerated co-simulation frameworks.
Work Platform Verification Communication Area DUT-only Co-sim
States/Bytes † Overhead Overhead Speed Speed
IBI-check [8] IBM AWAN [13] 2 / 7 20 % 20 % 100 KHz 80 KHz
SBS-check [19] Gem5 [5] (for estimation‡) 2 / 7 2 %‡ 22%‡ 100 KHz‡ 98 KHz‡
DiffTest-H Cadence Palladium [7] 32 / 1200 0.4 % 26% 480 KHz 478 KHz
Fromajo [56, 57] FireSim [22] 7 / 24 99 % Unknown 100 MHz 1 MHz
DiffTest-H Xilinx VU19P 32 / 1200 84 % 24 % 50 MHz 7.8 MHz
† The number of verification state types and the average byte size of verification states per retired instruction before optimization. ‡ Speed and overhead of SBS-check is estimated using Gem5, with IBI-check serving as the baseline.
On FPGA-accelerated platforms, Fromajo [56, 57] is the state- Running software independently will lead to divergence in the
of-the-art framework that runs the DUT on FireSim [22] and com- execution path of the reference model.
pares its execution against the reference model Dromajo [21]. It Hardware-to-software Communication. Due to the large amount
supports 7 types of architectural states and detects a subset of of hardware verification data, communication overhead accounts
non-deterministic behaviors. In contrast, DiffTest-H, with a more for more than 98% of overall co-simulation time. Recent approaches,
comprehensive set of 32 verification states, achieves a simulation including IBI-check [8] and ArChiVED [19], employ static data
speed of 7.8 MHz, 7.8× faster than Fromajo. packaging and checksum-based compression to optimize communi- Overall, compared to the state-of-the-art approaches on both cation. However, these works neglect non-deterministic behaviors emulator and FPGA, DiffTest-H delivers higher simulation speed, in co-simulation, which is critical for aligning the reference model expanded verification coverage, and comparable area overhead. state with the DUT under external interrupts and stimulus. While DESSERT [24], ZP Cosim [31], and Fromajo [57] identify several key 7 Related Work sources of non-determinism and make some optimizations toward Improved RTL Simulators. RTL simulators, such as open-source communication, they are inefficient for handling the large-scale, Verilator [46] and commercial VCS, translate RTL circuits written diverse verification events typical of industrial designs. in Verilog into dataflow graphs, where nodes represent combina- tional logic and edges represent data values. Recent optimizations 8 for CPU-based RTL simulation mainly focus on the sequential la- Conclusion tency of the dataflow graph, including ESSENT [4], RepCut [48], We propose DiffTest-H, a semantic-aware, hardware-accelerated and Khronos [58]. Some other efforts in accelerating RTL simu- co-simulation framework. It enhances verification efficiency while lation fully leverage the task parallelism in the dataflow graph, maintaining verification completeness and instruction-level debug- such as RTLFlow [27] and SAGA [47] running on GPUs, as well as gability by three semantic-aware communication optimizations. Manticore [16], ASH [15], and Nexus [6] accelerated on the FPGA. Batch minimizes communication frequency by tightly packing Despite these advances, dataflow-based RTL simulation executes structurally diverse verification events into a single transfer. Squash instructions rather than circuit logic, requiring multiple host cycles reduces data transmission volume by fusing verification events with for one design cycle and thus limiting speed to orders of magnitude a decoupled checking order. Replay preserves instruction-level de- below FPGA prototyping. buggability by reprocessing the original, unfused verification events Hardware-Accelerated Co-Simulation. Differing from RTL around the failure point. DiffTest-H is deployed on both emulator simulators, hardware emulators synthesize RTL circuits into gates and FPGA to verify XiangShan [35, 50, 51, 54, 55], a 6-wide out-of- in specialized ASICs or FPGAs, reaching a speed of MHz over order RISC-V processor. DiffTest-H achieves a 478KHz and 7.8MHz industrial-scale designs. Traditional emulators include Cadence simulation speed respectively, 80× and 78× faster than the base- Palladium, Synopsys Zebu, Siemens Veloce, and Xilinx FPGA. line, and uncovers 151 bugs in XiangShan. We have open-sourced Considering the ease of maintenance and running speed, existing DiffTest-H to the community, promoting verification efficiency for co-simulation mainly adopts software-implemented ISA reference broader chip designs. models such as Spike [20] and NEMU [33]. To adopt hardware emu- lators for accelerating co-simulation, it is necessary to consider the cross-platform communication overhead, which consumes over 98% Acknowledgments of co-simulation time. According to the communication direction, This work is co-authored by Shoulin Zhang, Ziqing Zhang, and Kan existing works can be categorized into two groups: Shi, with valuable support for the FPGA-based experiments. The Software-to-hardware Communication. ENCORE [40] runs the authors would like to thank the anonymous reviewers for their valu- DUT on the emulator and the REF on the host server independently, able feedback and comments. This work is supported in part by the and transmits software data to the emulator for comparison. How- National Natural Science Foundation of China (Grant No. 62090022, ever, as mentioned in Section 2.2, the reference model relies on 62090023) and the Strategic Priority Research Program of Chinese the data from design for state alignment under external interrupts. Academy of Sciences (Grant No. XDA0320000, XDA0320300).
1473
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
References [22] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, [1] Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris Scheiman. Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, 1995. LogGP: Incorporating long messages into the LogP model—one step closer Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and towards a realistic model for parallel computation. In Proceedings of the seventh Krste Asanovic. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system annual ACM symposium on Parallel algorithms and architectures. 95–105. simulation in the public cloud. In 2018 ACM/IEEE 45th Annual International [2] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Symposium on Computer Architecture (ISCA). IEEE, 29–42. Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: constructing [23] Michael Katrowitz and Lisa M Noack. 1996. I’m done simulating; now what? hardware in a scala embedded language. In Proceedings of the 49th annual design Verification coverage analysis and correctness checking of the DEC chip 21164 automation conference. 1216–1225. Alpha microprocessor. In Proceedings of the 33rd Annual Design Automation [3] Scott Beamer. 2020. A case for accelerating software RTL simulation. IEEE Micro Conference. 325–330. 40, 4 (2020), 112–119. [24] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan [4] Scott Beamer, Thomas Nijssen, Krishna Pandian, and Kyle Zhang. 2021. ESSENT: Bachrach, and Krste Asanović. 2018. DESSERT: Debugging RTL Effectively A high-performance RTL simuator. In Workshop on Open-Source EDA Technology with State Snapshotting for Error Replays across Trillions of Cycles. In 2018 (WOSET), at International Conference on Computer-Aided Design (ICCAD). 28th International Conference on Field Programmable Logic and Applications (FPL). [5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali 76–764. doi:10.1109/FPL.2018.00021 Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh [25] Sunwoo Kim, Jooho Wang, Youngho Seo, Sanghun Lee, Yeji Park, Sungkyung Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Park, and Chester Sungchung Park. 2020. Transaction-level model simulator for Hill, and David A Wood. 2011. The gem5 simulator. ACM SIGARCH computer communication-limited accelerators. arXiv preprint arXiv:2007.14897 (2020). architecture news 39, 2 (2011), 1–7. [26] Zhiwei Li, Boyan Ding, Haoyang Wu, and Tao Wang. 2017. A Flexible Frame- [6] Peter Birch. 2022. Open source FPGA-based emulation with nexus. In Workshop Oriented Host-FPGA Communication Framework for Software Defined Wireless on Open-Source EDA Technology (WOSET), Vol. 1. Network. In 2017 International Conference on Networking and Network Applications [7] Cadence. n.d.. Palladium. https://www.cadence.com/en_US/home/tools/system- (NaNA). IEEE, 118–124. design-and-verification/emulation-and-prototyping/palladium.html [27] Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei [8] Debapriya Chatterjee, Anatoly Koyfman, Ronny Morad, Avi Ziv, and Valeria Huang. 2022. From rtl to cuda: A gpu acceleration flow for rtl simulation with Bertacco. 2012. Checking architectural outputs instruction-by-instruction on batch stimulus. In Proceedings of the 51st International Conference on Parallel acceleration platforms. In Proceedings of the 49th Annual Design Automation Processing. 1–12. Conference. 955–961. [28] lowRISC. 2025. Ibex. https://ibex-core.readthedocs.io/en/latest/03_reference/ [9] Yuxiao Chen, Yisong Chang, Ke Zhang, Mingyu Chen, and Yungang Bao. 2023. verification.html REMU: Enabling Cost-Effective Checkpointing and Deterministic Replay in FPGA- [29] S Marconi, E Conti, P Placidi, J Christiansen, and T Hemperek. 2017. IEEE based Emulation. In 2023 IEEE 41st International Conference on Computer Design Standard for Universal Verification Methodology Language Reference Manual. (ICCD). 21–29. doi:10.1109/ICCD58817.2023.00014 [30] Romina Soledad Molina, Veronica Gil-Costa, María Liz Crespo, and Giovanni [10] Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Ramponi. 2022. High-level synthesis hardware design for fpga-based accelerators: Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU- Models, methodologies, and frameworks. IEEE Access 10 (2022), 90429–90455. FPGA platforms. In Proceedings of the 53rd Annual Design Automation Conference. [31] Anoop Mysore Nataraja. 2023. A Research-Fertile Co-Emulation Framework for 1–6. RISC-V Processor Verification. Master’s thesis. University of Washington. [11] Ryan A Cooke and Suhaib A Fahmy. 2020. Characterizing latency overheads in [32] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio the deployment of FPGA accelerators. In 2020 30th International Conference on López-Buedo, and Andrew W Moore. 2018. Understanding PCIe performance for Field-Programmable Logic and Applications (FPL). IEEE, 347–352. end host networking. In Proceedings of the 2018 Conference of the ACM Special [12] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Interest Group on Data Communication. 327–341. Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: [33] OpenXiangShan. 2025. NEMU. https://github.com/OpenXiangShan/NEMU Towards a realistic model of parallel computation. In Proceedings of the fourth [34] OpenXiangShan. n.d.. DiffTest. https://github.com/OpenXiangShan/difftest ACM SIGPLAN symposium on Principles and practice of parallel programming. [35] OpenXiangShan. n.d.. XiangShan. https://github.com/OpenXiangShan/ 1–12. XiangShan [13] J. Darringer, E. Davidson, D.J. Hathaway, B. Koenemann, M. Lavin, J.K. Morrell, [36] OSCPU. n.d.. NutShell. https://github.com/OSCPU/NutShell K. Rahmat, W. Roesner, E. Schanzenbach, G. Tellez, and L. Trevillyan. 2000. EDA [37] Lakshmanan Ponnambalam. 2017. Efficient SCE-MI Usage to Accelerate TBA in IBM: past, present, and future. IEEE Transactions on Computer-Aided Design of Performance. In Design, Verification & Test of Low Power and Secure Systems Integrated Circuits and Systems 19, 12 (2000), 1476–1497. (DVCon). IEEE, 2–2. https://dvcon-proceedings.org/document/efficient-sce-mi- [14] Simon Davidmann and Lee Moore. 2022. Introduction to the 5 Levels of RISC- usage-to-accelerate-tba-performance/ DVCon Proceedings Archive. V Processor Verification. In Design and Verification Conference and Exhibition [38] Hao Qian and Yangdong Deng. 2011. Accelerating RTL simulation with GPUs. (DVCon). In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). [15] Fares Elsabbagh, Shabnam Sheikhha, Victor A Ying, Quan M Nguyen, Joel S Emer, IEEE, 687–693. and Daniel Sanchez. 2023. Accelerating rtl simulation with hardware-software [39] Shisong Qin, Chao Zhang, Kaixiang Chen, and Zheming Li. 2021. iDEV: Ex- co-design. In Proceedings of the 56th Annual IEEE/ACM International Symposium ploring and exploiting semantic deviations in ARM instruction processing. In on Microarchitecture. 153–166. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing [16] Mahyar Emami, Sahand Kashani, Keisuke Kamahori, Mohammad Sepehr and Analysis. 580–592. Pourghannad, Ritik Raj, and James R Larus. 2023. Manticore: Hardware- [40] Kan Shi, Shuoxiang Xu, Yuhan Diao, David Boland, and Yungang Bao. 2023. accelerated RTL simulation with static bulk-synchronous parallelism. In Pro- ENCORE: Efficient Architecture Verification Framework with FPGA Accelera- ceedings of the 28th ACM International Conference on Architectural Support for tion. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programming Languages and Operating Systems, Volume 4. 219–237. Programmable Gate Arrays (FPGA ’23). Association for Computing Machinery, [17] Harry D. Foster. 2022. Part 3: The 2022 Wilson Research Group Functional New York, NY, USA, 209–219. doi:10.1145/3543622.3573187 Verification Study. https://blogs.sw.siemens.com/verificationhorizons/2022/10/ [41] Simens. n.d.. Veloce. https://eda.sw.siemens.com/en-US/ic/hav/veloce-cs/ 30/part-3-the-2022-wilson-research-group-functional-verification-study/ [42] Synopsys. 2025. ImperasDV. https://www.synopsys.com/verification/imperasdv. [18] Harry D. Foster. 2024. Wilson Research Group IC/ASIC functional verifica- html tion trend report. https://resources.sw.siemens.com/en-US/white-paper-2024- [43] Synopsys. n.d.. VCS. https://www.synopsys.com/verification/simulation/vcs. wilson-research-group-ic-asic-functional-verification-trend-report/ html [19] Chang-Hong Hsu, Debapriya Chatterjee, Ronny Morad, Raviv Gal, and Valeria [44] Synopsys. n.d.. ZeBu. https://www.synopsys.com/verification/emulation- Bertacco. 2014. ArChiVED: architectural checking via event digests for high prototyping/emulation/zebu-200.html performance validation. In Proceedings of the Conference on Design, Automation & [45] Bill Jason Tomas, Yingtao Jiang, and Mei Yang. 2014. Co-Emulation of Scan-Chain Test in Europe (Dresden, Germany) (DATE ’14). European Design and Automation Based Designs Utilizing SCE-MI Infrastructure. arXiv preprint arXiv:1409.3276 Association, Leuven, BEL, Article 317, 6 pages. (2014). [20] RISC-V International. 2025. Spike, a RISC-V ISA Simulator. https://github.com/ [46] Verilator. n.d.. Verilator. https://github.com/verilator/verilator riscv-software-src/riscv-isa-sim [47] Sara Vinco, Debapriya Chatterjee, Valeria Bertacco, and Franco Fummi. 2012. [21] Nursultan Kabylkas, Tommy Thorn, Shreesha Srinath, Polychronis Xekalakis, and SAGA: SystemC acceleration on GPU architectures. In Proceedings of the 49th Jose Renau. 2021. Effective Processor Verification with Logic Fuzzer Enhanced Annual Design Automation Conference. 115–120. Co-simulation. In MICRO-54: 54th Annual IEEE/ACM International Symposium on [48] Haoyuan Wang and Scott Beamer. 2023. Repcut: Superlinear parallel rtl simulation Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing with replication-aided partitioning. In Proceedings of the 28th ACM International Machinery, New York, NY, USA, 667–678. doi:10.1145/3466752.3480092 Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 572–585.
1474
MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao
[49] Haoyuan Wang, Thomas Nijssen, and Scott Beamer. 2024. Don’t Repeat Your- • How much time is needed to complete experiments
self! Coarse-Grained Circuit Deduplication to Accelerate RTL Simulation. In (approximately)?: Less than 1 hour.
Proceedings of the 29th ACM International Conference on Architectural Support for
Programming Languages and Operating Systems, Volume 4. 79–93. • Publicly available?: Yes. GitHub link: https://github.com/
[50] Kaifan Wang, Jian Chen, Yinan Xu, Zihao Yu, Wei He, Dan Tang, Ninghui Sun, and OpenXiangShan/xs-env/tree/micro2025-ae
Yungang Bao. 2025. XiangShan: An Open-Source Project for High-Performance • Code licenses (if publicly available)?: Mulan Permissive
RISC-V Processors Meeting Industrial-Grade Standards. IEEE Micro (2025).
[51] Kaifan Wang, Jian Chen, Yinan Xu, Zihao Yu, Zifei Zhang, Guokai Chen, Xuan Hu, Software License, Version 2
Linjuan Zhang, Xi Chen, Wei He, Dan Tang, Ninghui Sun, and Yungang Bao. 2024. • Archived (provide DOI)?: Yes. DOI link: https://doi.org/
XiangShan: An Open-Source Project for High-Performance RISC-V Processors 10.5281/zenodo.16637351
Meeting Industrial-Grade Standards. In 2024 IEEE Hot Chips 36 Symposium (HCS).
1–25. doi:10.1109/HCS61935.2024.10665293
[52] Warren Weaver. 1953. Recent contributions to the mathematical theory of com- A.3 Description
munication. ETC: a review of general semantics (1953), 261–281.
[53] Jinyan Xu, Yiyuan Liu, Sirui He, Haoran Lin, Yajin Zhou, and Cong Wang. 2023. A.3.1 How to access. DiffTest-H is open-sourced on GitHub and
MorFuzz: Fuzzing processor via runtime instruction morphing enhanced syn- archived on Zenodo. For reference, we provide runtime logs and per-
chronizable co-simulation. In 32nd USENIX Security Symposium (USENIX Security
23). 1307–1324. formance reports on both platforms. To reduce setup time for FPGA-
[54] Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen, Lingrui Gou, Yue Jin, based experiments, we also include pre-built bitstreams. Please refer
Qianruo Li, Xin Li, Zuojun Li, Jiawei Lin, Tong Liu, Zhigang Liu, Jiazhan Tan, to README.md for more details.
Huaqiang Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang, Fawang Zhang,
Linjuan Zhang, Zifei Zhang, Yangyang Zhao, Yaoyang Zhou, Yike Zhou, Jian-
grui Zou, Ye Cai, Dandan Huan, Zusong Li, Jiye Zhao, Zihao Chen, Wei He, A.3.2 Hardware dependencies. Xilinx VU19P FPGA (for FPGA-
Qiyuan Quan, Xingwu Liu, Sa Wang, Kan Shi, Ninghui Sun, and Yungang Bao. based simulation), Cadence Palladium (for Palladium-based simula-
2022. Towards Developing High Performance RISC-V Processors Using Agile tion), x86-64 server with 128GB RAM (host in simulation).
Methodology. In 2022 55th IEEE/ACM International Symposium on Microarchitec-
ture (MICRO). 1178–1199. doi:10.1109/MICRO56248.2022.00080
[55] Yi-Nan Xu, Zi-Hao Yu, Kai-Fan Wang, Hua-Qiang Wang, Jia-Wei Lin, Yue Jin, A.3.3 Software dependencies. Vivado 2020.2 (FPGA synthesis and
Lin-Juan Zhang, Zi-Fei Zhang, Dan Tang, Sa Wang, Kan Shi, Ning-Hui Sun, and implementation), Mill 0.11 (RTL generation from Chisel).
Yun-Gang Bao. 2023. Functional Verification for Agile Processor Development:
A Case for Workflow Integration. Journal of Computer Science and Technology A.3.4 Data sets. Linux Boot, Microbench.
38, 4 (2023), 737–753.
[56] Jiahan Zhang, Varun Koyyalagunta, Joe Rahmeh, and Divyang Agrawal. 2023.
Integrating a High-Performance Instruction Set Simulator with FireSim to A.4 Installation
Co-simulate Operating System Boots. In First FireSim and Chipyard User/Developer
Workshop at ASPLOS 2023 (ASPLOS ’23 Workshops). https://fires.im/workshop-
2023-pdf/04_integ_isa_sim_FireSim_Zhang.pdf ## Get latest artifacts from GitHub.
[57] Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste Asanovic. 2020. Sonic-
boom: The 3rd generation berkeley out-of-order machine. In Fourth Workshop on $ git clone -b micro2025-ae
Computer Architecture Research with RISC-V, Vol. 5. International Symposium on https://github.com/OpenXiangShan/xs-env.git
Computer Architecture Valencia, Spain, 1–7.
[58] Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: ## Install required software dependencies
Fusing memory access for improved hardware RTL simulation. In Proceedings of $ sudo -s ./setup-tools.sh
the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 180– ## Init environment with submodule mechanism.
193. $ make init
A Artifact Appendix
A.1 Abstract A.5 Experiment workflow
DiffTest-H is an open-source, hardware-accelerated co-simulation For the most up-to-date and detailed instructions, please refer to
framework for processor verification. It deploys the design under README.md. Below is a brief workflow of the experiments.
test (DUT) on Palladium or FPGA, while comparing its instruction- A.5.1 FPGA-based Co-simulation Speed with XiangShan. The ex-
level architectural state with a golden reference model (REF) on the periment demonstrates DiffTest-H’s co-simulation speed on Xilinx
host server. The artifact includes all code and workflow of DiffTest- VU19P FPGA as Figure 13 and Table 7. We recommend users to
H to demonstrate FPGA/Palladium-based simulation speed. use the Step 0 Quick Start, which directly leverages our pre-built
A.2 Artifact check-list (meta-information) bitstream, host, and workloads for reliable results in minutes.
Recommended (Steps 0): Quick start with pre-built artifacts.
• Hardware: x86-64 Ubuntu servers, Xilinx VU19P FPGA,
• Cadence Palladium Z1 make write_bitstream
• Metrics: Simulation Speed. make write_ddr
• Output: Performance report. make fpga-run
Experiments: (1) FPGA-based simulation speed evaluation
with XiangShan. (2) Palladium-based Optimization break-
down with XiangShan/NutShell. Fully Rebuild (Steps 1-5): From Chisel RTL generation to FPGA
• How much disk space required (approximately)?: About execution (∼18 hours).
128 GB. • Steps 1: Generate RTL from Chisel.
• How much time is needed to prepare workflow (ap-
proximately)?: About 18 hours. (Minimal if use pre-built make fpga-rtl DUT=XiangShan
bitstream).
1475
DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea
• Step 2: Build Host Executable Binary.
Core 0: HIT GOOD TRAP at pc = ...
make fpga-host DUT=XiangShan Simulation speed: 7780.71 KHz
• Step 3: Generate Bitstream via Vivado. (2) Result of A.5.2: Palladium-based simulation speed with differ-
ent optimization, as shown in Table 5, detailed in reference/perf-log.
make vivado ## Setup Vivado Project ## Speed of XiangShan-PLDM
make bitstream ## Synthesis, Implementation and Bitstream Simulation speed: 6.49 KHz # Baseline Simulation speed: 23.84 KHz # Batch • Step 4: Write bitstream and workload to FPGA. (Please check Simulation speed: 71.22 KHz # Batch+NonBlock README.md for more details, especially FPGA reset.) Simulation speed: 478.12 KHz # Batch+NonBlock+Squash ## Speed of NutShell-PLDM # Step 4.1: Write bitstream to FPGA Simulation speed: 13.67 KHz # Baseline make write_bitstream FPGA_BIT_HOME=... Simulation speed: 101.65 KHz # Batch # Step 4.2: Write workload to DDR via tcl Simulation speed: 389.09 KHz # Batch+NonBlock make write_ddr WORKLOAD=microbench Simulation speed: 1030.93 KHz # Batch+NonBlock+Squash
• Step 5: Run XiangShan Co-simulation ## Speed of XiangShan-FPGA
Simulation speed: 1278.07 KHz # Batch
make fpga-run Host=... WORKLOAD=microbench Simulation speed: 2198.00 KHz # Batch+NonBlock
Simulation speed: 7780.71 KHz # Batch+NonBlock+Squash
A.5.2 Palladium-based Optimization Breakdown with XiangShan/NutShell. A.7 Notes The experiment demonstrates incremental impacts of optimization as shown in Table 5. DiffTest-H is developed in the open-source community and will Step 1: Generate RTL from Chisel. keep updating the latest code and document. Any feedback and issues are welcome via GitHub or the author’s emails. The usage ## DIFF_CONFIG options: and reference results of both Palladium and FPGA are included in # Z for Baseline, README.md and reference/ folder. We are delighted to assist users # EBI for Batch, in reproducing the experiment results. # EBIN for Batch+NonBlock # EBINSD for Batch+NonBlock+Squash ## DUT options: XiangShan or NutShell make sim-rtl DUT=XiangShan DIFF_CONFIG=EBINSD
Step 2: Compile for Palladium.
## Build on Palladium, requiring XCELIUM, IXCOM, VXE...
make pldm-build DUT=XiangShan
Step 3: Run XiangShan/NutShell Co-simulation
## WORKLOAD options: linux or microbench
make pldm-run DUT=XiangShan WORKLOAD=linux
A.6 Evaluation and expected results
Please check reference/ folder for detailed log. Below are some
critical results of the experiments.
(1) Result of A.5.1: FPGA-based co-simulation speed with Xiang-
Shan as shown in Fig.13 and Table 7, detailed in reference/perf-log.
1476