Source a172fdce... — STIMSMITH

SOURCE ARCHIVE

SHA256: a172fdce1267f49c34a18fcf4704410729d5ee35248a9afefe5026886b7b7794

URL: https://talks-pubs.xiangshan.cc/publications/micro2025-DiffTestH.pdf

TYPE: application/pdf

SIZE: 1992.2 KB

FETCHED: 6/6/2026, 10:24:06 PM

EXTRACTOR: liteparse

CHARS: 115,458

EXTRACTED CONTENT

115,458 chars

ia DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification Kunlin You Yinan Xu Kehan Feng State Key Lab of Processors State Key Lab of Processors Beijing Institute of Open Source Chip Institute of Computing Technology, Institute of Computing Technology, Beijing, China Chinese Academy of Sciences Chinese Academy of Sciences fengkehan@bosc.ac.cn Beijing, China Beijing, China University of Chinese Academy of xuyinan@ict.ac.cn Sciences Beijing, China youkunlin24s@ict.ac.cn

           Luoshan Cai                         Yaoyang Zhou                                         Yungang Bao
   State Key Lab of Processors    Beijing Institute of Open Source Chip                     State Key Lab of Processors
Institute of Computing Technology,            Beijing, China                               Institute of Computing Technology,
   Chinese Academy of Sciences            zhouyaoyang@bosc.ac.cn                            Chinese Academy of Sciences
          Beijing, China                                                                           Beijing, China
 University of Chinese Academy of                                                         University of Chinese Academy of
             Sciences                                                                                 Sciences
          Beijing, China                                                                           Beijing, China
     cailuoshan22z@ict.ac.cn                                                                      baoyg@ict.ac.cn
Abstract                                                                               Keywords
Verification has become the most time-consuming phase in chip                          Processor Verification, Simulation Acceleration, Co-simulation
development. Co-simulation frameworks simulate the design under                        ACM Reference Format:
test (DUT) with a golden reference model (REF) and compare their
instruction-level results for verification, causing over 98% commu-                    Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yun-
nication overhead: although hardware-accelerated platforms, such                       gang Bao. 2025. DiffTest-H: Toward Semantic-Aware Communication in
as FPGA and emulators, speed up DUT simulation by 300×–10000×,                         Hardware-Accelerated Processor Verification. In 58th IEEE/ACM Interna-
                                                                                       tional Symposium on Microarchitecture (MICRO ’25), October 18–22, 2025,
overall co-simulation speedup is still limited to 2.5×–20×.                            Seoul, Republic of Korea. ACM, New York, NY, USA, 15 pages. https://doi.
In this paper, we propose DiffTest-H, a semantic-aware, hardware-                      org/10.1145/3725843.3756108
accelerated co-simulation framework with three techniques re-
ducing communication overhead while preserving debuggability:                          1
(1) Batch minimizes communication frequency by tightly pack-                            Introduction
ing structurally diverse verification events into a single transfer.                   Verification has become the most time-consuming phase in modern
(2) Squash reduces data transmission volume by fusing verifica-                        chip development, accounting for over 50% of the overall work-
tion events with a decoupled checking order. (3) Replay preserves                      flow [17, 18]. The challenge becomes even more significant for
instruction-level debuggability by reprocessing the original, un-                      industrial-scale processors with complex microarchitectures and
fused verification events around the failure point.                                    instruction set architectures (ISAs), where exhaustive verification
    DiffTest-H is deployed on both Palladium emulator and FPGA                         is essential for ensuring functional correctness.
to verify a 6-wide, out-of-order RISC-V processor, XiangShan. It                          Toward more efficient verification, co-simulation frameworks [14,
   achieves simulation speeds of 478KHz and 7.8MHz respectively, ~~ 21, 23, 28, 42, 54] have been widely adopted in processor verifica-
with an 80× and 78× speedup over the baseline, 119× and 1945×                          tion. In co-simulation, the design under test (DUT) and a golden
faster than 16-thread Verilator, and uncovers 151 bugs in XiangShan.                   reference model (REF) run in parallel, comparing their architectural
                                                                                       states after each instruction. The co-simulation framework extracts
CCS Concepts                                                                           verification events from the DUT, such as instruction commit and
• Hardware → Functional verification.                                                  register updates, and compares them with REF. Additionally, the
                                                                                       DUT-specific non-deterministic events (NDEs) [14, 21, 24, 39, 53, 54],
                                                                                       such as external interrupts and MMIO access, must be fully syn-
                                                                                       chronized to REF to align its architectural states with the DUT.
This work is licensed under a Creative Commons Attribution 4.0 International License.     However, existing co-simulation frameworks are still inefficient.
MICRO ’25, Seoul, Republic of Korea                                                    Traditional software-based solutions [21, 34, 54, 55] rely on RTL
© 2025 Copyright held by the owner/author(s).                                          simulators [43, 46] to simulate the DUT. Despite extensive research
ACM ISBN 979-8-4007-1573-0/25/10
https://doi.org/10.1145/3725843.3756108                                                working on its performance [3, 4, 6, 15, 16, 27, 38, 47–49, 58], the




1462

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

Table 1: Verification Events in the DiffTest[34, 54]. Category Types Representative Examples Control Flow 5 Exceptions and interrupts, Instruction commits, Traps, ... Register Updates 9 CSRs, General-purpose registers, Floating-point registers, ... Memory Access 3 Load/store operations, Atomic memory operations, ... Memory Hierarchy 6 Cache refill operations, L1/L2 TLB operations, ... RISC-V Extensions 9 Vector/Hypervisor CSRs, Vector registers, ... simulation speed of large-scale DUTs is only at a few kHz, making it impractical for verification requiring billions of test cycles. Hardware-accelerated platforms, including emulator [7, 8, 19, 41, 44] and FPGA [22, 24, 31, 40, 56, 57], offer promising simulation speed for better verification efficiency. Our evaluation shows that di- rectly deploying the DUT on the emulator (Cadence Palladium) can yield a 300× speedup over RTL simulation, and a 10,000× speedup on the FPGA (Xilinx VU19P). In contrast, leveraging co-simulations, where the DUT and the REF are deployed on the hardware and software side respectively, the speedup drops to less than 2.5× on the emulator and 20× on the FPGA. The reason is that the hardware- software communication becomes a new bottleneck, with over 98% co-simulation time consumed by communication overhead. The hardware-software communication, as a point-to-point inter- action, can be modeled by the LogGP model [1, 12] and decomposed into three phases [10, 11, 25, 26, 30, 32, 37, 45]: communication startup, data transmission, and software processing. For example, in the co-simulation framework DiffTest [34, 54], which covers 32 verification events as shown in Table 1, each verification event requires a handshake to start up communication, and then transfer the DUT’s architectural state to the REF for comparison, resulting in around 15 communications and 1.2 KB transmitted data per cycle. Existing works explore optimizations across the three phases of communication. The frequency of communication startup can be reduced by packing all verification events within a cycle into a single transfer [8, 9, 19]. The data transmission volume can be reduced by fusing same-type events, such as 𝑁 instruction com- mits into a single 𝑁 -commit event [19, 40]. The software pro- cessing latency can be hidden through hardware-software paral- lelism [9, 24, 31, 56, 57]. However, existing works still face the com- munication bottleneck: the state-of-the-art Fromajo [57] achieves only 1 MHz co-simulation speed on a 100 MHz FPGA. Moreover, fusing verification events across cycles discards per-instruction details, weakening instruction-level debuggability. To address communication challenges, Shannon and Weaver introduced semantic communication [52], emphasizing that under- standing the information semantics improves communication ef- ficiency. In co-simulation, verification events likewise carry three key semantic properties, which can be exploited to optimize com- munication while preserving debuggability: (1) Structural Semantics denotes the length and data structure of verification events, which vary significantly across event types and increase the complexity of packing and unpacking. For example, the event lengths in DiffTest [34, 54, 55] differ by up to 170×. Existing packing schemes allocate fixed space for each verification event and pad invalid events with bubbles, resulting in more communications to transmit the same set of valid events. Leveraging structural semantics, we can tightly pack variable-length events with space allocated according to length, and extract packed events with their data structures. Tight packing eliminates bubbles and reduces the required packets with less communication frequency. (2) Order Semantics denotes the specific checking order of verifi- cation events. For example, the NDEs, such as external interrupts, force updates to the REF’s state, requiring prior instructions to be checked while subsequent ones remain unchecked. Existing fusion approaches couple communication with checking order: the NDEs break the fusion of other events, and the already fused ones are transmitted ahead to REF for ordered checking, causing frequent fusion breaks and a limited fusion ratio. Leveraging order semantics, we can decouple communication from checking order: NDEs are transmitted ahead with order tags while other events continue to be fused, and the software reorders events by these tags to restore the required checking order. Order-decoupled fusion reduces fusion breaks and improves fusion ratio with less data transmitted. (3) Behavioral Semantics denotes the architectural behaviors checked by verification events, which help localize errors to specific microarchitectural components. However, fusing verification events weakens debuggability by discarding per-instruction behavioral details. Existing debugging methods rely on hardware snapshots to rerun the entire DUT for recovering the behavioral details, resulting in considerable resource and time overhead. Leveraging behavioral semantics, we can reprocess only the unfused verification events around the failure point rather than rerun the entire DUT, thereby enabling lightweight instruction-level debugging. Building on the above three semantic properties, we propose DiffTest-H, a semantic-aware, hardware-accelerated co-simulation framework significantly reducing communication overhead while preserving instruction-level debuggability: (1) Batch minimizes communication frequency by tightly pack- ing structurally diverse verification events into a single transfer. Leveraging structural semantics, Batch computes the offset length of each valid event on hardware for tight packing, while the soft- ware parses packed events according to their data structures. (2) Squash reduces data transmission volume by fusing verifi- cation events with a decoupled checking order. Leveraging order semantics, Squash allows NDEs to be transmitted ahead with order tags, while other events continue to be fused, and the software then reorders events by these tags to restore the required checking order. (3) Replay preserves instruction-level debuggability by repro- cessing the original, unfused verification events around the failure point. Leveraging behavioral semantics, Replay reprocesses only the unfused verification events rather than rerunning the entire DUT, enabling lightweight instruction-level debugging. DiffTest-H is implemented and evaluated within the DiffTest [34, 54, 55] co-simulation framework to verify XiangShan [35, 50, 51, 54, 55], a 6-wide out-of-order RISC-V processor, covering 32 types of verification events, including instructions, cache coherence, TLB, vectorization, and virtualization. Deployed on both the Cadence Palladium emulator and FPGA, DiffTest-H achieves simulation

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

   speeds of 478 KHz and 7.8 MHz respectively, with an 80× and 78         Design Under Test     Reference Model
 × speedup over the baseline DiffTest, 273× and 1945× faster than         (DUT)      (REF)
   16-thread Verilator. DiffTest-H reduces communication overhead              Load Workload    Same Initial State   Load Workload
 by 99.84% on the emulator, and is 7.8× faster than the state-of-     ①Instr. Commit, IC    ②step(IC)                      ③IC (workload)Co-sim step
  the-art [56, 57] on the FPGA. DiffTest-H has uncovered over 151    DEs (from workload)     ④compare                  with DEs
complex bugs in XiangShan that require up to 2 months to identify     ①Ext. Interrupts, EI     ...
 with Verilator but are detected within 11 hours by DiffTest-H on                           ②sync(EI)                ③EI (DUT)           Co-sim step
    Palladium. All of these bugs have been confirmed and fixed by    NDEs (from DUT)         ④compare                 with NDEs
     XiangShan developers with more than 780 lines of code change                              ...                    Deterministic

across 19 pull requests. Events, DEs In summary, we make the following contributions in this paper: Mismatch abort() Non-deterministic Abort co-sim Events, NDEs • We identify three stages of hardware-software communica- tion: communication startup, data transmission, software Figure 1: Co-simulation verification workflow. Each DUT processing, and summarize three corresponding optimiza- event notifies the REF for execution and comparison: de- tions: packing, fusion, and hardware-software parallelism. terministic events are executed directly by the REF, while • We propose and open-source DiffTest-H1, a semantic-aware, non-deterministic events are synchronized from the DUT. hardware-accelerated co-simulation framework: Batch min- imizes communication frequency by tightly packing veri- fication events. Squash reduces data volume by fusing events Non-deterministic events (NDEs) [14, 21, 24, 39, 53, 54, 56, 57], with a decoupled checking order. Replay preserves instruction- such as interrupts and MMIO access, challenge co-simulation. These level debuggability by reprocessing events around failure. NDEs are specific to DUT and cannot be reproduced independently • DiffTest-H, evaluated on XiangShan, an open-source 6-wide by the REF. To accommodate this, co-simulation frameworks fully out-of-order RISC-V processor, achieves simulation speeds synchronize these NDEs from DUT to REF at precise instructions of 478 KHz on the Palladium emulator and 7.8 MHz on the to correctly align their architectural states. FPGA, with an 80× and 78 × speedup over baseline DiffTest, With comprehensive checking of diverse verification events at 273× and 1945× faster than 16-thread Verilator, reducing each instruction, co-simulation offers two major advantages: 99.84% communication overhead on the emulator and is 7.8× Verification Sufficiency. Covering a wide range of verification faster than the state-of-the-art [56, 57] on the FPGA. states, co-simulation ensures sufficient verification of DUT under • DiffTest-H uncovers over 151 complicated bugs in Xiang- ISA-level behaviors as well as complex non-deterministic scenarios. Shan, all of which have been fixed by XiangShan developers Instruction-level Debuggability. Conducting comparisons after with over 780 lines of code change across 19 pull requests. each instruction, co-simulation halts upon detecting any mismatch with a precise failure context, including mismatched verification 2 Background events and cycle information for debugging. In this section, we present three key aspects of hardware-accelerated 2.2 Layout of Co-Simulation Framework processor co-simulation: First, we introduce the fundamental prin- ciples and workflow of co-simulation, illustrating how it ensures A typical processor co-simulation framework consists of three ma- verification sufficiency and instruction-level debuggability. Second, jor components [21, 29, 42, 54, 55]: the monitor, the checker, and we demonstrate the general structure of co-simulation framework the communication unit, distributed across hardware and software with an example of the DiffTest framework [34]. Third, we compare to verify the correctness of the DUT. verification platforms of co-simulation, highlighting advantages On the hardware side, monitors are embedded into the processor and bottlenecks of hardware-accelerated co-simulation. to capture verification events such as instruction commits, regis- ter updates, and memory operations. Since these events are dis- 2.1 Processor Co-Simulation Verification tributed across the DUT’s microarchitecture, modern co-simulation Processor co-simulation [21, 28, 42, 54] verifies functional correct- frameworks [21, 54, 55] often implement monitors in high-level ness by running the design under test (DUT) in parallel with a hardware description languages (HDLs) such as Chisel [2], enabling software reference model (REF), and comparing their architectural automated code generation to relieve manual wiring effort. The cap- states after each instruction. tured events are then formatted into structured data packets, which As illustrated in Figure 1, a typical co-simulation workflow be- can be parsed by the software according to their data structure. gins with the DUT and REF in the same initial state, and performs On the software side, the ISA checker operates alongside a REF, instruction-level comparison for processor verification: at each typically an Instruction Set Simulator (ISS) such as Spike [20] and instruction, the co-simulation framework 1 NEMU [33]. The REF starts from the same initial state as the DUT, ○ extracts DUT’s ver- executes instructions accordingly, and is synchronized with non- ification events, such as instruction commits, 2 ○3 execute a corresponding instruction, and 4 ○ notifies REF to deterministic events. The ISA checker uses the verification events ○ compares architec- monitored from the DUT to drive the REF’s execution and performs tural states of the DUT and REF. Once a mismatch is detected, the comparisons after each instruction, as shown in Section 2.1. co-simulation aborts with a detailed bug analysis. Between the monitor and the checker, verification events are 1https://github.com/OpenXiangShan/difftest transmitted across the hardware-software interface, such as DPI-C.

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

Table 2: Comparison of Co-Simulation Platform.                      500 KHz. Similarly, the Fromajo co-simulation framework [56, 57]
                                                                    on a 100 MHz FPGA also experiences communication overhead ex-

Platform Debuggability Cost Optimal Speed ceeding 99%. These findings highlight that, while hardware accelera- RTL Simulator Full visibility Free ∼ 3 KHz tion significantly improves the theoretical speed of DUT emulation, Emulator Waveform Expensive ∼ 500 KHz the overall speed of co-simulation is fundamentally constrained by FPGA Limited Affordable ∼ 50 MHz communication efficiency. 3 Analytical Overhead Model Given the diversity of verification events, modern co-simulation To clarify and quantify the contributors to communication overhead frameworks [21, 34, 54, 55] typically use individual DPI-C func- in hardware-accelerated co-simulation, we introduce an analytical tions for each event, resulting in frequent communication calls and overhead model inspired by the LogGP model [1], which decom- large data transfers. For example, in the co-simulation framework poses the overhead into three stages of hardware-software inter- DiffTest [34, 54, 55] covering 32 types of verification events, each action: communication startup, data transmission, and software event is transmitted through a separate DPI-C interface with an ag- processing. These stages manifest differently across simulation gregated size of 11,496 bytes, leading to substantial communication platforms and DUT designs. To ground the model, we provide a overhead and challenging communication-sensitive simulations. quantitative case study across different DUTs and platforms, and identify three key optimization guidelines for communication. 2.3 Hardware-accelerated Co-simulation 3.1 Theoretical Analysis Co-simulation, as discussed earlier, consists of two main compo- nents: the REF and the DUT. In general, the REF is implemented The communication overhead in hardware-accelerated co-simulation in software for flexibility and ease of maintenance, while the DUT can be modeled as the LogGP model [1, 12], where the overall la- is deployed through three distinct approaches: RTL simulation, tency between the FPGA/emulator and the software can be decom- hardware emulation, and FPGA prototyping. Each option presents posed into three stages [10, 11, 25, 26, 30, 32, 37, 45]: a unique trade-off between simulation speed, debuggability, and Communication Startup. This stage involves handshake and deployment cost, as summarized in Table 2. synchronization for each communication invocation, necessary Depending on the DUT deployment platform, co-simulation can to establish a data connection between the asynchronously run- be categorized into two classes: ning hardware and software. For example, emulator Cadence Palla- Software-based Co-simulation deploys both the REF and the DUT dium performs hardware-software synchronization at every DPI- within a software environment, typically using RTL simulators C function calls [25], while FPGA platforms rely on valid-ready such as Verilator [46] or Synopsys VCS [43]. In this setup, the DUT handshakes as dictated by protocols like XDMA [26]. The startup is translated into a high-level programming representation (e.g., overhead is primarily determined by the communication frequency C++), forming a directed graph where each node simulates a hard- (𝑁invokes) and the per-invocation latency (𝑇sync). ware signal. This method offers full design visibility and facilitates Data Transmission. After the connection is established, data fine-grained debugging. However, the simulation speed is severely is transmitted over the hardware-software link in fixed-length pro- limited, typically reaching only a few KHz. As the complexity of tocol frames, and each frame incurs transmission and propagation the DUT increases, the performance of RTL simulators worsens, delay. The transmission overhead scales with the total data volume making software-based co-simulation impractical for verification (𝑁bytes) and the available bandwidth (𝐵𝑊 ). requiring billions of test cycles. Software processing. On the host side, software must receive Hardware-accelerated Co-simulation addresses the speed lim- data from buffers, drive the REF to execute the same instructions as itation by deploying the DUT onto hardware acceleration plat- the DUT (synchronize non-deterministic behavior such as external forms, such as emulators (e.g., Cadence Palladium [7], Synopsys interrupts, if any), and compare their states for verifying correctness. ZeBu [44], Siemens Veloce [41]) or FPGAs. Unlike software simula- In traditional step-and-compare strategies [34, 54, 55], hardware tion, these platforms directly map the DUT onto physical hardware emulation pauses its clock until software processing completes. components, faithfully reproducing its behavior at much higher This part of latency is abstracted as 𝑇software. speeds—often achieving orders-of-magnitude improvements. The The overall communication overhead can be expressed as: REF remains in software, preserving its flexibility. 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑁 1 However, this deployment across hardware and software intro- Invokes × 𝑇sync + 𝑁Bytes × 𝐵𝑊 + 𝑇software (1) duces a new major bottleneck: communication overhead. Since the DUT and the REF are located on different physical platforms, veri- 3.2 Quantitative Analysis fication states must be frequently transmitted across the hardware- The analytical model presented in Equation 1 provides a general software interface. As the amount of communication (both in terms framework for quantifying communication overhead on hardware- of communication frequency and data volume) increases, the overall accelerated platforms with three key phases: communication startup, simulation speed becomes limited by the efficiency of the communi- data transmission, and software processing. However, in practice, cation interface. For example, in the DiffTest framework applied to the relative contributions of the three phases vary significantly XiangShan [35], communication overhead accounts for over 98% of depending on both the verification coverage required by the DUT the total simulation time when running on Palladium emulator at and the characteristics of the validation platform.

1465

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

NutShell + Palladium XiangShan + Palladium XiangShan + FPGA 1% 1% 0.2% DUT DUT 10% 17% 8% Core 0 Core 1 Global Memory 2% 4% REF 0 REF 1 14% Monitor ISS ISS 87% 68% 88% Acceleration Debugging ISA DUT Emulation Communication Startup Squash Replay Rev. Log Checker Data Tranmission Software Processing Batch Buffer (Delta) Figure 2: Overhead breakdown across DUTs and platforms. Send Recv To demonstrate the model’s generality, we conduct evaluations Communication Queue Queue run based on NutShell [36], a scalar in-order processor, and Xiang- Hardware Software debug Shan [35, 50, 51, 54], a 6-wide out-of-order processor, across both Figure 3: The DiffTest-H Framework. Palladium emulation and FPGA platforms. As shown in Figure 2, XiangShan incurs higher data transmission and software processing overhead than NutShell on the same Palladium platform, primar- ily due to its expanded verification events resulting in larger data volume and more complex checking. When comparing XiangShan five components: monitor, acceleration unit, communication unit, across platforms, the FPGA setup shows higher communication debugging unit, and ISA checker. startup but lower data transmission overhead, which results from From Figure 3, the monitor captures verification events from the the FPGA’s PCIe interface exhibiting higher handshake latency yet DUT. The events are then optimized by the acceleration unit and greater bandwidth compared to Palladium’s internal link. buffering for potential debugging. The acceleration unit applies two key optimizations: Squash 3.3 Guiding Communication Optimizations reduces data volume by fusing events (Section 4.3) and Batch mini- Based on Equation 1, the total overhead of software–hardware co- mizes communication frequency by packing events into a single simulation can be decomposed into three phases: communication packet (Section 4.2). The packed data is then non-blocking trans- startup, data transmission, and software processing, which can be mitted for software processing (Section 4.5), while allowing the optimized through the following optimizations: hardware to continue running at the same time. The frequency of communication startups can be reduced by Upon reception of verification events, the ISA checker runs the packing multiple verification events into a single transfer [8, 9, 19]. REF model accordingly and compares its state against the DUT For example, packing 256 16B events into a single 4KB transfer to verify correctness. Once a mismatch is detected, the debugging reduces startup cost by 256×. flow is triggered: the Replay unit rolls back the fused buggy events The data transmission volume can be reduced by fusing same- and reprocesses the original unfused ones (Section 4.4), providing type events [8, 19, 40, 57], such as 𝑁 instruction commits into one instruction-level debugging details around the failure point. 𝑁 -commit event with 𝑁 × reduction in data volume. 4.2 Batch: Packing with Structural Diversity The software processing latency can be hidden through hard- ware–software parallelism [9, 24, 31, 56, 57], also known as non- 4.2.1 Why semantics matter? Minimizing communication fre- blocking support. Such support is widely available: emulators like quency relies on effective packing of verification events. However, Palladium provide primitives such as GFIFO, while FPGAs can em- the structural diversity of events poses significant challenges to both ulate non-blocking transmission using multi-buffer FIFOs. hardware packing and software unpacking. As shown in Figure 4, the 32 types of verification events in the DiffTest co-simulation 4 Semantic-aware Communication Mechanism framework [34, 54, 55] exhibit size differences of up to 170×, along To reduce communication overhead while preserving debuggability, with highly variable transmission frequencies. we propose DiffTest-H, a semantic-aware, hardware-accelerated As illustrated in Figure 5, existing schemes [8, 9, 19] simplify co-simulation framework. This section introduces the overall frame- packing by assigning each verification event with a fixed-offset work of DiffTest-H and three optimization strategies. region in the packet. On the hardware side, the packer writes valid events into the assigned region, while on the software side, the 4.1 Overview parser always reads from the same region and extracts the event according to its data structure. However, this fixed-offset method DiffTest-H is a semantic-aware, hardware-accelerated co-simulation requires padding for invalid events to preserve offsets for others. framework, covering 32 types of verification events and preserv- Evaluation on DiffTest shows that such padding leads to more than ing instruction-level debuggability. Figure 3 shows the DiffTest-H 60% invalid bubbles in the packet, thereby resulting in 1.67× more framework under a dual-core design. The framework comprises communications to transmit the same set of valid events.

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

Event Size / Byte 1024 Event Size Event Invocations 3 Instr. Commit Reg Update Monitor 256 2Invoc. per Cycle Cycle 0 IC IC RU RU 64 Batch Flow Cycle 1 IC RU 164 1 Cycle 0 2 ICType Pack 1 0 IC 2 RU RU 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Cycle 1 1 IC 1 RU Event ID (ordered by size) Cycle Cycle 0 2 2 IC IC RU RU Figure 4: Verification event size and invocations in baseline Software Hardware Cycle 1 1 1 IC RU DiffTest. Event IDs are ordered by increasing size, with trans- Transmission mission frequency measured as invocations per cycle. Cycle 0+1 2 2 1 1 IC IC RU RU IC RU Meta-guide parse Unpack Problem: Mixed-type packing Fixed-offset Packing IC IC RU RU IC RU Instruction IC IC IC₂ RU Commit 1 2 2 Invalid bubbles Register Update RU₁ RU₂ RU₃ RU₃ EI₃ Fixed RU offset Figure 6: The Batch workflow. Verification events from differ- Batch: Computed-offset ent cycles are packed in three levels, accompanied by a meta External EI EI IC Interrupts 1 3 2 RU₂ RU₃ 1 1 recording its type and structure. The software then unpacks Offset from preceding lengths Coming by cycles ICx1+RUx1 the events based on the meta. Register Updates / RU Figure 5: Comparison of packing schemes. Fixed-offset pack- (V = Valid, I = Invalid) Packed RU Meta Info ing inserts bubbles to preserve offsets, while Batch computes V RU1 entry 1 V RU K offsets as the sum of preceding event lengths. SumV(i) == K-1 count-valids I RU2 ... V RU V ... entry K V Type Num ... K ... I Verification events inherently contain structural semantics, V(i) I RUi-1 Packing Kth Valid RU namely their length and data structure. By leveraging structural V RU semantics, the hardware packer can dynamically allocate space ac- & ... i for (i = K; i < N; i++) if (SumV(i) == K-1 && V(i)) cording to the actual length of each event and tightly pack variable- I RUN entry[K] <= item[i] length events of different types. As shown in Figure 5, the offset of a register update (RU) event can be computed by summing the Figure 7: Type-level packaging in Batch. Packing 𝐾 valid lengths of prefix events. On the software side, the parser can also entries from 𝑁 incoming semantics, and the 𝐾-th entry comes use the length information to compute offsets of specific events from the valid one whose prefix valids is exactly 𝐾 − 1. and reconstruct them according to their data structure. Such tight packing eliminates invalid bubbles in the packet, improving band- width utilization and allowing for transmitting the same set of valid (1) Type-Level. At the first level, Batch collects valid events from events with fewer communications. multiple same-type entries within a cycle and tightly packs them together for subsequent packing. Since events of the same type 4.2.2 How semantics work? To minimize communication fre- have identical structures and lengths, Batch employs a muxtree quency, we propose Batch, a structure-wise mechanism that tightly to aggregate them in parallel, where the 𝐾-th packed entry corre- packs events with different structures and supports dynamic un- sponds to the 𝐾-th valid incoming event of the cycle. As illustrated packing on the software side. Batch is designed to address two main in Figure 7, Batch instantiates a prefix counter for each incoming challenges: (1) how to tightly pack verification events with diverse event to count the number of preceding valid ones in parallel; when structures and lengths; (2) how to dynamically unpack the packed the count reaches 𝐾 − 1 and the current event is valid, that event is events and reconstruct their original data structures. selected as the 𝐾-th packed entry. In this stage, Batch also generates As shown in Figure 6, the Batch workflow consists of two parts: metadata for each packed event, recording its structure and count Pack and Unpack. In the Pack stage, Batch applies a multi-level pack- to support subsequent hardware packing and software unpacking. aging strategy, which tightly packs events of different structures (2) Cycle-Level. At the second level, Batch further packs different with metadata recording the structural information and numbers. types of events within the same cycle. For each type of packed event, In the Unpack stage, Batch will use the metadata to extract events Batch dynamically allocates a region in the packet, with its offset of specific lengths from the packet and invoke the corresponding computed by the length sum of preceding packed events. Since the parser functions to recover their original structures. valid count of packed events varies across cycles, both the offset Multi-level Packing Strategy. Batch leverages a 3-level strategy and the region length need to be computed dynamically according that exploits structural similarity to reduce packing complexity: to the actual counts of valid events recorded in the metadata.

                                                                        1467










                                                                        ...









                                                                        Packer










                                                                                                                     ...               ...    Mux    ...

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification                              MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

(3) Transmission-Level. At the third level, Batch assembles packed        Problem: Keep-order Fusion                   Order-coupled Fusion
data from different cycles into fixed-size transmission packets, with     Instruction                                      Trans./Checking Order
metadata and payloads concatenated separately. Because cycle pack-         Commit               IC₁  IC₂ IC₃ Fusion         IC₁ MA₁  IC₂     MA₂     IC₃
ets are variable in length, the residual space of a packet may be
insufficient to hold a full cycle packet, leaving unused space and         MMIO                 MA MA         Sync     Squash: Order Decoupling
                                                                                                                                                                  Checking
wasting bandwidth. To address this, Batch uses metadata to split a         Access Coming by cycles1   2      (NDEs)    Transmission Order                       Reorder for
                                                                                                                           MA₁   MA  IC
cycle packet at event-type boundaries, filling the remaining space                                                              2                            1-3
of the current packet and placing the rest into the next, thereby
reducing required communications with fewer packets.                     Figure 8: Comparison of fusion schemes. Order-coupled fu-
          Dynamic Unpacking with Meta. Packing events with diverse       sion breaks instruction fusion at each MMIO access (NDEs),
structures and lengths complicates unpacking, as the parser must         while Squash decouples fusion from checking order, trans-
both locate variable-length events within a mixed packet and re-         mitting MMIO accesses ahead and then reordering them.
construct their original structures. Batch resolves this challenge
through a meta-guided dynamic unpacking mechanism. Alongside                                             Timeline
each packet, Batch generates a meta that records event types, counts,                            IC₁     IC
and offsets, and links the meta to its parsing function. Guided by                                           2        IC₃                             IC₄    IC₅
this metadata, the software parser extracts variable-length events                                       MA₂                                          MA₄
and invokes the corresponding reconstruction functions to restore                                                      Fusion & Schedule
their structures. In this way, Batch enables accurate and efficient                                      MA₂                                          MA₄  IC₁₋₅
unpacking of events despite flexible, tight packing.                                             Hardware                   Differencing
                                                                                                         MA₂                                          MA₄  IC₁₋₅
                                                                                                                            Reorder
4.3    Squash: Fusing with Order Decoupling                                                      IC·     MA₂        IC₃₋₄                             MA₄    IC₅
4.3.1        Why semantics matter? Reducing transmission data vol-                               Software
ume calls for fusing same-type verification across instructions. For                             IC  Instr. Commit  MA      MMIO Access
instance, a sequence of 𝑁 instructions can be fused into a single                                   (DEs, Fusible)        (NDEs, Non-fusible)
𝑁 -commit event with the final PC and the instruction count. How-
ever, non-deterministic events (NDEs) challenge fusion with a strict     Figure 9: The Squash workflow. Verification events are fused
checking order requirement. The NDEs, such as external interrupts        across cycles, with the NDEs scheduled ahead for continuous
and MMIO access, are specific to the DUT and need to be synchro-         fusion of DEs. Differencing removes the redundancy in MA4
nized to REF at precise instructions, forcing updates to the REF’s       relative to MA2. The software then completes MA4 and re-
architectural states. As a result, any instructions prior to an NDE      stores the checking order by inserting MA back into IC.
should be checked ahead while subsequent ones remain unchecked.
As illustrated in Figure 8, existing fusion approaches [8, 19, 40,       remain unchanged over long instruction sequences. Such repeti-
57] couple fusion to the checking order. Once an NDE is detected,        tiveness allows for referencing unchanged parts of prior events to
they terminate the ongoing instruction fusion and transmit the           avoid redundant transmission.
fused instructions to the REF, ensuring the required checking or-
der by consistent transmission order. The order coupling design          4.3.2     How semantics work? To reduce transmission data vol-
causes frequent fusion breaks and a limited fusion ratio. In real-       ume while preserving correct checking order, we propose Squash,
  world workloads with substantial device interaction and frequent       an order-aware fusion scheme that fuses verification events from
exceptions, such as OS boot, device drivers, and I/O-intensive ap-       different instructions with decoupled checking order. Squash ad-
plications, the fusion breaks more often, markedly reducing the          dresses two key challenges: (1) how to preserve checking order
overall fusion ratio.                                                    and enable continuous fusion without frequent breaks; (2) how to
       Verification events inherently carry order semantics, which       eliminate redundant event contents for further data reduction.
reflects the required checking order between verification events. By       As illustrated in Figure 9, the Squash workflow consists of three
leveraging order semantics, the fusion and communication order           stages: fusion and scheduling, differencing, and reordering. On the
can be decoupled from the checking order: NDEs can be trans-             hardware side, same-type events are fused across instructions, while
mitted ahead with order tags, indicating after which instruction         non-fusible NDEs are scheduled ahead with order tags. Both the
they should be checked. Meanwhile, other events continue to be           fused events and NDEs are then differenced to remove redundancy.
fused across instructions. On the software, the checker reorders         On the software side, events are completed from the preceding
events by the tags to restore the required checking order. Such          event and reordered to the original checking sequence.
order-decoupled fusion reduces NDE-induced fusion breaks and               Fusion and Scheduling. Squash fuses verification events of the
improves fusion efficiency with less data volume.                        same type into a single event that represents the collective effect of a
  Moreover, the verification events exhibit natural repetitiveness       sequence. For instance, a sequence of instruction commits (ICs) can
and locality. For example, some control and status registers (CSRs)      be fused into one fused IC containing dense information such as the




1468










...

Squash Flow

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea                  Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

final PC, the number of committed instructions, and corresponding         Problem: Debugging after Fusion Snapshot Debugging
register updates. Meanwhile, non-fusible NDEs are scheduled for           Instruction IC ...       ...                       IC ... IC
transmission ahead with order tags, notifying the checker to check         Commit     1      ICₘ ICₙ₋ₖ... ICn                n-k      n
them at the precise instruction order. By binding each event to                                                             Re-execute DUT
its nearest preceding instruction commit, Squash preserves the             Fusion             ...                        Replay: Event Replay
checking order and ensures that the REF verifies events in the same                      Debugging last k instrs             ICₙ₋ₖ... ICn
order as the DUT, especially for non-deterministic behaviors such                                                           Retransmit Event
as external interrupts and MMIO access.
       Differencing. To exploit event repetitiveness, Squash applies     Figure 10: Comparison of debugging schemes after fusion.
differencing to remove unchanged fields in each event. Events are        Snapshot debugging re-executes the entire DUT from a snap-
decomposed into smaller units (e.g., CSR entries), and only modified     shot, while Replay retransmits buffered verification events
ones are transmitted (e.g., via XOR operations). On the software         to restore instruction-level debuggability.
side, the checker keeps the latest record and completes events by
filling unchanged fields from the previous ones, as in completing                                Timeline
MA2 from MA1 in Figure 9.                                                                IC₁         IC₂     Buffering   Replay
4.4    Replay: Debugging with Instruction-level                                            Fusion                        Buffer
       Behaviors                                                                                                              Retransmit
4.4.1            Why semantics matter? Debugging requires localizing                             IC₁₋₂             IC₁           IC₂
processor errors to specific microarchitectural components. Verifi-                      Hardware
cation events enable this by checking the DUT’s architectural state                                                           Reprocess
after each instruction, with each event corresponding to a specific                              IC₁₋₂     IC      IC
architectural behavior and covering the relevant microarchitectural            Software                         1-2      1       IC₂
components. For example, bugs in the DUT’s memory subsystem                                          revert()
can be exposed by register updates comparison from load/store                            IC  Instr. Commit     IC  Revert IC
instructions, as well as the refill value check from cache accesses.
     As shown in Figure 10, existing work [9, 24, 31, 56, 57] relies     Figure 11: The Replay workflow. Verification events are
on verification events to detect errors, but still falls back to wave-   buffered during fusion. When a fused event mismatches,
forms for debugging. To recover waveforms near the failure point,        the REF reverts it, notifies the hardware to retransmit the
they snapshot the entire DUT and re-execute from the nearest             buffered unfused ones, and reprocesses them for debugging.
checkpoint. Since the root cause may precede the observed failure,
snapshots must be taken periodically, incurring substantial resource
and time overhead.                                                       need to be replayed. Replay introduces a token-based management
      Verification events also carry behavioral semantics, referring     mechanism: tokens are assigned to buffered verification events be-
to architectural behaviors and their mapping to specific microarchi-     fore fusion, and fused together during optimization. Upon detecting
tectural components. These semantics can guide debugging by lo-          a mismatch, Replay uses these tokens to locate the exact range of
calizing faults more precisely. To address the loss of per-instruction   events and notifies the hardware to retransmit only the necessary
detail caused by fusion, we reprocess only the unfused verification      buffered events. Tokens also filter out irrelevant events that may
events around the failure, rather than re-executing the entire DUT.      arrive between the bug occurrence and replay notification, ensuring
This restores instruction-level behavioral details and pinpoints the     consistent replay.
faulty instruction and related microarchitectural component, en-           Revert Reference Model. Since mismatches may occur at any
abling lightweight and effective debugging.                              check, Replay must revert the REF to the latest checkpoint before
4.4.2        How semantics work? To restore instruction-level debug-     reprocessing events. Directly snapshotting the REF at each check-
gability, we propose Replay, a lightweight debugging mechanism           point would be prohibitively expensive, especially in memory usage.
that localizes bugs by reprocessing unfused events around the fail-      Instead, Replay adopts a compensation-based strategy: it records
ure point. Replay addresses two key challenges: (1) how to deter-        only the modifications between consecutive checkpoints. When a
mine the range of retransmission; (2) how to recover the REF’s state     fused event is found buggy, Replay restores the REF by rolling back
for reprocessing.                                                        these changes. For example, the original values of memory updates
   As illustrated in Figure 11, the Replay workflow contains a hard-     between two checkpoints are logged; reverting simply writes back
ware retransmission module and a software checking module. On            these logs in reverse order to achieve lightweight state recovery.
the hardware side, verification events are buffered during fusion and
retransmitted upon notification. On the software side, once a fused      4.5 Other Design Issue
event mismatches, the REF reverts its state, requests retransmission,    Software processing latency can be hidden through hardware–software
and reprocesses the unfused events for debugging.                        parallelism enabled by non-blocking communication. In co-simulation,
            Range Determination. Optimizations and communication la-     all checks are assumed to match until the first mismatch occurs,
tency make it difficult to identify the range of unfused events that     so the DUT does not need to wait for software results and can




       1469

Replay Flow

notify()

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

speculatively continue execution in parallel with software process- (e.g., B3–B4) (see ○7 ). Finally, the checker reprocesses the pre- ing, thereby hiding software latency. If the checker later detects a fusion events up to the specific buggy point and completes the mismatch, it asynchronously notifies the ahead-running DUT and co-simulation with a detailed debugging report, thereby preserving terminates the co-simulation. instruction-level debuggability (see 8○). However, non-blocking transmission introduces new challenges such as out-of-order delivery and transmission bursts. To ensure 5 Implementation correct processing, we employ a unified hardware–software inter- face (Section 4.2), where all verification events are transmitted in DiffTest-H is implemented with high-level hardware description structured packets for ordered parsing. To handle bursts, the com- language (HDL), Chisel. This section introduces DiffTest-H ’s com- munication unit incorporates sending and receiving queues with patibility across platforms and designs. To optimize DiffTest-H in backpressure, ensuring reliable and balanced transmission. different scenarios, we have also proposed an open-source tun- ing toolkit, supporting performance evaluation, SQL analysis, and 4.6 Put It All Together iterative debugging of DiffTest-H. Design/Platform Compatibility. As mentioned earlier, the DiffTest-H framework mainly consists of five units besides DUT: Event A Event B monitor unit, acceleration unit, communication unit, check unit, ① Monitor and replay unit. By simply instantiate a piece of probe logic from A0 A1 A2 A3 A4 ② Buffering the DUT, the DiffTest-H framework can automatically generate matching logic for all five units, providing compatibility for differ- B0 B1 B2 B3 B4 ent designs and platforms: we have deployed DiffTest-H on both A0-1 ③ SquashA Replay NutShell, a scalar in-order processor, and XiangShan, a 6-wide out- 2-4 Buffer of-order dual-core processor, across platforms including emulator B0-2 B3-4 ⑦ (Cadence Palladium), FPGA, and RTL Simulator (Verilator, VCS), ④ Batch Replay and speeds up co-simulation with the hardware acceleration. Tuning Toolkit. To further explore the optimization space for Packed Data B3 B4 different designs and co-simulation strategies, we have constructed Hardware ⑤ NonBlock a complete open-source toolkit, which mainly includes three parts: A0-1 A2-4 (1) Performance evaluation support: DiffTest-H integrates per- formance counters in both software and hardware. On the software B0-2 B3-4 B3-4 B3 B4 side, the counters collect performance statistics, such as the trans- mission times and data volume. On the hardware side, the counters Software ⑥ Step & Compare ⑧ Report & Finish monitor performance-related indicators, including Squash fusion ratios and Batch packet utilization. These metrics will be used to Figure 12: The DiffTest-H workflow. guide the adjustment of optimization for better performance. (2) SQL analysis support: DiffTest-H records online transmis- sion data in an SQL database for offline analysis. With this SQL Putting the acceleration and debugging units together, DiffTest- backend, DiffTest-H can also simulate order-decoupled fusion and H effectively optimizes hardware–software communication over- differencing strategy on the software, thereby fully exploiting event head while preserving instruction-level debuggability. Figure 12 correlations and reducing data transmission volume. illustrates the workflow of DiffTest-H and the eight procedures. (3) Iterative debugging support: When debugging DiffTest-H’s On the hardware side, the verification events are captured and verification logic, it is time-consuming and resource-wasting to collected by the monitor unit inserted into DUT (see 1○). The events include the unchanged DUT during compilation and execution. To are buffered for potential debugging purposes (see 2○), and then support independent iteration, DiffTest-H decouples the DUT and ○- 4 optimized by the acceleration unit (see 3 ○). Specifically, Squash verification logic by trace dumping and reloading. The mechanism performs both fusion and differencing: it fuses verification (e.g., dumps the original verification events captured from the DUT dur- B3–B4) and filters out unchanged fields between successive events ing the first run, which is also called the DUT trace. Based on the (e.g., B0-2 and B3-4) to reduce data transmission volume (see 3○). traces, DiffTest-H generates and drives verification logic indepen- Batch further packages diverse events across multiple cycles to dently, supporting lightweight and rapid iterative debugging. minimize communication frequency (see 4○). Through the non-blocking communication unit, the packed events 6 Evaluation are transmitted to the software side (see 5○), where they are ex- tracted with computed offset and reconstructed to their original In this section, we evaluate the performance and resource utilization data structure. The events are then checked step-by-step to verify of DiffTest-H across various DUT scales. We further perform an the DUT against the REF (see 6○). optimization breakdown to quantify the contribution of each strat- Upon detecting a mismatch, the replay unit recovers the REF’s egy within DiffTest-H. Finally, we demonstrate the effectiveness of state by rolling back the last faulty events (e.g., B3-4) and noti- DiffTest-H in the development of XiangShan, a 6-wide out-of-order fies the hardware to retransmit the corresponding buffered data processor. Our results are highlighted as follows:

                                       1470

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea                  Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

     •           On Cadence Palladium, DiffTest-H achieves 80× speedup        Verilator        PLDM (Baseline) PLDM (DiffTest-H)
        over baseline, and is 119× faster than a 16-thread Verilator          FPGA(Baseline)   FPGA(DiffTest-H)
        simulation, reducing communication overhead by 99.8%.              10000
     •         On FPGA, DiffTest-H achieves 78× speedup over baseline,        1000
            and is 1945× faster than a 16-thread Verilator simulation,        100
     •  reducing communication overhead by 98.8%.                             101       N/A                               N/A
                 DiffTest-H incurs a maximum resource overhead of 26%,              NutShell   XiangShan   XiangShan  XiangShan
        reduced to 6% when disabling Batch packing.                                            (Minimal) (Default)    (Default,2C)
     •          DiffTest-H uncovers over 151 complex bugs in XiangShan
        that require up to 2 months to identify with Verilator but are              Figure 13: Performance comparison.
        detected within 11 hours by DiffTest-H on Palladium.
6.1     Experimental Setup                                                 100000       DiffTest-H     Verilator
                                                                              10000
                                                                              1000
                   Table 3: Experimental Setup.                                 100
                                                                                101

     Feature    Configuration
                • NutShell, scalar, inorder
       DUT      •• XiangShan (Minimal), 2-wide, out-of-order
                   XiangShan (Default), 6-wide, out-of-order
                • XiangShan (Default, dual-core), 6-wide, out-of-order
     Platform   Emulator: Cadence Palladium                                           Figure 14: Bug detection time.
                FPGA: Xilinx VU19P
     Workload   Linux boot (∼1.7B instruction)
                KVM, XVISOR, RVV_TEST, SPEC CPU 2006                      benchmark across the following setups: (a) 16-threads Verilator,
                                                                          the current state-of-the-art RTL simulator; (b) Unoptimized Palla-
                                                                          dium setup, serving as the baseline for DiffTest-H; (c) DUT-only
Table 4: Scales and verification coverage across DUTs.                    Palladium setup, representing the theoretical maximum simula-
                                                                          tion speed without any co-simulation overhead. The performance
                          DUT Gates             Event  Avg. Bytes         results are quantified in kiloCycles per second (KHz).
                                                Types  per Instr.          On the large-scale DUT co-simulation, DiffTest-H demonstrates
                                                                          significant acceleration, achieving an 80× speedup over the unopti-
                        NutShell          0.6 M   6        93             mized Palladium baseline and 119× faster than a 16-thread Verilator
                  XiangShan (Minimal)     39.4 M  32      692             simulation. Across all DUT scales, including small and mid-sized
                  XiangShan (Default)     57.6 M  32      1437            configurations, DiffTest-H consistently delivers over 74× speedup
                XiangShan (Default, 2C)  111.8 M  32      3025            compared to the baseline, highlighting its effectiveness across a
                                                                          range of design complexities.
         To demonstrate the generalizability of DiffTest-H across DUTs     Furthermore, DiffTest-H’s acceleration capability significantly
        and platforms, we evaluate it on both Palladium and FPGA using    improves the efficiency of functional debugging. As illustrated in
    NutShell and XiangShan across different configurations. To further    Figure 14, complex bugs that require millions to billions of simula-
      validate DiffTest-H’s effectiveness under full-system workloads,    tion cycles to manifest can be detected within 11 hours on Palla-
           we employ benchmarks including Linux boot and SPEC CPU2006,    dium using DiffTest-H, whereas traditional simulation with Verila-
 covering most verification scenarios involving control flow, register    tor would take up to 2 months under the same conditions. These
        updates, memory access, hierarchy, and optional ISA extensions    bugs, uncovered during the verification of the XiangShan project,
   listed in Table 1. The experimental setup is listed in Table 3 with    have been officially reported to and acknowledged by the Xiang-
scales and verification coverage across DUTs listed in Table 4.           Shan development team, demonstrating the practical effectiveness
                                                                          of DiffTest-H in real-world chip development scenarios.
6.2     Performance Evaluation                                             By greatly accelerating bug discovery and iteration, DiffTest-H
We evaluate DiffTest-H’s performance against both the state-of-the-       enables designers to quickly identify and fix bugs, enhancing the
       art RTL simulator and emulation platforms under various setups.    productivity and reliability of the chip development.
     The performance results are obtained on realistic benchmarks, in-
cluding Linux, KVM, XVISOR, RVV_TEST, and SPEC CPU 2006.                  6.3 Optimization Breakdown
        Figure 13 presents DiffTest-H’s performance when running Linux    To evaluate the effectiveness of DiffTest-H’s optimization tech-
     Boot across the DUT configurations mentioned in Table 4. For com-    niques across different designs and platforms, Table 5 presents
           parison, we measure the performance under identical DUT and    incremental performance results on NutShell and XiangShan with




                1471

Time to Bug / minute Speed / KHz

KVM RVV_Test Xvisor 2C_LINUXKVM+CoreMark spec06/astar spec06/bwavesspec06/gccspec06/mcf spec06/sphinx3 spec06/zeusmp spec06/xalancbmk

Gates / M

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification  MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

Table 5: Optimization breakdown across DUTs and platforms.               Table 6: Summary of pull requests fixing bugs detected by
                                                                         DiffTest-H in XiangShan.
    Setup         NutShell       XiangShan   XiangShan
                on Palladium    on Palladium  on FPGA                           Bug Category         Pull Requests
 Baseline          14 KHz          6 KHz      0.1 MHz                          Exception and         #3639, #4239, #4263, #3991,
      +Batch    102 KHz (7×)    24 KHz (4×)    1.3 MHz (13×)             interrupt handling errors   #3778, #4157
    +NonBlock  389 KHz (28×)    71 KHz (12×)   2.2 MHz (22×)                  Memory hierarchy       #3964, #3685, #3621, #4037,
 +Squash       1030 KHz (74×)  478 KHz (80×)   7.8 MHz (78×)                and coherence issues     #3719, #4442
                                                                                 Vector and          #3876, #3965, #3690, #3643,
Palladium, and XiangShan with FPGA. Each row shows the benefit              control logic errors     #3646, #3664, #4361
brought by progressively applying Batch, NonBlock, and Squash.
 Batch significantly improves performance by reducing commu-
nication frequency through tight packing of structurally diverse         6.5     Finding Bugs
events, achieving up to 4×–13× speedup over the baseline. Non-
Block further accelerates co-simulation by masking software pro-                     To demonstrate the effectiveness of DiffTest-H in verification with
cessing latency with hardware-software parallelism, and provides         millions of test cycles and a wide range of verification states, we
an additional 2×–4× speedup over Batch. Squash greatly reduces           deployed DiffTest-H on XiangShan, an open-source 6-wide out-of-
data volume by fusing events with a decoupled checking order,            order dual-core processor within the DiffTest framework. DiffTest-
contributing the final boost to a total of 74× speedup on NutShell,      H supports 32 types of verification state, including instructions,
80× on XiangShan (Palladium), and 78× on XiangShan (FPGA).               cache coherence, TLB, vectorization, and virtualization.
 Overall, these optimizations reduce co-simulation time to about                             DiffTest-H was extensively used during XiangShan’s develop-
1%–2% of the unoptimized baseline, cutting communication over-           ment to run real-world benchmarks such as SPEC06 for error detec-
head by 99.8% on Palladium and 98.8% on FPGA. This demonstrates          tion. These workloads trigger complex microarchitectural corner
that DiffTest-H effectively eliminates the primary performance bot-      cases between pipeline stages, memory systems, and exception
tleneck in hardware-accelerated co-simulation, achieving both high       logic, with many bugs only manifesting after millions or billions of
speed and minimal extra overhead beyond DUT emulation.                   cycles. Compared to baseline DiffTest, DiffTest-H achieved signif-
                                                                         icantly shorter runtime to detect the same errors at similar cycle
6.4     Resource Analysis                                                counts, demonstrating higher co-simulation efficiency without sac-
We evaluate the additional resource usage introduced by the com-         rificing debuggability.
plete DiffTest-H framework across different configurations of Xi-                          Over the past six months, DiffTest-H helped XiangShan uncover
angShan. In our setup, DiffTest-H monitors XiangShan by inserting        over 151 complex bugs. All 151 complex bugs were confirmed and
128 probes within each core, covering 32 types of verification states.   fixed by the XiangShan development team, involving a total of
The basic resource usage for both DiffTest-H and the DUTs is sum-        780 lines of code modifications across 19 pull requests. These bugs
marized in Figure 15, with area results estimated using Cadence          span three categories: (1) exception and interrupt handling errors,
Palladium and quantified in million gates.                               such as incorrect virtual address generation, misaligned load/store
 As shown in Figure 15, DiffTest-H incurs approximately a 6%             wakeup, and improper interrupt responses; (2) memory hierarchy
area overhead without Batch across different DUT configurations.         and coherence issues, including TLB deadlocks during guest page
In this setup, DiffTest-H can operate on platforms with software-        faults, StoreQueue condition mismatches, and cache inconsistencies
like communication support, such as Cadence Palladium, achieving         under specific faults; (3) vector and control logic errors, such as
accelerated co-simulation with minimal additional area cost. When        wrong vstart updates, incorrect vs.dirty settings, and faulty
Batch is enabled, the area overhead increases to an average of           vector exception tracking.
25%. This configuration introduces a unified hardware-software                              Table 6 summarizes 19 pull requests categorized by bug type,
communication interface, significantly simplifying the migration         while Figure 14 presents the time savings achieved by DiffTest-H
to platforms lacking software-like communication support.                compared to Verilator in detecting these bugs.
    DUT    DUT+DiffTest-H (No Batch)          DUT+DiffTest-H             6.6     Comparison with Prior Work
 150                                                                     DiffTest-H supports deployment on both emulator and FPGA. As
 120                                                                     illustrated in Table 7, IBI-check [8] and SBS-check [19] represent
    90                                                                   state-of-the-art emulator-based solutions, achieving low communi-
    60                                                                   cation overhead (∼2%) and moderate area overhead (∼20%). How-
    300                                                                  ever, their verification states are limited to basic events such as
        NutShell XiangShan  XiangShan              XiangShan             instruction commits and register updates, still incapable of detect-
                 (Minimal)  (Default)           (Default,2C)             ing more complex architectural behaviors such as non-determinism
                                                                         discussed in Section 4.3. In contrast, DiffTest-H expands the ver-
        Figure 15: Resource usages.                                      ification states to 32 architectural behaviors while reducing the
                                                                         communication overhead to just 0.4% with similar area overhead.




    1472

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

     Table 7: Comparison of hardware-accelerated co-simulation frameworks.

       Work                Platform       Verification               Communication  Area   DUT-only   Co-sim
                                         States/Bytes †                 Overhead  Overhead  Speed     Speed
  IBI-check [8]         IBM AWAN [13]                        2 / 7        20 %      20 %   100 KHz    80 KHz
  SBS-check [19]  Gem5 [5] (for estimation‡)                 2 / 7        2 %‡      22%‡   100 KHz‡  98 KHz‡
    DiffTest-H      Cadence Palladium [7]  32 / 1200                     0.4 %      26%    480 KHz   478 KHz
 Fromajo [56, 57]        FireSim [22]                       7 / 24        99 %    Unknown  100 MHz    1 MHz
    DiffTest-H           Xilinx VU19P      32 / 1200                      84 %      24 %    50 MHz   7.8 MHz

† The number of verification state types and the average byte size of verification states per retired instruction before optimization. ‡ Speed and overhead of SBS-check is estimated using Gem5, with IBI-check serving as the baseline.

     On FPGA-accelerated platforms, Fromajo [56, 57] is the state-   Running software independently will lead to divergence in the
   of-the-art framework that runs the DUT on FireSim [22] and com-   execution path of the reference model.
  pares its execution against the reference model Dromajo [21]. It    Hardware-to-software Communication. Due to the large amount
  supports 7 types of architectural states and detects a subset of   of hardware verification data, communication overhead accounts
 non-deterministic behaviors. In contrast, DiffTest-H, with a more   for more than 98% of overall co-simulation time. Recent approaches,
comprehensive set of 32 verification states, achieves a simulation   including IBI-check [8] and ArChiVED [19], employ static data

speed of 7.8 MHz, 7.8× faster than Fromajo. packaging and checksum-based compression to optimize communi- Overall, compared to the state-of-the-art approaches on both cation. However, these works neglect non-deterministic behaviors emulator and FPGA, DiffTest-H delivers higher simulation speed, in co-simulation, which is critical for aligning the reference model expanded verification coverage, and comparable area overhead. state with the DUT under external interrupts and stimulus. While DESSERT [24], ZP Cosim [31], and Fromajo [57] identify several key 7 Related Work sources of non-determinism and make some optimizations toward Improved RTL Simulators. RTL simulators, such as open-source communication, they are inefficient for handling the large-scale, Verilator [46] and commercial VCS, translate RTL circuits written diverse verification events typical of industrial designs. in Verilog into dataflow graphs, where nodes represent combina- tional logic and edges represent data values. Recent optimizations 8 for CPU-based RTL simulation mainly focus on the sequential la- Conclusion tency of the dataflow graph, including ESSENT [4], RepCut [48], We propose DiffTest-H, a semantic-aware, hardware-accelerated and Khronos [58]. Some other efforts in accelerating RTL simu- co-simulation framework. It enhances verification efficiency while lation fully leverage the task parallelism in the dataflow graph, maintaining verification completeness and instruction-level debug- such as RTLFlow [27] and SAGA [47] running on GPUs, as well as gability by three semantic-aware communication optimizations. Manticore [16], ASH [15], and Nexus [6] accelerated on the FPGA. Batch minimizes communication frequency by tightly packing Despite these advances, dataflow-based RTL simulation executes structurally diverse verification events into a single transfer. Squash instructions rather than circuit logic, requiring multiple host cycles reduces data transmission volume by fusing verification events with for one design cycle and thus limiting speed to orders of magnitude a decoupled checking order. Replay preserves instruction-level de- below FPGA prototyping. buggability by reprocessing the original, unfused verification events Hardware-Accelerated Co-Simulation. Differing from RTL around the failure point. DiffTest-H is deployed on both emulator simulators, hardware emulators synthesize RTL circuits into gates and FPGA to verify XiangShan [35, 50, 51, 54, 55], a 6-wide out-of- in specialized ASICs or FPGAs, reaching a speed of MHz over order RISC-V processor. DiffTest-H achieves a 478KHz and 7.8MHz industrial-scale designs. Traditional emulators include Cadence simulation speed respectively, 80× and 78× faster than the base- Palladium, Synopsys Zebu, Siemens Veloce, and Xilinx FPGA. line, and uncovers 151 bugs in XiangShan. We have open-sourced Considering the ease of maintenance and running speed, existing DiffTest-H to the community, promoting verification efficiency for co-simulation mainly adopts software-implemented ISA reference broader chip designs. models such as Spike [20] and NEMU [33]. To adopt hardware emu- lators for accelerating co-simulation, it is necessary to consider the cross-platform communication overhead, which consumes over 98% Acknowledgments of co-simulation time. According to the communication direction, This work is co-authored by Shoulin Zhang, Ziqing Zhang, and Kan existing works can be categorized into two groups: Shi, with valuable support for the FPGA-based experiments. The Software-to-hardware Communication. ENCORE [40] runs the authors would like to thank the anonymous reviewers for their valu- DUT on the emulator and the REF on the host server independently, able feedback and comments. This work is supported in part by the and transmits software data to the emulator for comparison. How- National Natural Science Foundation of China (Grant No. 62090022, ever, as mentioned in Section 2.2, the reference model relies on 62090023) and the Strategic Priority Research Program of Chinese the data from design for state alignment under external interrupts. Academy of Sciences (Grant No. XDA0320000, XDA0320300).

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

References [22] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, [1] Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris Scheiman. Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, 1995. LogGP: Incorporating long messages into the LogP model—one step closer Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and towards a realistic model for parallel computation. In Proceedings of the seventh Krste Asanovic. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system annual ACM symposium on Parallel algorithms and architectures. 95–105. simulation in the public cloud. In 2018 ACM/IEEE 45th Annual International [2] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Symposium on Computer Architecture (ISCA). IEEE, 29–42. Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: constructing [23] Michael Katrowitz and Lisa M Noack. 1996. I’m done simulating; now what? hardware in a scala embedded language. In Proceedings of the 49th annual design Verification coverage analysis and correctness checking of the DEC chip 21164 automation conference. 1216–1225. Alpha microprocessor. In Proceedings of the 33rd Annual Design Automation [3] Scott Beamer. 2020. A case for accelerating software RTL simulation. IEEE Micro Conference. 325–330. 40, 4 (2020), 112–119. [24] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan [4] Scott Beamer, Thomas Nijssen, Krishna Pandian, and Kyle Zhang. 2021. ESSENT: Bachrach, and Krste Asanović. 2018. DESSERT: Debugging RTL Effectively A high-performance RTL simuator. In Workshop on Open-Source EDA Technology with State Snapshotting for Error Replays across Trillions of Cycles. In 2018 (WOSET), at International Conference on Computer-Aided Design (ICCAD). 28th International Conference on Field Programmable Logic and Applications (FPL). [5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali 76–764. doi:10.1109/FPL.2018.00021 Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh [25] Sunwoo Kim, Jooho Wang, Youngho Seo, Sanghun Lee, Yeji Park, Sungkyung Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Park, and Chester Sungchung Park. 2020. Transaction-level model simulator for Hill, and David A Wood. 2011. The gem5 simulator. ACM SIGARCH computer communication-limited accelerators. arXiv preprint arXiv:2007.14897 (2020). architecture news 39, 2 (2011), 1–7. [26] Zhiwei Li, Boyan Ding, Haoyang Wu, and Tao Wang. 2017. A Flexible Frame- [6] Peter Birch. 2022. Open source FPGA-based emulation with nexus. In Workshop Oriented Host-FPGA Communication Framework for Software Defined Wireless on Open-Source EDA Technology (WOSET), Vol. 1. Network. In 2017 International Conference on Networking and Network Applications [7] Cadence. n.d.. Palladium. https://www.cadence.com/en_US/home/tools/system- (NaNA). IEEE, 118–124. design-and-verification/emulation-and-prototyping/palladium.html [27] Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei [8] Debapriya Chatterjee, Anatoly Koyfman, Ronny Morad, Avi Ziv, and Valeria Huang. 2022. From rtl to cuda: A gpu acceleration flow for rtl simulation with Bertacco. 2012. Checking architectural outputs instruction-by-instruction on batch stimulus. In Proceedings of the 51st International Conference on Parallel acceleration platforms. In Proceedings of the 49th Annual Design Automation Processing. 1–12. Conference. 955–961. [28] lowRISC. 2025. Ibex. https://ibex-core.readthedocs.io/en/latest/03_reference/ [9] Yuxiao Chen, Yisong Chang, Ke Zhang, Mingyu Chen, and Yungang Bao. 2023. verification.html REMU: Enabling Cost-Effective Checkpointing and Deterministic Replay in FPGA- [29] S Marconi, E Conti, P Placidi, J Christiansen, and T Hemperek. 2017. IEEE based Emulation. In 2023 IEEE 41st International Conference on Computer Design Standard for Universal Verification Methodology Language Reference Manual. (ICCD). 21–29. doi:10.1109/ICCD58817.2023.00014 [30] Romina Soledad Molina, Veronica Gil-Costa, María Liz Crespo, and Giovanni [10] Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Ramponi. 2022. High-level synthesis hardware design for fpga-based accelerators: Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU- Models, methodologies, and frameworks. IEEE Access 10 (2022), 90429–90455. FPGA platforms. In Proceedings of the 53rd Annual Design Automation Conference. [31] Anoop Mysore Nataraja. 2023. A Research-Fertile Co-Emulation Framework for 1–6. RISC-V Processor Verification. Master’s thesis. University of Washington. [11] Ryan A Cooke and Suhaib A Fahmy. 2020. Characterizing latency overheads in [32] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio the deployment of FPGA accelerators. In 2020 30th International Conference on López-Buedo, and Andrew W Moore. 2018. Understanding PCIe performance for Field-Programmable Logic and Applications (FPL). IEEE, 347–352. end host networking. In Proceedings of the 2018 Conference of the ACM Special [12] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Interest Group on Data Communication. 327–341. Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: [33] OpenXiangShan. 2025. NEMU. https://github.com/OpenXiangShan/NEMU Towards a realistic model of parallel computation. In Proceedings of the fourth [34] OpenXiangShan. n.d.. DiffTest. https://github.com/OpenXiangShan/difftest ACM SIGPLAN symposium on Principles and practice of parallel programming. [35] OpenXiangShan. n.d.. XiangShan. https://github.com/OpenXiangShan/ 1–12. XiangShan [13] J. Darringer, E. Davidson, D.J. Hathaway, B. Koenemann, M. Lavin, J.K. Morrell, [36] OSCPU. n.d.. NutShell. https://github.com/OSCPU/NutShell K. Rahmat, W. Roesner, E. Schanzenbach, G. Tellez, and L. Trevillyan. 2000. EDA [37] Lakshmanan Ponnambalam. 2017. Efficient SCE-MI Usage to Accelerate TBA in IBM: past, present, and future. IEEE Transactions on Computer-Aided Design of Performance. In Design, Verification & Test of Low Power and Secure Systems Integrated Circuits and Systems 19, 12 (2000), 1476–1497. (DVCon). IEEE, 2–2. https://dvcon-proceedings.org/document/efficient-sce-mi- [14] Simon Davidmann and Lee Moore. 2022. Introduction to the 5 Levels of RISC- usage-to-accelerate-tba-performance/ DVCon Proceedings Archive. V Processor Verification. In Design and Verification Conference and Exhibition [38] Hao Qian and Yangdong Deng. 2011. Accelerating RTL simulation with GPUs. (DVCon). In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). [15] Fares Elsabbagh, Shabnam Sheikhha, Victor A Ying, Quan M Nguyen, Joel S Emer, IEEE, 687–693. and Daniel Sanchez. 2023. Accelerating rtl simulation with hardware-software [39] Shisong Qin, Chao Zhang, Kaixiang Chen, and Zheming Li. 2021. iDEV: Ex- co-design. In Proceedings of the 56th Annual IEEE/ACM International Symposium ploring and exploiting semantic deviations in ARM instruction processing. In on Microarchitecture. 153–166. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing [16] Mahyar Emami, Sahand Kashani, Keisuke Kamahori, Mohammad Sepehr and Analysis. 580–592. Pourghannad, Ritik Raj, and James R Larus. 2023. Manticore: Hardware- [40] Kan Shi, Shuoxiang Xu, Yuhan Diao, David Boland, and Yungang Bao. 2023. accelerated RTL simulation with static bulk-synchronous parallelism. In Pro- ENCORE: Efficient Architecture Verification Framework with FPGA Accelera- ceedings of the 28th ACM International Conference on Architectural Support for tion. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programming Languages and Operating Systems, Volume 4. 219–237. Programmable Gate Arrays (FPGA ’23). Association for Computing Machinery, [17] Harry D. Foster. 2022. Part 3: The 2022 Wilson Research Group Functional New York, NY, USA, 209–219. doi:10.1145/3543622.3573187 Verification Study. https://blogs.sw.siemens.com/verificationhorizons/2022/10/ [41] Simens. n.d.. Veloce. https://eda.sw.siemens.com/en-US/ic/hav/veloce-cs/ 30/part-3-the-2022-wilson-research-group-functional-verification-study/ [42] Synopsys. 2025. ImperasDV. https://www.synopsys.com/verification/imperasdv. [18] Harry D. Foster. 2024. Wilson Research Group IC/ASIC functional verifica- html tion trend report. https://resources.sw.siemens.com/en-US/white-paper-2024- [43] Synopsys. n.d.. VCS. https://www.synopsys.com/verification/simulation/vcs. wilson-research-group-ic-asic-functional-verification-trend-report/ html [19] Chang-Hong Hsu, Debapriya Chatterjee, Ronny Morad, Raviv Gal, and Valeria [44] Synopsys. n.d.. ZeBu. https://www.synopsys.com/verification/emulation- Bertacco. 2014. ArChiVED: architectural checking via event digests for high prototyping/emulation/zebu-200.html performance validation. In Proceedings of the Conference on Design, Automation & [45] Bill Jason Tomas, Yingtao Jiang, and Mei Yang. 2014. Co-Emulation of Scan-Chain Test in Europe (Dresden, Germany) (DATE ’14). European Design and Automation Based Designs Utilizing SCE-MI Infrastructure. arXiv preprint arXiv:1409.3276 Association, Leuven, BEL, Article 317, 6 pages. (2014). [20] RISC-V International. 2025. Spike, a RISC-V ISA Simulator. https://github.com/ [46] Verilator. n.d.. Verilator. https://github.com/verilator/verilator riscv-software-src/riscv-isa-sim [47] Sara Vinco, Debapriya Chatterjee, Valeria Bertacco, and Franco Fummi. 2012. [21] Nursultan Kabylkas, Tommy Thorn, Shreesha Srinath, Polychronis Xekalakis, and SAGA: SystemC acceleration on GPU architectures. In Proceedings of the 49th Jose Renau. 2021. Effective Processor Verification with Logic Fuzzer Enhanced Annual Design Automation Conference. 115–120. Co-simulation. In MICRO-54: 54th Annual IEEE/ACM International Symposium on [48] Haoyuan Wang and Scott Beamer. 2023. Repcut: Superlinear parallel rtl simulation Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing with replication-aided partitioning. In Proceedings of the 28th ACM International Machinery, New York, NY, USA, 667–678. doi:10.1145/3466752.3480092 Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 572–585.

1474

MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea Kunlin You, Yinan Xu, Kehan Feng, Luoshan Cai, Yaoyang Zhou, and Yungang Bao

  [49] Haoyuan Wang, Thomas Nijssen, and Scott Beamer. 2024. Don’t Repeat Your-        •  How much time is needed to complete experiments
    self! Coarse-Grained Circuit Deduplication to Accelerate RTL Simulation. In           (approximately)?: Less than 1 hour.

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4. 79–93. • Publicly available?: Yes. GitHub link: https://github.com/ [50] Kaifan Wang, Jian Chen, Yinan Xu, Zihao Yu, Wei He, Dan Tang, Ninghui Sun, and OpenXiangShan/xs-env/tree/micro2025-ae Yungang Bao. 2025. XiangShan: An Open-Source Project for High-Performance • Code licenses (if publicly available)?: Mulan Permissive RISC-V Processors Meeting Industrial-Grade Standards. IEEE Micro (2025). [51] Kaifan Wang, Jian Chen, Yinan Xu, Zihao Yu, Zifei Zhang, Guokai Chen, Xuan Hu, Software License, Version 2 Linjuan Zhang, Xi Chen, Wei He, Dan Tang, Ninghui Sun, and Yungang Bao. 2024. • Archived (provide DOI)?: Yes. DOI link: https://doi.org/ XiangShan: An Open-Source Project for High-Performance RISC-V Processors 10.5281/zenodo.16637351 Meeting Industrial-Grade Standards. In 2024 IEEE Hot Chips 36 Symposium (HCS). 1–25. doi:10.1109/HCS61935.2024.10665293 [52] Warren Weaver. 1953. Recent contributions to the mathematical theory of com- A.3 Description munication. ETC: a review of general semantics (1953), 261–281. [53] Jinyan Xu, Yiyuan Liu, Sirui He, Haoran Lin, Yajin Zhou, and Cong Wang. 2023. A.3.1 How to access. DiffTest-H is open-sourced on GitHub and MorFuzz: Fuzzing processor via runtime instruction morphing enhanced syn- archived on Zenodo. For reference, we provide runtime logs and per- chronizable co-simulation. In 32nd USENIX Security Symposium (USENIX Security 23). 1307–1324. formance reports on both platforms. To reduce setup time for FPGA- [54] Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen, Lingrui Gou, Yue Jin, based experiments, we also include pre-built bitstreams. Please refer Qianruo Li, Xin Li, Zuojun Li, Jiawei Lin, Tong Liu, Zhigang Liu, Jiazhan Tan, to README.md for more details. Huaqiang Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang, Fawang Zhang, Linjuan Zhang, Zifei Zhang, Yangyang Zhao, Yaoyang Zhou, Yike Zhou, Jian- grui Zou, Ye Cai, Dandan Huan, Zusong Li, Jiye Zhao, Zihao Chen, Wei He, A.3.2 Hardware dependencies. Xilinx VU19P FPGA (for FPGA- Qiyuan Quan, Xingwu Liu, Sa Wang, Kan Shi, Ninghui Sun, and Yungang Bao. based simulation), Cadence Palladium (for Palladium-based simula- 2022. Towards Developing High Performance RISC-V Processors Using Agile tion), x86-64 server with 128GB RAM (host in simulation). Methodology. In 2022 55th IEEE/ACM International Symposium on Microarchitec- ture (MICRO). 1178–1199. doi:10.1109/MICRO56248.2022.00080 [55] Yi-Nan Xu, Zi-Hao Yu, Kai-Fan Wang, Hua-Qiang Wang, Jia-Wei Lin, Yue Jin, A.3.3 Software dependencies. Vivado 2020.2 (FPGA synthesis and Lin-Juan Zhang, Zi-Fei Zhang, Dan Tang, Sa Wang, Kan Shi, Ning-Hui Sun, and implementation), Mill 0.11 (RTL generation from Chisel). Yun-Gang Bao. 2023. Functional Verification for Agile Processor Development: A Case for Workflow Integration. Journal of Computer Science and Technology A.3.4 Data sets. Linux Boot, Microbench. 38, 4 (2023), 737–753. [56] Jiahan Zhang, Varun Koyyalagunta, Joe Rahmeh, and Divyang Agrawal. 2023. Integrating a High-Performance Instruction Set Simulator with FireSim to A.4 Installation Co-simulate Operating System Boots. In First FireSim and Chipyard User/Developer Workshop at ASPLOS 2023 (ASPLOS ’23 Workshops). https://fires.im/workshop- 2023-pdf/04_integ_isa_sim_FireSim_Zhang.pdf ## Get latest artifacts from GitHub. [57] Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste Asanovic. 2020. Sonic- boom: The 3rd generation berkeley out-of-order machine. In Fourth Workshop on $ git clone -b micro2025-ae
Computer Architecture Research with RISC-V, Vol. 5. International Symposium on https://github.com/OpenXiangShan/xs-env.git Computer Architecture Valencia, Spain, 1–7. [58] Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: ## Install required software dependencies Fusing memory access for improved hardware RTL simulation. In Proceedings of $ sudo -s ./setup-tools.sh the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 180– ## Init environment with submodule mechanism. 193. $ make init A Artifact Appendix A.1 Abstract A.5 Experiment workflow DiffTest-H is an open-source, hardware-accelerated co-simulation For the most up-to-date and detailed instructions, please refer to framework for processor verification. It deploys the design under README.md. Below is a brief workflow of the experiments. test (DUT) on Palladium or FPGA, while comparing its instruction- A.5.1 FPGA-based Co-simulation Speed with XiangShan. The ex- level architectural state with a golden reference model (REF) on the periment demonstrates DiffTest-H’s co-simulation speed on Xilinx host server. The artifact includes all code and workflow of DiffTest- VU19P FPGA as Figure 13 and Table 7. We recommend users to H to demonstrate FPGA/Palladium-based simulation speed. use the Step 0 Quick Start, which directly leverages our pre-built A.2 Artifact check-list (meta-information) bitstream, host, and workloads for reliable results in minutes. Recommended (Steps 0): Quick start with pre-built artifacts. • Hardware: x86-64 Ubuntu servers, Xilinx VU19P FPGA, • Cadence Palladium Z1 make write_bitstream • Metrics: Simulation Speed. make write_ddr • Output: Performance report. make fpga-run Experiments: (1) FPGA-based simulation speed evaluation with XiangShan. (2) Palladium-based Optimization break- down with XiangShan/NutShell. Fully Rebuild (Steps 1-5): From Chisel RTL generation to FPGA • How much disk space required (approximately)?: About execution (∼18 hours). 128 GB. • Steps 1: Generate RTL from Chisel. • How much time is needed to prepare workflow (ap- proximately)?: About 18 hours. (Minimal if use pre-built make fpga-rtl DUT=XiangShan bitstream).

 1475

DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification  MICRO ’25, October 18–22, 2025, Seoul, Republic of Korea

• Step 2: Build Host Executable Binary.
                                                                Core 0: HIT GOOD TRAP at pc = ...
make fpga-host DUT=XiangShan                                    Simulation speed: 7780.71 KHz
• Step 3: Generate Bitstream via Vivado.                                              (2) Result of A.5.2: Palladium-based simulation speed with differ-
                                                                ent optimization, as shown in Table 5, detailed in reference/perf-log.

make vivado ## Setup Vivado Project                             ## Speed of XiangShan-PLDM

make bitstream ## Synthesis, Implementation and Bitstream Simulation speed: 6.49 KHz # Baseline Simulation speed: 23.84 KHz # Batch • Step 4: Write bitstream and workload to FPGA. (Please check Simulation speed: 71.22 KHz # Batch+NonBlock README.md for more details, especially FPGA reset.) Simulation speed: 478.12 KHz # Batch+NonBlock+Squash ## Speed of NutShell-PLDM # Step 4.1: Write bitstream to FPGA Simulation speed: 13.67 KHz # Baseline make write_bitstream FPGA_BIT_HOME=... Simulation speed: 101.65 KHz # Batch # Step 4.2: Write workload to DDR via tcl Simulation speed: 389.09 KHz # Batch+NonBlock make write_ddr WORKLOAD=microbench Simulation speed: 1030.93 KHz # Batch+NonBlock+Squash

• Step 5: Run XiangShan Co-simulation                           ## Speed of XiangShan-FPGA
                                                                Simulation speed: 1278.07 KHz # Batch
make fpga-run Host=... WORKLOAD=microbench                      Simulation speed: 2198.00 KHz # Batch+NonBlock
                                                                Simulation speed: 7780.71 KHz # Batch+NonBlock+Squash

A.5.2 Palladium-based Optimization Breakdown with XiangShan/NutShell. A.7 Notes The experiment demonstrates incremental impacts of optimization as shown in Table 5. DiffTest-H is developed in the open-source community and will Step 1: Generate RTL from Chisel. keep updating the latest code and document. Any feedback and issues are welcome via GitHub or the author’s emails. The usage ## DIFF_CONFIG options: and reference results of both Palladium and FPGA are included in # Z for Baseline, README.md and reference/ folder. We are delighted to assist users # EBI for Batch, in reproducing the experiment results. # EBIN for Batch+NonBlock # EBINSD for Batch+NonBlock+Squash ## DUT options: XiangShan or NutShell make sim-rtl DUT=XiangShan DIFF_CONFIG=EBINSD

   Step 2: Compile for Palladium.

## Build on Palladium, requiring XCELIUM, IXCOM, VXE...
make pldm-build DUT=XiangShan

   Step 3: Run XiangShan/NutShell Co-simulation

## WORKLOAD options: linux or microbench
make pldm-run DUT=XiangShan WORKLOAD=linux


A.6 Evaluation and expected results
Please check reference/ folder for detailed log. Below are some
critical results of the experiments.
   (1) Result of A.5.1: FPGA-based co-simulation speed with Xiang-
Shan as shown in Fig.13 and Table 7, detailed in reference/perf-log.




            1476