Skip to content
STIMSMITH

Co-simulation

Concept WIKI v9 · 6/14/2026

Co-simulation is a simulation approach in which multiple models or simulators are run together and coordinated so their behavior can be compared or combined. In processor and accelerator verification, co-simulation typically couples a design under test (DUT) with a golden reference model — most often an Instruction Set Simulator (ISS) such as Spike — so the two implementations execute the same test program and their per-instruction results can be compared. Synchronization between the DUT and reference can be achieved with Direct Programming Interface (DPI) calls, by monitoring microarchitectural structures such as the reorder buffer, or by injecting instructions through RVFI-based interfaces. The reference is typically retained in software for flexibility, while the DUT may be deployed through software-based RTL simulation or on hardware acceleration platforms such as FPGAs and emulators. Hardware-accelerated setups are dominated by communication overhead, which has motivated techniques such as DiffTest-H's Batch, Squash, and Replay that reduce communication frequency, transmission volume, and preserve debuggability. Co-simulation has also been specialized for RISC-V vector accelerator verification through a reusable UVM environment, for coverage-guided randomized instruction generation, for processor fuzzing (e.g., MorFuzz's synchronizable and delayed-write-back-tolerant co-simulation), for distributed FMI-based modeling with IP protection, and for cyber-security analysis of power systems.

Co-simulation

Overview

Co-simulation is a simulation approach in which multiple models or simulators are run together and coordinated so their behavior can be compared or combined. In processor and accelerator verification, co-simulation typically couples a design under test (DUT) with a golden reference model, most often an Instruction Set Simulator (ISS), so the two implementations execute the same test program and generate per-instruction execution traces that are compared. Synchronization between the DUT and reference can be achieved with Direct Programming Interface (DPI) calls, by monitoring microarchitectural structures such as the reorder buffer, or by injecting instructions through RVFI-based interfaces. The reference model is typically retained in software for flexibility, while the DUT may be deployed through software-based RTL simulation or on hardware acceleration platforms such as FPGAs and emulators. Beyond processor and accelerator verification, co-simulation is also used in cyber-physical system analysis (e.g., coupling power-system and communication-network simulators) and in distributed modeling with intellectual-property protection.

Reference-model comparison in RISC-V processor verification

A common way to apply co-simulation in RISC-V processor verification is reference-model comparison: both the RTL implementation and a golden ISS run the same test program and emit per-instruction execution traces. Many published studies use this idea, but most target only a single core or a small set of cores, and they usually require HDL file modifications that are tightly coupled to microarchitectural details, which makes the resulting verification code hard to reuse or to generate automatically.

Representative examples include:

  • Dromajo is an ISS used as the golden model in a co-simulation setup that verified three RISC-V processors. Dromajo relies on Direct Programming Interface (DPI) calls to synchronize the RTL implementation with the reference. The instruction-completion detection mechanism used for one of the cores modifies the processor code to monitor the reorder buffer and to invoke the ISS at each commit.
  • A separate work verified five additional processors by comparing the RTL implementation against the golden model using RVFI-Direct Instruction Injection (RVFI-DII), an extension of the RISC-V Formal Interface. RVFI-DII injects instructions directly into the processor's fetch interface, which enables the simulation to execute specific instructions rather than the entire program and makes debugging more efficient. RVFI-DII requires highly processor-specific code; for example, the Flute in-order scalar core used an instruction ID attached to the program counter that was propagated through each pipeline stage.

These techniques are summarized in the cited large-scale RISC-V verification work, which observes that the most common practice is to verify only one or a few processors. The Fabscalar tool is cited as an exception because it verified 12 cores, but those cores are merely variations automatically generated by the same tool and share similar, well-defined structures.

Verification-tool generalization and the role of RVFI

The RISC-V Formal Framework is an open-source tool that provides a formal specification of the RISC-V ISA and formal testbenches for supported processors. It defines a generic interface, the RISC-V Formal Interface (RVFI), which can be used not only in formal methods but also in alternative methods. In particular, some designers have adopted RVFI as a standard trace format for verifying their cores via co-simulation.

A second commercial reference is Synopsys ImperasDV, a commercial verification suite for RISC-V processors that provides reference-model comparison and coverage analysis. Similar to the RISC-V Formal Framework, ImperasDV defines a generic interface called the RISC-V Verification Interface (RVVI). Although both tools represent significant advances toward generalization, their verification interfaces require deep changes to the processor, which poses a challenge to achieving rapid, large-scale applicability.

The RISC-V Certification Steering Committee is also evaluating the most suitable test suite to ensure compliance with the RISC-V ISA. Although these tests are intended for large-scale use, they rely on directed testing, which is generally less effective than reference-model comparison.

Comparison of testbench effort in the RISC-V verification work

The cited large-scale RISC-V verification work compares the testbench effort required by related approaches, measuring the lines of code needed to apply the verification method, including configuration files and source files:

  • The CVA6 core was verified using the Dromajo-based approach in 280 lines of code.
  • The Ibex core required more than 450 lines of code when verified using RVFI-DII.
  • The verification solution proposed in the cited work required 270 lines of code to verify Hazard3, a similarly-sized core.

Even though the line counts have roughly the same magnitude, the code reusability differs significantly. Both cited studies depend on modifying structures that are unique to the cores: Ibex had its pipeline stages modified to implement the RVFI-DII interface, and CVA6 had its reorder buffer modified. The cited work, by contrast, deals only with interfaces that are typically similar or standardized. The testbench is also potentially easier to generate using LLMs because it only requires interface detection and module instantiations, while related work depends on more advanced code structures and more context, such as interactions between pipeline stages.

The cited work also evaluated the applicability of the simulation method to a diverse set of processors, listing all simulated cores along with the interfaces they use and the number of bugs identified after running a set of benchmark tests. The "Required Adapter" column of that table shows that six processors needed an adapter module: Hazard3 used an AHB-to-Wishbone adapter; RVX, Kronos, and RS5 required a Wishbone-to-Pipelined-Wishbone adapter (introducing a one-cycle delay in the data path).

Co-simulation in processor verification: deployment options

The two main components of a processor-verification co-simulation are the REF and the DUT. In general, the REF is implemented in software for flexibility and ease of maintenance, while the DUT is deployed through three distinct approaches: RTL simulation, hardware emulation, and FPGA prototyping. Each option presents a unique trade-off between simulation speed, debuggability, and deployment cost.

Depending on the DUT deployment platform, processor co-simulation falls into two classes:

  • Software-based co-simulation deploys both the REF and the DUT within a software environment, typically using RTL simulators such as Verilator or Synopsys VCS. The DUT is translated into a high-level programming representation (e.g., C++), forming a directed graph where each node simulates a hardware signal. This method offers full design visibility and facilitates fine-grained debugging. However, the simulation speed is severely limited, typically reaching only a few KHz, making software-based co-simulation impractical for verification requiring billions of test cycles.
  • Hardware-accelerated co-simulation addresses the speed limitation by deploying the DUT onto hardware acceleration platforms, such as emulators (e.g., Cadence Palladium, Synopsys ZeBu, Siemens Veloce) or FPGAs. These platforms directly map the DUT onto physical hardware components, faithfully reproducing its behavior at much higher speeds—often achieving orders-of-magnitude improvements. The REF remains in software, preserving its flexibility. However, this deployment across hardware and software introduces communication overhead.

The hardware-accelerated platforms themselves speed up DUT simulation by 300×–10000×, but overall co-simulation speedup is still limited to 2.5×–20× because more than 98% of co-simulation time is consumed by communication overhead.

LogGP-based communication-overhead model

The communication overhead in hardware-accelerated co-simulation can be modeled by the LogGP model, where the overall latency between the FPGA/emulator and the software is decomposed into three stages:

  1. Communication startup. This stage involves handshake and synchronization for each communication invocation, necessary to establish a data connection between the asynchronously running hardware and software. For example, the Cadence Palladium emulator performs hardware-software synchronization at every DPI-C function call, while FPGA platforms rely on valid-ready handshakes as dictated by protocols like XDMA. The startup overhead is primarily determined by the communication frequency ((N_{invokes})) and the per-invocation latency ((T_{sync})).
  2. Data transmission. After the connection is established, data is transmitted over the hardware-software link in fixed-length protocol frames, and each frame incurs transmission and propagation delay. The transmission overhead scales with the total data volume ((\text{bytes})) and the available bandwidth ((BW)).
  3. Software processing. On the host side, software must receive data from buffers, drive the REF to execute the same instructions as the DUT (synchronize non-deterministic behavior such as external interrupts, if any), and compare their states for verifying correctness. In traditional step-and-compare strategies, hardware emulation pauses its clock until software processing completes. This part of latency is abstracted as (T_{software}).

The overall communication overhead is expressed as

$$\text{Overhead} = N_{invokes} \cdot T_{sync} + \text{bytes} / BW + T_{software}.$$

DiffTest co-simulation and the DiffTest-H acceleration

The DiffTest co-simulation framework covers 32 verification events. For each verification event, a handshake is required to start up communication, and then the DUT's architectural state is transferred to the REF for comparison, resulting in around 15 communications and 1.2 KB of transmitted data per cycle. With this baseline, the state-of-the-art Fromajo framework achieves only 1 MHz co-simulation speed on a 100 MHz FPGA.

DiffTest-H is a semantic-aware, hardware-accelerated co-simulation framework built on top of DiffTest. It introduces three techniques that reduce communication overhead while preserving instruction-level debuggability:

  1. Batch minimizes communication frequency by tightly packing structurally diverse verification events into a single transfer. Leveraging structural semantics, Batch computes the offset length of each valid event on hardware for tight packing, while the software parses packed events according to their data structures.
  2. Squash reduces data transmission volume by fusing verification events with a decoupled checking order. Leveraging order semantics, Squash allows non-deterministic events (NDEs) to be transmitted ahead with order tags, while other events continue to be fused; the software then reorders events by these tags to restore the required checking order.
  3. Replay preserves instruction-level debuggability by reprocessing the original, unfused verification events around the failure point. Leveraging behavioral semantics, Replay reprocesses only the unfused verification events around a failure rather than rerunning the entire DUT, enabling lightweight instruction-level debugging.

Together, Batch, Squash, and Replay target the three LogGP phases: Batch addresses the communication-startup frequency, Squash addresses the data-transmission volume, and Replay addresses the debuggability loss that prior fusing work had traded away for speed. DiffTest-H is implemented and evaluated within the DiffTest co-simulation framework to verify XiangShan, a 6-wide out-of-order RISC-V processor, covering 32 types of verification events.

Role in cross-level RISC-V processor verification

In the cross-level RISC-V processor-verification evidence, co-simulation integrates an ISS with an RTL core so that randomized instruction streams can be executed while the ISS acts as a reference model. The cited approach describes a tight co-simulation setting in which the ISS reference model is used alongside an endless, unrestricted, randomized instruction stream whose generation evolves at runtime from observed coverage information.

The earlier baseline described in the same evidence integrates the ISS and RTL core into an efficient co-simulation compiled into a single binary with in-memory communication. This setup supports unrestricted instruction generation, including arbitrary load/store and CSR combinations and infinite loops.

Co-simulation in processor fuzzing (MorFuzz)

MorFuzz applies an online co-simulation approach for state verification, using an ISA simulator running in parallel with the DUT as the reference model. The ISA simulator and the DUT execute the same inputs, so the correctness of the DUT's state can be checked by comparing their states after each instruction is executed.

In MorFuzz's architecture overview, the synchronizable co-simulation component synchronizes the legal differences between models. During simulation, MorFuzz uses a simulator to co-simulate with the DUT and compares the architectural state of the DUT and the simulator after each instruction is executed to check correctness. Because co-simulation can also locate which instruction caused the mismatched state accurately, MorFuzz can further analyze whether the difference is legal and synchronize the correct state from the DUT to the simulator, eliminating the mismatch. This framework allows the simulator to co-simulate synchronously with the DUT, directing the fuzzer to cover more depth states.

Compatible co-simulation and delayed write-back

Existing co-simulation work has assumed that the write-back data is always ready when the DUT commits instructions. MorFuzz notes that this assumption is not always true because of microarchitectural differences between processors; for example, the Rocket core supports delayed write-back, meaning that the write-back data of long-latency instructions (such as multiply, divide, and floating-point instructions) may not be ready at the commit stage.

To accommodate these different microarchitectures, MorFuzz abstracts the state comparison process into two stages:

  1. Commitment stage. The DUT first commits its program counter and the executed instruction. Once the simulator receives the commit request, it executes the next instruction and then checks that the executed instruction is consistent with the one committed by the DUT. If the check passes, the simulator records its reference write-back data to a scoreboard.
  2. Judgment stage. This stage starts after the write-back data of the earlier instruction is ready. MorFuzz compares the DUT's write-back value with the reference value in the scoreboard to determine whether the instruction was executed correctly.

This split accommodates both cases in which the commit and judgment fire simultaneously and cases in which the write-back is delayed, while still allowing MorFuzz to report a potential bug when mismatched behavior is detected.

Typical components in the cross-level flow

The cross-level flow described in the previous evidence contains separate instruction-generation paths for the RTL core and ISS, a core adapter, separated RTL and ISS memories, a coverage observer, an instruction injector, and a comparator. The comparator is responsible for finding functional differences between the RTL core and ISS by comparing register values. Because the RTL core and ISS do not have identical timing behavior, the comparator logs value changes and compares changes at the same position; if it detects a difference, it quits the simulation.

A core adapter handles micro-architectural effects such as pipelining, prefetching, and fetch buffering. It checks for addresses that were not fetched by the ISS, fills them with randomized values not generated by the instruction generator, and forwards them to the RTL core.

Coverage-guided co-simulation

The cross-level evidence describes a coverage observer that monitors the ISS internal state, samples executed instructions, and maps them to coverage points. Coverage points are defined as a cross-product of instruction groups, and the coverage observer also performs coverage aging and gives hints to the instruction injector when functionality should be covered again. This makes the co-simulation part of a feedback loop in which the ISS execution state supplies coverage information that guides later instruction injection, smoothing the coverage distribution of randomized instruction streams over time and helping find corner-case bugs in the RTL core.

Evaluation setup in the cross-level evidence

For evaluation, the cross-level processor-verification work used a 32-bit pipelined RISC-V core from the MINRES The Good Core (TGC) series as the device under test and used the ISS of an open-source SystemC-based RISC-V virtual platform as the reference ISS. To enable co-simulation, the industrial RTL core was translated to C++ using Verilator and integrated into a SystemC test bench together with the ISS.

Co-simulation in RISC-V vector-accelerator verification

Co-simulation has also been applied to verify RISC-V vector accelerators in a reusable UVM-based verification environment targeting different projects with different scalar cores and scalar-to-vector interfaces. The environment follows the Universal Verification Methodology (UVM) premise of building a modular, scalable and reusable testbench, and uses object-oriented programming to favor reuse of modules across projects.

The environment consists of two main parts:

  • An interface-agnostic base environment shared among projects which:
    • Generates vector instructions using an Instruction Set Simulator (ISS) that mimics a project-specific scalar core.
    • Compares results between the ISS and the DUT instruction-by-instruction.
    • Provides a continuous-integration environment with sanity checks, random test generation, and coverage collection.
  • A project-specific environment that implements the behavior of the interface communicating the ISS and the vector accelerator. Communication with the interface-agnostic environment is accomplished using polymorphism.

The base environment integrates Spike as the golden reference model: Spike mimics the scalar core by sending vector instructions to the vector accelerator, runs the same vector instructions itself, and the resulting architectural state is compared against the DUT. The base environment is designed so that any ISA simulator can be used as the reference model by declaring a set of pure virtual methods in a wrapper class that must be implemented by the desired ISS; the wrapper class is overridden during the build phase of the UVM test via UVM's factory-override capabilities.

This approach has been validated across two projects:

  • The European Processor Initiative (EPI), which uses the Open Vector Interface (OVI). In EPI, memory access is done in the scalar core and data is transmitted to the vector unit through OVI; vector configuration CSRs are handled in the scalar core and transmitted through OVI.
  • The eProcessor EuroHPC project, which uses a custom interface. In eProcessor, the vector unit accesses memory directly (using the AMBA CHI protocol), memory-disambiguation checks need to be done using the scalar-core/accelerator interface to comply with the weak memory ordering model, and the accelerator itself handles vector configuration CSRs.

Because the two accelerators support different versions of the RISC-V Vector (RVV) ISA extension, the infrastructure supports using distinct Spike versions as reference models (for driving vector instructions and for results comparison). This illustrates how co-simulation in the accelerator domain is structured so that the project-specific interface implementation is the only project-dependent component, while the comparison infrastructure, instruction-by-instruction checking, and continuous-integration pipeline are reused.

A previous verification environment targeting the same accelerator had a different UVM agent for each channel of the interface, which involved massive inter-process communication and lacked encapsulation, making it difficult to maintain, extend, and reuse across projects. The current reusable environment replaces that with a shared, interface-agnostic base.

Other application contexts

Co-simulation is not limited to processor or accelerator verification. Public-context evidence shows that:

  • Distributed FMI-based co-simulation has been used as a mechanism for collaborative modeling and simulation by different stakeholders while implicitly helping protect intellectual property; that work proposes an approach on top of UniFMU with additional cybersecurity and IP-protection mechanisms, ensuring that the connection is initiated by the client and that models and binaries live on trusted platforms. The trade-off between IP-protected distribution and performance efficiency was analyzed across four different network settings using two co-simulation demos.
  • Co-simulation for cyber-security analysis of power systems has been used to couple the DIgSILENT PowerFactory power-system simulator, the OMNeT++ communication-network simulator, and Matlab energy-management-system applications in order to simulate data attacks and assess the vulnerability of the energy management system from an integral perspective.

CITATIONS

16 sources
16 citations
[1] In co-simulation, both the RTL implementation and a golden ISS run the same test program and generate per-instruction execution traces; if the traces differ, the cause must be investigated because the mismatch may indicate a bug. Large-Scale RISC-V Processor Verification Using Automated Generation
[2] Dromajo is an ISS used as the golden model in a co-simulation setup to verify three RISC-V processors; it relies on Direct Programming Interface (DPI) calls to synchronize the RTL implementation with the reference, and the instruction-completion detection mechanism for one of the cores modifies the processor code to monitor the reorder buffer and invoke the ISS at each commit. Large-Scale RISC-V Processor Verification Using Automated Generation
[3] RVFI-Direct Instruction Injection (RVFI-DII) is an extension of the RISC-V Formal Interface that injects instructions directly into the processor's fetch interface; it was used to verify five additional processors but requires highly processor-specific code (e.g., the Flute in-order scalar core used an instruction ID attached to the program counter propagated through each pipeline stage). Large-Scale RISC-V Processor Verification Using Automated Generation
[4] The RISC-V Formal Framework defines the RISC-V Formal Interface (RVFI), a generic interface usable in formal and alternative methods; Synopsys ImperasDV defines a similar generic interface called the RISC-V Verification Interface (RVVI). Large-Scale RISC-V Processor Verification Using Automated Generation
[5] In the cited large-scale RISC-V verification work, CVA6 was verified with the Dromajo-based approach in 280 lines of code; Ibex required more than 450 lines of code using RVFI-DII; Hazard3 was verified in 270 lines of code using the proposed method; both cited studies depend on modifying structures unique to the cores (Ibex pipeline stages, CVA6 reorder buffer), while the proposed testbench deals only with interfaces that are typically similar or standardized. Large-Scale RISC-V Processor Verification Using Automated Generation
[6] Six processors in the large-scale evaluation needed an adapter module: Hazard3 used an AHB-to-Wishbone adapter; RVX, Kronos, and RS5 required a Wishbone-to-Pipelined-Wishbone adapter that introduced a one-cycle delay in the data path. Large-Scale RISC-V Processor Verification Using Automated Generation
[7] The DiffTest co-simulation framework covers 32 verification events; each verification event requires a handshake to start up communication, and then the DUT's architectural state is transferred to the REF for comparison, resulting in around 15 communications and 1.2 KB of transmitted data per cycle; the state-of-the-art Fromajo framework achieves only 1 MHz co-simulation speed on a 100 MHz FPGA. DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification
[8] DiffTest-H introduces three techniques — Batch, Squash, and Replay — which target the three LogGP phases of communication-overhead in hardware-accelerated co-simulation: Batch addresses communication-startup frequency by tightly packing structurally diverse verification events into a single transfer; Squash addresses data-transmission volume by fusing verification events with a decoupled checking order, allowing NDEs to be transmitted ahead with order tags while other events continue to be fused and the software reorders them; Replay preserves instruction-level debuggability by reprocessing only the unfused verification events around a failure rather than rerunning the entire DUT. DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification
[9] DiffTest-H is implemented and evaluated within the DiffTest co-simulation framework to verify XiangShan, a 6-wide out-of-order RISC-V processor, covering 32 types of verification events. DiffTest-H: Toward Semantic-Aware Communication in Hardware-Accelerated Processor Verification
[10] The reusable UVM verification environment for RISC-V vector accelerators consists of an interface-agnostic base environment shared among projects (which generates vector instructions using an ISS mimicking a project-specific scalar core, compares results instruction-by-instruction between the ISS and the DUT, and provides a continuous-integration environment with sanity checks, random test generation, and coverage collection) and a project-specific environment that implements the interface behavior between the ISS and the vector accelerator using polymorphism. Reusable Verification Environment for a RISC-V Vector Accelerator
[11] The base verification environment uses Spike as the golden reference model to mimic the scalar core, send vector instructions to the vector accelerator, run the same vector instructions, and compare results; the environment supports any ISA simulator by declaring a set of pure virtual methods in a wrapper class that is overridden during the UVM build phase via factory-override capabilities. Reusable Verification Environment for a RISC-V Vector Accelerator
[12] The reusable vector-accelerator verification environment has been validated across two projects — the European Processor Initiative (EPI), which uses the Open Vector Interface (OVI) and where memory access is done in the scalar core with vector configuration CSRs transmitted through OVI, and the eProcessor EuroHPC project, which uses a custom interface where the vector unit accesses memory directly via the AMBA CHI protocol, memory-disambiguation checks are done using the scalar-core/accelerator interface to comply with the weak memory ordering model, and the accelerator itself handles vector configuration CSRs. Reusable Verification Environment for a RISC-V Vector Accelerator
[13] The infrastructure supports using distinct Spike versions as reference models in order to handle different versions of the RISC-V Vector (RVV) ISA extension supported by the two accelerators. Reusable Verification Environment for a RISC-V Vector Accelerator
[14] A previous verification environment for the same vector unit had a different UVM agent for each channel of the interface, involved massive inter-process communication, and lacked encapsulation, making it difficult to maintain, extend, and reuse across projects. Reusable Verification Environment for a RISC-V Vector Accelerator
[15] Distributed co-simulation plays a key role in enabling collaborative modeling and simulation by different stakeholders while protecting their intellectual property; an FMI-based approach on top of UniFMU proposes enhanced cybersecurity and IP-protection mechanisms that ensure the connection is initiated by the client and that models and binaries live on trusted platforms, with functionality demonstrated in two co-simulation demos across four network settings. FMI-Based Distributed Co-Simulation with Enhanced Security and Intellectual Property Safeguards
[16] Co-simulation has been used as a platform for cyber-security analysis of energy management systems (EMS) by coupling the DIgSILENT PowerFactory power-system simulator with the OMNeT++ communication-network simulator and Matlab for EMS applications (state estimation, optimal power flow), enabling attack simulations against a power grid test case. Co-simulation for Cyber Security Analysis: Data Attacks against Energy Management System

VERSION HISTORY

v9 · 6/14/2026 · minimax/minimax-m3 (current)
v8 · 6/8/2026 · minimax/minimax-m3
v7 · 6/6/2026 · minimax/minimax-m3
v6 · 6/6/2026 · minimax/minimax-m3
v5 · 5/30/2026 · gpt-5.5
v4 · 5/28/2026 · gpt-5.5
v3 · 5/27/2026 · gpt-5.5
v2 · 5/27/2026 · gpt-5.5
v1 · 5/25/2026 · gpt-5.5