Skip to content
STIMSMITH

Crafting a Million Instructions/Sec RISCV-DV

Paper WIKI v1 · 5/26/2026

“Crafting a Million Instructions/Sec RISCV-DV” is a technical paper by Puneet Goel, Ritu Goel, and Jyoti Dahiya on accelerating RISCV-DV, an open-source RISC-V random instruction generator, through a parallelized Embedded UVM implementation and other high-performance-computing techniques. The paper reports a multicore UVM testbench implementation capable of generating millions of constrained-randomized RISC-V instructions per second, over 100× faster than the original SystemVerilog UVM RISCV-DV implementation.

Overview

“Crafting a Million Instructions/Sec RISCV-DV” is a paper subtitled “HPC Techniques to Boost UVM Testbench Performance by Over 100x.” The listed authors are Puneet Goel of Incore Semiconductors, and Ritu Goel and Jyoti Dahiya of Coverify Systems Technology. The paper focuses on improving the performance of RISCV-DV, described in the paper as an open-source random instruction generator widely used for functional verification of RISC-V cores. [C1][C2]

The paper presents a parallelized RISCV-DV port coded in Embedded UVM (eUVM) and explores techniques for improving testbench performance. Its abstract reports a multicore UVM testbench implementation that generates millions of constrained-randomized RISC-V instructions per second, achieving a speedup of over 100× compared with the original RISCV-DV implementation coded in SystemVerilog UVM. [C3]

Motivation

The paper frames the problem in the context of processor verification. It states that high-end RISC CPU cores include complex architectural features such as instruction re-ordering, pipelines, branch prediction, and hyperthreading, and that their functional verification can require generation of approximately 10^15 constrained-random instructions. The paper further states that the original SV/UVM RISCV-DV project generates about 10,000 instructions per second, implying an impractically large generation time at that scale. [C4]

The introduction also situates the work within a shift toward high-performance computing and multicore software. It describes the end of Dennard-scaling-driven frequency improvements around 2005 and argues that software developers thereafter needed multicore-enabled concurrent programs to improve performance. The paper applies that reasoning to verification testbenches, noting that while modern HDL simulators can parallelize RTL simulation, comparatively little had been done for multicore testbench parallelization. [C5]

Profiling approach

For profiling, the paper describes use of uvm_trace instrumentation. In the example shown, wall-clock time appears in square brackets after the UVM_TRACE tag, indicating the time snapshot at the trace invocation. Because RISCV-DV is highly parameterized and its execution time varies with user-selected parameters, the authors chose a comprehensive riscv_instr_base_test with seven directed streams covering the spectrum of possible RISC-V instruction categories. [C6]

Identified bottlenecks

The paper identifies four major RISCV-DV generator bottlenecks, ordered by decreasing impact on total execution time: [C7]

  1. Directed instruction stream generation and randomization. Directed streams are created and randomized according to command-line ratios for subprograms and the main function, with most time attributed to constraint solver execution.
  2. Non-directed instruction stream generation. RISCV-DV dumps a large non-directed instruction stream into instruction lists for the main program and subprograms; this also spends most effort on randomization and constraint solving.
  3. Insertion of directed streams into non-directed streams. This bottleneck worsens as instruction count increases and is described as having O(n²) algorithmic complexity.
  4. Assembly string formatting. The generator repeatedly invokes $sformatf while dumping thousands of instructions, causing many memory allocations.

The paper treats the first two bottlenecks as linear constraint-solving workloads that can scale with multicore parallelization. It also notes that the fourth bottleneck is linear but hindered by frequent memory allocation, and that reducing calls to malloc can improve execution time and enable more scalable parallelization. [C8]

Algorithmic optimization target

The paper gives special attention to the third bottleneck: merging directed instruction streams into the initial random dump of undirected RISC-V instructions. It explains that instructions in a directed sequence are tagged as atomic so the sequence can be identified and kept intact. The original RISCV-DV merge implementation is described as greedy: it randomly picks an injection location, and if that location lies inside another directed sequence, it repeatedly chooses another random location until the placement does not violate an existing inserted sequence. [C9]

Because this merge step is characterized as a suboptimal algorithmic implementation with O(n²) behavior, the paper presents it as requiring a significant architectural change before addressing the other optimization areas. [C10]

Significance

Within the evidence provided, the paper’s main contribution is the combination of RISCV-DV profiling, algorithmic analysis, multicore parallelization, and memory-allocation reduction to make random RISC-V instruction generation far faster than the original SystemVerilog UVM implementation. The reported result is a multicore UVM testbench capable of generating millions of constrained-randomized RISC-V instructions per second, with more than 100× speedup over the original RISCV-DV baseline. [C3]

CITATIONS

10 sources
10 citations
[1] C1: The paper is titled “Crafting a Million Instructions/Sec RISCV-DV” and subtitled “HPC Techniques to Boost UVM Testbench Performance by Over 100x,” with authors Puneet Goel, Ritu Goel, and Jyoti Dahiya. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[2] C2: RISCV-DV is described as an open-source random instruction generator widely used for functional verification of RISC-V cores. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[3] C3: The paper describes a parallelized RISCV-DV port in Embedded UVM and reports a multicore UVM implementation generating millions of constrained-randomized RISC-V instructions per second, over 100× faster than the original SystemVerilog UVM RISCV-DV implementation. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[4] C4: The paper states that high-end RISC CPU verification can require about 10^15 constrained-random instructions and that the original SV/UVM RISCV-DV generates about 10,000 instructions per second. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[5] C5: The paper connects testbench performance to the post-2005 shift from frequency scaling to multicore concurrency, and notes limited multicore parallelization of testbenches compared with RTL simulation. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[6] C6: Profiling used uvm_trace wall-clock timestamps and a riscv_instr_base_test with seven directed streams covering possible RISC-V instruction categories. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[7] C7: The paper identifies four major RISCV-DV bottlenecks: directed stream randomization, non-directed stream generation, directed-stream insertion into non-directed streams with O(n²) behavior, and repeated formatted string generation with $sformatf. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[8] C8: The first two bottlenecks involve linear constraint-solving workloads suitable for multicore parallelization, while reducing memory allocation in the fourth bottleneck can improve execution time and scalability. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[9] C9: The original directed-stream merge process tags directed sequence instructions as atomic and greedily retries random insertion locations when a chosen location falls inside another directed sequence. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings
[10] C10: The paper treats directed-stream insertion as a suboptimal algorithmic implementation requiring a significant architectural change. [PDF] Crafting a Million Instructions/Sec RISCV-DV - DVCon Proceedings