Crafting a Million Instructions/Sec RISCV-DV Wiki

Overview

“Crafting a Million Instructions/Sec RISCV-DV” is a paper subtitled “HPC Techniques to Boost UVM Testbench Performance by Over 100x.” The listed authors are Puneet Goel of Incore Semiconductors, and Ritu Goel and Jyoti Dahiya of Coverify Systems Technology. The paper focuses on improving the performance of RISCV-DV, described in the paper as an open-source random instruction generator widely used for functional verification of RISC-V cores. [C1][C2]

The paper presents a parallelized RISCV-DV port coded in Embedded UVM (eUVM) and explores techniques for improving testbench performance. Its abstract reports a multicore UVM testbench implementation that generates millions of constrained-randomized RISC-V instructions per second, achieving a speedup of over 100× compared with the original RISCV-DV implementation coded in SystemVerilog UVM. [C3]

Motivation

The paper frames the problem in the context of processor verification. It states that high-end RISC CPU cores include complex architectural features such as instruction re-ordering, pipelines, branch prediction, and hyperthreading, and that their functional verification can require generation of approximately 10^15 constrained-random instructions. The paper further states that the original SV/UVM RISCV-DV project generates about 10,000 instructions per second, implying an impractically large generation time at that scale. [C4]

The introduction also situates the work within a shift toward high-performance computing and multicore software. It describes the end of Dennard-scaling-driven frequency improvements around 2005 and argues that software developers thereafter needed multicore-enabled concurrent programs to improve performance. The paper applies that reasoning to verification testbenches, noting that while modern HDL simulators can parallelize RTL simulation, comparatively little had been done for multicore testbench parallelization. [C5]

Profiling approach

For profiling, the paper describes use of uvm_trace instrumentation. In the example shown, wall-clock time appears in square brackets after the UVM_TRACE tag, indicating the time snapshot at the trace invocation. Because RISCV-DV is highly parameterized and its execution time varies with user-selected parameters, the authors chose a comprehensive riscv_instr_base_test with seven directed streams covering the spectrum of possible RISC-V instruction categories. [C6]

Identified bottlenecks

The paper identifies four major RISCV-DV generator bottlenecks, ordered by decreasing impact on total execution time: [C7]

Directed instruction stream generation and randomization. Directed streams are created and randomized according to command-line ratios for subprograms and the main function, with most time attributed to constraint solver execution.
Non-directed instruction stream generation. RISCV-DV dumps a large non-directed instruction stream into instruction lists for the main program and subprograms; this also spends most effort on randomization and constraint solving.
Insertion of directed streams into non-directed streams. This bottleneck worsens as instruction count increases and is described as having O(n²) algorithmic complexity.
Assembly string formatting. The generator repeatedly invokes $sformatf while dumping thousands of instructions, causing many memory allocations.

The paper treats the first two bottlenecks as linear constraint-solving workloads that can scale with multicore parallelization. It also notes that the fourth bottleneck is linear but hindered by frequent memory allocation, and that reducing calls to malloc can improve execution time and enable more scalable parallelization. [C8]

Algorithmic optimization target

The paper gives special attention to the third bottleneck: merging directed instruction streams into the initial random dump of undirected RISC-V instructions. It explains that instructions in a directed sequence are tagged as atomic so the sequence can be identified and kept intact. The original RISCV-DV merge implementation is described as greedy: it randomly picks an injection location, and if that location lies inside another directed sequence, it repeatedly chooses another random location until the placement does not violate an existing inserted sequence. [C9]

Because this merge step is characterized as a suboptimal algorithmic implementation with O(n²) behavior, the paper presents it as requiring a significant architectural change before addressing the other optimization areas. [C10]

Significance

Within the evidence provided, the paper’s main contribution is the combination of RISCV-DV profiling, algorithmic analysis, multicore parallelization, and memory-allocation reduction to make random RISC-V instruction generation far faster than the original SystemVerilog UVM implementation. The reported result is a multicore UVM testbench capable of generating millions of constrained-randomized RISC-V instructions per second, with more than 100× speedup over the original RISCV-DV baseline. [C3]