Profiling Wiki — STIMSMITH

Overview

Profiling is presented as the most important step in performance optimization for the RISCV-DV generator case study. The codebase contained more than 15,000 lines of code, and profiling reduced the optimization focus to roughly 200 repeatedly executed lines that were key to generator performance. [C1]

Purpose

The primary purpose of profiling in the cited RISCV-DV workflow was to identify performance bottlenecks before applying optimization techniques. The profiling results were combined with analysis of algorithmic complexity to rank bottlenecks by impact on overall generator execution time. [C2]

Profiling granularity

The evidence distinguishes between macro-level and micro-level profiling:

Macro-level profiling: uvm_trace, an eUVM construct, was used for formal identification of testbench bottlenecks. [C3]
Micro-level profiling: the open-source tool gprof is cited as useful for finer-grained profiling. [C4]

Example instrumentation pattern

The RISCV-DV case study shows uvm_trace calls placed around an instruction-randomization loop. The trace log records wall-clock timestamps in square brackets after the UVM TRACE tag, giving snapshots of time at trace invocation points. [C5]

uvm_trace("GEN INSTR", "START", UVM_DEBUG);
foreach (ref instr; instr_list)
  randomize_instr(instr, is_debug_program);
uvm_trace("GEN INSTR", "DONE", UVM_DEBUG);

Profiling setup considerations

Because RISCV-DV is highly parameterized, execution time varies significantly with user-selected parameters. For profiling, the cited study used the comprehensive riscv_instr_base_test with a mix of seven directed streams covering the possible RISC-V instruction categories. [C6]

Bottlenecks identified

Profiling and complexity analysis identified four major bottlenecks in decreasing order of impact: [C2]

Creation and randomization of directed instruction streams, where most time was traced to constraint-solver execution.
Dumping a large non-directed instruction stream into instruction lists for the main program and sub-programs, again dominated by randomization and constraint solving.
Insertion of directed instruction streams into the non-directed instruction stream, which became more severe as instruction count increased and had O(n²) algorithmic complexity.
Repeated creation of formatted strings for assembly output, where repeated $sformatf calls caused many memory allocations.

Optimization guidance derived from profiling

The first two bottlenecks involved constraint solving and had linear algorithmic complexity, making them suitable targets for multicore parallelization in the cited study. The fourth bottleneck also had linear complexity, but frequent memory allocation limited parallelization potential until allocation calls were reduced. The third bottleneck was attributed to sub-optimal algorithmic implementation and required a more significant architectural change. [C7]

Caution

uvm_trace is useful for macro-level profiling, but each invocation performs an operating-system call to fetch the current clock time. Excessive use can therefore cause an inordinate increase in testbench runtime. [C8]