Parallelized Fork Wiki

Overview

A Parallelized Fork is a testbench parallelization technique described for eUVM in the context of optimizing the RISCV-DV generator. The technique uses eUVM's fork construct to create multiple tasks and distribute them across CPU threads associated with parallel task executors. It is useful when a sequence contains thousands of transactions stored in a container such as a queue or array: the container is sliced, and each forked task processes one slice. [parallelized-fork-container-slicing]

In the RISCV-DV optimization work, the technique is identified as a suitable approach for large instruction sequences. The eUVM implementation differs from SystemVerilog because eUVM can execute forked processes on multiple cores and can delegate a newly forked process to a specified processor thread. [parallelized-fork-riscvdv-large-sequences]

Mechanics

The core workflow is:

Split a large container of work into slices.
Create a forked task for each slice.
Store each returned Fork object in an array or list.
Configure thread affinity for each fork using set_thread_affinity.
Join the forks to synchronize completion. [parallelized-fork-euvm-semantics]

The evidence describes eUVM fork as returning a Fork object. This object can be collected into a list, configured later, and joined later. The set_thread_affinity method assigns a particular fork to execute on a specified thread. [fork-object-affinity]

A typical pattern shown in the evidence is:

Fork[] forks;
for (int i = 0; i != threadCnt; ++i) {
  forks ~= (int i) {
    return fork(() {
      randomizeSome(i);
    });
  }(i);
}
foreach (i, f; forks)
  f.set_thread_affinity(i);
foreach (f; forks)
  f.join();

The fork body is wrapped in a lambda to capture the scoped loop variable. The evidence notes that this differs from SystemVerilog, which allows scoped variables to be declared in the fork header. [parallelized-fork-euvm-semantics]

Use in RISCV-DV

In the RISCV-DV generator, a parallelized fork is used to mitigate a bottleneck in the initial dumping of non-directed instruction streams that form the backbone of the main function and sub-programs. The generator first decides whether parallelization is worthwhile based on the number of instructions. The default threshold is 4000, stored in the par_instr_threshold configuration parameter. For larger dumps, the generator splits the instruction randomization work into par_num_threads slices, with the default thread count being 8, and delegates each slice to a separate thread. [parallelized-fork-riscvdv-threshold]

The provided randomization code partitions an instruction list by computing start_idx and end_idx for each thread, creates a fork that randomizes instructions in that range, sets the fork's thread affinity, appends the fork to a list, and then joins all forks. [parallelized-randomization-listing]

For directed instruction streams, the evidence describes a slightly different strategy: because directed streams are organized into multiple groups, a separate thread is designated for randomizing the streams in each group. This strategy is chosen for code simplicity and because a given group of directed streams has identical constraints, reducing stress on thread-specific constraint solvers. [directed-stream-parallelization]

Constraints and performance considerations

Parallelized fork is not always effective for small numbers of transactions or instructions. The evidence explains that constraint solvers take more time when invoked for the first time for a given constraint; after more transactions are randomized, solvers can reuse already solved constraints and pick random solutions from the solved set. In a parallelized testbench, each thread has a separate solver instance, and sharing a solver instance across threads would defeat the purpose of parallelization. [parallelization-small-workloads]

The technique also interacts with testbench architecture. The RISCV-DV eUVM port refactored statically scoped instruction-registry variables into a separate riscv_instr_registry class, with an instance placed inside the singleton riscv_instr_gen_config class. This was done because globally or statically scoped variables accessed by multiple threads require synchronization locks to avoid races, which can harm concurrent runtime efficiency. [static-variable-refactoring]

When to apply

Parallelized Fork is appropriate when:

The workload is a large sequence of transactions or instructions stored in a container.
The work can be partitioned into independent slices.
eUVM task executors can run forked tasks across CPU threads.
The overhead of multiple solver instances and thread setup is justified by the amount of work. [parallelized-fork-container-slicing]

It is less appropriate for small randomization workloads, where solver warm-up and per-thread solver overhead can dominate. [parallelization-small-workloads]