Multicore Parallelization Wiki

Overview

Multicore Parallelization in this context refers to distributing testbench execution work across multiple CPU threads, with particular emphasis on compute-intensive sequence and instruction randomization. The evidence describes the technique in eUVM, where forked processes can be assigned to specific processor threads, unlike SystemVerilog fork semantics where a task created by fork executes on the same CPU thread as its parent. [C1]

The technique is motivated by the limitation that multicore support in conventional SystemVerilog-oriented flows is largely focused on RTL and gate-level simulation, while behavioral testbenches share data by reference and require user-level constructs for synchronized shared-data access. [C2]

Execution models

VIP-level parallelism

A multicore eUVM simulator can use multiple task executors, each mapped to its own CPU thread. Synchronization barriers keep those task executors synchronized with the scheduler. [C3]

The evidence cautions that the scheduler and synchronization barriers limit achievable gains. It cites Amdahl’s Law and notes that parallelizing the scheduler itself is unlikely to help much because a simulator often has only a small number of active events and processes at a given simulation time. However, if testbench tasks are compute-intensive, a multi-threaded testbench can provide performance gains. [C3]

A typical use case is a subsystem-level testbench with multiple UVM agents or Verification IPs (VIPs). Since sequence randomization can be one of the most compute-intensive testbench processes, eUVM can map each UVM agent to a separate CPU thread to distribute sequence randomization. [C3]

Sequence-level parallelism

VIP-level parallelism is less applicable to simple module-level testbenches with only a limited number of UVM components. The evidence therefore describes sequence-level parallelization, which uses eUVM worker threads to exploit multicore concurrency even when there are few UVM components. [C4]

Worker threads are free-running asynchronous threads that are hierarchically owned by the simulator but decoupled from the scheduler. They continue running even when the scheduler activates. Because they are decoupled from the scheduler, worker threads cannot wait for simulator events, though they can trigger events. [C4]

Communication and synchronization

Although worker threads share memory with simulator tasks, UVM requires data exchange through TLM FIFOs for synchronization. Standard UVM TLM FIFOs use events to block reads when empty and writes when full. [C4]

Because asynchronous worker threads cannot wait for simulator events, eUVM provides asynchronous TLM FIFO variants. For example, an async-write TLM FIFO supports a worker thread that generates UVM transactions for a regular UVM task: when the FIFO is full, the worker thread is blocked by a software semaphore; when the receiving task removes an item, the semaphore is released; when the FIFO is empty, the receiving task blocks on the regular read event, and the worker thread triggers that event when it writes data. [C5]

Parallelized fork pattern

The evidence describes a common eUVM pattern for parallelizing a large container of work:

Split a queue or array of transactions into slices.
Create one forked task per slice.
Store the returned Fork objects in an array.
Use set_thread_affinity to bind each fork to a particular task executor or CPU thread.
Join the forks after configuration. [C1]

This pattern is useful when a sequence contains thousands of transactions stored in a container. Each spawned fork processes one slice of the original container, allowing sequence generation to be accelerated across multiple CPU threads. [C1]

Application to RISCV-DV

The evidence applies multicore eUVM parallelization to the RISCV-DV generator, whose output is a bare-metal RISC-V assembly program or, alternatively, an executable binary dump that can be loaded directly into simulation or emulation memory. [C6]

The RISCV-DV generator involves large instruction sequences, so the evidence identifies parallelized fork as the best approach for its parallelization. The first targeted bottleneck is the initial dumping of non-directed instruction streams that form the backbone of the main function and sub-programs. [C7]

The RISCV-DV implementation decides whether to parallelize based on instruction count. The default threshold is par_instr_threshold = 4000. When the number of instructions exceeds that threshold, the generator splits the work into par_num_threads slices, with a default of 8, and delegates instruction randomization for each slice to a separate thread. [C7]

A listing in the evidence shows the implementation structure: compute per-thread start and end indices, create a fork that calls randomize_instr over that slice, assign thread affinity with set_thread_affinity, append the fork to the list, and finally join all forks. [C8]

Directed instruction streams use a slightly different strategy: since there are multiple groups of directed streams, a separate thread is designated for randomizing the streams in each group. The evidence states that this reduces stress on thread-specific constraint solvers because streams within a group have identical constraints. [C8]

Architectural preconditions

The evidence notes that global or statically scoped variables can reduce the runtime efficacy of concurrent software because shared access requires synchronization locks to avoid race conditions. RISCV-DV contained statically scoped variables in riscv_instr.sv for instruction registration. In the eUVM port, those variables and related functions were refactored into a separate riscv_instr_registry class, with an instance placed inside the singleton riscv_instr_gen_config class to preserve singleton-like behavior. [C7]

This refactoring is presented as part of making the RISCV-DV architecture compliant with the concurrency semantics of the D programming language used by eUVM. [C7]

Performance considerations and caveats

The evidence identifies several constraints on multicore parallelization:

Parallelizing the scheduler itself offers limited benefit because the scheduler remains a sequential component and synchronization barriers add overhead. [C3]
The technique is most effective for compute-intensive tasks, especially sequence or instruction randomization. [C3]
Parallelization may not help for small transaction or instruction counts because constraint solvers take longer on their first invocation for a constraint; later randomizations can reuse already solved constraint sets. [C7]
Each execution thread in a parallelized testbench should have a separate constraint solver instance; sharing one solver across threads would defeat the purpose of parallelization. [C7]
Profiling is important before optimization: in the RISCV-DV case, profiling reduced focus from more than fifteen thousand lines of code to about two hundred repeatedly executed lines that were key to generator performance. [C9]

Practical guidance

Based on the evidence, Multicore Parallelization is most appropriate when:

The testbench workload is compute-intensive rather than scheduler/event dominated. [C3]
The workload can be divided into independent slices, such as transaction containers or instruction lists. [C1]
Shared global or static state has been refactored or otherwise protected to avoid lock-heavy execution. [C7]
The workload is large enough to amortize per-thread and constraint-solver overhead. [C7]
Communication between asynchronous worker threads and regular UVM tasks is performed through appropriate asynchronous TLM FIFO constructs. [C4][C5]