Sequence Level Parallelism Wiki

Overview

Sequence Level Parallelism is described as a technique for taking advantage of multicore concurrency in simple module-level testbenches that may contain only a limited number of UVM components. The technique appears in the context of eUVM optimization for RISCV-DV-style testbench workloads, where parallelization is applied below or within the sequence-generation layer rather than only by mapping large VIP components to different executors. [C1]

Motivation

The cited source contrasts sequence-level parallelization with VIP-level multicore mapping. VIP-level parallelism may be suboptimal because scheduler activity can deactivate task executors, and high speedup requires balanced task load across executors; different VIPs may not provide such balance. Sequence-level parallelization is introduced as a way to exploit multicore concurrency even in smaller module-level testbenches. [C1]

Worker-thread model

In the eUVM architecture described by the source, the multicore simulator includes multiple task executors and also free-running asynchronous threads called worker threads. These worker threads are hierarchically owned by the simulator but are decoupled from the scheduler, so they continue running even when the scheduler activates. Because worker threads are decoupled from the scheduler, they cannot wait for simulator events, although they can trigger events. [C2]

Data exchange and synchronization

Although worker threads share memory with simulator tasks, the source states that UVM requires data exchange to occur through TLM FIFOs in order to maintain synchronization across threads. Standard UVM TLM FIFOs use events to block reads when the FIFO is empty and writes when the FIFO is full. Since asynchronous worker threads cannot wait for simulator events, eUVM provides asynchronous TLM FIFO variants that replace event-based waiting with software semaphore mechanisms where needed. [C3]

The source describes three asynchronous FIFO use cases: asynchronous write, asynchronous read, and fully asynchronous read/write. An asynchronous write FIFO is used when a worker thread generates UVM transactions that must be transferred to a regular UVM task; when full, the worker thread blocks on a software semaphore, and when the receiving task frees a slot, the semaphore is released. A fully asynchronous FIFO uses semaphores at both read and write ports and is intended for data exchange between two asynchronous worker threads. [C4]

Sequencer parallelization pattern

Worker thread constructs in eUVM make it possible to parallelize a UVM sequencer by creating multiple free-running worker threads. In the described scheme, the worker threads feed a virtual sequencer in a round-robin fashion. This is useful when transaction generation and randomization are complex or time-consuming enough to choke an RTL simulation. [C5]

Fine-grained parallelism with multicore fork

The source also describes a finer-grained form of sequence-level parallelism using a multicore-parallelized fork. This approach is intended for cases where per-transaction generation and randomization are not expensive enough to justify the overhead of asynchronous TLM FIFO reads and writes. In standard SystemVerilog-style fork semantics, a fork-created task executes on the same CPU thread as the parent thread. In eUVM, forked tasks can instead be distributed across CPU threads associated with multiple task executors. [C6]

This form is useful when a sequence contains thousands of transactions stored in a container such as a queue or array. To accelerate generation, eUVM can slice the sequence-carrying container and split those slices across forked tasks, with each spawned fork processing one slice of the original container. [C7]

Execution semantics

The cited listing for the eUVM parallelized fork shows that a call to the fork construct returns a Fork object, which can be stored for later processing. The fork body is wrapped in a lambda function to capture the scoped variable. The set_thread_affinity method is then used to bind individual forks to parallel task executors. Finally, the forks are joined; as in SystemVerilog, the forks start executing only after a call to join or another blocking construct. [C8]

Applicability

Sequence Level Parallelism is applicable when multicore execution can reduce bottlenecks in sequence or transaction generation. The worker-thread and asynchronous FIFO pattern is useful when transaction generation and randomization are complex and time-consuming. The multicore fork pattern is positioned for scenarios with many simpler transactions where FIFO communication overhead would otherwise offset the benefit of parallelism. [C5][C6][C7]