Vector Processing Unit (VPU) Wiki

Overview

The Vector Processing Unit (VPU) is an academic RISC-V-based decoupled vector accelerator developed in the context of the European Processor Initiative (EPI). The cited verification work states that the accelerator was successfully taped out, implemented version 0.7.1 of the RISC-V Vector extension (RVV), and was connected to a scalar processor core through the Open Vector Interface (OVI). [VPU identity and context]

Within the EPI project, Barcelona Supercomputing Center developed the vector accelerator, SemiDynamics designed the scalar RISC-V core, EXTOLL handled top-level test-chip integration, and Fraunhofer coordinated tape-out. [EPI development roles]

Architecture

The VPU is based on the RISC-V ISA Vector extension 0.7.1v. It has eight vector lanes and supports vectors up to a maximum vector length of 256 elements of 64 bits each, for a total of 16 Kb. Its register-file organization includes 32 logical vector registers and 40 physical vector registers. [VPU architecture]

Each lane contains one Fused Multiply Accumulate (FMA) unit capable of two double-precision operations per cycle, giving the VPU a stated maximum throughput of 16 DFlops/cycle. The design supports 64-bit and 32-bit floating-point vector operations, as well as 64-, 32-, 16-, and 8-bit integer vector operations. [VPU compute capabilities]

The eight vector lanes are connected to memory operation units, an inter-lane ring, and instruction queues. The scalar core executes scalar instructions and sends vector instructions to the VPU. Vector memory accesses are performed by the scalar core through OVI rather than by the VPU directly. [VPU scalar-core integration]

Memory operations have limited out-of-order capability, mostly between arithmetic and memory operations. [VPU memory behavior]

Open Vector Interface integration

OVI is the interface through which the VPU connects to the scalar core. The evidence identifies seven OVI sub-interfaces used by the VPU/core integration: ISSUE, DISPATCH, COMPLETED, MEMOP, LOAD, STORE, and MASK-INDEX. [OVI sub-interfaces]

The ISSUE sub-interface carries requests from the core along with instruction, configuration, and scalar-input information. DISPATCH confirms or kills issued instructions, enabling speculative issue of vector instructions. COMPLETED notifies instruction completion and carries metadata and scalar output. MEMOP carries start and finish signals for memory operations. LOAD sends load data and metadata from the core to the VPU. STORE sends store-operation data from the VPU to the core. MASK-INDEX sends vector content used to generate addresses for masked and indexed memory instructions. [OVI sub-interface roles]

ISSUE, STORE, and MASK-INDEX use a credit system for handshaking between the VPU and scalar core. At each vector instruction, multiple OVI sub-interfaces may need to be considered because some can change instruction behavior or provide results. [OVI handshaking and instruction effects]

Verification approach

The VPU was verified using a Universal Verification Methodology (UVM) environment intended to be modular, scalable, reusable, and shareable among project partners. The verification team considered separate constrained-random environments for individual VPU submodules, but chose interface-level verification around OVI because submodule-level verification would have required excessive effort and final specifications were not ready for all submodules. [UVM verification strategy]

The UVM environment created one agent for each semi-independent OVI sub-interface. For example, the ISSUE sub-interface agent contains a sequencer, driver, and monitor connected to the virtual interface. Virtual sequences create interface-specific transactions, monitors observe interface state and return information to virtual sequences, and UVM events synchronize communication among sub-interfaces. [UVM environment structure]

Because of the strong dependencies among OVI sub-interfaces, the environment randomized only the instructions fed to the ISSUE sub-interface and made the other sub-interfaces react according to the driven instructions. [Constrained-random strategy]

Reference model and co-simulation

The verification infrastructure used Spike, the RISC-V ISA simulator, for co-simulation in the UVM environment. Spike served two roles: it executed scalar instructions and provided vector instructions to the UVM in program order, and it acted as the golden/reference model for checking the VPU design-under-test results. [Spike roles]

The team modified Spike to support SystemVerilog Direct Programming Interface calls, resume simulation until a vector instruction is executed, return reference results to UVM, read Spike memory, and force reduction results into Spike to avoid divergence for unordered floating-point reductions. [Spike modifications]

A UVM scoreboard compared VPU results against reference-model results. When an instruction completed, the scoreboard compared the VPU outputs with values extracted from Spike, including destination vector-register data when needed. [Scoreboard comparison]

For unordered floating-point reductions, the VPU used a different reduction algorithm from Spike, which was permitted by the RVV specification. To avoid false mismatches and later divergence in Spike register state, the verification team created an independent C reference model implementing the same reduction algorithm as the VPU; matching results were then injected into Spike registers. [Floating-point reduction handling]

Test generation, CI, and coverage

The verification infrastructure used RISCV-DV, a SystemVerilog/UVM-based open-source RISC-V instruction generator, to generate random RISC-V assembly tests with vector instructions. Because RISCV-DV implemented a later RVV version than 0.7.1, the team adapted the required parts to RVV 0.7.1. [RISCV-DV use]

Reported RISCV-DV additions included generation of vsetvli instructions, modification of memory-operation generation to allow changes of element width and vector length, an option to select initialization patterns for data pages, constraints on memory addresses to avoid memory exceptions, and adaptation to RVV 0.7.1. [RISCV-DV adaptations]

The verification environment included more than 50 SystemVerilog Assertions focused on OVI behavior, with most assertions targeting memory-related sub-interfaces. [OVI assertions]

The continuous-integration infrastructure used Jenkins pipelines for generating new random tests, re-running failed tests, selecting regression sets based on coverage, and running regression suites on candidate changes and weekly large-set checks. [CI pipelines]

The verification work reports that the environment was used for about a year, found 3005 errors, and reached 95.79% average functional coverage. Reported code coverage averaged 72.64%, including 90.90% statement coverage and 49.83% toggle coverage. [Verification results]

Memory-operation verification

Memory operations were identified as one of the most delicate parts of the design. The VPU does not directly access memory; instead, it reads and writes data through the scalar core using the MEMOP, LOAD, STORE, and MASK interfaces, requiring substantial inter-sub-interface communication. [Memory verification complexity]

For load operations, expected memory data was obtained from Spike and written into a memory model before instruction execution, then sent through the VPU load sub-interface. For store operations, memory contents were needed before execution to check masked operations and detect undesired writes; after the VPU sent store data, the stored values were later compared with Spike. [Memory operation checking]

Masked memory operations required outgoing VPU transactions carrying masks or indexes, which the environment used to execute and compare the instruction against Spike. This comparison helped identify mask-related memory-instruction errors. [Masked memory checking]

OVI retries added further verification complexity. A retry occurs when the VPU cannot handle all loaded cache lines sent by the scalar core; the instruction completes with a vstart value indicating the first element not written to the vector registers and must then be re-executed from that element. The verification work identifies retries as one of the primary sources of VPU errors. [OVI retries]