Source fd7abd23... — STIMSMITH

SOURCE ARCHIVE

SHA256: fd7abd23170afcc333204e9084efaac8350f7a5975cf0b6164e0231f1461fdef

URL: https://riscv-europe.org/summit/2025/media/proceedings/2025-05-15-RISC-V-Summit-Europe-P2.2.06-RUGG-abstract.pdf

TYPE: application/pdf

SIZE: 442.7 KB

FETCHED: 6/7/2026, 10:15:27 PM

EXTRACTOR: liteparse

CHARS: 13,274

EXTRACTED CONTENT

13,274 chars

                                                                                            Who tests the TestRIG? Tooling for randomised
                                                                                                         tandem verification

                                                                           Peter Rugg, Alexandre Joannou, Jonathan Woodruff, Franz A. Fuchs, Simon W. Moore
                                                                                           University of Cambridge, first.last@cl.cam.ac.uk

                                                                                                               Abstract
                                                               TestRIG is a framework to test RISC-V implementations first presented at the RISC-V Summit in Zurich in
                                                           2019. Since then, the ecosystem has grown, with multiple new implementations integrated and industrial interest.
                                                              This presentation discusses some improvements to the ecosystem, including mutation-based coverage tooling,
                                                         features for generating static test suites, and a single-implementation mode that enables more traditional fuzzing.
 The developments are motivated by testing the Toooba processor, including the CHERI security extensions. This
 work helps to evolve TestRIG into a tool suite that can increasingly improve assurance in RISC-V designs.

 Background                                                                                   rs2))          = {
                                                             let  mode    execution mode  capIssealed(cap));
                                                             let  cap  ==clearTaglf(cap,  encdec(X(rs2)[0 ..  01);
TestRIG is an ecosystem for cross-verifying RISC-V           let  hasMode = not (permshal
                                                                                                              mode) else cap:
implementations using a standard RVFI-DII          inter-                                 formed (cap)) & canX(cap);
face [1]. Verification Engines      connect to the imple-             SUCCESS

mentations over this interface: QuickCheckVEngine uses Haskell’s QuickCheck library to generate tests ~~ Figure 1: Auto-generated report showing the results of and automatically shrink any divergences to a mini- mutation-based testing of a Sail CHERI function by re- mal reproducer. The RISC-V golden Sail model im- moving lines of Sail code (just one type of mutation out of plements RVFI-DII, allowing implementations to be several). Blue lines resulted in a model that did not build compared against this (hopefully) correct-by-definition when deleted, green shows successful detection (the user can click to link to the failing test), and red shows a case where executable simulator [2]. The RISC-V community is no counterexample was detected. In this case, the user in the process of standardising a CHERI extension, can see that they need to direct random test generation to adding unforgeable hardware capabilities for memory produce more sealed capabilities. Traditional code coverage safety and compartmentalisation [3]. TestRIG is in use ~~ would show that this line was run, hiding the blind-spot! to test CHERI in the Toooba and CVA6 processors. Since initial publication, the TestRIG infrastructure has seen increasing community engagement, includ- has been run, it may contain bugs that would not ing users and contributors from Microsoft Research, be visible in any of the checked outputs of the test. lowRISC, and SCI Semiconductor. The repository now ~~ For example, Figure 1 shows a case where a CHERI links to 10 RVFI-DII-extended implementations, and validity tag gets conditionally cleared: a test may cause has several forks from other members of the commu- this code to be run, but only on already-untagged nity. Improved support has been added for RISC-V capabilities, or the capability may be used later in the compressed instructions and other extensions. test, hiding the error. The Sail model enables us to approach the problem differently: measuring mutation adequacy [4] of the Coverage TestRIG generators. By introducing artificial bugs into the Sail model, we can assert that the tests definitely An important gap in TestRIG so far is a means to do catch those types of bugs. The tests will be run measure the effectiveness of Verification Engines. This ~~ comparing the mutated Sail model against the original, section introduces a tool to automatically measure this detecting any divergences in outputs. Due to auto- using mutation-based testing. matic counterexample shrinking, these divergences can Coverage measurement is particularly important for typically be made very concise. For example, we can directed random verification, as it is otherwise difficult ~~ hardcode an if condition to true. Running this under to determine the quality of the distribution of tests ~~ TestRIG against another unmodified version of the produced. Traditional coverage measures properties of Sail model confirms that the tests relied on behaviour the simulator alone, for example whether certain lines where the else was taken. Figure 1 demonstrates the of code have been run, or certain state configurations ~~ benefits when verifying a CHERI core. reached. This leaves a gap in assurance: even if code We have implemented a framework to perform this

RISC-V Summit Europe, Paris, 12-15th May 2025                                                                               1

work as scripts within the TestRIG repository. Users can implement transformations over the Sail model as Python classes that implement two functions: one to find points to transform (e.g. each if clause in the model) and one to perform the transformations (e.g. replacing the condition with a hard-coded false). The framework then manages performing the transformations on copies of the Sail model, building them, and running a Verification Engine. mations and results of runs are tracked in an SQLite database and can be displayed in HTML to allow easy visual inspection, as seen in Figure 1. While the scripts currently use text-based transformations written in Python, we would ideally hook in to the Sail compiler to perform transformations on the syntax tree. lining function definitions could also make coverage measurements more meaningful. The framework could also be adapted to allow mu- tations to the RTL of a particular implementation, testing for microarchitectural coverage rather than just model coverage [5]. Test suite generation There is a useful side effect of the above proach, combined with QuickCheckVEngine’s existing shrinking mechanism. Since only a single difference is introduced to the model at a time, the counterexample produced is very targeted to that line of code. With counterexample shrinking, this produces a minimal test case for that line of the architecture, relying on the minimal features needed to test that behaviour. Repeating this allows us to build up a library of traces that test each line of the model. These tests are typ- ically very short for the types of coverage so far: a single-digit number of instructions, as op- posed to hundreds required for traditional This makes it much easier to diagnose a failure. We conjecture that such tests could form the basis of a comprehensive architectural compatibility suite. Debugging lockups An unrelated TestRIG development was motivated by bugs in the Toooba processor causing it to lockup and not retire instructions. This is one of the worst possible failure modes for a processor, likely requiring a hardware reset to recover. We discovered Toooba could lockup by mis-decoding some illegal instructions. To ensure we had caught all the relevant cases, we added a mode to TestRIG to allow a single implemen- tation to be run alone, without needing to compare to a model. Simply checking that an RVFI report is received for every DII instruction within a timeout suffices to check for lockup conditions, but we also allow templates to specify asserts that can check for arbitrary properties of the resulting trace. Running as a single implementation is important, as we would like to be able to check for lockup conditions without needing to align the implementation and model on true or all implementation-defined cases of the specification. This then allows us to inject a string of completely uniform 32-bit instructions into the processor. DII is Transfor- very useful here, as otherwise managing control-flow in the processor would be difficult. This reproduced the decode issues, and confirmed that the problem was resolved following our fixes to Toooba. This approach surprisingly also found a subtle and rare branch prediction issue in Toooba that could also In- cause a lockup. The fetch stage could get stuck in a loop, incorrectly predicting that the first instruction is a compressed jump, then redirecting without correctly retraining the branch predictor due to an associativity issue. This shows that the methodology was able to discover behaviours relying on deep and rare condi- tions. The framework also identified a fatal assert in a version of the Sail model. Conclusion coverage ap- We have shown several new tools that add additional capabilities to the TestRIG framework. This closes key gaps, including validation of the tests generated, sup- port for generating minimal static unit tests, and extra tooling for catching lockup bugs. We hope that pro- cessor implementers can see ever-greater benefits from joining the ecosystem. All the work is open-source under permissive licenses: we encourage everyone to use it and contribute suggestions and improvements! examined References ISA tests. [1] Alexandre Joannou et al. “Randomized Testing of RISC-V CPUs Using Direct Instruction Injection”. In: IEEE Design & Test 41.1 (2024), pp. 40–49. doi: 10.1109/MDAT.2023. 3262741. [2] Alasdair Armstrong et al. “ISA semantics for ARMv8- a, RISC-v, and CHERI-MIPS”. In: Proc. ACM Program. Lang. 3.POPL (Jan. 2019). doi: 10.1145/3290384. url: https://doi.org/10.1145/3290384. [3] RISC-V CHERI extension TG. RISC-V Specification for CHERI Extensions. https://github.com/riscv/riscv- [4] cheri. Hong Zhu, Patrick A. V. Hall, and John H. R. May. “Soft- ware unit test coverage and adequacy”. In: ACM Comput. Surv. 29.4 (Dec. 1997), pp. 366–427. issn: 0360-0300. doi: 10.1145/267580.267590. url: https://doi.org/10.1145/ 267580.267590. [5] Y. Serrestou, V. Beroulle, and C. Robach. “Functional Verification of RTL Designs driven by Mutation Testing metrics”. In: 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007). 2007, pp. 222–227. doi: 10.1109/DSD.2007.4341472.

2 RISC-V Summit Europe, Paris, 12-15th May 2025