SOURCE ARCHIVE
EXTRACTED CONTENT
13,274 chars Who tests the TestRIG? Tooling for randomised
tandem verification
Peter Rugg, Alexandre Joannou, Jonathan Woodruff, Franz A. Fuchs, Simon W. Moore
University of Cambridge, first.last@cl.cam.ac.uk
Abstract
TestRIG is a framework to test RISC-V implementations first presented at the RISC-V Summit in Zurich in
2019. Since then, the ecosystem has grown, with multiple new implementations integrated and industrial interest.
This presentation discusses some improvements to the ecosystem, including mutation-based coverage tooling,
features for generating static test suites, and a single-implementation mode that enables more traditional fuzzing.
The developments are motivated by testing the Toooba processor, including the CHERI security extensions. This
work helps to evolve TestRIG into a tool suite that can increasingly improve assurance in RISC-V designs.
Background rs2)) = {
let mode execution mode capIssealed(cap));
let cap ==clearTaglf(cap, encdec(X(rs2)[0 .. 01);
TestRIG is an ecosystem for cross-verifying RISC-V let hasMode = not (permshal
mode) else cap:
implementations using a standard RVFI-DII inter- formed (cap)) & canX(cap);
face [1]. Verification Engines connect to the imple- SUCCESS
mentations over this interface: QuickCheckVEngine uses Haskell’s QuickCheck library to generate tests ~~ Figure 1: Auto-generated report showing the results of and automatically shrink any divergences to a mini- mutation-based testing of a Sail CHERI function by re- mal reproducer. The RISC-V golden Sail model im- moving lines of Sail code (just one type of mutation out of plements RVFI-DII, allowing implementations to be several). Blue lines resulted in a model that did not build compared against this (hopefully) correct-by-definition when deleted, green shows successful detection (the user can click to link to the failing test), and red shows a case where executable simulator [2]. The RISC-V community is no counterexample was detected. In this case, the user in the process of standardising a CHERI extension, can see that they need to direct random test generation to adding unforgeable hardware capabilities for memory produce more sealed capabilities. Traditional code coverage safety and compartmentalisation [3]. TestRIG is in use ~~ would show that this line was run, hiding the blind-spot! to test CHERI in the Toooba and CVA6 processors. Since initial publication, the TestRIG infrastructure has seen increasing community engagement, includ- has been run, it may contain bugs that would not ing users and contributors from Microsoft Research, be visible in any of the checked outputs of the test. lowRISC, and SCI Semiconductor. The repository now ~~ For example, Figure 1 shows a case where a CHERI links to 10 RVFI-DII-extended implementations, and validity tag gets conditionally cleared: a test may cause has several forks from other members of the commu- this code to be run, but only on already-untagged nity. Improved support has been added for RISC-V capabilities, or the capability may be used later in the compressed instructions and other extensions. test, hiding the error. The Sail model enables us to approach the problem differently: measuring mutation adequacy [4] of the Coverage TestRIG generators. By introducing artificial bugs into the Sail model, we can assert that the tests definitely An important gap in TestRIG so far is a means to do catch those types of bugs. The tests will be run measure the effectiveness of Verification Engines. This ~~ comparing the mutated Sail model against the original, section introduces a tool to automatically measure this detecting any divergences in outputs. Due to auto- using mutation-based testing. matic counterexample shrinking, these divergences can Coverage measurement is particularly important for typically be made very concise. For example, we can directed random verification, as it is otherwise difficult ~~ hardcode an if condition to true. Running this under to determine the quality of the distribution of tests ~~ TestRIG against another unmodified version of the produced. Traditional coverage measures properties of Sail model confirms that the tests relied on behaviour the simulator alone, for example whether certain lines where the else was taken. Figure 1 demonstrates the of code have been run, or certain state configurations ~~ benefits when verifying a CHERI core. reached. This leaves a gap in assurance: even if code We have implemented a framework to perform this
RISC-V Summit Europe, Paris, 12-15th May 2025 1
work as scripts within the TestRIG repository. Users allow templates to specify asserts that can check for can implement transformations over the Sail model arbitrary properties of the resulting trace. Running as Python classes that implement two functions: one as a single implementation is important, as we would to find points to transform (e.g. each if clause in the like to be able to check for lockup conditions without model) and one to perform the transformations (e.g. needing to align the implementation and model on replacing the condition with a hard-coded true or all implementation-defined cases of the specification. false). The framework then manages performing the This then allows us to inject a string of completely transformations on copies of the Sail model, building uniform 32-bit instructions into the processor. DII is them, and running a Verification Engine. Transfor- very useful here, as otherwise managing control-flow mations and results of runs are tracked in an SQLite in the processor would be difficult. This reproduced database and can be displayed in HTML to allow easy the decode issues, and confirmed that the problem was visual inspection, as seen in Figure 1. While the scripts resolved following our fixes to Toooba. currently use text-based transformations written in This approach surprisingly also found a subtle and Python, we would ideally hook in to the Sail compiler rare branch prediction issue in Toooba that could also to perform transformations on the syntax tree. In- cause a lockup. The fetch stage could get stuck in a lining function definitions could also make coverage loop, incorrectly predicting that the first instruction is measurements more meaningful. a compressed jump, then redirecting without correctly The framework could also be adapted to allow mu- retraining the branch predictor due to an associativity tations to the RTL of a particular implementation, issue. This shows that the methodology was able to testing for microarchitectural coverage rather than discover behaviours relying on deep and rare condi- just model coverage [5]. tions. The framework also identified a fatal assert in a version of the Sail model. Test suite generation Conclusion There is a useful side effect of the above coverage ap- proach, combined with QuickCheckVEngine’s existing We have shown several new tools that add additional shrinking mechanism. Since only a single difference is capabilities to the TestRIG framework. This closes key introduced to the model at a time, the counterexample gaps, including validation of the tests generated, sup- produced is very targeted to that line of code. With port for generating minimal static unit tests, and extra counterexample shrinking, this produces a minimal tooling for catching lockup bugs. We hope that pro- test case for that line of the architecture, relying on cessor implementers can see ever-greater benefits from the minimal features needed to test that behaviour. joining the ecosystem. All the work is open-source Repeating this allows us to build up a library of traces under permissive licenses: we encourage everyone to that test each line of the model. These tests are typ- use it and contribute suggestions and improvements! ically very short for the types of coverage examined so far: a single-digit number of instructions, as op- References posed to hundreds required for traditional ISA tests. This makes it much easier to diagnose a failure. We [1] Alexandre Joannou et al. “Randomized Testing of RISC-V conjecture that such tests could form the basis of a CPUs Using Direct Instruction Injection”. In: IEEE Design comprehensive architectural compatibility suite. & Test 41.1 (2024), pp. 40–49. doi: 10.1109/MDAT.2023. 3262741. Debugging lockups [2] Alasdair Armstrong et al. “ISA semantics for ARMv8- a, RISC-v, and CHERI-MIPS”. In: Proc. ACM Program. Lang. 3.POPL (Jan. 2019). doi: 10.1145/3290384. url: An unrelated TestRIG development was motivated https://doi.org/10.1145/3290384. by bugs in the Toooba processor causing it to lockup [3] RISC-V CHERI extension TG. RISC-V Specification for and not retire instructions. This is one of the worst CHERI Extensions. https://github.com/riscv/riscv- possible failure modes for a processor, likely requiring [4] cheri. a hardware reset to recover. We discovered Toooba Hong Zhu, Patrick A. V. Hall, and John H. R. May. “Soft- ware unit test coverage and adequacy”. In: ACM Comput. could lockup by mis-decoding some illegal instructions. Surv. 29.4 (Dec. 1997), pp. 366–427. issn: 0360-0300. doi: To ensure we had caught all the relevant cases, we 10.1145/267580.267590. url: https://doi.org/10.1145/ added a mode to TestRIG to allow a single implemen- 267580.267590. tation to be run alone, without needing to compare [5] Y. Serrestou, V. Beroulle, and C. Robach. “Functional to a model. Simply checking that an RVFI report is Verification of RTL Designs driven by Mutation Testing metrics”. In: 10th Euromicro Conference on Digital System received for every DII instruction within a timeout Design Architectures, Methods and Tools (DSD 2007). suffices to check for lockup conditions, but we also 2007, pp. 222–227. doi: 10.1109/DSD.2007.4341472.
2 RISC-V Summit Europe, Paris, 12-15th May 2025