Source a5d97e9c... — STIMSMITH

SOURCE ARCHIVE

SHA256: a5d97e9cf93f2aa8c469595903b6bcbcbdbf264c907986805ad5a61c84374af2

URL: https://www.cl.cam.ac.uk/research/security/ctsrd/pdfs/202402-testrig.pdf

TYPE: application/pdf

SIZE: 183.0 KB

FETCHED: 6/3/2026, 8:31:30 AM

EXTRACTOR: liteparse

CHARS: 46,883

EXTRACTED CONTENT

46,883 chars

IEEE DESIGN AND TEST, 2023 1

Randomized Testing of RISC-V CPUs using Direct Instruction Injection

Alexandre Joannou, Peter Rugg, Jonathan Woodruff, Franz A. Fuchs, Marno van der Maas, Matthew Naylor, Michael Roe, Robert N. M. Watson, Peter G. Neumann, Simon W. Moore

     I. INTRODUCTION                                             used TestRIG to test many standard RISC-V extensions, and
 TestRIG (Testing with Random Instruction Generation) is a       the experimental CHERI security extension.
testing framework for RISC-V implementations. The RISC-V                    We found TestRIG to be easier to use than unit tests,

community has standardized a formal model of the architecture since instructions can be tested as they are implemented in the Sail language1, giving a human-readable specification without supporting a full testing framework. We also found that can also be used for simulation and verification. The that TestRIG gave more thorough test coverage due to random Sail language is designed for this purpose by allowing in- generation replacing developer effort to explore possibilities. It struction semantics to be described conveniently (for exam- is effective at detecting not just issues in instruction semantics, ple, by supporting variable bit-widths). Ideally, a RISC-V but also in the pipeline and the data caches. As a result, implementor could formally prove equivalence between their TestRIG has completely replaced our instruction-set level unit implementation and the Sail model, but proof tools are not testing for development. yet sufficiently automated to be routinely used on the whole- processor level. They instead focus on equivalence between II. THE DREAM – MODEL-BASED VERIFICATION combinational logic functions, with verification of a full out- Architectural extensions are traditionally specified starting of-order microarchitecture remaining an open problem. As a with a prose specification, and then four implementations are pragmatic compromise, we use TestRIG to check equivalence produced largely independently: between the model and an implementation by generating ran- 1) Assembler dom instruction sequences, executing the same sequences on 2) Executable model (simulator) the model and the implementation under test, and comparing 3) Instruction-level unit-test suite execution traces (tandem execution). This approach does not 4) Hardware implementation prove equivalence but can demonstrate divergence, and is While this ensemble of implementation efforts is laborious usable in all stages of development. when done once, its greatest cost lies in discouraging design dardTestRIG uses the RISC-V Formal Interface (RVFI) stan-2 exploration; design changes require consistency among five to observe the change in state after each instruction independent code bases. of the implementation under test, and uses a novel technique A formal, executable instruction-set architecture (ISA) spec- that we are calling Direct Instruction Injection (DII) for test ification can greatly simplify this workflow. We use the injection. In normal program execution, the next instruction Sail [3] domain-specific language, which features human- is fetched from program memory at an address determined readable syntax. Sail excerpts serve as pseudocode in our ISA by the program counter. With Direct Instruction Injection, the documents [15]. Sail also produces a simulator (item 2), and next instruction to be executed is provided by the test harness, will eventually provide verification (item 3) and the assembler regardless of the CPU’s program counter. We are not testing completed, fabricated chips. Rather, (item 1) to be derived from it automatically. we are comparing executable formal models, software ISA simulators and simulated execution of hardware designs. This A. Model-based Formal Verification requires us to instrument the CPU design with an additional Formal verification tools for RISC-V have often used the interface for Direct Instruction Injection used by the test RVFI tracing (see Section IV) interface along with tools harness during tandem verification. like Cadence’s JasperGold, to prove that a series of traces We have added the Direct Instruction Injection interface to from a simple HDL model is equivalent to a series of traces the Sail RISC-V formal model3, and to two high-performance from a pipelined HDL implementation. Unfortunately, these emulators: Spike4, and QEMU5. We have also instrumented tools can handle only in-order pipelines, and require specialist four RISC-V processor implementations with RVFI-DII, span- knowledge. As a result, the formal-verification approach does ning from embedded to superscalar implementations. We have not yet replace functional testing for entire processors. 1https://github.com/riscv/sail-riscv 23https://github.com/SymbioticEDA/riscv-formal B. Model-based Random Testing https://github.com/CTSRD-CHERI/sail-riscv 4https://github.com/riscv-software-src/riscv-isa-sim While formally proving equivalence for complex microar- 5https://github.com/CTSRD-CHERI/qemu chitectures has been elusive, pragmatic tools have used other

IEEE DESIGN AND TEST, 2023                                                                                                                                       2

                                         ways to detect divergence from a model. These approaches                       Verification Engine (VEngine)
cannot                                      prove       equivalence between a formal model and an        Consume Execution Trace(s)     Generate Instructions
implementation, but can refute it with counterexamples.
                                        For example, directed-random test-sequence generation has                           DII
                                          been used to debug pipeline and memory bugs, as well as        RVFI
                                       to uncover unexpected divergences in implementation behav-
                                  ior [1], [13]. There exist multiple test generators for RISC-V,
e.g., RISC-V RTG [14], but RISCV-DV6                                             remains the most            Socket     …    Socket          Socket
                                        advanced such sequence generator for RISC-V, and it works
                                 well for these use cases, particularly where detailed traces can            RISC-V     …    RISC-V          RISC-V
                                         be compared. RISCV-DV generates assembly programs, ready        Implementation  Implementation  Implementation

to be converted to in-memory images for execution. RISCV- DV includes a number of test generators for RV32IMAFDC Fig. 1. An illustrative example of the TestRIG ecosystem with a Verification and RV64IMAFDC – including support for page-table in- Engine communicating with any two RISC-V implementations over sockets. teractions, privileged CSR use, and handling traps/interrupts. The Verification Engine injects instruction sequences and compares the execution traces until it finds a divergence. These generated test programs are executed on both a golden model and a processor in development. A RISCV-DV test framework would typically detect a divergence by comparing III. TESTRIG the execution traces. Figure 1 gives an overview of the modular TestRIG ecosys- Although randomly generating tests is a promising ap- tem. In TestRIG, an interactive Verification Engine (VEngine) proach, it can have several drawbacks:• stimulates RISC-V implementations over RVFI-DII sockets, Automatically generated counterexamples can be long which are detailed in Section IV. An RVFI-DII compatible and convoluted, while hand-written tests can be made RISC-V implementation can reset, consume instruction se- The generator must ensure that useful instructions are A VEngine can drive one or more RVFI-DII compatible • short and easy to understand. quences, and report execution traces via its RVFI-DII interface. found at the target of each randomly generated branch. implementations; a VEngine might have an internal RISC- Automated reduction of failing test cases has previously V model, similar to PyH2P, or could drive two independent been used in software testing. For example, C-Reduce [10] implementations and compare their RVFI traces, as we have can take a program that triggers a bug in a C compiler and done with QCVEngine, which is presented in Section V. reduce it to a minimal example that triggers the bug. VEngine instruction sequences could be loaded from disk, PyH2P [7] applies automated test case reduction randomly generated randomly, or produced with interactive architecture- generated RISC-V instruction sequences. PyH2P often pro- driven state-space exploration. duces test sequences that contain less than 5 instructions, with The RVFI-DII bytestream interface allows models and im- every instruction being meaningful for reproducing the error. plementations written in various languages to communicate Nevertheless, PyH2P has three shortcomings: through widely supported networking sockets. QCVEngine is 1) PyH2P does not perform full trace comparison with its written in Haskell, and the Sail RISC-V model is written internal PyMTL3 model, but only with final memory in the Sail domain-specific language (either interpreted by and register state. an OCaml program or compiled into C). Spike and QEMU 2) PyH2P has difficulty shrinking through branches, as it are RISC-V simulators written in C and C++. TestRIG also must produce a valid in-memory program. supports hardware implementations like RVBS, Ibex, Piccolo, 3) PyH2P does not use community-standard interfaces that Flute, and Toooba, which are written in either SystemVerilog have been proven across a range of implementations. or Bluespec. RVBS7 is a reference implementation, Ibex8 PyH2P points in an encouraging direction, and TestRIG ma- and Piccolo9 are simple 32-bit implementations, Flute10 is tures the approach, proposing a standardized communication a 5-stage in-order pipeline processor implementing RV64, interface so that verification engines (VEngines), models, and and Toooba11 is a RISC-V 64-bit superscalar out-of-order implementations are interchangeable and can be improved in- processor. dependently. Additionally, instruction injection allows straight- Participants in the TestRIG ecosystem are expected to be forward shrinking of sequences with branches. This has al- identical in every architecturally visible way. Besides a RVFI- lowed us to completely replace instruction-level unit tests for DII interface, TestRIG requires 8 MiB of memory accessible at the sophisticated CHERI extension [16], greatly improving address 0x80000000 (all other addresses returning an access both productivity and assurance, and enabling extension of an fault), and must support resetting to a known state (zeroed array of simulators and processors more efficiently than the registers, known default values for RISC-V control and status CHERI implementations on MIPS or ARM. Symbolic QED [12] is another approach that generates 7 minimal tests for verification (including post-silicon) using a 8 https://github.com/CTSRD-CHERI/RVBS https://github.com/CTSRD-CHERI/ibex formal model of the pipeline. 9 https://github.com/CTSRD-CHERI/Piccolo 10 https://github.com/CTSRD-CHERI/Flute 6https://github.com/google/riscv-dv 11 https://github.com/CTSRD-CHERI/Toooba

IEEE DESIGN AND TEST, 2023                                                                                                         3

registers, zeroed 8 MiB of memory) upon injection of a “reset”     remove the instruction cache entirely (but not address trans-
DII packet.                                                        lation of the PC, which is architecturally visible) to ensure
                                                                   maximal pipeline packing, or can exercise the instruction
                   IV. RVFI-DII                                    cache and replace the bytes of the instruction after they have
   To participate in the TestRIG verification ecosystem, imple-    been fetched. RISC-V compressed instructions present another
mentations must be extended with RVFI-DII instrumentation.         choice: to substitute picked instructions before decode, or
To ease development, we distribute data structures and libraries   inject 16-bit instruction fragments from DII to exercise the
in several languages to facilitate RVFI-DII connections over       picking logic. The simple single-issue design of Piccolo and
TCP ports.                                                         Flute enabled us to replace the cache entirely with a DII queue
   The RISC-V Formal Interface (RVFI), specified by Claire         that delivered one instruction every cycle, either compressed
Wolf, is an existing trace format for formal verification us-      or uncompressed. For superscalar Toooba, we began with
ing symbolic instructions. RVFI exposes select architecturally     unmodified instruction-cache access, substituting the vector
significant signals such as the instruction encoding and any       of picked instructions before decoding. In an effort to debug
memory address or value, as well as the indices and values of      instruction picking itself, we later moved to bypassing the
the operand and writeback registers.                               instruction cache and providing 16-bit instruction fragments
   TestRIG extends RVFI with Direct Instruction Injection          to the pipeline, relying on the instruction picker and decode
(DII). DII is for instruction input, RVFI is for trace output,     to reconstitute the correct DII instruction sequence.
and RVFI-DII supports full interactive verification. Interactive           Canceled instructions present a further challenge to DII.
verification enables automated simplification and shrinking, as    Synchronization is required when instructions are dropped
discussed in Section V-A. Existing RISC-V cores that imple-        in the pipeline, as RVFI-DII requires a single RVFI trace
ment RVFI can be augmented to participate in the TestRIG           entry for each DII instruction injected. While adding RVFI-
ecosystem by implementing DII, and conversely RVFI-DII             DII to Flute, we arrived at a mature design that attaches a
designs may benefit from RVFI formal verification tooling.         sequence ID to each RVFI instruction and carries it with the

Not all architectural updates are reported in the RVFI PC through the pipeline. Instruction Fetch actively requests interface, e.g., floating-point registers and extended CHERI ca- each instruction ID from the DII sequence (as with PC requests pability registers. While this is a limitation, PyH2P relies only to the cache), allowing pipeline redirects to work naturally. We on final register and memory state and is still able to usefully adapted this approach to Toooba by adding superscalar fetch detect divergence. We found that occasional instructions that and assigning IDs to compressed instruction fragments. This move unexposed values into RVFI-visible state could produce more capable DII unit is available in our RVFI-DII libraries12, sufficiently succinct counterexamples. This strategy was also and has been backported to Flute. While DII instrumentation used in RVFI formal verification efforts. may appear daunting, we have found that beginning with this An RVFI interface exports internal signals of an RTL de- mature strategy greatly reduces both implementation effort and sign, or internal variables of a simulator or emulator. For more design disturbance. In retrospect, the few hours invested in this complex RTL designs, such as pipelined or superscalar mi- implementation have greatly streamlined the otherwise much croarchitectures, extracting the appropriate values may require longer testing phase. preserving state for an RVFI report in a commit/write-back stage that did not previously have access to them. Extending V. QUICKCHECK VENGINE the superscalar Toooba core for RVFI-DII required two extra Our TestRIG Verification Engine, QCVEngine, leverages records for each instruction in the Reorder Buffer.As these Haskell’s QuickCheck library [5]. Due to the simplicity of records are present only when built for simulation with RVFI, Direct Instruction Injection execution, which decouples the this is not a physical overhead for the design. instruction stream from control flow, QCVEngine can use DII directly specifies the instruction sequence expected in unmodified QuickCheck utilities to generate, compare, and the output trace, and does not associate instructions with shrink instruction sequences. memory addresses. This requires custom pipeline instrumen- QuickCheck receives a function with a pass/fail return value, tation, but enables greatly simplified sequence generation and generates inputs in search of a failure. To facilitate this, we and shrinking, as the program counter does not affect the construct a function that receives a list of instructions, sends instruction stream. these over two DII sockets, collects RVFI traces back from A DII interface receives a reset command followed by a these sockets, asserts that they match, and returns the result. sequence of instructions. A Bluespec implementation of this We then provide a set of generators of arbitrary instruction interface is shown below: sequences that are used by QuickCheck to produce inputs to typedef struct { this function. Bool rvfi_cmd; // Instruction or reset command? Bit#(10) rvfi_time; // Time to inject instruction We use convenience functions to define instructions in a } Bit#(32) rvfi_insn; // Instruction word (32/16 bit) syntax closely resembling the RISC-V ISA manual, and pro- RVFI_DII_Instruction vide tailored generators for each instruction field to promote For an emulator, this interface simply replaces each fetched register reuse. QuickCheck automatically discovers and uses instruction with an encoding from the DII queue. For RTL designs, DII support is more complex. An RTL design can 12 https://github.com/CTSRD-CHERI/BSV-RVFI-DII

IEEE DESIGN AND TEST, 2023                                                                                                                      4

these generators through the type system and uses them to con- as expected behavior updates naturally with the model as struct arbitrary instruction sequences. We also provide targeted the instruction set evolves. We have also used this feature generators for simple subsets of the instruction set, as well as to replay recorded test-suite examples (including riscv-tests generators that leverage templates of varying complexity to and RISCV-DV), adding full trace-equivalence check with reach deeper states, including virtual memory mappings and shrinking. This feature has also allowed us to capture traces cache conflicts. Templates are a common tool for random test of an operating system booting on the model implementation, generators; for example, IBM’s Genesys-Pro [1] is built on which we could then use to aid bring-up of the same operating templates to intelligently solve for desired deep states. system on implementations, with instruction shrinking rapidly highlighting any problems. A. Smart Shrinking While Direct Instruction Injection allows us to primarily C. Non-shrinkable Sequences rely on QuickCheck’s builtin shrinking strategies, we aug- Sequences can be annotated as non-shrinkable. This has mented these with smart shrinking functions that not only been used to force initialization to cover divergences in initial eliminate instructions, but intelligently transform them to state. For example, one implementation did not initialize simplify the sequence. floating-point registers, which produced trivial counterexam- Once a counterexample is found by QCVEngine, ples. A non-shrinkable initialization sequence allowed us to QuickCheck uses a builtin list-shrinking function that progress to interesting divergences in exception conditions and removes sequences from the list and tests again, hoping to rounding modes. eliminate instructions with no relevance to the errant behavior. Illustratively, here is an initial counterexample found for an D. Assertions artificial hardware bug where the LSB of the add instruction’s result (but not addi’s) is stuck at zero: Sequences can include assertions – e.g., that the value addi x7, x4, 123 # Generate odd immediate written by the previous instruction was non-zero. These make addi x5, x3, 42 # Generate even immediate it possible to fail without a divergence. Unusually, sequences addi x6, x7, 0 # Move x7 to x6 with assertions do not require tandem verification to discover xori x1, x5, 745 # Irrelevant add x1, x5, x6 # Perform buggy add a failure, and we have used these to test the limits of The builtin list shrinking results in: implementation-defined behavior. addi x7, x4, 123 # Generate odd immediate VI. E VALUATION addi x6, x7, 0 # Move x7 to x6 add x1, x5, x6 # Perform buggy add A. A Coverage Study The middle instruction can also be eliminated if the final Architectural coverage is the first metric for basic verifica- add takes register x7 as an operand directly. To automate tion. We evaluated coverage of the RISC-V architecture using this functionality, we further augment shrinking to intelli- sailcov13, which measured how many branches of the RISC- gently propagate an instruction’s output register to future input V Sail model were explored during a run. We compared our operands. This allows another pass of list-shrinking to further TestRIG QCVEngine against the RISC-V test suite (riscv- reduce the counterexample: tests14) and the RISCV-DV generator. addi x7, x4, 123 # Generate odd immediate For our coverage study we conduct two runs of each add x1, x5, x7 # Perform buggy add testing framework (QCVEngine, riscv-tests, and RISCV-DV) We also add a library of simplifications to be used during for RV32IMC and RV64IMAFDCZicsr. For RV32IMC, we shrinking. These eliminate esoteric instructions from the trace take the Sail RISC-V model coverage of the I, M, and C that perform mundane functions and distract from the root extension instructions as well as the coverage of the general- cause of the failure. For example, memory operations often purpose registers. For RV64IMAFDCZicsr, we measure the trap; thus, we might attempt to simplify a memory operation coverage of I, M, A, F, D, C, and CSR instructions as well to an ecall, an instruction that only traps, to make the error as the coverage of the general-purpose and floating-point more obvious. registers. For riscv-tests, we measure the coverage of the Sail Any shrinking or simplification is safe to try for model- RISC-V model running the test binaries. For RISCV-DV, we based testing; any change that still diverges is kept. In rare produce TestRIG traces from the Spike simulator executing circumstances, the shrinking may reveal an alternative bug, the tests and replay them through RVFI-DII while measuring obscuring the original, but still producing a useful result. the coverage of the Sail RISC-V model. For QCVEngine, we configure it with the two architecture strings and let it B. Sequence Import/Export run 500 sequences of each generator. The RV32IMC results Instruction traces can be converted to (and from) a human- are similar across all three testing frameworks, indicating that readable format both for terminal reporting, and for reading QCVEngine can support a suitable alternative to unit testing and writing trace files – individually or in bulk from a direc- and torture testing, at least with respect to breadth of coverage. tory. This has enabled us to collect a library of regression tests The RV64IMAFDCZicsr results more variance, but all three to quickly check all previous counterexamples. Unlike hand- 13 written tests with assertions, these do not require maintenance, 14 https://github.com/rems-project/sail/tree/sail2/sailcov https://github.com/riscv-software-src/riscv-tests

IEEE DESIGN AND TEST, 2023    5


       5                                                               to test these cases. We have applied TestRIG not only to RISC-
     4.5                                                               V, but also to CHERI-RISC-V (our security extension, noted
       4                                                               below).
     3.5
       3                                                               A. CHERI Introduction
    2.52                                                                                         Capability Hardware Enhanced RISC Instructions (CHERI)
     1.5                                                               is a security extension of conventional Instruction Set Ar-
       1                                                               chitectures that adds capabilities                     – unforgeable and bounded
     0.5                                                               tokens. A capability is a fat pointer [8] containing the address
       0                                                               and metadata including permissions and bounds information.
    QCVEngine riscv-tests     riscv-dv                                 Furthermore, validity of capabilities is ensured by a hidden
                                                                       tag. A capability authorizes access to a region of memory, and
Fig. 2.     Counterexample size complexity                             no data or instruction access is possible without a valid ca-
                                                                       pability. Furthermore, all capability operations are monotonic
                                                                       and therefore cannot increase the privileges a capability grants.
remain comparable except in floating point register coverage,          As a result, CHERI enforces spatial safety, enables temporal
in which RISCV-DV excels and CSR instructions in which                 safety, and supports fine-grained software compartmentaliza-
QCVEngine excels. QCVEngine chooses from a subset of                   tion.
floating-point registers to increase the probability of operand
reuse, at a cost to overall coverage. As failures based on             B. Architectural Bug
register number are rare, we have made a trade-off between                                     When developing the compressed encoding of CHERI ca-
increasing the probability of finding violations that require          pabilities, we had a bug that unnecessarily cleared the tag
multiple permutations and increasing the overall FD register           of a pointer when setting an address that wrapped the address
coverage.                                                              space. This bug was found with this shrunken counterexample:
                                                                       cSetBoundsImmediate x3, x1, 1106 #                Set a short bound
B. Counterexample Complexity                                           cIncOffsetImmediate x2, x4, -197 #                Small negative integer
    cSetAddr x4, x3, x2 # Set the integer as the address

Counterexample complexity is another useful metric. Our cGetTag x1, x4 # XXX Untagged archive of QCVEngine traces comprises 3509 shrunken coun- While this case may have been covered in an extensive terexamples discovered during development of our CHERI unit-test suite composed at significant effort, our TestRIG processor extensions. Figure 2 shows the distribution of our generator required only a list of CHERI instructions to produce archive of counterexample lengths versus riscv-test and riscv- a counterexample far more compact than most hand-written dv trace lengths, which do not allow shrinking. The me- tests. dian value is 3 instructions, and the third quartile is only 5 instructions. The median for riscv-tests 561 instructions, C. Microarchitectural Bug which is more than 3 times the maximum counterexample found by QCVEngine, and the median riscv-dv sequence is We have also used TestRIG for discovering microarchitec- 15339. QCVEngine’s small counterexample size is facilitated tural vulnerabilities in transient execution. One such generator by Direction Instruction Injection and smart shrinking as de- produces a sequence of arbitrary instructions, followed by scribed previously. Single digit counterexample length greatly an assertion that no additional cache misses were counted, accelerated our discovery of failures and development of fixes which would indicate a transient violation of the capability compared to even a traditional unit test suite. system. The following shrunken counterexample demonstrated a vulnerability in cSetBoundsImmediate: VII. ILLUSTRATIVE CASES .noshrink ... # initialize counters TestRIG is useful for a broader range of verification than ... # bound x31 to 8 bytes instruction-level unit tests and improves productivity in all .shrink# Illegally increase the bound on a pointer cases. Architectural bugs, which are traditionally targeted cSetBoundsImmediate# x31, x4, 797 by hand-written test suites, are usually discovered quickly Load through this illegal capability lb.cap lb.cap x31, x31[0] with TestRIG. Microarchitectural mistakes, such as register ... # Delay for counter to propagate forwarding or pipeline-flush problems, are also discovered .noshrink csrrs x30, hpmcounter3 (0xc03), x0 # Read L1 cache miss quickly and deterministically in TestRIG, but are difficult to .assert rd_wdata == 0x0 "" anticipate and target in unit-test suites. Memory mistakes, such Because CHERI only allows bounds to be reduced, the as cache bugs or memory speculation failures, have also been cSetBoundsImmediate instruction is illegal and throws an ex- discovered efficiently with targeted generators, and are notori- ception due to attempting to enlarge the bounds. Nevertheless, ously hard to discover using static unit tests. Finally, TestRIG the capability that would have been produced is forwarded in has found unexpected interactions, where architectural features the pipeline during the flush and causes a cache fill that could interact in unforeseen ways, while unit-test suites are unlikely lead to side-channel attacks.

log 10 ( instruction count )

IEEE DESIGN AND TEST, 2023                                                                                                            6

              This sequence uses both .noshrink and .assert. The for-          Pipeline Performance: Similarly, a higher-level model of
mer is required to initialize the state of the counters so that the        pipeline scheduling and performance could be used to analyze
final assertion on the L1 cache miss counter is deterministic.         the timing of instruction traces committed in a pipeline to dis-
                                                                             cover performance bugs and track performance improvements.
D. Cache Bug                                                          The high level of control possible with direct instruction injec-
                                                                         tion should enable precise detection of performance anomalies.
            We received Flute as a working in-order RV64G design, and  Model-derived Engine:               TestRIG’s modular design en-

discovered that the data cache was direct-mapped and 4 KiB, ables a variety of engines to drive RVFI-DII compatible RISC- rather than 2-way associative and 8 KiB – as specified. An V implementations. With QCVEngine, the test maintenance experiment with parameters confirmed that the 2-way cache burden is greatly simplified, but not entirely eliminated. Past could not boot the operating system. This bug had not been experience suggests that even deep-state tests can be automat- found with the unit-test suite, so we used a generator that ically generated from a model specification, as with IBM’s constructs addresses within the TestRIG memory range (see Genesys [1]. Previous CHERI work used tests generated from Section III), as well as random loads and stores. This generator a formal model of our CHERI-MIPS ISA (written in the L3 [6] quickly discovered the bug with the following shortened specification language), compiling from L3 to HOL4, and then sequence, after 42 tests and 20 rounds of shrinking: using constraint solving to automatically generate instruction lui x1, 262148 sequences to reach a desired state without triggering undefined slli x1, x1, 1 behavior. This approach has also been applied to the CHERI lui x20, 262148 # Value used as data ori x3, x1, 1 # A page address ARM Morello instruction set starting from a Sail model [4], lui x2, 262148 [11]. Brian Campbell, a key contributor to this work, has also slli x2, x2, 1 # The same page address lhu x4, x3[1] # Load from address begun on a Sail-OCaml VEngine with direct access to the data sh x20, x2[2] # Store to an overlapping byte structures of our Sail RISC-V model. This eliminates indepen- lhu x2, x3[1] # Divergence on reloading dent encodings in the VEngine, and we expect this approach to The final sequence contains only three memory operations: be taken further to automate generation of templates that target two loads with a single store in between, all to overlapping specific deep states in the architectural model using constraint addresses. This counterexample was found less than 10 sec- solving. onds into the TestRIG run, and was fixed within the hour. The fix is reproduced below: IX. COUNTEREXAMPLE-DRIVEN DEVELOPMENT Bool hit = False; TestRIG’s model-based testing leads to counterexample- for (Integer way = 0; way < num_ways; way = way + 1) driven development, an advancement over test-driven devel- begin Bool hit_at_way = (tags[way].state != EMPTY) opment, a widely known technique of software engineering. hit = hit || && (tags[way].tag == pa_tag); Typical test-driven development for processor design requires hit_at_way; if (hit_at_way) // XXX This line was missing! a basic working design before architectural unit tests can be end way_hit = fromInteger (way); used. Counterexample-driven development using TestRIG can automatically provide reduced stimulus for the most basic While this bug was trivial to resolve with a TestRIG features and can carry development all the way to advanced counterexample, it had escaped the entire development process interactions. The CHERI extension to Ibex was a striking of the Flute processor. It was not found with the RISC-V unit- example. After extending Ibex with RVFI-DII support, a test suite and was overwhelmingly difficult to debug from a summer intern was able to independently add full CHERI full software trace. functionality to Ibex in a month, due to the tight cycle of reduced counterexamples provided by QCVEngine. VIII. FUTURE OF TESTRIG X. CONCLUSION Despite having an array of models, simulators, and imple- We have collated all the current TestRIG-compatible im- mentations supporting RVFI-DII, the generators of our initial plementations and verification engines into the open-source TestRIG verification engine, QCVEngine, are still rudimentary. TestRIG repository15. This repository includes documentation QCVEngine Generators: The Haskell infrastructure in that has been followed and improved multiple times by new QCVEngine supports rich and complex generators. How- users. TestRIG accelerates development at all stages, providing ever, the generators for virtual memory, cache testing, and a tighter debugging loop than we have experienced in any floating-point operations can be enriched with more intelligent other processor development paradigm. We expect TestRIG to directed-random templates for reaching deeper states. lead the way to a standardized testing framework for RISC- Memory Concurrency Testing: TestRIG should support V that leverages instrumentation of open implementations, to memory-model testing. RVFI-DII instruction streams injected greatly simplify verification. Such a framework improves upon with specified timestamps into multiple shared memory cores traditional instruction-set-level unit testing in every way, and should allow precise stimulation of concurrency behaviors. subsumes specialized random test generators into a cohesive These would require a more advanced verification engine that community of easy-to-use verification tools. tests RVFI traces not only for equivalence, but also against higher-level memory-model semantics – as in Axe [9]. 15 https://github.com/CTSRD-CHERI/TestRIG

IEEE DESIGN AND TEST, 2023 7

      ACKNOWLEDGEMENTS
            Approved for public release; distribution is unlimited. Spon-
                   sored by the Defense Advanced Research Projects Agency
                    (DARPA), under contract HR0011-18-C-0016 (“ECATS”) as
                     part of the DARPA SSITH research program. The views,
             opinions, and/or findings contained in this report are those
             of the authors and should not be interpreted as representing
          the official views or policies, either expressed or implied, of
              the Department of Defense or the U.S. Government. This work
          was partially supported by the Innovate UK, Industrial Strategy
               Challenge Fund (ISCF) under the Digital Security by Design
                  (DSbD) Programme, to deliver a DSbDtech enabled digital
             platform (grant 105694). For the purpose of open access, the
                  authors have applied a Creative Commons Attribution (CC

BY) license to the accepted version of this manuscript.

                           REFERENCES

[1] Allon Adir et al. Genesys-pro: Innovations in test program generation for functional processor verification. IEEE Design & Test of Computers, 21(2), 2004. [2] Merav Aharoni, Sigal Asaf, Laurent Fournier, Anotoly Koifman, and Raviv Nagel. FPgen–a test generation framework for datapath floating point verification. In Eighth IEEE International High-Level Design Validation and Test Workshop, pages 17–22, 2003. [3] Alasdair Armstrong et al. Isa semantics for armv8-a, risc-v, and cheri- mips. Proc. ACM Program. Lang., 3(POPL), jan 2019. [4] Thomas Bauereiss et al. Verified security for the morello capability- enhanced prototype arm architecture. Technical report, University of Cambridge, Computer Laboratory, 2021. [5] Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. Acm sigplan notices, 46(4), 2011. [6] Anthony Fox. Directions in ISA specification. In ITP, 2012. [7] Shunning Jiang et al. PyH2: Using PyMTL3 to create productive and open-source hardware testing methodologies. IEEE Design & Test, 38(2), 2020. [8] Trevor Jim et al. Cyclone: A safe dialect of C. In ATEC 2002, Berkeley, CA, USA. USENIX. [9] Matthew Naylor et al. A consistency checker for memory subsystem traces. In FMCAD 2016. IEEE. [10] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and Xuejun Yang. Test-case reduction for C compiler bugs. In 33rd ACM SIGPLAN Conference om Programming Language Design and Implementation (PLDI 2012), 2012. [11] Peter Sewell. Engineering with full-scale formal architecture: Morello, cheri, armv8-a, and risc-v. In FMCAD 2021. IEEE. [12] Eshan Singh, David Lin, Clark Barrett, and Subhasish Mitra. Logic bug detection and localization using symbolic quick error detection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018. [13] Shajid Thiruvathodi and Deepak Yeggina. A random instruction se- quence generator for arm based systems. In 2014 15th International Microprocessor Test and Verification Workshop. IEEE. [14] Dai Duong Tran, Thi Giang Truong, Truong Giang Do, and The Duc Do. Risc-v random test generator. In 2021 15th International Conference on Advanced Computing and Applications (ACOMP), pages 150–155, November 2021. [15] Robert N. M. Watson et al. Capability Hardware Enhanced RISC Instructions: CHERI Instruction-Set Architecture (Version 8). Technical Report UCAM-CL-TR-951, University of Cambridge, Computer Labo- ratory, October 2020. [16] Jonathan Woodruff et al. The CHERI capability model: Revisiting RISC in an age of risk. In ISCA 2014, Minneapolis, MN, USA. IEEE.