Source c9e10917... — STIMSMITH

SOURCE ARCHIVE

SHA256: c9e109174d6acd0fda07e1dbfbc9ba765f0ea4966ca9d84d55845069d53ed6bc

URL: https://yuex1994.github.io/files/aspdac22.pdf

TYPE: application/pdf

SIZE: 390.7 KB

FETCHED: 6/6/2026, 10:48:47 PM

EXTRACTOR: liteparse

CHARS: 53,892

EXTRACTED CONTENT

53,892 chars

Generalizing Tandem Simulation: Connecting High-level and RTL Simulation Models

            Yue Xing, Aarti Gupta, and Sharad Malik
              Princeton University, Princeton, USA

yuex@princeton.edu, aartig@cs.princeton.edu, sharad@princeton.edu

Abstract— Simulation-based testing has been the workhorse efficiency, an alternative approach – tandem simulation (or of hardware implementation validation. For processors, tandem RTL co-simulation with ISA simulator) – has been proposed simulation improves test and debug efficiency by cross-level sim- for processor designs [4, 5]. It combines the instruction-level ulating the Instruction Set Architecture (ISA) and RTL models, execution model (ILEM) based on the processor ISA, and the and comparing architectural-state variables at the end of each in- RTL-based execution model (RTEM). The ILEM and RTEM struction rather than at the end of the whole trace. Further, the are combined into a cross-level execution model (CLEM), and simulation may start with the ISA model and switch to the RTL the simulation is executed instruction-by-instruction. At the model at some point by transferring the values of the architectural end of each instruction, the corresponding architectural vari- variables, thus speeding up the “warm-up” phase. However, thus ables (AV) are checked (AV-Check). This check ensures that far tandem simulation has been limited to processor designs as the instruction-level architectural variables (ILAVs) and the other SoC components lack high-level ISA models and thus the corresponding RTL architectural variables (RTAVs) are equiv- notion of instructions. Even for processors, significant manual alent. Any deviation signifies a potential bug, which can be an- effort is required in connecting the two models and constructing alyzed with nearby instructions for debugging. The AV-Check the necessary controller to synchronize/check/swap between them. can also be invoked at specific intervals or checkpoints, to fur- This paper leverages the recently proposed Instruction-level Ab- ther reduce the performance overhead of comparison. In addi- stractions (ILAs) for generalizing tandem simulation to accelera- tion, tandem simulation allows swapping in values from ILAVs tors. Further, we use the refinement-map that is part of the ILA to RTAVs (AV-Swap), which can be leveraged to jump-start the verification methodology to automate the connection between the RTEM in the middle of an ILEM simulation. This can signif- ILA and the RTL simulation models for both processors and ac- icantly reduce simulation time by leaving the “warm-up” part celerators. We provide seven case studies to demonstrate the prac- of the test to only the ILEM. tical applicability of our methodology. While tandem simulation has been used for processors, other I. INTRODUCTION SoC components, especially accelerators, are not thought of as Modern System-on-Chips (SoCs) comprise CPUs/GPUs and having instruction-level models, which limits the use of tandem an increasing number of specialized hardware components simulation to processor designs. Further, tandem simulation broadly referred to as accelerators. SoC design flow starts thus far requires customization in that human input is needed with high-level design models [1,2] for the various components to establish the connection between the ILEM and RTEM to which are then refined into low-level implementations, typi- apply the AV-Check and AV-Swap. This lack of automation cally at the Register-Transfer Level (RTL). To ensure the cor- also limits practical application. rectness of a low-level implementation, its equivalence against This paper addresses the above two gaps by leveraging the the high-level model is checked using formal verification or recently proposed Instruction-Level Abstraction (ILA) to ex- simulation-based testing. While formal verification provides tend tandem simulation to accelerators, and by using a refine- guarantees of correctness, the state explosion problem hinders ment map that is specified by a user as part of the formal verifi- its use on large designs in practice. Thus, simulation-based cation methodology [6–8] to enable automation. The ILA has testing is more generally applied to ensure the conformance of been recently proposed for formally modeling accelerators [9]. a low-level implementation with a high-level specification. Similar to the ISA for processors, an ILA models an accelera- In general, conformance testing refers to applying the same tor in terms of a set of architectural variables and “instructions” sequence of test stimulus to the execution models (EM) of the that update the values of these variables. The ILA model can high-level specification and the low-level implementation and be used to automatically generate an ILEM, which can then be determining compliance by comparing their traces at the end used to perform tandem simulation with an RTL implementa- of simulation time [3]. This can be inefficient, since the first tion. Further, we automate the tandem simulation flow by using mismatch generally happens much earlier than the end of sim- the refinement map, which specifies: (i) the correspondence be- ulation and is often enough for debugging. To address this in- tween ILAV and RTAV, i.e., what to compare for verification, and (ii) the instruction start and finish conditions, i.e., when to This work was supported by the Applications Driving Architectures compare for verification. Thus, it provides the required infor- (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. mation for monitoring the RTL implementation and checking This research is also funded in part by NSF award number 1628926, XPS: its compliance against the ILA model. FULL: Hardware Software Abstractions: Addressing Specification and Verifi- Since the ILA generalizes the ISA, the proposed ILA-based cation Gaps in Accelerator-Oriented Parallelism, and the DARPA POSH Pro- gram Project: Upscale: Scaling up formal tools for POSH Open Source Hard- tandem simulation methodology applies uniformly to both ac- ware. celerators and processors. However, there are a couple of chal-

ILEM instr 1 instr 2 … instr 1…1000 … warm-up instrs important instrs AV-Check AV-Check AV-Check AV-Check AV-Check AV-Swap

RTEM …… RTL clk Instr-1 Instr-2 RTL clk …… …… …… RTL clk …… finish finish Instr-1000 finish S1: Instruction-by-instruction checking S2: Checking at checkpoints S3: Jump-starting implementation simulation Fig. 1. Three Scenarios for Tandem Simulation lenges in extending and automating tandem simulation: AES ILA AV map • Challenge 1 – Testbenches: The ILEM and RTEM require W Input addr_in, data_in, cmd AES-ILA AES-RTL testbenches in different forms – instruction-by-instruction S Architectural key, length, addr, status, key top.aes_key.reg_out Variables data_mem, output_data length top.aes_length.reg_out vs. cycle-by-cycle. Tailoring testbenches for different lev- addr top.aes_addr.reg_out S0 All AVs are initialized as 0 status top.status_reg els takes extra effort and needs to be automated. Instructions (I) top.xram.mem I0 set_key Set encryption key data_mem wr_en top.xram_wr • Challenge 2 – AV-Swapping and micro-architectural vari- decode function: addr top.xram_addr data top.xram_data_in ables: The implementation has more variables (micro- 𝑫𝟎 = 𝑎𝑑𝑑𝑟𝑖𝑛 == 0𝑥𝑓𝑓10 output_data top.out_data_reg && (𝑐𝑚𝑑 == 2) instruction map architectural variables) than the ILA model. Thus, when state update function: Instruction start finish swapping AVs from the ILEM to the RTEM, these extra 𝑵𝟎 𝑘𝑒𝑦 = 𝑑𝑎𝑡𝑎_𝑖𝑛 condition condition I1 set_length Set length of text to encrypt set key decode 1 cycle (micro-architectural) variables need to be set properly. I2 set_address Set address of text to encrypt set length decode 1 cycle I3 start child instructions: set addr decode 1 cycle We propose solutions to these challenges based on use of encrypt • I0ᶜLoad data block start encrypt decode status == 0 • I1ᶜEncrypt data get status decode 1 cycle ILAs and refinement maps. We demonstrate the strength of • I2ᶜStore data block interface map our methodology on seven case studies, covering four hard- I4 get_status Poll status for output AES-ILA AES-RTL cmd wr ware accelerator designs (AES-block [10], AES-round [10], (a) ILA model for AES (b) AES Refinement Map GaussianBlur [11], FlexNLP [12]) and three RISC-V processor Fig. 2. AES Example designs (Pico [13], Piccolo [14], Rocket Core [15]). In partic- B. Instruction Level Abstraction (ILA) ular, GaussianBlur, FlexNLP, Piccolo, and Rocket Core have been integrated into various SoC designs in the broad com- Recently, the ILA was introduced as a uniform instruction- munity, demonstrating that our method is applicable to practi- level formal model for both processors and accelerators [9] 1 cal designs. We report the instruction-by-instruction checking Similar to the ISA, the ILA for accelerators specifies a set of time and AV-Swapping time, which are both negligible relative instructions and AVs. (The ISA can be viewed as a special to the simulation time of the ILEM and RTEM. We also pro- case of an ILA.) Each instruction is a command at the in- vide results for cases when RTL designs are buggy and also for terface of the accelerator. For instance, accelerators that are jump-starting, demonstrating the advantage of our methodol- accessed through MMIO (Memory-Mapped Input/Output) are ogy in improving debugging and simulation speed. controlled by loads/stores issued by the SW/FW (firmware) on To summarize, our paper makes the following contributions: the host processor. The ILA model considers these load/stores • We extend the tandem simulation methodology to accel- appearing at the interface as “instructions” for the accelerator. erators using ILAs as high-level reference models. Formally, an ILA model [9] is defined as a five-element tuple • We describe a fully automated flow to apply tandem sim- 〈S, W, S0, D, N 〉, where S and W are the sets of state variables ulation to processors and accelerators, by leveraging a re- and inputs, and S0 denotes initial values. The set of instruc- finement map (often available in formal verification). tions I is defined by sets D and N , which represent the decode • We demonstrate the effectiveness of this methodology functions (the triggered condition) and the state update func- through seven case studies, including several practical de- tions, respectively. An example fragment of an ILA model of signs at scale. a cryptographic (AES) accelerator is shown in Figure 2a (this figure is similar to an ILA example figure from [17], the AES II. BACKGROUND example is from [9]). The AVs include the encryption key, the A. Tandem Simulation for Processors text length etc. The inputs are the MMIO interface signals. Figure 2a shows the list of instructions and the definition of We focus on the following three scenarios of tandem simu- the instruction SET KEY. As the state update function is a state lation and demonstrate how they can be automatically applied. transition function for the architectural variables, this lends it- Scenario 1: Instruction-by-instruction checking. Traditional self to direct translation to an ILEM for co-simulation. conformance testing applies checking (e.g. AV-Check) at the The ILA allows for hierarchy to model complex instruc- end of the trace. In contrast, in this scenario AV-Check is ap- tions using child-instructions defined in a child-ILA (like plied at the end of each instruction, as shown in Figure 1. micro-instructions for complex processor instructions), e.g., Scenario 2: Checking at checkpoints. The AV-Check is done at the START ENCRYPT instruction in AES is described using predefined points [16] (e.g. Figure 1 applies AV-Check for ev- child-instructions for loading, encrypting, and storing the data. ery 1K instructions). This potentially reduces the performance C. Refinement Map overhead in doing AV-Check at each instruction. A key issue limiting automation of tandem simulation even Scenario 3: Jump-starting implementation simulation. This is for processors is the lack of a general approach that connects useful when a long test stimulus contains both important and unimportant sections (e.g., a warm-up phase in Figure 1). Sim- 1 ulating the unimportant sections only on the ILEM improves This section provides an overview of the ILA modeling and verification the overall simulation speed. methodology and uses examples and figures for exposition here that are similar to those in previous ILA papers (e.g. [9, 17]) with appropriate attribution.

                        (i) AV map    (iii) interface map                   inputs                                  D0          ILA          kernel() {
             ILA    RTL      ILA      RTL                                     (W)                           decode  D1      N0(S, W)             // Below are for parent-ILA
                                                                                  Architectural             func.   D       N1(S, W)             if (𝑫𝟎)
            ILAV1   RTAV1      ILA-input        RTL-input                         variables                                                         instr0.update();
            ILAV2   RTAV2                                                                        (S)                                                instr1.update();
      (memory AV relates  wr_en  RTL variable x  (iv) checkpoint map                                         (D)    ...     N...(S, W)           if (𝑫𝟏)
     to several RTAVs for  addr  RTL variable y      period        p (Every p instructions)                                                      ...
       memory updates)                                                                                                                           // Below are for child-ILAs
                           data  RTL variable z     sequence           [t1, t2, t3, …]                                          child-ILA        do {
              …             …          …           condition           e.g. cmd == 0x1                      child   D0c     N0c (S, Sᶜ)             if  (D0c)
                                                                                                                                                    child_i0.update();
         (ii) instruction map            (v) “cold start” map                                     child     decode  D1c                             ...   (no executable
Instruction  start                   finish         pre-swap      e.g. (1) reset (m1 cycles)     variables  func.   D...c   N1c (S, Sᶜ)          } until
           condition               condition     cycle/sequence  e.g. (2) reset = [1, 0, …],      (Sc)       (Dᶜ)           N...c (S, Sᶜ)    }            child instr)
     instr1  decode                 n cycles
     instr2  decode                bool-expr                      global start = [0, 1, 0…]
         (&& bool-expr)        e.g. (commit == 1)  swap cycle           e.g. m2 cycles            (a) ILA execution model                        (b) Kernel template
         Fig. 3.        Refinement Map (sketch)                                                      Fig. 4.             ILEM generated from ILA
the ILEM/RTEM and checks the corresponding AVs at the end                   The generated ILEM execution kernel is a single thread pro-
of instructions. We address this by leveraging the notion of a             gram representing the ILA execution semantics. ILAtor syn-
refinement map, used in formal verification for processors [6,7]           thesizes the ILEM in both C and SystemC (as needed) so that
and accelerators [8]. As sketched in Figure 3 (similar to a figure         it can be easily integrated with RTEM for tandem simulation.
in [17]), the ILA refinement map (template shown as black text;            The inputs (W ) and AVs (S) of an ILA directly correspond to
refinement map info shown as red text) defines two main fields:            the I/O and member variables of ILEM. For instructions, ILA-

(i) Architectural Variable (AV) map: defines the mapping the ILAVs to the corresponding RTAVs. This provides the in- formation about what to check, e.g., RTAV1 in Figure 3 corre- sponds to ILAV1, and thus they are checked for equivalence. (ii) Instruction map: defines the time or instruction starts (e.g., decode function is true) and finishes (e.g., after n cycles, or after a commit variable is true) in the RTL implementation. This indicates the correspondence points at which the RTAVs should be checked against the ILAVs. The above two fields specify the key information needed for verification – what to check and when to check. The refinement map also uses the following optional fields as needed: (iii) Interface map: provides the correspondence between the ILA inputs and RTL inputs (when not identical). (iv) Checkpoint map and (v) “cold start” map: are not part of the original ILA refinement map [8] and have been added as part of this work to support tandem simulation. Their use will be discussed in §III. Figure 2b (similar to a figure in [17]) shows an ex- ample refinement map used for AES-ILA and AES-RTL AV-Comparator, (partially derived from [8]). It shows that, for instance, top.aes key.reg out from AES-Block implementation cor- responds to AES-ILA’s key. The set key row shows that instruction starts when the corresponding decode function is true and it finishes after executing one RTL cycle. III. GENERALIZED TANDEM SIMULATION In this section, we first introduce ILEM generation, followed by an overview of the proposed methodology for automating tandem simulation. Then we show how the specific challenges discussed in §I are addressed. A. ILAtor: Automatic Generation of an ILEM The ILA model is written in a domain-specific language em- bedded in C++, supported in the ILAng platform [8]. We have developed a tool named ILAtor to automatically generate an ILEM from an ILA model. (The name ILAtor is based on the corresponding tool Verilator [18], which generates an RTEM from a Verilog RTL model.) The ILEM of an ILA model is similar to that of an executable ISA-level processor model. As shown in the upper part of Figure 4a, the ILEM does the fol- lowing: when a new input instruction is presented, it executes the instruction whose decode function evaluates to true, i.e., its state update function will be applied. The lower part shows the execution model of child-ILAs, which is defined similar to that for the ILA. from tor uses the program template shown in Figure 4b to automat- ically generate the execution kernel. It decodes and executes instructions as defined in its ILA. B. Methodology Overview condition when each Tool’s Input Tools Auto-Generated Executables ILA Model ILAtor ILEM testbench Instruction level executable model (ILEM) Refinement Tandem AV- 1-2 Instr. 1-3 AV- 3 Map Generator Comparator Monitor Converter RTL Impl. Existing RTL executable model (RTEM) Model Compiler* RTEM testbench Cross-Level * Verilator for Verilog; g++ for systemC Executable model (CLEM) n block involved in n-th scenario tool developed in this paper Fig. 5. Tandem Simulation Flow Figure 5 shows the flow of tandem simulation. The ILEM is generated by the ILAtor, and the RTEM is generated by an RTL simulator-generator (e.g. Verilator [18]). Our tandem tool creates three additional blocks – an instruction monitor, an and an AV-Converter – from the refinement map. The instruction monitor uses the instruction map to de- tect instruction boundaries (if any instruction starts or finishes) the in the RTEM. Depending on the scenario, it will invoke the AV-Comparator (Scenario 1-2, Figure 1) for checking AVs, or the AV-Converter (Scenario 3, Figure 1) for swapping AVs and jump-start. Both of these are based on the AV map, which pro- vides the correspondence between ILAVs and RTAVs. Our methodology augments the refinement map with the checkpoint map (as in Figure 3) to support the following three types of checkpoints for Scenario 2. (1) Checkpoint period (P): invokes checking for every P instructions, (2) Checkpoint se- quence ([t1, t2, ...]): invokes checking at the tnth instruction, and (3) Checkpoint condition (C): invokes checking when con- dition C holds. According to the refinement map, the tandem generator will augment the instruction monitor block to appro- priately invoke the AV-Comparator. A testbench (either for ILEM or RTEM, as in Figure 5) is needed to drive the overall tandem simulation and we assume that such a testbench is given. C. Challenge 1: Single ILEM Testbench Unlike processors which fetch instructions from memory, accelerators receive commands/instructions at their interface. Thus, for accelerators the ILEM and RTEM require testbenches in different forms – for ILEM it is usually a sequence of in- struction inputs, while for RTEM it is the cycle-by-cycle input

stimulus. The ILEM executes one instruction in a step, while         dem generator for applying the automatic tandem simulation. 2
RTEM typically executes an instruction in multiple cycles –          We also used the base refinement map (described in Json for-
during these cycles, the RTEM may block a following instruc-         mat) from ILAng, and added extra fields ((iv) and (v), colored
tion if it is not ready to process it.                               blue in Figure 3) to support tandem simulation.
   This tailoring of testbenches for different levels requires ad-        The statistics of the ILA model, RTL implementation, re-

ditional effort and thus needs automation. In the case when finement map and tandem simulation times are reported in Ta- only the ILEM testbench is available, we automate this similar ble I. As the ILA is a higher-level model than RTL, the ILA to how processor instructions are simulated from instruction size (in lines of code, LoC) is smaller than RTL size in all de- memory. We add an auxiliary “program counter” to the accel- signs. We view the LoC as a rough measure of design com- erator ILEM and RTEM for accessing an external memory that plexity or designer effort. Note also that the refinement map stores the test instruction sequence. For ILEM, the program size is much smaller than the RTL size, indicating that the hu- counter simply increments by one in each step. For RTEM, the man effort in developing a refinement map is much smaller. program counter is guarded by the “start condition” (from the GaussianBlur, Piccolo and Rocket core were originally gener- refinement map) of the current instruction it points to, i.e., the ated from High-level Synthesis (HLS)/Hardware Generator – current instruction will be executed only when the start con- Halide [11], Bluespec [20] and Chisel [21], respectively. We dition is true. Thus, both ILEM and RTEM run the same test also report the HLS/Generator code size for them. instructions from the ILEM testbench. A. Overview of Case Studies In the other direction, when only the RTEM testbench is 1. Advanced Encryption Standard (AES): We consider two ac- available, the tandem simulation of Scenarios 1 and 2 can be celerator implementations for AES [10], implemented in Ver- directly automated since the RTEM is monitored for the in- ilog and C respectively, which implement a block-based and a struction it executes. For Scenario 3, an ILEM testbench is round-based algorithm, respectively. We use the same AES still needed, since the ILEM has to execute independently (e.g., ILA model [9] (introduced in §II) for both implementations prior to the jump-start) without monitoring the RTEM. with individual refinement maps. This case demonstrates that D. Challenge 2: Jump-Starting RTEM Simulation different refinement maps enable different RTEMs to be tan- Jump-starting requires the conversion of AVs: from RTL dem simulated with the same ILEM. to ILA, and from ILA to RTL. The former is straightforward, 2. GaussianBlur: This case study is of a stencil image process- since the AV map contains all the information about restoring ing accelerator for GaussianBlur (GB) [11], synthesized from ILAVs from RTL variables. The other direction is more chal- Halide description using HLS. It demonstrates that our method- lenging because the ILA is an abstracted model, and there are ology can handle HLS-synthesized implementations. RTL micro-architectural variables that are not in the ILA, such 3. FlexNLP: FlexNLP [12] is designed for machine learn- as internal counters and pipeline registers. Their values need to ing applications with RNN models with attention mechanisms. be handled carefully. The design is implemented in 18k lines (excluding the Men- We address this similar to processor tandem simulation [4,5] tor library code) of synthesizable SystemC. It demonstrates our by applying a “cold start” to set the RTL micro-architectural methodology’s strength in handling practical scale designs. variables to their reset values. We then use the AV map to 4. Pico, Piccolo and Rocket Core: We have applied our set the RTAVs with the corresponding ILAVs. We automate methodology to three RISC-V processor implementations – this with the additional “cold start” map field of the refine- Pico, a multi-cycle design, and Piccolo and Rocket Core, both ment map. In the “cold start” map (Figure 3), the pre-swap pipelined designs. Similar to AES, this case also uses a single cycle/sequence section specifies the input sequence for RTEM RISC-V ILA [9], and uses three different refinement maps for reset; the swap cycle describes the holding time for swapped the three implementations. RTAVs to account for designs where RTAVs take multiple cy- B. Runtime Evaluation and Simulation Speedups cles to propagate to micro-architectural variables. As shown We applied the three tandem simulation scenarios in all in example (1) in the pre-swap cycle/sequence section, one can seven case studies and evaluated the simulation speed. We have assert reset for a couple of cycles. We also support specifying further broken down the simulation time for each tandem sim- a general sequence to RTEM input pins, as shown in example ulation component as presented in Table I . For designs with (2) for reset and global start. an available testbench, such as FlexNLP, we use the given test- IV. CASE STUDIES bench to drive the simulation. Other designs are driven by ran- domly generated test input sequences. Most designs success- We applied the proposed tandem simulation methodology to fully pass the tests except for AES-round, where we identified seven case studies, including four accelerator and three proces- a bug. The bug happens in an inclusive loop boundary which sor implementations. We evaluated the following three aspects should have been exclusive and causes encrypting an extra data for all seven designs – (1) the performance (runtime) of each block in some tests. Our method detects the bug right after the simulated component (ILEM, RTEM, etc.), (2) the simulation “start encryption” instruction which causes the state deviation, speedup with jump-starting, and (3) the improvement of bug in about 0.5s (after running about a third of the test sequence). detection with instruction-by-instruction AV-Check. For this design, we used the bug-fixed version in the other ex- We conducted the experiments on a 3.4 GHz 24-core Intel periments measuring simulation time. Xeon server with 62 GB of RAM, running Ubuntu 16.04. We The simulation time for RTEM and ILEM is reported in the used Verilator v4.1 [18] and SystemC library 2.3.3 [19] for Ver- first two columns of Table I. It is averaged over the number ilog and SystemC simulation. The open-source ILAng [8] was 2 used for ILA modeling, and we developed ILAtor and the tan- tandemSource code is available on https://github.com/yuex1994/ASPDAC-

                                                                TABLE I
                            Case Studies – Statistics of ILA Models, RTL Designs, Refinement Maps and Simulation Time
 Design                         Design Statistics                                                          Simulation Time Breakdown
                 ILA Size   # of Arch.      RTL Size        Ref-map       RTEM         ILEM                S1           S2-type1     S2-type2                      S2-type3     S3
                 (LoC)      Variable bits   (LoC)           Size (LoC)    (μs/instr)   (μs/instr)          (μs/instr)   (μs/instr)   (μs/instr)                    (μs/instr)   (μs)
 AES (block)     236        298             1078            73            387          7.3                 0.25         0.033        0.033                         0.033        74.1
 AES (round)     236        298             321             62            7.49         7.1                 0.64         0.2          0.22                          0.22         80.9
 GaussionBlur    285        621             11375 (1325†)   147           3.3          1.2                 0.19         0.066        0.063                         0.068        14
 FlexNLP         5807       5008            18338           459           2999         262                 17.6         0.083        0.071                         0.21         16694
 Pico            584        1056            2014            208           0.97         0.29                0.084        0.024        0.019                         0.02         0.4
 Piccolo         584        1056            6063 (4122†)    223           4.5          0.3                 0.26         0.022        0.019                         0.019        789
 Rocket Core     584        1056            13468 (3856†)   213           101          0.29                0.85         0.029        0.025                         0.026        652
 † Lines of Code for HLS/Hardware Generator.
of instructions for each test, thus making it a per-instruction
simulation time.                As a higher-level model, the ILEM gener-                     32      FlexNLP
ally runs much faster than the RTEM, with the speedup ranging                                        AES (block)
                                                                                                     AES (round)
from 3X (e.g., GB) to 300X (e.g., Rocket). One exception is                                   8      GB
for AES-round – its C implementation is already very abstract,                                       Pico
                                                                                                     Piccolo
thus leaving little room for ILEM speedup.                                                           Rocket
              The third column (S1) demonstrates the runtime overhead of                      2      Max Speedup
Scenario 1, which includes the per-instruction time for moni-                                 1
toring the RTEM for the instruction boundary and checking the                               0.5  0%           20%       40%      60%                          80%           100%
RTAVs against ILAVs after every instruction. For practical de-                                       Jump-started input sequence length as a percentage
signs (e.g., FlexNLP, Rocket Core), this time is within 1% of                                                           (this is the percentage of instructions simulated by ILEM)
the RTEM simulation time.                                                  Fig. 6.     Simulation Speedup (logarithm y-axis) for jump-starting x% of the
              The fourth to sixth columns (S2-type1, S2-type2, S2-type3)   input instruction sequence on ILEM
show the per-instruction time for monitoring each type of                             0.9            Condition Bug          Data Bug                          Expression Bug
checkpoint, respectively.              This can be regarded as the simu-              0.8                                   0.82
lation overhead for Scenario 2. As shown in Table I, these take                       0.7    bug identified in
much less time (less than 10% of that for S1), demonstrating                          0.6     original design
speedups in comparison at check-points only, rather than after                        0.5                      0.42
every instruction.                                                                    0.4                0.350.38  0.31 0.30           0.27
               The last column (S3) lists the AV-Swapping time from ILEM              0.3                                       0.21
                                                                                      0.2                                                  0.18
to RTEM, which is the runtime overhead in Scenario 3.                 It              0.1    0.050.040.10              0.11                        0.02  0.11 0.09   0.100.09 0.040.04
presents the one-time overhead of applying AV-Swapping, not                            0                                                                               0.03
per-instruction. It varies significantly across the designs and is                           AES-block AES-round       GB    FlexNLP   Pico                        Piccolo   Rocket
                                                                                                                            Case Study
determined roughly by the number of AVs and the “cold start”               Fig. 7.    Bug detection time (normalized to the simulation-to-the-end time) for

length. Among all case studies, the swapping time is within various design cases the time required for executing several to several hundred in- is very close to the upper bound. They all achieve more than structions on the RTEM. Thus, this overhead is negligible in 10X practical tests that have millions of instructions, as long as AV- speedup at the 95% fraction point. However, due to the Swapping is not invoked very frequently. C implementation of the AES-round being very abstract, jump- C. Simulation Speedup with Jump-Starting starting it achieves no speedup (speedup is less than one). D. Improvement in Bug Detection We conducted experiments to evaluate the effectiveness of We also studied the improvement in bug detection time by jump-starting in long input sequences. We divided each test measuring the elapsed time to detect bugs using tandem sim- into two parts – a “warm-up” phase and an important phase. ulation. As mentioned in §IV-B, most available designs are We considered different fractions of the test inputs in the warm- bug-free. So, we set up the experiment by inserting a bug in up and important phases. For example, we considered the first each design. Specifically, we consider three types of bugs: a 5%, 15%, ... as warm-up, and the rest as important phase. Fig- “condition bug” changes a value/condition in a conditional (an ure 6 presents the simulation speedup for different fractions, in if-then-else or case) statement; a “data bug” changes a value comparison to no jump-start. in a computation; an “expression bug” changes a logic opera- Note that many designs have a significant speedup – more tor (e.g., from AND/OR to XOR). We inserted a bug of each than 2X with 80% jump-started instructions – and the speedup type separately, leading to three buggy variations per design. increases as a higher fraction of instructions are jump-started. Further, there are tens/hundreds of candidate locations for bug The dashed line in the figure plots a theoretical maximum insertion – we randomly picked one for our experiments. For speedup for a given fraction, which is computed by assuming AES-round, the bug identified in §IV-B belongs to the “con- the ILEM simulation takes no time (and RTEM simulates dif- ditional bug” category and is also used here. The test inputs ferent test inputs with a constant speed). For example, when are randomly generated (as in §IV-B), and are long enough to 95% of the test inputs are jump-started, the simulation time is detect the bugs. at least simulating the remaining 5% on RTEM. Therefore, the We evaluated two debug strategies: 1) traditional confor- upper bound of the speedup is 0.05∗TTRT EM = 20. As seen mance testing – which runs the test to the end and then com- RT EM pares the ILEM and RTEM results, and 2) tandem simulation in Figure 6, the speedup of AES-block, FlexNLP, and Rocket

Normalized Bug Detection Time Speedup (log)

   – which runs the test instruction-by-instruction and applies the    signers for formal verification.                  We discussed the challenges
      AV-Check at the end of each instruction. We applied these two    in this generalization and proposed a methodology that uses
   strategies to each bug variant of the designs and measured their    the ILA model for automatically generating an instruction-level
        bug detection time. We normalized the bug detection time of    execution model, and adapts its associated refinement map for
the second strategy by that of the first strategy and plotted it in    automating the comparison checks and jump-starting in tan-
     Figure 7. (The absolute simulation time for the first strategy    dem simulation.                      We applied this methodology to seven de-
       ranges from 1-15 seconds for different design variants.) The    sign case studies, including several processor and accelerator
         normalized numbers here demonstrate that tandem simulation    designs. The evaluation results demonstrate the effectiveness
   often detects the bug earlier than finishing the test in confor-    of the proposed tandem simulation methodology in improving

mance testing. In many cases, it finds the bugs in less than simulation speed-up and earlier bug detection. 10% of the full test time, and in most cases in less than 40%. REFERENCES An outlier is a data bug in FlexNLP, where the buggy data is [1] T. Gr¨otker, S. Liao, G. Martin, and S. Swan, System Design with System- used only in a very late stage of the test program. [2] CTM. Springer Science & Business Media, 2007. E. Summary L. Cai and D. Gajski, “Transaction Level Modeling: An Overview,” in 1st CODES+ISSS, 2003, pp. 19–24. In summary, our experimental results demonstrate that: [3] P. Herber, M. Pockrandt, and S. Glesner, “Automated conformance evalu- • Tandem simulation for all three proposed scenarios can ation of SystemC designs using timed automata,” in 15th IEEE European Test Symposium, 2010, pp. 188–193. be effectively automated using the ILA model and refine- [4] R. Nikhil and D. Rad, “RISC-V at Bluespec,” 2015, [Online]. Avail- ment map from the ILA verification methodology for pro- able:https://riscv.org/wp-content/uploads/2015/01/riscv-bluespec-work cessors and accelerators. shop-jan2015.pdf, accessed on: 2020-11, (slides from 1st RISC-V Workshop). • The overhead of the extra components introduced by [5] D. Petrisko, F. Gilani, M. Wyse, T. Jung, S. Davidson, P. Gao et al., our automatic tandem simulation methodology (i.e., AV- “BlackParrot: An Agile Open Source RISC-V Multicore for Accelerator SoCs,” IEEE Micro, vol. 40, no. 4, pp. 93–102, 2020. Comparator, Instruction Monitor, and AV-Converter) is [6] J. R. Burch and D. L. Dill, “Automatic verification of pipelined micro- negligible compared to RTEM simulation time. processor control,” in CAV, 1994, pp. 68–80. • There is a significant simulation speedup by jump-starting [7] P. Manolios and S. K. Srinivasan, “Automatic verification of safety unimportant/“warm-up” phases. and liveness for pipelined machines using WEB refinement,” TODAES, vol. 13, no. 3, pp. 1–19, 2008. • The instruction-by-instruction checking detects bugs ear- [8] B.-Y. Huang, H. Zhang, A. Gupta, and S. Malik, “Ilang: a modeling lier than run-to-the-end methods. and verification platform for SoCs using instruction-level abstractions,” in TACAS, 2019, pp. 351–357. V. RELATED WORK [9] B.-Y. Huang, H. Zhang, P. Subramanyan, Y. Vizel, A. Gupta, and S. Ma- The idea of tandem simulation was proposed in Blue- lik, “Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification,” TODAES, vol. 24, no. 1, pp. 1–24, Spec’s toolchain [4] for various RISC-V processor implemen- [10] 2018. tations [14]. Similarly, the BlackParrot project integrated the H. Hsing, “OpenCores.org: Tiny AES,” 2014, [Online]. Available:https: //opencores.org/project/tiny aes, accessed on: 2020-04. Dromajo RISC-V ISA co-simulator [22] with their RTL simu- [11] J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and lation, effectively providing tandem simulation capability [5]. M. Horowitz, “Programming Heterogeneous Systems from an Image Earlier designs, such as the IBM Power processors, were also [12] Processing DSL,” TACO, vol. 14, no. 3, pp. 1–25, 2017. validated using instruction-by-instruction checking [23]. These T. Tambe, E.-Y. Yang, Z. Wan, Y. Deng, V. J. Reddi, A. Rush et al., “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings methods are limited to processor designs, and are manually for Resilient Deep Learning Inference,” in DAC, 2020, pp. 1–6. done with no systematic methodology. In contrast, our pro- [13] C. Wolf, “PicoRV32,” 2020, [Online]. Available:https://github.com/ posed method leverages the ILA model for extending tandem [14] cliffordwolf/picorv32, accessed on: 2020-11. Bluespec, Inc., “BlueSpec RISC-V designs,” 2020, [Online]. Avail- simulation to accelerators and leverages the refinement map for able:https://github.com/bluespec, accessed on: 2020-11. automation. [15] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Ce- Past work has also explored the idea of co-simulating cross- lio et al., “The rocket chip generator,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17, 2016. level models, especially between the transaction-level model [16] S. Kraemer, R. Leupers, D. Petras, and T. Philipp, “A checkpoint/restore (TLM) and RTL [24, 25]. These techniques utilize a transactor framework for SystemC-based virtual platforms,” in International Sym- (either manually or automatically generated) to refine the exist- [17] posium on SoC, 2009, pp. 161–167. Y. Xing, H. Lu, A. Gupta, and S. Malik, “Leveraging processor model- ing TLM test inputs or functional assertions into RTL simula- ing and verification for general hardware modules,” in DATE, 2021, pp. tion, where the RTL can be verified by checking the test output 1130–1135. or triggered assertions. Unlike these works, our tandem simula- [18] W. Snyder, D. Galbi, and P. Wasson, “Verilator,” 2009, [Online]. Avail- able:https://www.veripool.org/projects/verilator, accessed on: 2020-11. tion approach is based on the RTL model being a refinement of [19] Mentor, “Accellera Systems Initiative.” 2020, [Online]. Available:https:// the ILA model, and thus focuses on the architectural state vari- [20] www.accellera.org/downloads/standards/systemc, accessed on: 2020-11. ables and checks them at the granularity of instructions. This R. Nikhil, “Bluespec System Verilog: efficient, correct RTL from high level specifications,” in MEMOCODE, 2004, pp. 69–70. provides the key benefit of bug detection and early termination, [21] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Aviˇzienis et al., while the TLM/RTL co-simulation generally requires finishing “Chisel: constructing hardware in a scala embedded language,” in DAC, the whole test before checking. It also enables jump-starting [22] 2012, pp. 1212–1221. Esperanto Technology, “Dromajo,” [Online]. Available: https:// through AV-Swapping, which is a harder task for TLM-to-RTL github.com/chipsalliance/dromajo/, accessed on: 2020-11. cross-level simulation. [23] D. W. Victor, J. M. Ludden, R. D. Peterson, B. S. Nelson, W. K. Sharp, J. K. Hsu et al., “Functional verification of the POWER5 microprocessor VI. CONCLUSIONS and POWER5 multiprocessor systems,” IBM Journal of Research and Development, vol. 49, no. 4.5, pp. 541–553, 2005. In this paper, we generalize the notion of instruction-level [24] N. Bombieri, F. Fummi, and G. Pravadelli, “On the evaluation of and RTL tandem simulation to include accelerators in addition transactor-based verification for reusing TLM assertions and testbenches to processors. We propose an automatic flow for tandem simu- [25] at RTL,” in DATE, 2006, pp. 1–6. lation by leveraging the refinement map, which is used by de- M. Chen and P. Mishra, “Assertion-based functional consistency check- ing between TLM and RTL models,” in 26th VLSID, 2013, pp. 320–325.