Source e354c874... — STIMSMITH

SOURCE ARCHIVE

SHA256: e354c8748b05c53a956ca11be75b37aaf64200dc0794bcf137ff4ca464d7bc31

URL: https://jyhuang91.github.io/papers/tc2021-rvdfi.pdf

TYPE: application/pdf

SIZE: 2174.8 KB

FETCHED: 6/5/2026, 11:29:09 PM

EXTRACTOR: liteparse

CHARS: 110,495

EXTRACTED CONTENT

110,495 chars

IEEE TRANSACTIONS ON COMPUTERS                                       1

                                          RV D FI: A RISC-V Architecture with Security
 Enforcement by High Performance Complete
                                                      Data-Flow Integrity

                        Lang Feng∗, Member, IEEE , Jiayi Huang∗, Member, IEEE , Luyi Li, Haochen Zhang,
                                                 Zhongfeng Wang†, Fellow, IEEE

 Abstract—With the rapid revolution of open-source hardware, RISC-V architecture has been prevalent in both academic research and
 industrial developments. Due to the increasing threats of information leakage, it is imperative to provide a secure RISC-V ecosystem to
 defend against malicious software exploits. Toward this goal, data-flow integrity (DFI) is employed as a strict security policy for
 enforcing the legitimacy of each data access, thereby filtering out most of the attack exploits. However, due to the intensive
 computations needed by DFI, there are only limited proposals successfully implementing partial DFI with low performance overhead.
 Moreover, all the previous studies failed to enforce the complete DFI policy in a real hardware platform, while trading off security
 strength for performance efficiency. To provide RISC-V architecture with high security enforcement and low performance overhead, we
 leverage the open-source Rocket Chip and propose RVDFI, the first complete DFI implementation based on RISC-V architecture with
 only 17.8% performance overhead on average and 3.9% in minimum, incurring much less performance loss compared to the 166.3%
 overhead caused by previous complete DFI implementation.

 Index Terms—Data-Flow Integrity, RISC-V, Computer Architecture, Security, Rocket Chip.
    ✦

1 INTRODUCTION I N recent years, the rapid open-source hardware develop- abilities for wider attack surfaces, which requires strong ments have broken the barriers in semiconductor designs, protection. Although it can be defended with software paving the way for the next waveform of computing design mechanisms, they usually incur considerable performance and innovation. Various efforts across communities have overhead. Therefore, it is imperative to provide security been made to reshape the ecosystem of open-source hard- support for RISC-V to enforce security while maintaining ware, including electronic design automation (EDA) [1], [2], performance efficiency. agile development methodology [3], [4], open instruction Data-flow integrity (DFI) is a strict security policy that set architecture (ISA) [5], and so on. To eliminate the access enforces the legitimacy of all the memory access instruc- barrier posed by commercial proprietary EDA software, tions [18]. Since most of the software attacks need to access several EDA tools have been open sourced for free usage, at least one piece of data in the memory, DFI can be used to such as Verilog to Routing [1] and Icarus Verilog [2]. Ad- identify the illegal data access. Thus, most of the software ditionally, many artificial intelligence hardware accelerator attacks including control-data attacks [19] and non-control- designs and generators have their open accesses, such as data attacks [20], [21] can be detected. However, as around NVDLA [6], Gemmini [3], and DNNweaver [4]. In the field 30% of the instructions in a typical software program are of processor design, the open RISC-V ISA [5] has been rev- memory accesses such as load and store, DFI verification olutionizing the processor developments in both academia can be frequent and consume intensive resources. Due to and industry [7], [8], [9], [10], [11], [12]. this difficulty, there were only a few follow-on proposals In 2010, RISC-V [5] was designed and later became an after DFI was proposed and implemented in software in open ISA standard, which leads to various hardware devel- 2006 [18]. If complete DFI is defined as the DFI policy opments, covering the fields of artificial intelligence [12], proposed in the seminal work [18], all the previous designs [13], high performance computing [8], [14], cryptogra- either enforce only partial DFI or incur large performance phy [11], etc. However, unlike other architectures, the se- overhead. Moreover, none of them implement the complete curity enforcement on RISC-V is still in its infancy. For the DFI verification on a real hardware platform. world leading companies such as Intel, AMD and ARM, To mitigate the performance overhead of DFI verifica- their processors are equipped with security modules, such tion, hardware-assisted approaches have been investigated as Intel CET [15], AMD SEV-SNP [16], and ARM Core- in recent work [9], [22]. However, security strength has Sight [17], to provide both security defense and performance been traded for performance to make the previous designs efficiency. On the security side, software programs have practical. Although a recent study [23] leverages near- become more complex and tend to expose more vulner- memory processing (NMP) and realizes complete DFI, it incurs high performance overhead and hardware resource ∗Both authors contribute equally to this paper. consumption. Furthermore, NMP may not be widely ac- †The corresponding author. cessible, especially for IoT devices. In contrast, our work Digital Object Identifier no. 10.1109/TC.2021.3133701 explores various architectural enhancements along with the

 Author’s version. 0018-9340 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
              See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON COMPUTERS                                                                                                2

compilation flow. The proposed enhancements realize com- can be detected. Besides, non-control-data attacks such as plete DFI verification without security loss with reasonable Hearbleed [21] and the vulnerability in Nullhttpd [20] can hardware resource consumption, meanwhile achieving less also be identified by DFI verification since these attacks than 19 performance overhead compared to the software- are performed by illegally modifying or reading the data. based complete DFI implementation. Compared with control-flow integrity (CFI) [25], which can To provide strong security protection on RISC-V, our only protect the control-data attacks and can be bypassed work proposes RVDFI, a RISC-V processor design equipped by non-control-data attacks, DFI can enforce the security of with high security enforcement by performing DFI verifica- both control and non-control-data. tion at runtime. With DFI verification, the performance of To formalize DFI, given a program, each instruction is the proposed RISC-V processor is still comparable with an assigned an identifier (ID). We call the instructions that can unsecured baseline. To the best of our knowledge, RVDFI write data to the memory the write instructions, such as the is the first design with complete DFI enforcement and less store instruction, and the instructions that can read data than 20% (17.8%) performance overhead, and is developed from the memory the read instructions, such as the load in- based on the open-source RISC-V based Rocket Chip SoC. struction. Note that the instruction here is a general concept. In summary, the contributions of this paper are as follows: Either each statement of an assembly code or a C program • Architectural Support: We propose a RISC-V architec- can be called an instruction. When a read instruction with ID ture extension and instrumentation approach to trans- A reads data from the memory, the data needs to be written mitting DFI related information and enabling DFI capa- by one of the legal write instructions with IDs V₁, V₂, ..., bility for RISC-V processors. Vₙ. The set of IDs {V₁, V₂, ..., Vₙ} is called the reaching • Microarchitecture: A dedicated DFI verification mod- definition set (RDSet) of instruction A. Each read instruction ule is designed to provide light-weight accelerations for has its own RDSet, which includes the IDs of all the legal the simple enforcement logic, which frees the processor write instructions of this read instruction. The RDSet of each pipeline from DFI burden for useful work, thereby read instruction is generated by the static analysis on the reducing performance overhead. program. Besides, a table that records the ID of the latest • Enhancements: Enhancements are proposed to im- write instruction writing to each data is needed by DFI. This prove the security and further reduce performance loss, table is called reaching definition table (RDTable). Each time, including the support for function return and library when a write instruction writes a data, RDTable needs to protection, the hardware design for dynamic redundant be updated. When a read instruction reads a data, the latest load pruning, and dedicated DFI caches. write instruction ID of the data is read from RDTable, and • Evaluation: We evaluate the performance using SPEC the obtained ID is compared with each ID in the RDSet of CPU2006 benchmark suite [24]. The results show that this read instruction. If one of the IDs in the RDSet matches, RVDFI only incurs 17.8% performance overhead on av- DFI verification passes, otherwise, there is a DFI violation. erage, which is less than 91 of the performance overhead An example of DFI can be illustrated by the example of the software-DFI [18] baseline. The security analysis shown in Fig. 1. Assume the ID of each instruction in this also shows that complete DFI effectively defends 156 example is the line number. After static analysis, the RDSet control-data attacks generated from the RIPE suite and of instruction 8 is {6} since printf reads the data written by two real-world non-control-data attacks. The compar- instruction 6. Similarly, the RDSet of instruction 11 is {10}; isons with previous work show that RVDFI has high The RDSet of instruction 13 is {5}; The RDSet of instruction security and low hardware resource consumption. 16 is {10, 13}. Therefore, if the variable len at line 11 is In the following sections, we first introduce the prelim- too large, there will be a buffer overflow and data can be inaries of DFI, RISC-V, and the thread model in Section 2. illegally written by instruction 11. If so, when instruction 13 Next, we discuss the related work in Section 3. The ba- (both a write and a read instruction) is executed, the latest sic RVDFI architecture for DFI verification is proposed in write instruction writing data recorded in RDTable is 11, Section 4. Then, Section 5 further introduces various en- which is not in {5}, and a DFI violation occurs. hancements for improving security and performance. Exper- 1 int data [32]; imental results are analyzed in Section 6. Finally, Section 7 2 int data2 [32]; concludes this paper. 3 int data3 [32]; 4 ... 5 data[pos] = func(pos ); 2 PRELIMINARIES 6 data2[pos2] = func(pos2 ); 7 for (i = 0; i < 32; i++) 2.1 Data-Flow Integrity 8 printf("%d\n", data2[i]); 9 for (i = 0; i < 32; i++) Data-flow integrity (DFI) is a policy for ensuring the legit- 10 data3[i]=i; 11 memcpy(data2 , data3 , len ); imacy of each data access. Since most software attacks are 12 for (i = 0; i < len2; i++) { based on data modification, DFI can identify a wide range 13 data3[i] = data[i]; of attacks. For example, control-data attacks such as return- 14 } 15 for (i = 0;i < 32; i++) oriented programming (ROP) and jump-oriented program- 16 printf("%d\n", data3[i]); ming (JOP) can be detected. A control-data attack example Fig. 1. The code example for illustrating DFI. is the attack to the indirect branches. Attackers may change the indirect branches’ targets illegally, thereby making the 2.2 RISC-V Architecture and Motivation processor execute illegal instructions. This illegal data mod- Our design RVDFI is based on an open-source RISC-V ification cannot pass the DFI verification, and the attacks based SoC—Rocket Chip [7]. However, the key idea of

IEEE TRANSACTIONS ON COMPUTERS                                                                                               3
our design can be easily applied to other processors. The              As DFI is a hybrid approach involving both static soft-
core architecture of Rocket Chip, which is called Rocket        ware analysis and runtime protection, RVDFI requires the
tile, is shown in Fig. 2. It is an in-order processor with      software analysis to assign the instruction IDs and generate
a 5-stage pipeline, where IF, ID, EX, MEM, WB stand for         the data-flow graph (RDSets) of a program. In addition,
“instruction fetch”, “instruction decode”, “execute”, “mem-     since RVDFI extends the instruction set to embed DFI capa-
ory access”, and “writeback”, respectively. Besides, Icache     bility, it also needs compilation support for code generation.
and Dcache represent instruction cache and data cache,          We assume the DFI software toolchain is also trustworthy
respectively. Different from a typical processor, Rocket Chip   and bug free.
contains a Rocket Custom Coprocessor (RoCC), which can
be customized with specific functions and controlled by         3
customized instructions provided by the Rocket Chip.                PREVIOUS WORK
                                                                3.1   RISC-V Architecture
    Core                                                        Along  with  RISC-V  [5] developments, coherent         RISC-V
    RoCC                                                        based  system-on-chip (SoC) generators such as  Rocket
    IF ID EX MEM                          WB                    Chip [7] and BlackParrot [30] have been proposed to foster
                                                                agile RISC-V designs. Recently, several SoC prototypes have
                                                                been designed with agile development methodology by
                                                                leveraging the open-source RISC-V ecosystem [8], [14]. For
    Icache   Dcache                                             example, Celerity [14] features five general-purpose Rocket
                                                                cores for controlling a massively parallel 496-core      tiled
Fig. 2. The architecture of Rocket tile.                        manycore array and ten ultra-low-power cores to achieve

For software-DFI [18], general-purpose machine instruc- both performance and energy efficiency. HammerBlade [8] tions are used for DFI verification, including storing/load- is another highly programmable and energy efficient many- ing/searching RDTable and DFI checking, which checks core RISC-V fabric for accelerating mixed sparse/dense if the latest write instruction’s ID is in the RDSet or not. computation through heterogeneity. Besides, RISC-V archi- Previous study [23] revealed that most of the performance tecture has been employed to accelerate machine learn- overhead is caused by DFI checking, which contains many ing workloads. A RISC-V based multicore scheduling ap- comparison instructions and branch instructions. Motivated proach [13] was proposed for accelerating deep neural net- by this, we allocate the checking tasks from the processor works. XpulpNN [12] is also based on RISC-V and extends core to a dedicated hardware module. This can release the the ISA to realize low bitwidth quantized neural networks computation resources required by DFI checking from the with low power consumption. In addition, RISC-V is also core. The dedicated hardware module needs to have three leveraged for post-quantum cryptography (PQC) such as functions: It can communicate with the core to monitor the RISQ-V [11]. In the security field, defense mechanisms executed and committed instructions, it can access the mem- have been developed and demonstrated based on RISC-V, ory to access RDTable and RDSet, and it can be customized such as hardware-assisted data-flow isolation (HDFI) [9], to perform DFI verification. Therefore, RoCC is leveraged tagged memory supported data-flow integrity (TMDFI) [22], since it perfectly satisfies the architectural requirements. which will be described in detail shortly. Meanwhile, RISC- V is also widely used in industrial developments, such as NVIDIA Falcon controller [10] and Google RISCV-DV [31] 2.3 Threat Model and Assumptions instruction generator. In this paper, we address memory corruption based at- tacks. Therefore, we follow the typical threat model of most 3.2 DFI Variants related work. We assume the software may have one or There have been few proposals that follow up DFI to more vulnerabilities that can be exploited to attack the reduce its performance overhead while lowering its crite- system through data alteration. Once the attack succeeds, ria with partial DFI [9], [32], [33]. Instead of maintaining the attacker would be able to read and write any mem- the full data-flow, write integrity testing (WIT) ensures ory locations in the system. This would allow attackers to that each object can only be modified by particular write perform different measurements or take various actions, instructions [32]. However, since WIT only enforces the which is not limited to any certain types as this capability integrity of write instructions, an unsafe read instruction empowers many attack vectors. The vulnerabilities can be can read more bytes than the programmer’s intention, and from different places, including operating system kernel, consequently lead to information leakage. Attacks such as hypervisor, library code, user programs and so on. As long Heartbleed [21] can bypass WIT. In contrast, RVDFI can as the binary is generated by the DFI software analysis defend against it since DFI enforces the integrity of both and compilation workflow, it would be DFI-capable, and read and write instructions. Song et al. proposed an access thereby under the protection of RVDFI. As a hardware- control system Kenali with DFI [33]. But it only applies based solution, we assume the hardware is trusted and bug to the operating system kernel, leaving a wide surface in free. So attacks that exploit hardware vulnerabilities, such as user code and data for exploits. The most recent work PIM- rowhammer [26] attack and cache side-channel attacks [27], DFI [23] leverages near-memory processing (NMP) for DFI are out of scope. These attacks can be mitigated by other verification offloading. However, it incurs an average of hardware-based protection techniques [28], [29]. 36.4% performance overhead, which is 1× more than RVDFI.

IEEE TRANSACTIONS ON COMPUTERS                                                                                                   4
Moreover, PIM-DFI requires NMP, 238,333 LUTs and 39,994           the above approaches focus on pointers for coarse-grained
FFs when implementing on the same platform as RVDFI,              bounds checking, which are limited for fine-grained data
which is a magnitude more hardware resources than RVDFI.          protection such as generic and intra-object data. Therefore,
Another example is hardware-assisted data-flow isolation          they fall short in fine-grained data protection compared
(HDFI) [9]. Similar to DFI policy, the policy enforced by         to RVDFI that provides instruction-level data access con-
HDFI also requires the data read by each instruction can          trol, covering all data accesses. PHMon is a programmable
only be written by certain instructions. However, HDFI            hardware monitor that can implement different security
only separates the memory into two regions, which are             policies [39]. Despite its flexibility, its generality can incur
sensitive and non-sensitive. In contrast, the complete DFI        high overhead for simple checks and miss the opportunities
enforced by RVDFI uses 16-bit IDs, which is enough for even       of runtime optimizations offered by RVDFI, which is critical
large software programs [18], with much finer granularity         to achieve significant overhead reduction. In contrast to
and thus, higher security. The weakness of HDFI has been          the above solutions, RVDFI provides both fine-grained data
discussed in prior work [23], which shows that some attacks       protection and low runtime overhead through a complete
that are missed by HDFI can be detected by the complete           DFI implementation on RISC-V architecture.
DFI policy. Consequently, RVDFI   is 32768× finer-grained
than HDFI. Similarly, tagged memory supported data-flow           4
integrity (TMDFI) [22] also has low granularity, since it only       BASIC RVDFI ARCHITECTURE

uses 8 bits for the ID and only supports 256 different IDs. In this section, we introduce the basic RVDFI architecture. As shown in [23], for a typical program, such as the bench- The overall DFI verification flow is first presented. Then, mark in SPEC CPU2006, it needs more than 1000 or even two important aspects are described, which are the static 10000 IDs, which is orders of magnitude of what TMDFI analysis for facilitating runtime DFI verification, and the can provide. Therefore, RVDFI is 256× finer-grained than DFI-related information transmission between the processor TMDFI. Moreover, TMDFI has more than 39% performance core and RoCC. overhead, which is 2× as RVDFI. 4.1 DFI Verification Flow 3.3 Control-Flow Integrity The overview of the basic RVDFI architecture and the DFI Control-flow integrity (CFI) [25] is another security policy verification flow is shown in Fig. 3, where the DFI verifica- different from DFI. CFI enforces the legitimacy of each tran- tion is performed in RoCC and managed by the proposed sition between the instruction sequences. For example, CFI DFI controller. The software program that needs to be veri- requires that each branch instruction in a program should fied by DFI is called the target program. The verification flow only jump to one of the legal targets generated by static anal- is separated into offline and online parts. In the offline part, ysis. CFI was first proposed with a software implementa- the target program is analyzed and instrumented during the tion [25], and later assisted with hardware approaches [34], compilation flow. For the online part, the DFI verification [35] to reduce the performance overhead. Intel proposed is performed to check the execution of the target program control-flow enforcement technology (CET) [15] to enforce at runtime. During the execution of the target program, CFI, which is a coarse-grained implementation compared to when a memory instruction that needs to be verified is Griffin [34]. Griffin is a CFI design that uses Intel Processor committed, a DFI-request is raised by the core and sent to the Tracing to generate control-flow traces, which is used for DFI controller for verification. The instruction that initiates CFI verification in software. Lee et al. [35] has proposed us- the DFI-request is called the corresponding instruction of this ing ARM Program Trace Macrocell to generate the control- DFI-request. RoCC stalls the core once a new DFI-request is flow traces, which are sent to an FPGA through ARM Trace raised but RoCC is busy. “Other” in Fig. 3 stands for all the Port Interface Unit for CFI verification. In comparison, CFI is other signals needed by RoCC to complete DFI verification. able to detect control-data attacks but not non-control-data The circled numbers in Fig. 3 show the steps for the DFI attacks, while DFI can identify both types of attacks [18]. verification flow, which is described as follows ( 0 is not shown in the figure): 3.4 Hardware-based Memory Protection To reduce the high overhead of software memory protection Target Program Core Stall DFI ⑤ mechanisms, several hardware-based approaches have been ① Static Analysis DFI-Request Controller proposed [36], [37], [38], [39]. CHERI [36] uses a one-bit IDs IF ID EX MEM WB Other RoCC tag to indicate whether a memory address stores a valid Code Generation capability, where 256 bits are used to describe the capability ② Storage ④ of the stored fat pointer, which can be used for bounds ③ Icache Dcache checking. A recent study has shown that CHERI fails to Target Program Memory protect intra-object data since the bounds checking is in RDSets Target Program RDSets RDTable coarse object granularity [40]. Recently, BOGO [37] leverages memory protection extension (MPX) to provide temporal Offline Online memory safety in addition to MPX’s spatial memory safety. Fig. 3. The DFI verification flow for the basic RVDFI system. AOS [38] is a low-overhead always-on hardware-assisted approach to protecting heap memory safety with bounds 0 The operating system (OS) is modified to reserve a part checking, which leaves the stack data an attack surface. All of physical memory dedicated for RDSets and RDTable.

IEEE TRANSACTIONS ON COMPUTERS                                                                                                         5
                       The core can check access bounds to ensure no user      support the largest set and can create significant memory
     instruction can write to the reserved memory.                               fragmentation. Instead, we adopt an indirect access ap-

1 Given the source code of the target program, static proach via a RDSmap, whose size is the same as the number analysis is performed to assign an ID for each memory of load instruction IDs. Each 64-bit RDSmap entry records instruction and generate the RDSet for each of them. the corresponding load ID’s RDSet bounds in the RDSet 2 The target program is instrumented for sending DFI- memory region, where the higher and lower 32 bits are used requests and related information to RoCC through its for the higher and lower bounds of the RDSet in the RDSets customized instruction. region, respectively. Fig. 4 shows an example of finding an 3 When the target program is loaded by the OS, its RDSet using the RDSmap. Suppose the RDSmap starts at RDSets are also loaded into the memory. address 0x2400, and the load ID is 5, whose RDSet bounds 4 The instructions of the target program are executed are stored at the sixth entry of RDSmap at address 0x2428 on the core, and DFI-requests are raised by memory (0x2400+8×5). The entry tells that the RDSet of the load instructions at the commit stage and sent to RoCC at instruction is the 36th – 38th IDs in RDSets, that is {0,1,2}. runtime. Same as RDSets, RDSmap is also static information and 5 During the execution of the target program, when a generated at compile time. DFI-request is raised, the DFI controller checks the ID Similar to the .data and .text sections, RDSmap and (A), type (write/read) and the target address (T addr) RDSets can also be lowered to the binary as .rdsmap and of the corresponding instruction. If the type is write, .rdsets sections that can be loaded into the memory during the DFI controller updates the RDTable to record that program loading time. In our evaluation, we store them in a the latest instruction writing to T addr is A. Otherwise, file and load them to a dedicated memory region before the if the type is read, the DFI controller first reads the program starts. ID (B) of the latest instruction writing to T addr from For RDTable, it is maintained in the physical memory the RDTable. Then, the DFI controller reads instruction and each memory location of the remaining physical mem- A’s RDSet, and checks if B is in the RDSet. If so, DFI ory has a corresponding table entry. When one page is verification passes, otherwise, a violation is detected shared by multiple processes, the shared page should be and an exception is raised. read-only, and the corresponding entries in the RDTable Therefore, RVDFI is able to enforce DFI at runtime by will not be updated. When a shared page is about to be following the above verification flow. written, copy-on-write in OS will copy the shared page to a new physical page and assign it to the writer process. 4.2 Static Analysis and RDSets/RDTable Formats The new physical page corresponds to a different region in the RDTable, so there is no conflict between processes. To facilitate DFI verification of a target program, we use Similar to the seminal complete DFI implementation [18], the LLVM [41] compiler infrastructure and the LLVM-based the data can be 4-byte aligned. Therefore, the n-th entry static value-flow analysis (SVF) framework [42] to generate of the RDTable can record the ID of the latest instruction the RDSets through static analysis. First, LLVM [41] is used writing to the physical target address n << 2. Since each ID to compile the target program into the intermediate repre- costs 16 bits, in this case, if the size of the physical memory is sentation (IR) code. Then we apply Andersen’s algorithm N , the RDTable size is N4 × 2 = N2 . Similarly, if the data is 8- (field-sensitive, context- and flow-insensitive) to perform byte aligned, the RDTable size is N4 . Note that the RDTable the static analysis on the IR and generate the def-use chains, cannot be tampered with by user programs, but it can be based on which the RDSet for each read instruction is updated by the proposed DFI controller. generated. The RDSet of a read instruction (use) consists of all the write instructions (def s) of the variables that can 4.3 Information Transmission and Instrumentation flow to it. The same RDSets are used in both the software- Since DFI verification is performed in RoCC, all the informa- DFI [18] baseline and RVDFI. Note that the main focus of tion related to DFI needs to be provided to RoCC. According this work is to reduce DFI performance overhead given the to Section 2.1, to verify DFI of a memory instruction in static analysis results. More precise static analysis is beyond a target program, a DFI-request is issued by the core and our scope and we leave it for future exploration. the following pieces of information of the corresponding instruction are needed: 0x2400 0 5 • Iid: The ID of the corresponding instruction, which instruction itself. 0x2420 35 ... 36 RDSmap is static information only related to the corresponding 0x2428 36 39 • Itype: The type (write/read) of the corresponding in- 0x2500 3 4 7 9 struction, which is also static information. 0x2548 0 1...2 6 RDSets • Itaddr: The target address of the corresponding instruc- 0x2550 7 8 3 1 tion, which is dynamic information and can only be obtained from the core when the write/read instruction Fig. 4. An example of finding a RDSet using RDSmap. is executed and committed. • Iwid: The ID of the latest write instruction writing the The RDSets are loaded into the memory when the target data that is read by the corresponding instruction, if the program is loaded by the OS upon execution. Since the corresponding instruction is a read instruction. This is RDSets have variable sizes, a fixed RDSet size needs to also dynamic information.

IEEE TRANSACTIONS ON COMPUTERS                                                                                                                                           6

• Irds: The RDSet of the corresponding instruction, if the get the ID (Iid) and the type (Itype) of the latest committed corresponding instruction is a read instruction. This store or load. is static information generated at compile time and An alternative approach to transmitting Iid and Itype is retrieved by the RoCC at runtime. to encode them along with each write/read instruction. This 4.3.1 Transmitting Iid and Itype may require expanded instruction-length encoding, which we leave for future work. For Iid and Itype, 16 bits are enough for encoding the ID of each instruction with the offline optimization for reducing 4.3.2 Transmitting Itaddr the number of IDs [18], so Iid and Itype only need 17 bits For information Itaddr, it can be obtained at MEM stage to be encoded, with 16 bits for Iid and 1 bit for Itype. while executing a store/load, since Itaddr is used to access Due to the small size and static nature of Iid and Itype, Dcache for the data. However, according to Inmet St and instead of storing Iid and Itype in the memory, we can Inmet Ld, the moment of a store/load accessing the data encode them along with the corresponding instruction. We is before a custom0 reaches RoCC. Therefore, when RoCC implement it by leveraging one of the RISC-V’s customized receives Iid and Itype by receiving custom0 from the core, instructions: custom0, which can be used to control RoCC. RoCC needs to be able to obtain the target address (Itaddr) When a custom0 finishes execution in the core’s pipeline and of the latest store/load that is right before the custom0. To retires, its instruction body is sent to RoCC by the core. We realize this, we make the architecture change as shown in directly instrument a custom0 right after each write/read Fig. 7 to pass the Itaddr along the CPU pipeline. instruction, and use the body of custom0 to encode Iid and Itype as the extended additional bits. We adjust the format of custom0 to encode Iid and Itype 4 3:ld a1 0x80 ID, type RoCC in the instruction body. The original format is shown in 3 3:ld a1 0x802:custom0 2 3:ld a1 0x80 2:custom0 1:ld a0 0x70 Fig. 5. For the white row in Fig. 5, funct7 is a 7-bit indicator 1 2:custom0 1:ld a0 0x70 whose meaning can be freely defined by the designers. ... EX MEM WB Stall RoCC can write any data back to the destination register rd DFI- DFI Controller Request if xd is 1. Besides, rs1 and rs2 are the source registers, and Tar Addr the core can pass the data in rs1/rs2 to RoCC if xs1/xs2 is Buffer 1, respectively. We change the format of custom0 as the gray 1 0x70 row in Fig. 5, by leveraging funct7, rd, rs1 and rs2 to encode 2 0x70 Iid and Itype. The reason why we skip leveraging xd, xs1 Core 43 0x80 0x80 0x70 0x70 and xs2 is that there are some constraints for their values, so they cannot be arbitrary values as DFI may require. Icache Dcache 31 25 24 20 19 15 14 13 12 11 7 6 0 funct7 rs2 rs1 xd xs1 xs2 rd opcode Fig. 7. The modified Rocket Chip and the example for transmitting Itype. 0 type ID[15] ID[14:5] 0 0 0 ID[4:0] opcode 31 27 26 25 24 15 14 13 12 11 7 6 0 In Fig. 7, two pipeline registers are added to pass the Fig. 5. The original/new (white/gray) format of custom0 instruction. target address from the MEM stage to the commit stage, Since the static analysis of our work is based on LLVM then the target address is transmitted to RoCC. A target ad- IR, the only write/read instruction is store/load. A pseudo dress buffer is added inside RoCC to record the latest target code example of the instrumentation can be found in Fig. 6, address. With the two registers and the target address buffer, where the code at line 2/4 stands for store/load data it is ensured that the target address of the corresponding to/from address 0x90/0x70, respectively. In Fig. 6, right instruction and the custom0 are synchronized and can reach after each store and load, a custom0 is instrumented. The the DFI controller at the same time. An example is also instrumentation of store is named Inmet St, while that of shown in Fig. 7, where the red numbers stand for the clock load is named Inmet Ld. We call the instruction, which relates cycles, and the vertical dotted lines are for separating the to a custom0, the corresponding instruction of the custom0. For stages. The numbers at the bottom of Fig. 7 are the target example, the store instruction at line 2 is the corresponding addresses. The executing code is the lines 4-6 in Fig. 6. It instruction of the custom0 at line 3. When RoCC receives the is shown that when the custom0 reaches RoCC at cycle 4, body of a custom0, a DFI-request is raised by the core for the DFI controller is also able to fetch the target address DFI verification. of the corresponding ld a0 0x70 from the target address buffer. With this modification and instrumentation, RoCC 1 ... can obtain the correct target address of the latest memory 2 st a2 (0 x90) access instruction before custom0. 3 custom0 (Inmet St , with the ID and the type "store") 4 ld a0 (0 x70) 5 custom0 (Inmet Ld , with the ID and the type "load") 4.3.3 Transmitting Iwid and Irds 67 ld a1 (0 x80) For information Iwid, as described in Section 2.1, it is stored custom0 (Inmet Ld , with the ID and the type "load") 8 ... in the RDTable and the RDTable is initially empty. The Fig. 6. A pseudo code example of the instrumentation for getting Iid, RDTable is updated when a new memory write instruction Itype and Itaddr. is committed. As discussed in Section 4.2, RDTable is stored By adding Inmet St and Inmet Ld, when custom0 is in the memory. RoCC has the memory interface to Dcache, committed and its body is transmitted to RoCC, RoCC can and thus, can access the RDTable. For information Irds, it is

IEEE TRANSACTIONS ON COMPUTERS                                                                                                                                                  7
generated through the offline static analysis on the target                                                       stack. When ret is executed, the return address is read from
program, and loaded into the memory before the target                                                             the stack and the program counter of the core is changed to
program starts. Therefore, RoCC can access Irds from the                                                          the return address. DFI requires that the return address of
memory while performing DFI verification.                                                                         a function can only be written by the function call that calls
5   ENHANCEMENTS ON RVDFI SYSTEM                                                                                  this function. To enforce this policy, the memory address of
                                                                                                                  the return address needs to be obtained by RoCC when each
Although the basic RVDFI architecture moves the computa-                                                          call and each ret are executed.
tion of DFI verification from the core to RoCC and relieves                                                               For RISC-V, the memory address of the return address is
the large performance overhead, the performance loss is still                                                     related to sp, which is the stack pointer register. After each
not negligible. When there is a new DFI-request, if RoCC is                                                       function is called, there are 5 steps before it returns:
performing DFI verification of the previous DFI-request, the                                                       1) The call is executed.
core has to be stalled until the previous DFI verification is                                                        2) The return address is stored at sp-4, and the value of sp
finished. Otherwise, the new DFI-request would be missed                                                           is decreased to enlarge the stack.
if the core proceeds with execution, and security would be                                                         3) The instructions in the function body are executed.
reduced. To mitigate this problem, we can either temporarily                                                            4) The value of sp is increased to change the size of the
store the DFI-requests to avoid stalling, or increase the speed                                                        stack back to that before this function is called, and the
of the DFI verification performed in RoCC. Besides the                                                               return address at sp-4 is read by the core.
performance, the security of the basic RVDFI architecture                                                          5) The ret is executed.
can also be improved by supporting function return and                                                            In this case, right before step 1 and right before step 5, RoCC
dynamically linked libraries.                                                                                     needs to obtain sp-4, which is the memory address where
the RWe proposed several approaches to further enhancingV                                                         the return address is stored.
                                                         DFI architecture for supporting complete DFI veri-               To transmit Iid and Itype, and inform RoCC of obtaining
fication with better performance efficiency. The enhanced                                                         the memory address of the return address right before step 1
                                                  RVDFI is shown in Fig. 8, with the additional connections       and 5, we instrument the target program with an additional
between the registers in the core and RoCC, a FIFO, a load                                                        custom0 right before each call (Inmet Call) and ret (Inmet
pruning buffer, and a few dedicated DFI caches. The details                                                       Ret). When RoCC receives Inmet Call and Inmet Ret, it
are introduced in the following subsections.                                                                      obtains sp-4 immediately, which is the target address of
                                                                                                                  implicit store and load, respectively.
      Core                                                   sp             Lib Tar Addrs, Lib Length RoCC         When RoCC receives     an    Inmet   Call, it updates      the
                                                             sp a0
                                                                 a0 a1
                                                                  a1 a2
                                                                     a2 ....
                                                                 Regs         FIFO ID, type        RDSmap Cache   RDTable entry according to the address sp-4 with a special
    ...     MEM                                                   WB          Load                                ID, -1, which is specifically used for function calls. When
                                                                            Pruning                               RoCC receives Inmet Ret, it reads from the RDTable entry
                                                                             Buffer
                                                                            Stall  DFI Controller  RDSet Cache    according to the address        sp-4, and checks if the data is
                                                                                Tar Addr Tar Addr RDTable Cache   -1. If so, DFI verification passes. If another illegal    store
                                                                                Buffer                            writes data to the return address at sp-4, RoCC updates
                                                                                                                  the corresponding entry of RDTable from -1 to the ID of
      ...     Dcache                                                                                              the store. In this case, DFI verification would fail when the
                                                                                                                  function returns.
Fig. 8. The modified Rocket Chip for DFI verification with the enhance-
ments, which is the final version of RVDFI.                                                                       5.1.2 Library Protection
5.1   Supporting Function Return and Library                                                                      For a typical program, there may exist multiple dynamically

As described in Section 4.2, the static analysis and in- linked library functions, such as memcpy, memset, etc. Usu- strumentation are performed based on the LLVM IR. For ally the instructions of dynamically linked library functions function return and library function call, they both contain are not analyzed during static analysis. However, many memory accesses. However, these memory accesses are usu- attacks can happen in libraries such as libc library [43]. ally implicit in the IR without explicit load or store avail- Therefore, it is essential to also enforce DFI for library able. Therefore, the proposed instrumentation in Section 4.3 functions. For each library function, the information of the cannot protect function returns and library function calls. pointer arguments including their memory operation types, We further augment the instrumentation technique and the target addresses and the associated memory range that the Rocket Chip architecture to support the DFI verification may be accessed, is used during static analysis. This infor- for function returns and libraries. Although previous works mation can also be propagated to the hardware through an such as [18], [23] also support this, we propose an alternative extended custom0 with a new encoding. This new custom0 approach with a more efficient design for RISC-V architec- is instrumented right before each library function call, after tures to achieve less instruction instrumentation. function arguments are loaded into registers (Inmet Lib). The custom0 of Inmet Lib can specify if the library function 5.1.1 Function Return Protection writes/reads data to/from the memory, the memory access range. When RoCC receives a DFI-request of Inmet Lib, the There are two instructions related to function return protec- DFI controller fetches the needed arguments and performs tion, which are calls and rets. When a call is executed, the DFI enforcement by verifying the definition IDs of all the return address of the function being called is written on the memory locations that are pointed by the read pointer

IEEE TRANSACTIONS ON COMPUTERS                                                                                                                                                                8

and its range is in the RDSet of the call instruction. This 5.2 DFI-Request FIFO procedure is similar to checking n loads for a memory range As described in Section 4.1, the major cause for performance of n. Similarly, the RDTable entries of the write pointer and overhead is due to core stalling if it raises a new DFI- its memory range are updated with the ID of the current request while RoCC is busy with the previous DFI-request. call. Note that when a library function needs to load data, Although dropping the new request can avoid stalling the sometimes the memory access range can be huge, which core, the security can be compromised. Since RoCC may be may lead to long detection latency. To avoid the core from idle during the processor computation phase when there are executing instruction streams that may contain attacks in no memory instructions for DFI verification, it can result in this detection window, RoCC stalls the core when it begins a free time slack. Based on this, we introduce a FIFO inside to process a DFI-request corresponding to a library function RoCC to store the incoming DFI-requests, and send them to that loads data, until this DFI-request is processed. the DFI controller if it is free. Once the FIFO is full, the core For complicated library functions, especially system is stalled. The FIFO is not only for temporarily avoiding calls, they can be supported similar to the kernel DFI ap- stalling the core when there is a new DFI-request; more proach in Kenali [33], which is out of the scope of this work. importantly, it also enables a DFI-request to use the free 5.1.3 Instrumentation and Architecture Enhancements time slacks. This can further reduce the chance of stalling the core and increase the DFI controller utilization, thereby To support both function return and library protections, we reducing the performance overhead. further extend the custom0 instruction in Fig. 5, and the encoding format is shown in Fig. 9, where the white row 5.3 Dynamic Redundant Load Pruning Buffer is for store and load instrumentation, and the gray row In the seminal software-DFI work [18], offline optimizations is for function return and library call protections. Whether for pruning redundant DFI verification were proposed to the format is Inmet St or Ld and Inmet Call, Ret or Lib is reduce the performance overhead. However, these offline decided by the 3rd most significant bit (0 for St/Ld and 1 optimizations are conservative due to static analysis. With- for Call/Ret/Lib). In Fig. 9, ”r”, ”w”, ”ret” represent if the out the runtime information, they lose the opportunities of corresponding instruction reads data, writes data, and is a pruning for further reducing performance loss. Although function return or not, respectively. previous work PIM-DFI [23] has suggested the runtime 31 27 26 25 24 15 14 13 12 11 7 6 0 optimizations, its optimization implementations incur sig- 0 type ID[15] ID[14:5] 0 0 0 ID[4:0] opcode nificant area overhead, as shown in the result analysis in 31 0 30 291 28r w27 ret ID[15] ID[14:5] 0 0 0 ID[4:0] opcode Section 6.6. In RVDFI, besides implementing all the offline 26 25 24 15 14 13 12 11 7 6 0 Fig. 9. The format of new custom0 instruction (gray) for function return optimizations in work [18] during static analysis, we also and library protection. propose a light-weight hardware design named load prun- An instrumentation example for function returns and ing buffer for dynamic redundant load pruning, to further libraries is shown in Fig. 10. Inmet Call and Inmet Lib prune the redundant DFI-requests of Inmet Ld at runtime. should be right before the call, while Inmet Ret should be 0x24 0x54 ld 0x7c 0x15 ld 0x36 0x15 st 0x67 0xda ld 0x89 0xa2 ld right before the ret, according to the instrumentation policy. 0x7c 0x15 0x24 0x54 0x7c 0x15 0 0 0x67 0xda 1 def func () { N 0x7c 0x15 Y 0x24 0x54 N 0x24 0x54 N 0 0 N 23 ... ID Target 0x7c 0x15 0 0 0x24 0x54 add sp ,sp ,0 x80 (a) 4 custom0 (Inmet Ret) Address (b) (c) (d) (e) 56 } ret Fig. 11. An example of dynamic redundant load pruning. 7 def func2 () { 8 ... Redundant OR 9 ld a0 8(sp) Same Target 10 add a1 ,a4 ,6 Addresses Same 11 ld a2 64( sp) = = = IDs 12 custom0 (Inmet Lib) S/L 13 call memcpy 14 custom0 (Inmet Call) 01 00 15 call func ID & 0 D Q 0 00 01 ... D Q 10 D Q 0 10 16 ld a0 (0 x70) Target 1 11 11 17 custom0 (Inmet Ld) Address 18 st a2 (0 x90) clk 19 custom0 (Inmet St) Fig. 12. The hardware structure for dynamic redundant load pruning. 20 ... 21 } If there are two DFI-requests (L, M ) of Inmet Ld with Fig. 10. A pseudo code example of the instrumentation for enabling DFI the same ID and target address, and between L and M , verification for function returns and libraries. there is no other DFI-request of Inmet St with the same target address, nor other DFI-request of Inmet Lib, then, Besides, Rocket Chip is further modified to support DFI-request M is redundant. Such a runtime pruning is function return and library protections in Fig. 8. Since the enabled with the proposed load pruning buffer, where each arguments of a library function call are stored in the entry is a pair of the target address and ID of an Inmet registers starting from a0, the values of sp, a0, a1, a2, and Ld. As shown in Fig. 8, the load pruning buffer is added other successive registers are passed to RoCC and RoCC between the FIFO and the DFI controller. Once there is a fetches the values of these registers upon receiving a custom0 DFI-request output from the FIFO, the load pruning buffer of Inmet Call, Ret, and Lib. can decide if this DFI-request is redundant or not.

IEEE TRANSACTIONS ON COMPUTERS 9 An example is shown in Fig. 11, where the large gray is ineffective, since it can interfere with the memory accesses rectangles stand for the load pruning buffer. The ID and of the core and affect performance. Consequently, as shown the target address of a new DFI-request is shown at the top in Fig. 8, level-0 (L0) dedicated caches backed by the L1 of each subfigure, with st representing Inmet St, and ld Dcache are proposed for RoCC to mitigate the Dcache access representing Inmet Ld. In Fig. 11(a), the new DFI-request contention with the processor core. The effectiveness of our of Inmet Ld with ID 0x24 and target address 0x54 is sent L0 dedicated caches is shown in Section 6.3, where even from the core, and they are compared with the valid buffer using a larger Dcache still has almost 50% performance entries. Since there is no match, the buffer outputs an “N” overhead than using the dedicated caches with a much to indicate the new DFI-request is not redundant, and its smaller cache size in total. information is pushed into the buffer as shown in Fig. 11(b). In Fig. 11(b), the information of another new DFI-request is simultaneously compared with all the valid buffer entries with a hit. The buffer outputs “Y” to indicate the DFI- request is redundant, and the DFI-request is ignored by the DFI controller. At the same time, the information of this DFI- request is pushed into the buffer. In Fig. 11(c), the target address of the DFI-request of Inmet St matches some entries on target addresses, meaning the corresponding store of this DFI-request will change the data-flow and become the Fig. 13. The relationship of the dedicated cache’s cacheline size, the most recent definition of the target address. Therefore, the matched entries are stale and are cleared from the buffer, access time, and the hit rate, for SPEC CPU 2006 429.mcf benchmark. as shown in Fig. 11(d). When the buffer is full and there is As shown in Fig. 8, we added three caches inside RoCC another new request, the oldest one is shifted out from the for RDTable, RDSmap, and RDSet, respectively. Note that buffer as shown in Fig. 11(d) to (e). Note that the overflow RDSmap and RDSet cache are read-only since the two pieces does not make any false alarm since this only reduces the of information are static, while RDTable cache involves opportunity of pruning redundant DFI processing. both read and write since it contains dynamic information. Fig. 12 shows the load pruning buffer design, whose For RDTable cache, write non-allocate and write through core part is a shift register-like structure realized by a chain policy is adopted for cache write. The cacheline size of each of D flip-flops (DFFs). Each circle with “=” is a pair of dedicated cache is also an important design parameter for two comparators, which compare the IDs and the target caching performance. We tested the relationship between addresses, respectively. Each comparator outputs 1 if two the cacheline size, the access time, and the hit rate of IDs (or two target addresses) are the same. “S/L” equals RDTable using the 429.mcf benchmark, when the cache size 0/1 if the input DFI-request’s corresponding instruction is is fixed. The results are shown in Fig. 13, and the access a store/load. The “S/L” is used as the higher control bit time is normalized to the access time of the 2-byte cacheline of the multiplexer. For each pair of comparators, the results configuration. It is shown that the access time is the least of ID and target address comparison are passed to an AND when the cacheline is 8 bytes, which is the datapath width gate, and all the results of such AND gates are passed to a between the L0 dedicated caches and the L1 Dcache. The tree of OR gates. If any port Q of the DFFs has the same reason is, if the cacheline size is too small, cache hit rate ID and target address as the input, the output “Redundant” can decrease, as shown in Fig. 13. If the cacheline size is would be 1, indicating that it is redundant to verify this DFI- too large, during one cache miss, RoCC needs to access request, which can be simply dropped. If “S/L” is 1 (for a the Dcache multiple times to fill the cacheline, which can load), in each clock cycle, one input can be pushed into the increase the miss penalty. Therefore, we choose 8 bytes (64 leftmost DFF and all the other data at port Qs are shifted bits) as the cachline size of each dedicated cache. right by 1 if the corresponding instruction of the new DFI- Besides, since the cacheline of the RDSet cache contains request is load. However, if “S/L” is 0 (for a store), when four 16-bit IDs, DFI checking can be parallelized. When the port Q of one DFF has the same target address compared there is a load, DFI verification needs to perform the com- with the input, this DFF is reset, which is the procedure in parison of the ID of the latest store writing to the loaded Fig. 11(c)-(d). Otherwise, the DFF remains the same value. data and the IDs in the load’s RDSet. Instead of reading one ID each time, RoCC can read 4 IDs and execute 4 compar- 5.4 Dedicated Cache isons in parallel, further reducing performance overhead. According to the analysis in the section of experiments, Miss status holding registers (MSHRs) can also help re- in Fig. 16, one can observe that most of the performance duce performance overhead. Since both the core and RoCC overhead of RVDFI is from memory access. Note that in can access the Dcache, we add MSHRs to the Dcache, which software-based DFI, most of the performance overhead is provides the opportunities for the core (and RoCC) to access due to DFI checking but not memory access [23]. It is the Dcache when RoCC (and the core) is waiting for the because RVDFI has moved the DFI checking to RoCC, which response from memory under a cache miss in the Dcache. is specialized for DFI verification, thereby greatly reducing 6 the checking overhead. Therefore, memory access changes EXPERIMENTAL RESULTS to be the top performance overhead factor. 6.1 Experiment Setup Although RoCC can access level-1 (L1) Dcache to reduce RVDFI is developed based on the Rocket Chip, which is a the memory access latency, increasing the size of the Dcache RISC-V based SoC generator [7]. The full system is based

IEEE TRANSACTIONS ON COMPUTERS                                                                                                 10

    TABLE 1
    Summary of code changes for RVDFI.                             40
                     Language  Lines of Code
Static Analysis Tool    C           ∼500                           20
Instrumentation Tool  Python       ∼4800
    Linux System        C            4
    Rocket Chip       Chisel       ∼1700                            0
on the Freedom project [44]. The 64-bit core of RVDFI has a
5-stage pipeline, with 16KB instruction cache and 16KB data
cache. The memory size of RVDFI is 2GB. The system is pro-         Fig. 14. The average DFI detection latency of each benchmark.
totyped on the HyperSilicon VeriTiger-H4000T FPGA plat-
form with a Xilinx Virtex UltraScale XCVU440FLGA2892               Nullhttpd is a HTTP server that has a heap overflow

FPGA. We use LLVM [41] for software program compilation vulnerability (CVE-2002-1496) [20]. If the server receives a and SVF [42] for static analysis. A summary of code changes POST request with negative content length L, it should in this work is shown in Table 1. not process the request. However, the server continues to For security analysis, we use RIPE suite [19], Heartbleed process and allocates a buffer of L + 1024 bytes, which is attack [45], the attack code for Nullhttpd [20] to evaluate less than 1024 bytes. Later, the server writes data of 1024 RVDFI. In addition, we use the SPEC CPU2006 benchmark bytes into the buffer, and therefore buffer overflow occurs. suite for performance evaluation [24]. The experiment shows that RVDFI successfully detects such a buffer overflow. When some load instructions attempt to 6.2 Security Analysis access the data written by overflow, it is found that the data In this section, we analyze the security of RVDFI. The exper- is not written by any instructions in the RDSet of the load iments of control-data attacks, such as return-oriented pro- instruction. An experiment is also conducted to confirm that gramming (ROP) and jump-oriented programming (JOP) RVDFI does not produce false alarms in this context. (indirect branches modification), and non-control-data at- 6.2.3 Detection Latency tacks are discussed. Another metric for evaluating security is the detection la- 6.2.1 Control-Data Attacks tency. The latency is defined as the time interval between RIPE suite [19] is the dominant benchmark of control-data the moment each DFI-request containing the DFI checking attacks. As discussed in Section 2.1, control-flow attacks can task (by Inmet Ld, Inmet Ret or Inmet Lib) is raised and the also be identified by DFI since these attacks need to modify moment its processing is finished, excluding the time when the data in the memory. In the experiments, we tested 156 the core is stalled. The latency indicates the cycles that can attacks including return-oriented programming (ROP) at- be used by the core to execute instructions during the DFI tacks and jump-oriented programming (JOP) attacks. These checking. The results are shown in Fig. 14, which shows attacks change the targets of the indirect branches (such as that RVDFI only incurs an average 20-cycle latency. This function pointers), or the return address stored on the stack, is short enough to prevent an effective attack from being to tamper with the control-flow. Results show that RVDFI successfully executed before the DFI violation is detected. can detect all the attacks. Besides, we modified RIPE to Besides, the latency varies across different benchmarks, disable the activation of the attacks, and RVDFI does not which is not directly related to the performance overhead. report false alarms in this case. These variations are related to the benchmark characteristics and may result from different memory instruction densities 6.2.2 Non-Control-Data Attacks and different RDSet sizes. Note that RVDFI defends against We also tested two kinds of non-control-data attacks, one is software vulnerabilities and enforces DFI for committed in- for data leaks while another is for illegal data modification. structions. The detection window has little to no impact Heartbleed (CVE-2014-0160) [21] is a vulnerability in the on transient execution attacks. The execution of the instru- OpenSSL cryptography library. When a message, including mented DFI related instructions may in return reduce the the payload and the length of the payload, is sent to a window of transient execution attacks. server, the server echoes back the message with the claimed length. However, it does not check if the actual payload 6.3 Performance Overhead length is the same as the claimed one. As such, an attacker Table 2 shows the performance overhead. The baseline may send a message claiming a length that is larger than of Columns NR1 and NR2 (Soft-NoMSHR and RVDFI- the actual payload length. Then, the server sends back not NoMSHR) is running the uninstrumented target program only the original payload but also some additional data, on the unmodified Rocket Chip without MSHR, and the which might be private sensitive data, to fulfill the claimed baseline of Columns 1–9 (Soft-MSHR, Basic-RVDFI, partially length. Consequently, the sensitive data is stolen by the enhanced RVDFI variants and fully enhanced RVDFI) is run- attacker. We use the proof-of-concept code based on [45] for ning the uninstrumented target program on the unmodified the attack, which is successfully detected by RVDFI as the Rocket Chip with MSHR. The main result of the proposed data to be loaded for sending back cannot be most recently RVDFI is at Column 9, with each dedicated cache 8KB, written by an instruction not from the sender. An attack-free and 24KB in total. As shown in Column 9, RVDFI only transaction, where the actual payload length conforms to the incurs 17.8% performance overhead, while the previous claimed one, is also tested and no false alarm is reported. complete DFI work based on software implementation at

Average DFI Detection Latency (cycle)

                                                                   462.libquantum470.lbm473.astar Avg.
                                                                   401.bzip2429.mcf433.milc445.gobmk456.hmmer458.sjeng

IEEE TRANSACTIONS ON COMPUTERS                                                                                                   11

    TABLE 2
Performance overhead of SPEC CPU 2006 benchmark and hardware resource consumption.
(†The percentage is calculated compared with Column NR1. ‡RDSet, RDSmap, and RDTable caches are implemented.)
       Scheme    Soft [18]    RVDFI  Soft [18]                   Partial RVDFI                               RVDFI
    Scheme Name    Soft-      RVDFI-   Soft-   Basic-  RVDFI-  RVDFI-  RVDFI-    RVDFI-   RVDFI-     RVDFI-  RVDFI
                   NoMSHR     NoMSHR    MSHR   RVDFI    FIFO    LdPr   RDSet     RDSmap  RDTable    64KB D$
     Column ID      NR1        NR2       1       2       3
        MSHR                 ×  √                        √       4       5     √   6        7          √8      √9
        FIFO         -          √        -       ×               √
Load Pruning Buffer  -          √        -       ×       ×       ×       ×         ×        ×          √       √
    ‡                                                                    ×         ×        ×                  √
  Dedicated Cache    -         24KB      -       ×       ×       ×     RDSet     RDSmap  RDTable     Dcache     ‡
                                                                  8KB             8KB      8KB       +48KB    24KB
     # of LUTs     50025      62376    59163   63676   63927   66293   65692     65515    65858      67315   71059
                              24.7%†            7.6%    8.1%   12.1%   11.0%     10.7%    11.3%      13.8%   20.1%
      # of FFs     38571      49853    41981   48373   48047   52993   48394     48373    48388      54349   53209
                              29.2%†           15.2%   14.4%   26.2%   15.3%     15.2%    15.3%      29.5%   26.7%
     # of BRAMs      81         81       81      81      81      81      81        81       81    117(44.44%)  81
  401.bzip2        244.6%     26.8%    235.6%  78.9%   77.1%   75.0%   33.1%     69.5%    69.4%      66.2%   16.2%
   429.mcf         130.1%     38.3%    119.0%  29.6%   25.8%   28.2%   22.5%     25.5%    26.4%      10.8%   17.0%
   433.milc        264.6%     31.0%    273.3% 212.5%   207.7%  169.1%  73.8%     187.3%   198.9%     101.7%  24.7%
  445.gobmk        271.0%     36.3%    276.0%  54.4%   42.6%   53.2%   43.9%     42.0%    46.7%      10.8%   26.9%
  456.hmmer        43.3%       3.9%    43.4%    6.1%    6.0%    6.1%    6.2%      5.0%     6.0%       5.8%    3.9%
  458.sjeng        181.4%     15.6%    180.2%  28.9%   24.0%   28.4%   23.4%     22.5%    23.5%       2.1%   11.9%
462.libquantum     55.7%      20.1%    54.5%   13.1%   13.8%   13.4%   13.4%     13.4%    13.2%      12.4%   12.9%
   470.lbm         113.2%     49.5%    128.5%  36.3%   29.1%   36.4%   34.3%     31.3%    36.1%      19.4%   25.1%
  473.astar        150.9%     35.1%    186.0%  31.5%   26.7%   35.5%   33.2%     30.1%    30.6%      11.4%   21.9%
   Average         161.6%     28.5%    166.3%  54.6%   50.3%   49.5%   31.5%     47.4%    50.1%      26.7%   17.8%

Column 1 incurs 166.3% overhead, which is more than 9×                           Update RDTable   Load RDSmap
compared to RVDFI. The effect of each enhancement is also       100              Load RDTable     Load RDSet and Check
investigated. The Basic-RVDFI implementation incurs nearly
55% performance loss (Column 2). With the FIFO introduced        75
(Column 3), the overhead of RVDFI-FIFO can be reduced            50
by 4.3 percentage points (pp) of the performance overhead.       25
Besides, pruning the load instructions at runtime in RVDFI-
LdPr (Column 4) can also reduce more than 5pp of the            0
performance overhead. When introducing the RDSet cache
in RVDFI-RDSet (Column 5), the overhead is greatly re-
duced to 31.5%, which proves the effectiveness of the RDSet   Fig. 15. The percentages of the time cost breakdowns of different DFI
cache. Similarly, the RDSmap cache enhancement in RVDFI-      verification steps.
RDSmap (Column 6) and the RDTable cache enhancement in           RDTable Cache       RDSmap Cache RDSet Cache
RVDFI-RDTable (Column 7) are able to reduce the overhead        100
to 47.4% and 50.1%, respectively. The results show that each   75
individual enhancement (Column 3–7) can effectively re-
duce the performance loss, which results in a low-overhead     50
design when combining them together for a fully enhanced       25
RVDFI (Column 9). We also remove the dedicated caches           0
and barely increase the L1 Dcache size by 48KB (Column 8),
which is 1× larger than the dedicated caches in the complete
RVDFI design (Column 9). Although a larger Dcache is
used, the performance overhead is even 50% worse than         Fig. 16. The cache hit rates of different dedicated caches.
that of RVDFI, which demonstrates the effectiveness of the    To analyze the cause of the performance overhead of
proposed dedicated L0 caches. The existence of MSHRs          RVD
can also affect the experiment setup of the system and           FI, the DFI verification time of RVDFI                with all the
have impacts on performance, especially for RVDFI that has    enhancements is broken down and illustrated in Fig. 15.
both the core and RoCC access L1 Dcache. Therefore, we        As shown, although varying from different benchmarks,
separately conduct the experiments with MSHRs enabled         most of the time is spent on accessing the RDTable. This
and disabled. The results show that MSHRs do not affect       is due to the relatively low hit rate of the RDTable cache as
the performance overhead of software-DFI (Column NR1).        shown in Fig. 16, where the RDTable cache shows the lowest
The difference between Column NR1 and Column 1 is             hit rate among the dedicated caches with the same cache
only due to noises, indicated by the marginal variations      size. For some benchmarks, such as 429.mcf, 462.libquantum
of different benchmarks. However, MSHR-enabled RVDFI          and 470.lbm, the hit rates of RDTable cache are lower than
(Column 9) can improve the performance compared to its        50%. Therefore, RDTable accesses contribute most to the DFI
MSHR-disabled counterpart (RVDFI-NoMSHR in NR2) for           verification time of RVDFI.
all the benchmarks.This is because the non-blocking Dcache                     Besides, the time cost breakdowns of the setups with
with MSHRs can eliminate the penalty of subsequent cache      individual dedicated caches are shown in Fig. 17. Com-
access under miss caused by either the core or RoCC.          pared with Basic-RVDFI with no dedicated cache, imple-
                                                              menting the RDSet cache inside RVDFI can greatly reduce










Benchmark

Hit Rate (%) DFI Time Cost Breakdown (%)

                                                                                 462.libquantum470.lbm473.astar Avg.
                                                              401.bzip2429.mcf433.milc445.gobmk456.hmmer458.sjeng




                                                                                 462.libquantum470.lbm473.astar Avg.
                                                              401.bzip2429.mcf433.milc445.gobmk456.hmmer458.sjeng

IEEE TRANSACTIONS ON COMPUTERS                                                                                                                12

                                                                         RVDFI. It shows that the binary size overhead of software-
                                                                         DFI is more than 125% on average while that of RVDFI
                                                                         is negligible, because the instrumentation of RVDFI                only
                                                                         adds at most 1 instruction for each memory access (or call,
    19.33%                                                               return) instruction, while software-DFI needs much more
                                                                         computations including comparison, addition, shifting, and
                                                                         branching. Fig. 18 also depicts the size of the RDSets (in-
                                                                         cluding the RDSmaps) of each benchmark. The average size
                                                                         is only around 200KB and the maximum is around 1400KB.

                                                                         6.6 Comparison with Previous Hardware-based DFI

                                                                         In this subsection, we compare RVDFI with famous previous
    5.59%                                                                hardware-based DFI enforcement. The comparison is shown
                                                                         in Table 3, and the details are discussed in the following:
                                                                                                   TABLE 3
                                                                           Comparison with previous hardware-based DFI enforcement.
Fig. 17. The percentages of the time cost by different DFI verification          †The granularity compared with complete DFI.
steps of different setups.                                               ‡The result is not reported in the corresponding reference.
                                                                                         DFI     Performance Hardware Resource   Memory Overhead
                                                                            Method   Enforcement  Consumption
                                                                                     Completeness  Overhead  LUT      FF     4B aligned 8B aligned
                                                                           HDFI [9]    1/32768†      <2%      -‡      -‡         3.1%      1.6%
                                                                          TMDFI [22]    1/256†       ∼39%     -‡      -‡         25.0%    12.5%
                                                                         PIM-DFI [23]  Complete      ∼36%  238,333  39,994       50.0%    25.0%
                                                                            RVDFI      Complete      ∼18%   11,896  11,228       50.0%    25.0%
                                                                                             HDFI [9]: Compared with RVDFI, although the perfor-
                                                                         mance overhead of HDFI is lower (<2%) by using a 1-
                                                                         bit tag for each data, its security strength is much weaker
                                                                         and can be easily attacked by the attack model discussed
                                                                         in work [23]. Since a complete DFI implementation such
Fig. 18. The executable binary size overhead and RDSets sizes of         as RVDFI uses 16 bits to separate the memory regions, the
different benchmarks.                                                    memory overhead of HDFI is 1/16 of that of RVDFI, but
                                                                         the data region protection of RVDFI is 32768× finer-grained
the percentage of the time cost on loading RDSet. Besides,               than HDFI.
implementing the RDSmap or the RDTable cache can also                                     TMDFI [9]: For TMDFI [22], RVDFI is 256× finer-grained
reduce the percentage of access time of RDSmap or RDTable,               than TMDFI, since TMDFI can only separate the memory
respectively. However, RDTable access time is the most                   region into 256 regions. Although TMDFI consumes 1/2
challenging one to reduce.                                               less memory overhead, it incurs a much higher performance
6.4 Hardware Resource and Memory Consumption                             overhead (39%) than RVDFI (18%).
    PIM-DFI [23]: For security, both schemes realize com-

The hardware resource consumption is also evaluated and plete DFI. However, PIM-DFI has 36.4% performance over- listed in Table 2. It shows that Basic-RVDFI needs 7.6% head while RVDFI only has 17.8% overhead. In terms of more look-up tables (LUTs) and 15.2% more flip-flops (FFs), memory usage, both schemes incur the same overhead compared with the unmodified Rocket Chip. Each enhance- as that of software-DFI [18]. PIM-DFI’s most significant ment costs at most around 7,000 LUTs and around 11,000 disadvantage is that it requires either a PIM processor or FFs. The final RVDFI implementation (Column 9) consumes a normal CPU core, in addition to an extra 238,333 LUTs 20.1% more LUTs, 26.7% more FFs. Without MSHR, the LUT and 39,994 FFs overhead when implementing at the same and FF consumption of RVDFI are 62,376 and 49,853, which platform as RVDFI. In contrast, RVDFI only needs 11,896 is 24.7% and 29.2% more than the original Rocket Chip LUTs and 11,228 FFs to realize DFI verification, which is without MSHR, respectively. a magnitude fewer. Therefore, RVDFI is more efficient in According to Section 4.2, the memory overhead is 50% both performance and hardware resources than PIM-DFI. and 25%, when the data is 4-byte and 8-byte aligned, respec- Another difference is that PIM-DFI is evaluated using sim- tively, for complete DFI. Although the memory overhead ulation while RVDFI is a real hardware prototype. is not low, for some security critical applications such as military and finance applications, memory overhead is less 6.7 Sensitive Study critical while security is one of the top priorities. Compared We also studied the detailed effectiveness of each enhance- with other complete DFI work [18], [23], RVDFI realizes ment by varying their sizes. Fig. 19 shows the normalized much higher performance without more memory overhead. performance overhead compared to the baseline, which 6.5 Binary Size Overhead and RDSets Size Analysis is an unmodified Rocket Chip with MSHR (Column 1 in Table. 2). It shows that by adding the enhancements and Since we instrument the target program for DFI enforce- the hardware resources, performance overhead can be mit- ment, the size of the executable binary can increase. Fig. 18 igated. Specifically, increasing the RDSet cache size has the shows the binary size overhead of both software-DFI and most positive impact.

IEEE TRANSACTIONS ON COMPUTERS                                                                                                                    13

                                                                           [4]   H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
                                                                                 A. Mishra, and H. Esmaeilzadeh, “From High-Level Deep Neural
                                                                                 Models to FPGAs,” IEEE/ACM International Symposium on Microar-
                                                                                 chitecture, pp. 1–12, 2016.
                                                                           [5]   RISC-V: The Free and Open RISC Instruction Set Architecture,
                                                                                 https://riscv.org/, 2010.
                                                                           [6]   NVDLA, http://nvdla.org/, 2018.
                                                                           [7]   K. Asanovi´c,   R.  Avizienis, J.   Bachrach, S. Beamer, D.   Bian-
Fig. 19. The effectiveness of the enhancements.                                  colin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz,
                                                                                 S. Karandikar, B. Keller, D. Kim, J. Koenig, Y. Lee, E. Love,
         2 Entries   16 Entries     64 Entries     128 Entries                   M. Maas, A. Magyar, H. Mao, M. Moreto, A. Ou, D. A. Patterson,
           35                                                                    B. Richards, C. Schmidt, S. Twigg, H. Vo, and A. Waterman,
           30                                                                    “The Rocket Chip Generator,” EECS Department, University of
                                                                                 California, Berkeley, Tech. Rep., 2016.
           25                                                              [8]   HammerBlade    RISC-V     Manycore,         https://riscv.org/news/
           20                                                                    2020/07/the-hammerblade-risc-v-manycore-a-programmable-
           15                                                                    scalable-risc-v-fabric-michael-taylor-and-max-h-ruttenberg-
           10                                                                    fosdem/, 2020.
           5                                                               [9]   C. Song, H. Moon, M. Alam, I. Yun, B. Lee, T. Kim, W. Lee, and
           0                                                                     Y. Paek, “HDFI: Hardware-Assisted Data-Flow Isolation,” IEEE
                                                                                 Symposium on Security and Privacy, pp. 1–17, 2016.
                                                                           [10] RISC-V     in   NVIDIA,     https://riscv.org/wp-content/uploads/
                                                                                 2017/05/Tue1345pm-NVIDIA-Sijstermans.pdf, 2017.
Fig. 20. The load pruning buffer hit rates under different buffer sizes.   [11] T. Fritzmann, G. Sigl, and J. Sep ´ulveda, “RISQ-V: Tightly Cou-
                                                                                 pled RISC-V Accelerators for Post-Quantum Cryptography,” IACR
    For dynamic redundant load pruning, we use “hit rate”                        Transactions on Cryptographic Hardware and Embedded Systems, vol.
to represent the ratio of the number of the redundant DFI-                       2020, no. 4, pp. 239–280, 2020.
requests identified by the load pruning buffer over the                    [12] A. Garofalo, G. Tagliavini, F. Conti, D. Rossi, and L. Benini,
                                                                                 “XpulpNN: Accelerating Quantized Neural Networks on RISC-V
total number of Inmet Ld DFI-requests. The pruning hit                           Processors Through ISA Extensions,” Design, Automation and Test
rate increases as the load pruning buffer size increases, as                     in Europe Conference, pp. 186–191, 2020.
shown in Fig. 20. After the point where the load pruning                   [13] Y. Zhang, B. Du, L. Zhang, and J. Wu, “Parallel DNN Inference
buffer has 64 entries, the hit rate only increases marginally.                   Framework Leveraging a Compact RISC-V ISA-Based Multi-Core
                                                                                 System,” ACM SIGKDD International Conference on Knowledge Dis-
Therefore, the load pruning buffer is implemented with 64                        covery & Data Mining, p. 627–635, 2020.
entries in the main RVDFI results (Column 9 of Table 2).                   [14] S. Davidson, S. Xie, C. Torng, K. Al-Hawai, A. Rovinski, T. Ajayi,
Besides, the hit rates vary from benchmark to benchmark.                         L. Vega, C. Zhao, R. Zhao, S. Dai, A. Amarnath, B. Veluri, P. Gao,
                                                                                 A. Rao, G. Liu, R. K. Gupta, Z. Zhang, R. Dreslinski, C. Batten, and
Although the average hit rate is around 6%, the effectiveness                    M. B. Taylor, “The Celerity Open-Source 511-Core RISC-V Tiered
of the load pruning buffer is relatively higher for some                         Accelerator Fabric: Fast Architectures and Design Methodologies
benchmarks, such as 433.milc and 401.bzip2. Therefore, the                       for Fast Chips,” IEEE Micro, vol. 38, no. 2, pp. 30–41, 2018.
dynamic redundant load pruning can increase the chance to                  [15] Intel     CET,       https://software.intel.com/sites/default/files/
                                                                                 managed/4d/2a/control-flow-enforcement-technology-
reduce the performance overhead for certain programs.                            preview.pdf, 2019.
                                                                           [16] AMD Secure Encrypted Virtualization, https://www.amd.com/
7    CONCLUSIONS                                                                 en/processors/amd-secure-encrypted-virtualization, 2020.
                                                                           [17] CoreSight Program Flow Trace, http://infocenter.arm.com/help/
In this paper, a secure RISC-V architecture named RVDFI is                       topic/com.arm.doc.ihi0035b/IHI0035B cs pft v1 1 architecture
proposed, which enables hardware-assisted complete DFI                           spec.pdf, 2011.
                                                                           [18] M. Castro, M. Costa, and T. Harris, “Securing Software by Enforc-
verification through specialized DFI verification architec-                      ing Data-Flow Integrity,” Symposium on Operating Systems Design
ture design. The system stacks consisting of compilation,                        and Implementation, pp. 147–160, 2006.
customized instruction instrumentation, and operating sys-                 [19] J. Wilander, N. Nikiforakis, Y. Younan, M. Kamkar, and W. Joosen,
                                                                                 “RIPE: Runtime Intrusion Prevention Evaluator,” Computer Secu-
tem are augmented to enable a secure DFI-capable RISC-                           rity Applications Conference, pp. 41–50, 2011.
V SoC. In addition, several enhancements are proposed to                   [20] Null HTTPd Remote Heap Overflow Vulnerability, https://www.
improve the security and reduce the performance overhead,                        securityfocus.com/bid/5774.
including the DFI request FIFO, the load pruning buffer, the               [21] The Heartbleed Bug, http://heartbleed.com/.
                                                                           [22] T. Liu, G. Shi, L. Chen, F. Zhang, Y. Yang, and J. Zhang, “TMDFI:
dedicated DFI caches, etc. The evaluation shows that RVDFI                       Tagged Memory Assisted for Fine-Grained Data-Flow Integrity
not only realizes complete DFI that can detect both control-                     Towards      Embedded Systems       Against Software Exploitation,”
data and non-control-data attacks, but also is practical due                     IEEE International Conference On Trust, Security And Privacy In
to its low performance overhead. In summary, RVDFI is the                        Computing And Communications/ IEEE International Conference On
                                                                                 Big Data Science And Engineering, pp. 545–550, 2018.
first RISC-V architecture with complete DFI verification that              [23] L. Feng, J. Huang, J. Huang, and J. Hu, “Toward Taming the
incurs only 17.8% performance overhead.                                          Overhead Monster for Data-Flow Integrity,” arXiv preprint, 2021.
                                                                           [24] SPEC CPU 2006 Benchmark, https://www.spec.org/cpu2006/.
                                                                           [25] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti, “Control-
       REFERENCES                                                                flow Integrity,” ACM Conference on Computer and Communications
                                                                                 Security, pp. 340–353, 2005.
[1]   Verilog to Routing, https://verilogtorouting.org/, 2012.             [26] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson,
[2]   Icarus Verilog, http://iverilog.icarus.com/, 1998.                         K. Lai, and O. Mutlu, “Flipping Bits in Memory Without Accessing
[3]   H. Genc, A.    Haj-Ali, V. Iyer,     A. Amid, H. Mao, J.  Wright,          Them: An Experimental Study of DRAM Disturbance Errors,”
      C. Schmidt, J. Zhao, A. Ou, M. Banister, Y. S. Shao, B. Nikolic,           ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 361–
      I. Stoica, and K. Asanovic, “Gemmini: An Agile Systolic Array              372, 2014.
      Generator Enabling Systematic Evaluations of Deep-Learning Ar-       [27] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas,
      chitectures,” arXiv preprint, 2019.                                        M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and

Load Pruning Buf Hit Rate (%)

      462.libquantum470.lbm473.astar Avg.
      401.bzip2429.mcf433.milc445.gobmk456.hmmer458.sjeng

IEEE TRANSACTIONS ON COMPUTERS 14

       Y. Yarom, “Spectre Attacks: Exploiting Speculative Execution,”     Jiayi Huang (Member, IEEE) received the BEng

IEEE Symposium on Security and Privacy, pp. 1–19, 2019. degree in information and communication engi- [28] Y. Park, W. Kwon, E. Lee, T. J. Ham, J. H. Ahn, and J. W. Lee, neering from Zhejiang University, China, in 2014, “Graphene: Strong yet Lightweight Row Hammer Protection,” and the PhD degree in computer engineering IEEE/ACM International Symposium on Microarchitecture, pp. 1–13, from Texas A&M University, in 2020. He is cur- 2020. rently a postdoctoral researcher with the Depart- [29] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. Fletcher, and ment of Electrical and Computer Engineering, J. Torrellas, “InvisiSpec: Making Speculative Execution Invisible UC Santa Barbara. His research interests in- in the Cache Hierarchy,” IEEE/ACM International Symposium on clude computer architecture, computer systems, Microarchitecture, pp. 428–441, 2018. and security. He is a member of the ACM and [30] D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao, the IEEE Computer Society. C. Zhao, Z. Azad, S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin, and M. B. Taylor, “BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs,” IEEE Micro, vol. 40, no. 4, pp. 93–102, 2020. [31] RISCV-DV, https://github.com/google/riscv-dv, 2019. [32] P. Akritidis, C. Cadar, C. Raiciu, M. Costa, and M. Castro, “Pre- venting Memory Error Exploits with WIT,” IEEE Symposium on Security and Privacy, pp. 263–277, 2008. [33] C. Song, B. Lee, K. Lu, W. R. Harris, T. Kim, and W. Lee, “Enforcing Luyi Li Luyi Li is currently working towards the Kernel Security Invariants with Data Flow Integrity,” Network and B.E. degree with integrated circuit design and Distributed System Security Symposium, pp. 1–15, 2016. integrated system from Nanjing University, Nan- [34] X. Ge, W. Cui, and T. Jaeger, “GRIFFIN: Guarding Control Flows jing, China. His research interests focus on hard- Using Intel Processor Trace,” ACM International Conference on ware acceleration, computer architecture, secu- Architectural Support for Programming Languages and Operating Sys- rity, etc. tems, pp. 585–598, 2017. [35] Y. Lee, J. Lee, I. Heo, D. Hwang, and Y. Paek, “Using Core- Sight PTM to Integrate CRA Monitoring IPs in an ARM-Based SoC,” ACM Transactions on Design Automation of Electronic Systems, vol. 22, no. 3, pp. 52:1–52:25, 2017. [36] R. N. M. Watson, J. Woodruff, P. G. Neumann, S. W. Moore, J. Anderson, D. Chisnall, N. Dave, B. Davis, K. Gudka, B. Laurie, S. J. Murdoch, R. Norton, M. Roe, S. Son, and M. Vadera, “CHERI: A Hybrid Capability-System Architecture for Scalable Software Compartmentalization,” IEEE Symposium on Security and Privacy, pp. 20–37, 2015. [37] T. Zhang, D. Lee, and C. Jung, “BOGO: Buy Spatial Memory Haochen Zhang Haochen Zhang is currently Safety, Get Temporal Memory Safety (Almost) Free,” International working towards the B.E. degree with integrated Conference on Architectural Support for Programming Languages and circuit design and integrated system from Nan- Operating Systems, p. 631–644, 2019. jing University, Nanjing, China. His research in- [38] Y. Kim, J. Lee, and H. Kim, “Hardware-based always-on heap terests are computer architecture, security, etc. memory safety,” IEEE/ACM International Symposium on Microar- chitecture, pp. 1153–1166, 2020. [39] L. Delshadtehrani, S. Canakci, B. Zhou, S. Eldridge, A. Joshi, and M. Egele, “PHMon: A Programmable Hardware Monitor and Its Security Use Cases,” USENIX Security Symposium, pp. 807–824, 2020. [40] N. Joly, S. ElSherei, and S. Amar, “Security Analysis of CHERI ISA,” https://github.com/microsoft/MSRC-Security- Research/blob/master/papers/2020/Security%20analysis% 20of%20CHERI%20ISA.pdf, 2020. [41] LLVM, https://llvm.org/. [42] Y. Sui and J. Xue, “SVF: Interprocedural Static Value-flow Analysis in LLVM,” International Conference on Compiler Construction, pp. 265–266, 2016. Zhongfeng Wang Zhongfeng Wang (Fellow, [43] H. Shacham, “The Geometry of Innocent Flesh on the Bone: IEEE) received both B.E. and M.S. degrees Return-into-libc without Function Calls (on the x86),” ACM con- from Tsinghua University. He obtained the Ph.D. ference on Computer and communications security, pp. 552–561, 2007. degree from the University of Minnesota, Min- [44] Freedom, https://github.com/sifive/freedom, 2016. neapolis, in 2000. He has been working for [45] The Source Code for Triggering Heartbleed Bug, https://github. Nanjing University, China, as a Distinguished com/mykter/afl-training/tree/master/challenges/heartbleed. Professor since 2016. Previously he worked for Broadcom Corporation, California, from 2007 to 2016 as a leading VLSI architect. Before that, he worked for Oregon State University and National Semiconductor Corporation. Lang Feng Lang Feng received his B.E. degree Dr. Wang is a world-recognized expert on Low-Power High-Speed in electronic science and technology (microelec- VLSI Design for Signal Processing Systems. He has published over tronic technology) from University of Electronic 200 technical papers with multiple best paper awards received from the Science and Technology of China, Chengdu, IEEE technical societies. In the current record, he has had many papers China, in 2016, and his Ph.D. degree in com- ranking among top 25 most (annually) downloaded manuscripts in IEEE puter engineering from Texas A&M University, Trans. on VLSI Systems. In the past, he has served as Associate Editor College Station, in 2020. In Nov. 2020, he joined for IEEE Trans. on TCAS-I, T-CAS-II, and T-VLSI for many terms. the School of Electronic Science and Engineer- ing of Nanjing University, where he is an asso- ciate research fellow. His research interests are computer architecture, security, etc.