Skip to content
STIMSMITH

Benchmarks

Concept WIKI v1 · 5/27/2026

Benchmarks are evaluation artifacts used to measure and compare systems. In CPU verification evidence, they appear as performance-oriented patterns and as part of a RISC-V core verification strategy alongside compliance tests, direct tests, random instruction generation, and simulation-based checking. Broader benchmark research also highlights that benchmark quality, runnability, documentation, and auditability affect whether results are reliable and comparable.

Benchmarks

Benchmarks are evaluation infrastructures or workloads used to assess systems and enable systematic comparison. In hardware and CPU verification contexts, benchmarks can be part of a performance verification plan: they provide patterns used to measure processor performance aspects and identify bottlenecks.

Role in CPU verification

In a verification plan for a CPU, benchmarks are distinct from architecture-compliance and microarchitecture test plans. The cited RISC-V verification thesis describes performance verification as a plan focused on performance aspects and bottlenecks, capturing the benchmark patterns used to measure a CPU. Examples named in that context include specint, lmbench, and dhrystone.

The same thesis presents a UVM-based verification infrastructure for a RISC-V core. In that infrastructure, benchmarks are used together with open-source RISC-V toolchain components, the RISC-V compliance test suite, a random instruction generator, and direct tests. The thesis states that these tests and benchmarks are intended to evaluate the functional performance and correctness of the RISC-V core under multiple scenarios and conditions.

Examples in the RISC-V thesis

The thesis table of contents places Benchmarks in the experimental evaluation chapter, with subsections for Dhrystone and Coremark. This supports treating Dhrystone and Coremark as benchmark examples in that work.

Reliability and quality considerations

Recent benchmark-focused research in LLM safety reports that benchmark repositories can have runnability and documentation problems: one study of 31 LLM safety benchmarks found that only 39% of benchmark repositories ran without modification and only 16% provided flawless installation guides. The same study argues that ad-hoc modifications needed to run benchmarks can make downstream evaluations less comparable.

Another line of recent work frames complex benchmarks themselves as objects that may require auditing. BenchGuard proposes using frontier LLMs to audit task-oriented, execution-based agent benchmarks and reports author-confirmed benchmark issues, including fatal errors that made tasks unsolvable in one audited benchmark. Together, these results reinforce that benchmark results depend not only on the system being tested but also on the quality, documentation, and correctness of the benchmark artifacts.

CITATIONS

6 sources
6 citations
[1] Benchmarks are evaluation infrastructures used to identify trends and support systematic comparisons. Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
[2] CPU performance verification plans can capture patterns or benchmarks used to measure processor performance aspects and bottlenecks, with examples including specint, lmbench, and dhrystone. UVM based design verification of a RISC-V CPU core - POLITesi
[3] The RISC-V verification thesis uses benchmarks together with direct tests, a random instruction generator, RISC-V toolchain components, and the RISC-V compliance test suite to evaluate functional performance and correctness under different scenarios. UVM based design verification of a RISC-V CPU core - POLITesi
[4] The thesis experimental evaluation chapter contains a Benchmarks section with Dhrystone and Coremark subsections. UVM based design verification of a RISC-V CPU core - POLITesi
[5] A 2026 study of 31 LLM safety benchmarks found that only 39% of benchmark repositories ran without modification, only 16% had flawless installation guides, and ad-hoc modifications can reduce comparability of downstream evaluations. Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
[6] BenchGuard proposes automated auditing of task-oriented, execution-based agent benchmarks and found author-confirmed issues, including fatal errors that made some benchmark tasks unsolvable. BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks