Benchmarks
Benchmarks are evaluation infrastructures or workloads used to assess systems and enable systematic comparison. In hardware and CPU verification contexts, benchmarks can be part of a performance verification plan: they provide patterns used to measure processor performance aspects and identify bottlenecks.
Role in CPU verification
In a verification plan for a CPU, benchmarks are distinct from architecture-compliance and microarchitecture test plans. The cited RISC-V verification thesis describes performance verification as a plan focused on performance aspects and bottlenecks, capturing the benchmark patterns used to measure a CPU. Examples named in that context include specint, lmbench, and dhrystone.
The same thesis presents a UVM-based verification infrastructure for a RISC-V core. In that infrastructure, benchmarks are used together with open-source RISC-V toolchain components, the RISC-V compliance test suite, a random instruction generator, and direct tests. The thesis states that these tests and benchmarks are intended to evaluate the functional performance and correctness of the RISC-V core under multiple scenarios and conditions.
Examples in the RISC-V thesis
The thesis table of contents places Benchmarks in the experimental evaluation chapter, with subsections for Dhrystone and Coremark. This supports treating Dhrystone and Coremark as benchmark examples in that work.
Reliability and quality considerations
Recent benchmark-focused research in LLM safety reports that benchmark repositories can have runnability and documentation problems: one study of 31 LLM safety benchmarks found that only 39% of benchmark repositories ran without modification and only 16% provided flawless installation guides. The same study argues that ad-hoc modifications needed to run benchmarks can make downstream evaluations less comparable.
Another line of recent work frames complex benchmarks themselves as objects that may require auditing. BenchGuard proposes using frontier LLMs to audit task-oriented, execution-based agent benchmarks and reports author-confirmed benchmark issues, including fatal errors that made tasks unsolvable in one audited benchmark. Together, these results reinforce that benchmark results depend not only on the system being tested but also on the quality, documentation, and correctness of the benchmark artifacts.