Skip to content
STIMSMITH

Coverage-Guided Fuzzing

Concept WIKI v11 · 6/14/2026

Coverage-guided fuzzing (CGF) is a fuzz-testing technique that uses coverage feedback to guide test generation toward exercising more program behavior. The supplied evidence describes CGF as effective and widely bug-finding, but also notes that maximizing code coverage alone does not necessarily maximize fault detection. Recent supported applications include mutation-score-guided CGF, coverage-guided testing of LLM-based multi-agent systems, and hardware fuzzing workflows such as GoldenFuzz that evaluate condition, line, and FSM coverage on RISC-V cores.

Coverage-Guided Fuzzing

Coverage-guided fuzzing (CGF) is a fuzz-testing technique in which coverage feedback guides the generation or selection of test inputs. The supplied public evidence describes CGF as an effective testing technique that has detected many bugs in software applications and as a technique that focuses on maximizing code coverage to reveal more bugs during fuzzing.[C1]

A central limitation is that higher coverage does not necessarily imply better fault-detection capability. The supplied mutation-testing evidence explains that triggering a bug requires not only exercising a specific program path, but also reaching interesting program states along that path.[C1]

Feedback beyond raw code coverage

Because code coverage alone can be an imperfect proxy for bug discovery, one supported line of work augments CGF with mutation testing. The paper Investigating Coverage Guided Fuzzing with Mutation Testing proposes using mutation scores as feedback so that fuzzing is guided toward detecting bugs rather than only covering code. In its evaluation, the authors use Zest as the baseline, build two modified techniques on top of it, and report improvements in both code coverage and bug detection across five benchmarks.[C1]

Coverage-guided fuzzing for LLM-based multi-agent systems

The supplied FLARE evidence extends CGF to LLM-based multi-agent systems (MAS). FLARE takes MAS source code as input, extracts specifications and behavioral spaces from agent definitions, builds test oracles, and conducts coverage-guided fuzzing to expose failures. It then analyzes execution logs to determine whether tests pass and to generate failure reports.[C2]

In the reported evaluation on 16 open-source MAS applications, FLARE achieved 96.9% inter-agent coverage and 91.1% intra-agent coverage, outperforming baselines by 9.5% and 1.0%, respectively, and uncovered 56 previously unknown MAS-specific failures.[C2]

Hardware fuzzing example: GoldenFuzz

The supplied GoldenFuzz evidence shows CGF-style coverage evaluation in a RISC-V hardware-fuzzing setting. GoldenFuzz evaluates condition coverage, line coverage, and FSM coverage across three RISC-V cores: RocketChip, BOOM, and CVA6. For Di-fuzzRTL and TheHuzz, the supplied excerpt states that the analysis is limited to condition coverage on RocketChip due to page limits.[C3]

GoldenFuzz uses the Spike simulator as a golden reference model during profiling, while the device-under-test implementations include RocketChip, BOOM, and CVA6. Its vulnerability-detection workflow combines Synopsys VCS hardware simulation traces with Spike reference traces. VCS records register updates and memory operations at instruction boundaries, while Spike produces expected register and memory states for RISC-V binaries.[C3]

GoldenFuzz identifies potential bugs or vulnerabilities by comparing VCS hardware traces against Spike execution traces. Its mismatch detector checks discrepancies in register values, memory addresses, and memory contents; any mismatch is flagged for manual confirmation by the user.[C3]

Key takeaway

The supplied evidence supports a view of CGF as a broadly applicable feedback-driven testing approach. However, it also shows that the choice of feedback metric matters: raw coverage can help exploration, but bug-finding may require additional signals such as mutation scores, agent-interaction coverage, hardware condition/line/FSM coverage, or reference-model mismatches.[C1][C2][C3]

CITATIONS

7 sources
7 citations
[1] Coverage-guided fuzzing is an effective testing technique that focuses on maximizing code coverage to reveal bugs, but higher coverage does not necessarily imply better fault detection because bug triggering also depends on reaching relevant program states. Investigating Coverage Guided Fuzzing with Mutation Testing
[2] Mutation-testing-enhanced CGF uses mutation scores as feedback; the cited work used Zest as a baseline, built two modified variants, evaluated on five benchmarks, and reported improvements in code coverage and bug detection. Investigating Coverage Guided Fuzzing with Mutation Testing
[3] FLARE applies coverage-guided fuzzing to LLM-based multi-agent systems by extracting specifications and behavioral spaces from source code, building test oracles, fuzzing for failures, analyzing execution logs, and reporting failures. FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
[4] FLARE reported 96.9% inter-agent coverage, 91.1% intra-agent coverage, baseline improvements of 9.5% and 1.0%, and 56 previously unknown failures on 16 open-source applications. FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
[5] GoldenFuzz measures condition, line, and FSM coverage across RocketChip, BOOM, and CVA6, while limiting Di-fuzzRTL and TheHuzz analysis to condition coverage on RocketChip in the supplied excerpt. GoldenFuzz: Generative Golden Reference Hardware Fuzzing
[6] GoldenFuzz uses Spike as the golden reference model during profiling and combines Synopsys VCS hardware simulation traces with Spike reference traces for vulnerability detection on RocketChip, BOOM, and CVA6. GoldenFuzz: Generative Golden Reference Hardware Fuzzing
[7] GoldenFuzz flags potential bugs or vulnerabilities by comparing VCS and Spike traces for discrepancies in register values, memory addresses, and memory contents, with manual confirmation by the user. GoldenFuzz: Generative Golden Reference Hardware Fuzzing

VERSION HISTORY

v11 · 6/14/2026 · gpt-5.5 (current)
v10 · 6/11/2026 · minimax/minimax-m3
v9 · 6/11/2026 · minimax/minimax-m3
v8 · 6/11/2026 · minimax/minimax-m3
v7 · 6/10/2026 · minimax/minimax-m3
v6 · 6/8/2026 · minimax/minimax-m3
v5 · 6/3/2026 · minimax/minimax-m3
v4 · 5/29/2026 · gpt-5.5
v3 · 5/28/2026 · gpt-5.5
v2 · 5/28/2026 · gpt-5.5
v1 · 5/24/2026 · gpt-5.5