Skip to content
STIMSMITH

Multi-Armed Bandit

Concept WIKI v1 · 5/28/2026

A multi-armed bandit (MAB) is a decision-making model in which a decision maker repeatedly chooses among arms with unknown reward behavior, balancing exploration of uncertain choices against exploitation of choices that have paid off. In hardware verification evidence, MAB is used to schedule virtual test sequences, using functional-coverage-driven rewards and algorithms such as UCB1 to accelerate coverage closure.

Overview

A Multi-Armed Bandit (MAB) is a model for making decisions under uncertainty. Its name comes from the casino analogy of multiple one-armed slot machines: a decision maker is given several arms with unknown reward profiles and, at each time step, chooses one arm to maximize expected reward. The central trade-off is between exploration—trying arms whose rewards are uncertain—and exploitation—choosing arms that have already produced relatively high rewards. [C1]

Public examples describe the same exploration/exploitation dilemma in other domains. In password guessing, a guesser may have several dictionaries or information sources but does not know in advance which will yield the best results; this can be framed as a MAB problem. QoS-aware variants of Thompson sampling have also been proposed for runtime decision making in self-adaptive systems where an arm must satisfy QoS requirements with high confidence. [C10]

Use in coverage-driven verification

In the provided UVM verification evidence, MAB is used to automate the application of test stimuli to a design under test (DUT) with the goal of reaching coverage goals faster than traditional scheduling. The analogy maps slot-machine arms to available test sequences: each sequence is applied to the DUT for a number of cycles, its coverage performance is converted into a reward, and the MAB policy recommends which sequence to select next. [C2]

A virtual sequence is treated as an arm in the MAB formulation. In the cited framework, a virtual sequence is a collection of test sequences that drive the DUT interfaces. The framework preselects a fixed set of virtual sequences and fixed parameters before simulation; after that point, sequences are not allowed to change parameters, constraints, reseeding, or random behavior. This fixed-arm setup lets the MAB learn which sequences perform well and penalize poor performers over repeated trials. [C3]

Functional coverage and reward design

The reward signal is derived from functional coverage. Verification engineers define functional coverpoints for architecturally interesting DUT behavior. Each coverpoint has bins, and each bin has a coverage goal; a bin is considered fully covered once it has been hit at least the required number of times. Achieved coverage provides a quantitative measure of how effectively the tests explore the DUT, including corner cases. [C4]

For a trial, a selected sequence is simulated for a fixed number of cycles. Reward is computed from the bins that are still active, meaning bins whose total hit count has not yet reached the goal. Bins that are already closed are removed from the reward calculation so that the algorithm focuses on still-uncovered behavior rather than repeatedly rewarding sequences for hitting already-covered properties. [C5]

In the described reward computation, the framework resets per-trial hit counters for active bins, runs the selected sequence, and assigns a reward according to how many active bins were hit at least once during that trial. The reward lies in the range [0, 1]; for late-stage corner-case coverage, the evidence describes a logarithmic renormalization intended to boost very small rewards and compress large ones so the learning algorithm can better distinguish useful sequences. [C6]

UCB1-based selection

One implementation used in the evidence is the UCB1 Algorithm. The framework first plays each of the K virtual sequences once to initialize its mean payoff. On later trials, it selects the sequence maximizing an upper-confidence estimate of the form Q(a) + sqrt(2 log(t) / N_a), where Q(a) is the current mean reward estimate for sequence a, t is the trial number, and N_a is the number of times sequence a has been played. After the selected sequence runs, the observed reward is used to update that sequence’s mean reward. [C7]

The uncertainty term makes less-played sequences more likely to be tried, while repeated selection reduces a sequence’s uncertainty by increasing N_a. If an optimistic estimate is wrong, it decreases after further observation; if the choice is good, the algorithm can exploit it while still occasionally exploring alternatives. This is how UCB1 balances exploration and exploitation in the verification flow. [C8]

Reported verification case studies

The cited RISC-V verification work applies MAB at multiple abstraction levels. At unit level, the framework is applied to an Instruction Fetch unit connected through four interfaces, where each interface is driven by a distinct test sequence. A virtual sequence contains one sequence per interface. The evidence describes selecting K = 40 virtual sequences randomly from parameterized sequence choices, with parameters such as instruction-cache partial access rate, stall rate, and branch-direction probabilities. [C9]

At top level, the same MAB approach is applied to a full 2-way superscalar out-of-order RISC-V processor. In that setting, the test sequence becomes an assembly-language instruction sequence executed by the processor, generated by a random instruction generator that produces valid ISA-compliant instruction sequences. [C11]

The reported RISC-V case studies state that MAB reached higher functional coverage goals without manual intervention and in substantially smaller simulation time than random scheduling of the available tests, with simulation-time savings ranging from 1.5× to 2× across different simulation seeds. [C12]

Practical notes

In this evidence, MAB is not used to mutate sequence parameters during simulation. Instead, the available arms are fixed before learning begins. The framework explicitly avoids runtime dynamic biasing of sequence randomness because such intervention can work against the MAB’s need to learn stable reward behavior for each arm. [C3]

The evidence also notes that once MAB identifies low-reward sequences, an orthogonal optimization could replace them with new sequences to further improve coverage and reduce simulation time; however, such replacement is described as outside the core MAB concept presented there. [C13]

CITATIONS

13 sources
13 citations
[1] C1: A multi-armed bandit is a decision-making model with multiple arms of unknown reward profile, requiring a trade-off between exploration and exploitation to maximize expected reward. [PDF] UVM-based verification of RISC-V superscalar processors
[2] C2: In the UVM verification framework, MAB maps arms to test sequences, applies each selected sequence for cycles, records a coverage-based reward, and recommends the next sequence using a balanced exploration-exploitation policy. [PDF] UVM-based verification of RISC-V superscalar processors
[3] C3: In the cited MAB verification framework, virtual sequences and their parameters are fixed before simulation, and runtime parameter or constraint changes are avoided so the MAB can learn stable sequence performance. [PDF] UVM-based verification of RISC-V superscalar processors
[4] C4: Functional coverage in the evidence is defined through coverpoints and bins with goals; a bin is fully covered once it has been hit at least its goal number of times. [PDF] UVM-based verification of RISC-V superscalar processors
[5] C5: The reward calculation counts only active bins whose coverage goals have not yet been reached, removing already-covered bins from reward determination. [PDF] UVM-based verification of RISC-V superscalar processors
[6] C6: The described reward is the fraction of active bins hit at least once during a trial, lies in [0, 1], and may be logarithmically renormalized to enhance small rewards near corner-case closure. [PDF] UVM-based verification of RISC-V superscalar processors
[7] C7: The UCB1-based MAB framework initializes by playing each of K virtual sequences once, then selects the sequence maximizing Q(a) + sqrt(2 log(t) / N_a), observes reward, and updates the mean reward. [PDF] UVM-based verification of RISC-V superscalar processors
[8] C8: UCB1 balances exploration and exploitation by increasing selection pressure for uncertain or less-played sequences while reducing uncertainty after a sequence is selected. [PDF] UVM-based verification of RISC-V superscalar processors
[9] C9: In the Instruction Fetch case study, a virtual sequence contains four sequences, one per interface, and K = 40 virtual sequences are randomly selected from parameterized choices. [PDF] UVM-based verification of RISC-V superscalar processors
[10] C10: Public examples frame password guessing as a MAB-style exploration/exploitation problem and describe a QoS-aware Thompson-sampling variant for multi-armed bandits in runtime decision making. Multi-armed bandit approach to password guessing; QoS-Aware Multi-Armed Bandits
[11] C11: At top-level RISC-V processor verification, the MAB approach automates test application using instruction sequences generated as valid ISA-compliant assembly-language instructions. [PDF] UVM-based verification of RISC-V superscalar processors
[12] C12: The RISC-V case studies report higher functional coverage without manual intervention and simulation-time savings of 1.5× to 2× compared with random scheduling. [PDF] UVM-based verification of RISC-V superscalar processors
[13] C13: The evidence notes that replacing low-reward sequences after MAB identifies them is a possible orthogonal optimization, not part of the core MAB concept discussed. [PDF] UVM-based verification of RISC-V superscalar processors