Multi-Armed Bandit Wiki

Overview

A Multi-Armed Bandit (MAB) is a model for making decisions under uncertainty. Its name comes from the casino analogy of multiple one-armed slot machines: a decision maker is given several arms with unknown reward profiles and, at each time step, chooses one arm to maximize expected reward. The central trade-off is between exploration—trying arms whose rewards are uncertain—and exploitation—choosing arms that have already produced relatively high rewards. [C1]

Public examples describe the same exploration/exploitation dilemma in other domains. In password guessing, a guesser may have several dictionaries or information sources but does not know in advance which will yield the best results; this can be framed as a MAB problem. QoS-aware variants of Thompson sampling have also been proposed for runtime decision making in self-adaptive systems where an arm must satisfy QoS requirements with high confidence. [C10]

Use in coverage-driven verification

In the provided UVM verification evidence, MAB is used to automate the application of test stimuli to a design under test (DUT) with the goal of reaching coverage goals faster than traditional scheduling. The analogy maps slot-machine arms to available test sequences: each sequence is applied to the DUT for a number of cycles, its coverage performance is converted into a reward, and the MAB policy recommends which sequence to select next. [C2]

A virtual sequence is treated as an arm in the MAB formulation. In the cited framework, a virtual sequence is a collection of test sequences that drive the DUT interfaces. The framework preselects a fixed set of virtual sequences and fixed parameters before simulation; after that point, sequences are not allowed to change parameters, constraints, reseeding, or random behavior. This fixed-arm setup lets the MAB learn which sequences perform well and penalize poor performers over repeated trials. [C3]

Functional coverage and reward design

The reward signal is derived from functional coverage. Verification engineers define functional coverpoints for architecturally interesting DUT behavior. Each coverpoint has bins, and each bin has a coverage goal; a bin is considered fully covered once it has been hit at least the required number of times. Achieved coverage provides a quantitative measure of how effectively the tests explore the DUT, including corner cases. [C4]

For a trial, a selected sequence is simulated for a fixed number of cycles. Reward is computed from the bins that are still active, meaning bins whose total hit count has not yet reached the goal. Bins that are already closed are removed from the reward calculation so that the algorithm focuses on still-uncovered behavior rather than repeatedly rewarding sequences for hitting already-covered properties. [C5]

In the described reward computation, the framework resets per-trial hit counters for active bins, runs the selected sequence, and assigns a reward according to how many active bins were hit at least once during that trial. The reward lies in the range [0, 1]; for late-stage corner-case coverage, the evidence describes a logarithmic renormalization intended to boost very small rewards and compress large ones so the learning algorithm can better distinguish useful sequences. [C6]

UCB1-based selection

One implementation used in the evidence is the UCB1 Algorithm. The framework first plays each of the K virtual sequences once to initialize its mean payoff. On later trials, it selects the sequence maximizing an upper-confidence estimate of the form Q(a) + sqrt(2 log(t) / N_a), where Q(a) is the current mean reward estimate for sequence a, t is the trial number, and N_a is the number of times sequence a has been played. After the selected sequence runs, the observed reward is used to update that sequence’s mean reward. [C7]

The uncertainty term makes less-played sequences more likely to be tried, while repeated selection reduces a sequence’s uncertainty by increasing N_a. If an optimistic estimate is wrong, it decreases after further observation; if the choice is good, the algorithm can exploit it while still occasionally exploring alternatives. This is how UCB1 balances exploration and exploitation in the verification flow. [C8]

Reported verification case studies

The cited RISC-V verification work applies MAB at multiple abstraction levels. At unit level, the framework is applied to an Instruction Fetch unit connected through four interfaces, where each interface is driven by a distinct test sequence. A virtual sequence contains one sequence per interface. The evidence describes selecting K = 40 virtual sequences randomly from parameterized sequence choices, with parameters such as instruction-cache partial access rate, stall rate, and branch-direction probabilities. [C9]

At top level, the same MAB approach is applied to a full 2-way superscalar out-of-order RISC-V processor. In that setting, the test sequence becomes an assembly-language instruction sequence executed by the processor, generated by a random instruction generator that produces valid ISA-compliant instruction sequences. [C11]

The reported RISC-V case studies state that MAB reached higher functional coverage goals without manual intervention and in substantially smaller simulation time than random scheduling of the available tests, with simulation-time savings ranging from 1.5× to 2× across different simulation seeds. [C12]

Practical notes

In this evidence, MAB is not used to mutate sequence parameters during simulation. Instead, the available arms are fixed before learning begins. The framework explicitly avoids runtime dynamic biasing of sequence randomness because such intervention can work against the MAB’s need to learn stable reward behavior for each arm. [C3]

The evidence also notes that once MAB identifies low-reward sequences, an orthogonal optimization could replace them with new sequences to further improve coverage and reduce simulation time; however, such replacement is described as outside the core MAB concept presented there. [C13]