Multi-Armed Bandit

Concept

A multi-armed bandit (MAB) is a decision-making model in which a decision maker repeatedly chooses among arms with unknown reward behavior, balancing exploration of uncertain choices against exploitation of choices that have paid off. In hardware verification evidence, MAB is used to schedule virtual test sequences, using functional-coverage-driven rewards and algorithms such as UCB1 to accelerate coverage closure.

First seen 5/28/2026

Last seen 5/28/2026

Evidence 7 chunks

Wiki v1

WIKI

Overview

A Multi-Armed Bandit (MAB) is a model for making decisions under uncertainty. Its name comes from the casino analogy of multiple one-armed slot machines: a decision maker is given several arms with unknown reward profiles and, at each time step, chooses one arm to maximize expected reward. The central trade-off is between exploration—trying arms whose rewards are uncertain—and exploitation—choosing arms that have already produced relatively high rewards. [C1]

Public examples describe the same exploration/exploitation dilemma in other domains. In password guessing, a guesser may have several dictionaries or information sources but does not know in advance which will yield the best results; this can be framed as a MAB problem. QoS-aware variants of Thompson sampling have also been proposed for runtime decision making in self-adaptive systems where an arm must satisfy QoS requirements with high confidence. [C10]

READ FULL ARTICLE →

NEIGHBORHOOD

No graph connections found for this entity yet. It may appear in future ingestion runs.

explore full graph →

RELATIONSHIPS

4 connections

UCB1 Algorithm ← implements 98% 2e

UCB1 is the specific multi-armed bandit algorithm implemented in the verification framework.

Reward Function uses → 95% 2e

The MAB framework uses a reward function to guide test sequence selection.

virtual sequence uses → 95% 1e

The MAB framework treats virtual sequences as the arms of the bandit.

Functional Coverage uses → 95% 1e

The MAB framework targets functional coverage goals as its optimization objective.

LINKED ENTITIES

4 links

UCB1 Algorithm IMPLEMENTS Extracted graph relationship

Reward Function USES Extracted graph relationship

virtual sequence USES Extracted graph relationship

Functional Coverage USES Extracted graph relationship

CITATIONS

13 sources

13 citations — click to expand

[1] C1: A multi-armed bandit is a decision-making model with multiple arms of unknown reward profile, requiring a trade-off between exploration and exploitation to maximize expected reward. [PDF] UVM-based verification of RISC-V superscalar processors

[2] C2: In the UVM verification framework, MAB maps arms to test sequences, applies each selected sequence for cycles, records a coverage-based reward, and recommends the next sequence using a balanced exploration-exploitation policy. [PDF] UVM-based verification of RISC-V superscalar processors

[3] C3: In the cited MAB verification framework, virtual sequences and their parameters are fixed before simulation, and runtime parameter or constraint changes are avoided so the MAB can learn stable sequence performance. [PDF] UVM-based verification of RISC-V superscalar processors

[4] C4: Functional coverage in the evidence is defined through coverpoints and bins with goals; a bin is fully covered once it has been hit at least its goal number of times. [PDF] UVM-based verification of RISC-V superscalar processors

[5] C5: The reward calculation counts only active bins whose coverage goals have not yet been reached, removing already-covered bins from reward determination. [PDF] UVM-based verification of RISC-V superscalar processors

[6] C6: The described reward is the fraction of active bins hit at least once during a trial, lies in [0, 1], and may be logarithmically renormalized to enhance small rewards near corner-case closure. [PDF] UVM-based verification of RISC-V superscalar processors

[7] C7: The UCB1-based MAB framework initializes by playing each of K virtual sequences once, then selects the sequence maximizing Q(a) + sqrt(2 log(t) / N_a), observes reward, and updates the mean reward. [PDF] UVM-based verification of RISC-V superscalar processors

[8] C8: UCB1 balances exploration and exploitation by increasing selection pressure for uncertain or less-played sequences while reducing uncertainty after a sequence is selected. [PDF] UVM-based verification of RISC-V superscalar processors

[9] C9: In the Instruction Fetch case study, a virtual sequence contains four sequences, one per interface, and K = 40 virtual sequences are randomly selected from parameterized choices. [PDF] UVM-based verification of RISC-V superscalar processors

[10] C10: Public examples frame password guessing as a MAB-style exploration/exploitation problem and describe a QoS-aware Thompson-sampling variant for multi-armed bandits in runtime decision making. Multi-armed bandit approach to password guessing; QoS-Aware Multi-Armed Bandits

[11] C11: At top-level RISC-V processor verification, the MAB approach automates test application using instruction sequences generated as valid ISA-compliant assembly-language instructions. [PDF] UVM-based verification of RISC-V superscalar processors

[12] C12: The RISC-V case studies report higher functional coverage without manual intervention and simulation-time savings of 1.5× to 2× compared with random scheduling. [PDF] UVM-based verification of RISC-V superscalar processors

[13] C13: The evidence notes that replacing low-reward sequences after MAB identifies them is a possible orthogonal optimization, not part of the core MAB concept discussed. [PDF] UVM-based verification of RISC-V superscalar processors