Skip to content
STIMSMITH

UCB1 Algorithm

Technique WIKI v1 · 5/28/2026

UCB1 is an Upper Confidence Bound multi-armed bandit algorithm used to select among candidate virtual sequences by balancing exploitation of sequences with high observed reward and exploration of less-used sequences. In the cited UVM-based RISC-V verification flow, UCB1 selects one virtual sequence per trial, updates rewards from functional coverage data, and can reach the same coverage as random sequence selection in substantially fewer trials.

Overview

The UCB1 Algorithm is described in the provided evidence as an Upper Confidence Bound multi-armed bandit algorithm used to orchestrate simulation tests. In the verification flow, the available virtual sequences are treated like bandit "arms" or slot machines, and the algorithm selects which sequence to run based on rewards observed in previous trials. Its goal is to maximize and speed up functional coverage while still trying sequences that have not been used often enough to rule out high payoff potential.

Role in a MAB-based verification flow

In the cited UVM-based verification methodology, the framework first selects a set of K virtual sequences, defines coverage properties, and defines a reward function. UCB1 then controls execution of those virtual sequences during simulation. Based on collected rewards, the multi-armed-bandit selection process chooses one virtual sequence to apply to the design under test and gathers coverage data as the observed payoff; a scoreboard checks for functional specification violations during simulation.

The evidence describes the following operating assumptions for this use:

  • each sequence has an unknown reward distribution and unknown expected reward;
  • rewards from successive plays of the same sequence are assumed independent;
  • rewards across different sequences are also assumed independent;
  • only one virtual sequence runs in each round;
  • the selected virtual sequence runs uncontrolled, with no reseeding or parameter tuning;
  • a non-selected sequence remains frozen and contributes no reward while idle.

Selection rule

The UCB1 procedure in the evidence is:

  1. Run each of the K virtual sequences once for M cycles to initialize its mean payoff.
  2. For each later trial t, select the sequence a with the largest upper-confidence estimate, combining the sequence's current mean reward Q(a) with an uncertainty term that depends on log(t) and the number of times that sequence has been played, N_a.
  3. Observe the reward for the selected sequence and update that sequence's mean reward.

The uncertainty term is what drives exploration. When a sequence is selected, N_a increases and its uncertainty decreases. When other sequences are selected, t increases while N_a for the idle sequence does not, so the idle sequence's uncertainty grows. The logarithm makes these increases smaller over time. As a result, all sequences are eventually selected, while sequences with consistently better rewards are exploited more often.

Reward and coverage use

In the cited verification framework, rewards are derived from coverage progress. The framework measures how many active coverage bins a sequence hits during a trial. For example, if four bins remain active and a sequence hits three of them at least once in a trial, it receives a reward of 3/4. Rewards are in the range [0, 1].

The evidence also notes a practical issue near the end of simulation: when only hard-to-hit corner-case bins remain, a useful sequence may hit only one bin out of many active bins, producing a very small reward. To make learning more effective in this case, the reward function is renormalized in [0, 1] using a logarithmic transformation that boosts small rewards and compresses large ones.

Example application: RISC-V instruction fetch verification

The cited work applies the methodology to the instruction fetch unit of a RISC-V superscalar processor. The instruction fetch unit is connected to the rest of the processor through four interfaces. During simulation, each interface is driven by a distinct test sequence, and a virtual sequence contains four sequences total, one per interface. The test sequences are generated by independent constrained-random generators customized by parameters that model event probabilities.

To limit the search space, the methodology constrains each parameter to at most three values and selects K = 40 virtual sequences randomly from the available choices. Coverage goals are then defined using functional coverpoints and bins. In the cited experiment, coverpoints are considered fully covered when each bin has been hit at least 100 times.

Reported results

The evidence compares random selection of the same 40 virtual sequences against the MAB framework using UCB1 and the proposed reward function. Random selection was run for 5000 trials across five seeds and reached coverage between 82% and 91%. For each seed, the UCB1-based framework was then measured by how many trials it needed to reach the same coverage level.

Reported results show that UCB1 reached the same coverage much faster: 1507, 1945, 2372, 1525, and 2590 trials for the five seeds, averaging 1988 trials versus 5000 random trials. The evidence summarizes this as about a 2× reduction in the number of trials needed to reach the same coverage goal, with average savings of 60% in the table.

When allowed to run for the same 5000 trials as random selection, UCB1 reached 88% to 96% coverage across the five seeds. The evidence cautions that UCB1 does not improve the inherent quality of the selected sequence set; instead, it identifies the sequence set's coverage potential more quickly than traditional random application.

Practical interpretation

For constrained-random verification, UCB1 is useful when many candidate virtual sequences exist and their coverage effectiveness is initially unknown. It learns which sequences contribute more coverage and schedules them more often, while still periodically exploring low-use sequences. This makes it suitable for accelerating functional coverage closure without changing the sequence set itself.

CITATIONS

9 sources
9 citations
[1] UCB1 is an Upper Confidence Bound multi-armed bandit algorithm used to exploit effective sequences while exploring less-used sequences for potentially higher payoff. [PDF] UVM-based verification of RISC-V superscalar processors
[2] The MAB verification framework selects one virtual sequence based on collected reward, applies it to the DUT, gathers coverage data, and uses a scoreboard to check functional specifications. [PDF] UVM-based verification of RISC-V superscalar processors
[3] The UCB1 selection procedure initializes rewards by playing each virtual sequence once, then selects the sequence maximizing a mean reward plus uncertainty estimate, observes the trial reward, and updates the mean reward. [PDF] UVM-based verification of RISC-V superscalar processors
[4] The uncertainty term decreases when a sequence is selected, increases for idle sequences as trials advance, and helps UCB1 balance exploration and exploitation. [PDF] UVM-based verification of RISC-V superscalar processors
[5] Rewards in the cited framework are based on active coverage-bin hits, are bounded in [0, 1], and are logarithmically renormalized to make small rewards more distinguishable near hard-to-cover corner cases. [PDF] UVM-based verification of RISC-V superscalar processors
[6] In the RISC-V instruction fetch example, a virtual sequence contains four sequences, one per interface, and the framework uses parameterized constrained-random sequences to mimic interface behavior. [PDF] UVM-based verification of RISC-V superscalar processors
[7] The cited experiment selects K = 40 virtual sequences and treats coverpoints as fully covered when bins are hit at least 100 times. [PDF] UVM-based verification of RISC-V superscalar processors
[8] Across five seeds, random sequence selection for 5000 trials reached 82% to 91% coverage, while UCB1 reached the same coverage in 1507 to 2590 trials, averaging 1988 trials and about a 60% trial saving. [PDF] UVM-based verification of RISC-V superscalar processors
[9] When UCB1 ran for the same 5000 trials as random selection, it reached 88% to 96% coverage, and the evidence states that UCB1 identifies the potential of the selected sequence set faster rather than improving the sequence set's quality. [PDF] UVM-based verification of RISC-V superscalar processors