UCB1 Algorithm Wiki

Overview

The UCB1 Algorithm is described in the provided evidence as an Upper Confidence Bound multi-armed bandit algorithm used to orchestrate simulation tests. In the verification flow, the available virtual sequences are treated like bandit "arms" or slot machines, and the algorithm selects which sequence to run based on rewards observed in previous trials. Its goal is to maximize and speed up functional coverage while still trying sequences that have not been used often enough to rule out high payoff potential.

Role in a MAB-based verification flow

In the cited UVM-based verification methodology, the framework first selects a set of K virtual sequences, defines coverage properties, and defines a reward function. UCB1 then controls execution of those virtual sequences during simulation. Based on collected rewards, the multi-armed-bandit selection process chooses one virtual sequence to apply to the design under test and gathers coverage data as the observed payoff; a scoreboard checks for functional specification violations during simulation.

The evidence describes the following operating assumptions for this use:

each sequence has an unknown reward distribution and unknown expected reward;
rewards from successive plays of the same sequence are assumed independent;
rewards across different sequences are also assumed independent;
only one virtual sequence runs in each round;
the selected virtual sequence runs uncontrolled, with no reseeding or parameter tuning;
a non-selected sequence remains frozen and contributes no reward while idle.

Selection rule

The UCB1 procedure in the evidence is:

Run each of the K virtual sequences once for M cycles to initialize its mean payoff.
For each later trial t, select the sequence a with the largest upper-confidence estimate, combining the sequence's current mean reward Q(a) with an uncertainty term that depends on log(t) and the number of times that sequence has been played, N_a.
Observe the reward for the selected sequence and update that sequence's mean reward.

The uncertainty term is what drives exploration. When a sequence is selected, N_a increases and its uncertainty decreases. When other sequences are selected, t increases while N_a for the idle sequence does not, so the idle sequence's uncertainty grows. The logarithm makes these increases smaller over time. As a result, all sequences are eventually selected, while sequences with consistently better rewards are exploited more often.

Reward and coverage use

In the cited verification framework, rewards are derived from coverage progress. The framework measures how many active coverage bins a sequence hits during a trial. For example, if four bins remain active and a sequence hits three of them at least once in a trial, it receives a reward of 3/4. Rewards are in the range [0, 1].

The evidence also notes a practical issue near the end of simulation: when only hard-to-hit corner-case bins remain, a useful sequence may hit only one bin out of many active bins, producing a very small reward. To make learning more effective in this case, the reward function is renormalized in [0, 1] using a logarithmic transformation that boosts small rewards and compresses large ones.

Example application: RISC-V instruction fetch verification

The cited work applies the methodology to the instruction fetch unit of a RISC-V superscalar processor. The instruction fetch unit is connected to the rest of the processor through four interfaces. During simulation, each interface is driven by a distinct test sequence, and a virtual sequence contains four sequences total, one per interface. The test sequences are generated by independent constrained-random generators customized by parameters that model event probabilities.

To limit the search space, the methodology constrains each parameter to at most three values and selects K = 40 virtual sequences randomly from the available choices. Coverage goals are then defined using functional coverpoints and bins. In the cited experiment, coverpoints are considered fully covered when each bin has been hit at least 100 times.

Reported results

The evidence compares random selection of the same 40 virtual sequences against the MAB framework using UCB1 and the proposed reward function. Random selection was run for 5000 trials across five seeds and reached coverage between 82% and 91%. For each seed, the UCB1-based framework was then measured by how many trials it needed to reach the same coverage level.

Reported results show that UCB1 reached the same coverage much faster: 1507, 1945, 2372, 1525, and 2590 trials for the five seeds, averaging 1988 trials versus 5000 random trials. The evidence summarizes this as about a 2× reduction in the number of trials needed to reach the same coverage goal, with average savings of 60% in the table.

When allowed to run for the same 5000 trials as random selection, UCB1 reached 88% to 96% coverage across the five seeds. The evidence cautions that UCB1 does not improve the inherent quality of the selected sequence set; instead, it identifies the sequence set's coverage potential more quickly than traditional random application.

Practical interpretation

For constrained-random verification, UCB1 is useful when many candidate virtual sequences exist and their coverage effectiveness is initially unknown. It learns which sequences contribute more coverage and schedules them more often, while still periodically exploring low-use sequences. This makes it suitable for accelerating functional coverage closure without changing the sequence set itself.

UCB1 Algorithm