Skip to content
STIMSMITH

Mann-Whitney U Test

Concept WIKI v1 · 5/26/2026

The Mann-Whitney U Test is a non-parametric statistical test that does not assume a normal distribution and is reported as suitable for small sample sizes. In the provided evidence, it is used as a one-tailed test to evaluate whether differences between Vanilla AFL and Enhanced AFL fuzzing results are statistically significant.

Mann-Whitney U Test

Definition

The Mann-Whitney U Test is described in the provided evidence as a non-parametric test. It makes no assumption about normal distribution and therefore can be used with small sample sizes. [Definition]

Use in statistical comparison

In the cited processor-verification study, the test is used in a one-tailed form to compare fuzzing results produced by Vanilla AFL and Enhanced AFL. The study applies the test to two reported metrics: #Queue and #Unique-Crash. [Use in study]

#Queue comparison

The study defines #Queue values as the number of test vectors that increase coverage and cause no execution mismatch in co-simulation. Enhanced AFL generated fewer #Queue test vectors on average than Vanilla AFL, which the authors state would be preferable if the same coverage could be achieved with fewer test vectors. [Queue metric]

The authors analyzed the #Queue values using a one-tailed Mann-Whitney U Test. They report a 95% confidence-interval critical U-value threshold of 34, an observed U-value of 60, a z-score of 0, and a p-value of 0.5. They conclude that the apparent improvement is not statistically significant and that the divergence is negligible. [Queue test result]

#Unique-Crash comparison

The study defines #Unique-Crash values as the number of unique test vectors that cause an execution mismatch. Enhanced AFL generated more #Unique-Crash test vectors on average than Vanilla AFL. [Unique-crash metric]

The authors again used a one-tailed Mann-Whitney U Test for the #Unique-Crash values. They report a 99% confidence-interval critical U-value threshold of 25, an observed U-value of 17, a z-score of -2.8236, and a p-value of 0.0024. Based on this result, the study states that Enhanced AFL is highly significantly better at detecting errors. [Unique-crash test result]

Interpretation in the provided evidence

The evidence illustrates the Mann-Whitney U Test being used to distinguish between an apparent difference that is not statistically significant and a difference considered highly significant by the authors. In the #Queue case, the test result led to the conclusion that the observed difference was negligible; in the #Unique-Crash case, the test result supported the conclusion that Enhanced AFL detected errors better than Vanilla AFL. [Interpretation]

CITATIONS

7 sources
7 citations
[1] Definition: the Mann-Whitney U Test is non-parametric, assumes no normal distribution, and works for small sample sizes. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[2] Use in study: the cited study used the one-tailed Mann-Whitney U Test to compare Vanilla AFL and Enhanced AFL fuzzing results. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[3] Queue metric: #Queue values are test vectors that increase coverage and cause no execution mismatch; Enhanced AFL generated fewer #Queue test vectors on average. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[4] Queue test result: for #Queue, the study reports a 95% critical U threshold of 34, U=60, z=0, p=0.5, and concludes the improvement is not statistically significant. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[5] Unique-crash metric: #Unique-Crash values are unique test vectors that cause an execution mismatch, and Enhanced AFL generated more of them on average. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[6] Unique-crash test result: for #Unique-Crash, the study reports a 99% critical U threshold of 25, U=17, z=-2.8236, p=0.0024, and concludes Enhanced AFL is highly significantly better at detecting errors. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing
[7] Interpretation: the evidence shows the test being used to classify one apparent difference as not statistically significant and another result as highly significant. Efficient Cross-Level Processor Verification using Coverage-guided Fuzzing