Mann-Whitney U Test
Definition
The Mann-Whitney U Test is described in the provided evidence as a non-parametric test. It makes no assumption about normal distribution and therefore can be used with small sample sizes. [Definition]
Use in statistical comparison
In the cited processor-verification study, the test is used in a one-tailed form to compare fuzzing results produced by Vanilla AFL and Enhanced AFL. The study applies the test to two reported metrics: #Queue and #Unique-Crash. [Use in study]
#Queue comparison
The study defines #Queue values as the number of test vectors that increase coverage and cause no execution mismatch in co-simulation. Enhanced AFL generated fewer #Queue test vectors on average than Vanilla AFL, which the authors state would be preferable if the same coverage could be achieved with fewer test vectors. [Queue metric]
The authors analyzed the #Queue values using a one-tailed Mann-Whitney U Test. They report a 95% confidence-interval critical U-value threshold of 34, an observed U-value of 60, a z-score of 0, and a p-value of 0.5. They conclude that the apparent improvement is not statistically significant and that the divergence is negligible. [Queue test result]
#Unique-Crash comparison
The study defines #Unique-Crash values as the number of unique test vectors that cause an execution mismatch. Enhanced AFL generated more #Unique-Crash test vectors on average than Vanilla AFL. [Unique-crash metric]
The authors again used a one-tailed Mann-Whitney U Test for the #Unique-Crash values. They report a 99% confidence-interval critical U-value threshold of 25, an observed U-value of 17, a z-score of -2.8236, and a p-value of 0.0024. Based on this result, the study states that Enhanced AFL is highly significantly better at detecting errors. [Unique-crash test result]
Interpretation in the provided evidence
The evidence illustrates the Mann-Whitney U Test being used to distinguish between an apparent difference that is not statistically significant and a difference considered highly significant by the authors. In the #Queue case, the test result led to the conclusion that the observed difference was negligible; in the #Unique-Crash case, the test result supported the conclusion that Enhanced AFL detected errors better than Vanilla AFL. [Interpretation]