Testcase Generation Wiki

Overview

Testcase generation refers to producing test cases or testsets that can be executed to validate a system. In the supplied evidence, the concept appears in two main contexts: software test-driven development (TDD), where requirements can be used as input for text-to-testcase generation, and instruction-set-simulator (ISS) verification, where generated instruction programs are used to expose simulator errors.

Text-to-testcase generation for software

In TDD, test cases are expected to be written from requirements before implementation code. The public evidence notes that many automated test-case generation approaches take source code as input, which does not fully support TDD when code is not yet available. Recent work therefore studies text-to-testcase generation, where natural-language requirements are the input.

Two examples from the public context are:

Enhancing Large Language Models for Text-to-Testcase Generation: a GPT-3.5-based approach fine-tuned on a curated dataset with prompt design. In the reported evaluation over five large-scale open-source projects, it generated 7,000 test cases and achieved 78.5% syntactic correctness, 67.09% requirement alignment, and 61.7% code coverage.
PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation: a deep reinforcement learning approach intended to generate syntactically correct, executable, complete, effective test cases aligned with a natural-language requirement. On the APPS benchmark, the authors report that PyTester, despite using a small language model, outperformed larger models such as GPT-3.5, StarCoder, and InCoder.

Testcase generation for ISS verification

In RISC-V ISS verification, generated testsets can be used to compare an implementation under test against reference simulators. The coverage-guided fuzzing approach in Verifying Instruction Set Simulators using Coverage-guided Fuzzing has two phases: first, the fuzzer generates a testset; then each generated testcase is executed on the ISS under test and reference ISSs, and their results are compared.

The paper describes testcases with controlled setup and teardown behavior: registers are initialized to predefined values so implementations start in the same state, and a suffix writes register values to a predefined memory region so execution results can be dumped and compared. During generation, the ISS under test emits execution feedback; if a testcase increases coverage, it is added to the fuzzer testset.

The same work extends coverage-guided fuzzing with functional coverage and a specialized mutator tailored to ISS verification. Functional coverage is presented as a complement to code coverage, especially for computational errors that depend on operand values and structure. The authors implemented the approach on top of LLVM libFuzzer and evaluated it on three publicly available RISC-V ISSs.

Relationship to RISC-V Torture and directed tests

The ISS paper compares coverage-guided fuzzing against official RISC-V ISA tests and the RISC-V Torture testcase generator. The official RISC-V ISA tests are characterized as hand-written directed tests and therefore do not require a generation step. RISC-V Torture generates random tests, but the paper reports that increasing the Torture testset from 1,000 to 10,000 tests only slightly increased coverage because Torture receives no execution feedback, so each test is generated independently of the previous ones.

By contrast, the coverage-guided fuzzer uses execution feedback and is not constrained to a fixed instruction subset. The paper reports that it detected all previously shown errors and found six additional errors across ISS-UT, Spike, and Forvis. The conclusion states that fuzzing is useful for triggering corner cases and error cases and can complement other testcase generation techniques.

Key considerations

From the supplied evidence, testcase generation approaches differ along several axes:

Input: natural-language requirements in text-to-testcase generation, source code in some traditional automated approaches, or instruction encodings/programs in ISS fuzzing.
Feedback: coverage-guided fuzzing uses execution feedback to retain coverage-increasing testcases, while RISC-V Torture is described as generating tests independently without execution feedback.
Oracle or comparison mechanism: ISS verification can compare register/memory results between an ISS under test and reference ISSs.
Coverage goals: structural/code coverage can be supplemented with functional coverage to reach operand- and instruction-structure-related behaviors.
Limitations: generated tests may reveal mismatches that are not necessarily bugs, especially when the generator explores illegal or underspecified instruction sequences.