Metatesting your property-based tests
Cross-posted at the G-Research official blog.
Property-based testing is a great technique for getting the computer to do much of the drudgery when you’re writing tests. But it’s pretty easy to accidentally test much less than you thought. Here, we’ll go over an extremely easy technique you can use to automatically test your own test coverage.
This post is written in F#, an impure functional-first language. F#’s impurity is actually what makes this technique so easy; in a pure language like Haskell, the same goal has to be achieved using more machinery.
What is property-based testing?
“Property-based testing” refers to any method of testing which focuses around defining properties that your programs obey rather than specific test cases your programs satisfy. That is, the programmer writes down some kind of general desired fact about the program, and the computer creates test cases to try and disprove that fact.
The first and most popular implementation of a property-based testing library is Haskell’s QuickCheck. QuickCheck has many ports into other languages, and the one we’ll be using is FsCheck for F#.
Example
Here’s a little bare-bones implementation of an “interval set”: a set of integers, but stored in a space-efficient way if there are many consecutive runs of integers in the set.
The defining example is as follows.
|
|
(I’ve used a list
here rather than a Set
as the underlying data structure, just to make the code a bit shorter.)
Let’s say we want to test a small public API.
|
|
Here’s a small, trivial implementation - which does have a correctness bug, since this is a post about testing! To make the code smaller, it also has efficiency bugs: for example, it doesn’t even try to join intervals together if they grow to overlap each other, and it’s also possible for the same integer to appear twice in the interval-set.
|
|
Here are some tests, which all pass, so we commit the code and move on with our lives.
|
|
Then eventually someone comes along and points out that the following test fails!
|
|
Property-based testing is an answer to this problem of incomplete test coverage. It allows us to write more comprehensive tests with less drudgery: we specify directly the properties we wish to hold, rather than listing specific examples of those properties.
The obvious property we want here is that checking a list of integers for containment should be the same as testing the interval set for containment.
|
|
And FsCheck can check this for us:
|
|
After one run, I get this output:
|
|
FsCheck has found a failing case and has even found another one which is nice and small for us.
The bug
By the way, the bug was in the recursion for add'
:
|
|
The silent difficulty with property-based testing
How do we know we ever hit any interesting behaviour?
In the example above, our property is of the form “do something, then check whether the result is equal to a List.contains
call”.
But what if that List.contains
never outputs true
?
Then we’re just generating lots of uninteresting tests which all look like “is 59
in the set {1, 2, 3}
?”, and we’re never actually testing the important cases!
Happily, though, F# is impure, so we can actually assert whether this is happening extremely easily.
|
|
Now, the call site looks like this:
|
|
We’re now asserting at test time that at least 10% of our cases are indeed the “list does contain” case.
And in fact this test now becomes flaky!
|
|
We have successfully turned our own uncertainty into a flaky test.
What to do about it
We need to fix the flaky test.
One way to do this is to manipulate the generators FsCheck is using to construct test cases.
We can make a generator of int list * int
, where the list contains the int
about 50% of the time.
(I won’t go into how to do this now; the FsCheck documentation on Test Data is pretty comprehensive.)
Then when we use this new generator, we’re extremely likely to pass the “more than 10% of our generated test cases are of the positive case” check.
Indeed, we’re taking many independently-sampled cases (by default, FsCheck generates 100 of them), and we’ve tweaked the distribution so that these samples are of the positive case with 50% probability, and to pass the check we only need 10% of them to be testing the positive case; so the probability of this test failing after the generator is set up correctly is actually around \(1.5 \times 10^{-17}\) (being \(\mathbb{P}(X \le 10)\) when \(X\) follows the binomial distribution on 100 trials with probability \(\frac{1}{2}\) of success).
Depending on how many property-based tests you have, and how frequently they are run, you may want to make spurious failure less likely even than this; you can do so by skewing the generator even further towards positive cases, or by increasing the number of trials FsCheck runs.
How would you approach this in Haskell?
Haskell is a pure language, so you can’t just do the cheap answer we did here of mutating some external state from within the property.
Instead, QuickCheck comes with the function checkCoverage
, which you can set up with conditions you want to assert about test distributions.
Note that checkCoverage
is smarter than our cheap solution, though, because it does statistical tests to try and minimise the chance of spurious failures; above, I just did the calculation to discover the probability of spurious failure, but it’s certainly not always so easy.
FsCheck’s in-built support
FsCheck does have some methods (documented as FsCheck: Observing Test Case Distribution) which let you find things out about test distributions.
For example, Prop.trivial
allows you to mark an outcome as being “trivial” or as falling into one or more categories, and you can then get FsCheck to tell you the distribution of test cases across the various categories.
However, I honestly think it’s easier and more readable to do it yourself using the cheap method from this post.
Conclusion
When I implemented automated coverage checks this way in one of our internal projects, I discovered several parts of our API surface that just weren’t tested at all by property-based tests. It was trivial to fix up the test generator so that those parts were hit, but if we hadn’t put in this automated instrumentation, we would probably never have found out until our users started submitting bug reports about those under-tested parts of our API.
If you’re using property-based testing (which you should be!), you should pay attention to the distribution of test cases you’re getting from the generator library. Check that the tests are actually as comprehensive as you hoped. It’s extremely easy to do this in F#, and it takes no knowledge of the FsCheck library to do it; anyone who speaks only a little F# can come in and add this metatesting to an existing property-based test suite, even if they know nothing about how FsCheck works.