Closed
Description
Currently, we use the same mutation engine for string
and []byte
. This tends to generate a lot of invalid UTF-8 strings that aren't usable for many use cases. While invalid UTF-8 is likely to turn up many shallow parser bugs, it may make the mutator less effective at finding more subtle, deeper bugs.
We should have an option to make the mutator only generate UTF-8. Some ideas:
- Create a
UTF8String
defined type. A fuzz function that accepts that as a parameter would only get valid UTF-8 strings. - Only provide valid UTF-8 strings for
string
parameters. A function could request[]byte
for random bytes, and that can still be converted tostring
.
cc @golang/fuzzing @findleyr
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status
Relationships
Development
No branches or pull requests
Activity
dsnet commentedon Jun 22, 2021
IIUC, one of the eventual goals is to be able to fuzz structs. If so, I'm not sure either idea works as well for higher-order types, where the field type is already a
string
and can't be easily changed.Perhaps there can be a method hung off of
testing.F
that configures parameters about the mutator?findleyr commentedon Jun 22, 2021
I like this idea. We could also have ASCIIString. I think it would be convenient to have such a mechanism, for exactly the reasons you list, but I'm wary of making it the default. Despite my comment about the readability of non-utf8 string examples, consuming arbitrary bytes DID turn up a bug that wouldn't have been found with valid characters alone (#46855).
I wonder if fields could be explicitly ignored by the mutator, perhaps with a struct tag, so that they may be substituted by other means. I.e.type S struct {Count int
Input string
fuzz:"ignore"
}
func f(t *testing.T, s S, input fuzz.UTF8String) {s.Input = string(input)
...
}
Edit: I already dislike this last suggestion, at least as I manifested it. Fuzzing, being testing, should not add struct tags to non-test objects.
CAFxX commentedon Jun 22, 2021
Or, we could start with trying to have the mutator automatically balance exploration and exploitation based on which kind of string (e.g. all valid utf-8 runes, mix of valid and invalid utf-8 runes, binary data) yield the best results, in terms of defects discovered, on that specific target.
So e.g. if for that target the fuzzer detects that all-valid utf-8 runes strings unearth more defects, it will progressively bias the mutator to generate more all-valid utf-8 runes strings, and less of the others types of strings. The benefit of this approach being that it requires no changes to the code being fuzzed and no additional knobs (although it may be slightly slower).
Later on, if needed, we could easily add a way to disable this auto-tuning by passing an explicit configuration as suggested by @dsnet.
rolandshoemaker commentedon Jun 22, 2021
Essentially this is the concept behind some of the advanced fuzzing strategies discussed in #46507. Implementing input prioritization methods which focus on inputs which produce more coverage (among other characteristics) naturally biases towards mutations that produce inputs that the program understands, without actually having to alter the mutator at all.
That said there have been some discussions previously about biasing certain mutators, i.e. in order to prioritize mutators which reduce the size of inputs over those that increase the size of inputs etc. In a similar vein we may want to bias the mutators for strings, to produce inputs with valid UTF-8, over invalid UTF-8. I'd like to do some evaluation with a string based target to see how much of an effect of coverage this produces.
katiehockman commentedon Jun 23, 2021
I don't think we should do this. There is no property in the language which states that strings must be UTF-8, and we shouldn't special case this for fuzzing.
Rob made an important point:
This stuck out to me because it demonstrates one of the biggest benefits of fuzzing: the mutator is able to generate inputs to
f.Fuzz
that are vastly different from what you may have assumed is "valid input". Limiting to UTF-8 by default would take away from this.A few general thoughts:
katiehockman commentedon Jun 29, 2021
Is there anything else to discuss on this issue, or are we okay with closing it?
jayconrod commentedon Jun 29, 2021
Let's close. I think we're in agreement that UTF-8 by default is not a good idea. Custom mutators or defined types are appealing, but let's open an issue for those when we have a design.