[dev.fuzz] internal/fuzz: mutator should generate valid UTF-8 for strings #46874

New issue

Closed

[dev.fuzz] internal/fuzz: mutator should generate valid UTF-8 for strings#46874

Labels

FrozenDueToAgeNeedsDecisionfuzz

Milestone

Unreleased

jayconrod

Currently, we use the same mutation engine for string and []byte. This tends to generate a lot of invalid UTF-8 strings that aren't usable for many use cases. While invalid UTF-8 is likely to turn up many shallow parser bugs, it may make the mutator less effective at finding more subtle, deeper bugs.

We should have an option to make the mutator only generate UTF-8. Some ideas:

Create a UTF8String defined type. A fuzz function that accepts that as a parameter would only get valid UTF-8 strings.
Only provide valid UTF-8 strings for string parameters. A function could request []byte for random bytes, and that can still be converted to string.

cc @golang/fuzzing @findleyr

added

added this to the Unreleased milestone

on Jun 22, 2021

dsnet

Member

Some ideas ...

IIUC, one of the eventual goals is to be able to fuzz structs. If so, I'm not sure either idea works as well for higher-order types, where the field type is already a string and can't be easily changed.

Perhaps there can be a method hung off of testing.F that configures parameters about the mutator?

findleyr

Member

Create a UTF8String defined type. A fuzz function that accepts that as a parameter would only get valid UTF-8 strings.

I like this idea. We could also have ASCIIString. I think it would be convenient to have such a mechanism, for exactly the reasons you list, but I'm wary of making it the default. Despite my comment about the readability of non-utf8 string examples, consuming arbitrary bytes DID turn up a bug that wouldn't have been found with valid characters alone (#46855).

IIUC, one of the eventual goals is to be able to fuzz structs. If so, I'm not sure either idea works as well for higher-order types, where the field type is already a string and can't be easily changed.

~~I wonder if fields could be explicitly ignored by the mutator, perhaps with a struct tag, so that they may be substituted by other means. I.e.~~

type S struct {
Count int
Input string fuzz:"ignore"
}

func f(t *testing.T, s S, input fuzz.UTF8String) {
s.Input = string(input)
...
}

Edit: I already dislike this last suggestion, at least as I manifested it. Fuzzing, being testing, should not add struct tags to non-test objects.

CAFxX

Contributor

Perhaps there can be a method hung off of testing.F that configures parameters about the mutator?

Or, we could start with trying to have the mutator automatically balance exploration and exploitation based on which kind of string (e.g. all valid utf-8 runes, mix of valid and invalid utf-8 runes, binary data) yield the best results, in terms of defects discovered, on that specific target.

So e.g. if for that target the fuzzer detects that all-valid utf-8 runes strings unearth more defects, it will progressively bias the mutator to generate more all-valid utf-8 runes strings, and less of the others types of strings. The benefit of this approach being that it requires no changes to the code being fuzzed and no additional knobs (although it may be slightly slower).

Later on, if needed, we could easily add a way to disable this auto-tuning by passing an explicit configuration as suggested by @dsnet.

rolandshoemaker

Member

Or, we could start with trying to have the mutator automatically balance exploration and exploitation based on which kind of string (e.g. all valid utf-8 runes, mix of valid and invalid utf-8 runes, binary data) yield the best results, in terms of defects discovered, on that specific target.

Essentially this is the concept behind some of the advanced fuzzing strategies discussed in #46507. Implementing input prioritization methods which focus on inputs which produce more coverage (among other characteristics) naturally biases towards mutations that produce inputs that the program understands, without actually having to alter the mutator at all.

That said there have been some discussions previously about biasing certain mutators, i.e. in order to prioritize mutators which reduce the size of inputs over those that increase the size of inputs etc. In a similar vein we may want to bias the mutators for strings, to produce inputs with valid UTF-8, over invalid UTF-8. I'd like to do some evaluation with a string based target to see how much of an effect of coverage this produces.

katiehockman

Contributor

Only provide valid UTF-8 strings for string parameters

I don't think we should do this. There is no property in the language which states that strings must be UTF-8, and we shouldn't special case this for fuzzing.

Rob made an important point:

Despite my comment about the readability of non-utf8 string examples, consuming arbitrary bytes DID turn up a bug that wouldn't have been found with valid characters alone (#46855).

This stuck out to me because it demonstrates one of the biggest benefits of fuzzing: the mutator is able to generate inputs to f.Fuzz that are vastly different from what you may have assumed is "valid input". Limiting to UTF-8 by default would take away from this.

A few general thoughts:

https://go.googlesource.com/proposal/+/master/design/draft-fuzzing.md#custom-generators talks a bit about what a custom generator might look like with the current design (once structs are supported). We could provide our own set of custom generators in the testing package in the future, including one for UTF-8-only strings
I also wonder if testing/quick would make more sense for inputs like this which have very specific generator needs.

katiehockman

Contributor

Is there anything else to discuss on this issue, or are we okay with closing it?

jayconrod

ContributorAuthor

Let's close. I think we're in agreement that UTF-8 by default is not a good idea. Custom mutators or defined types are appealing, but let's open an issue for those when we have a design.

jayconrod

closed this as completed

on Jun 29, 2021