-
Notifications
You must be signed in to change notification settings - Fork 356
Two possible improvements to type coercion #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@shmax - type coercion is yours; what do you think of this? IMO both are excellent ideas. |
Will definitely look into these issues this weekend. My first 10-second impression is that coercing to null for a string would violate the type (as null is not a string), so at best it would only be valid if you had some kind of mixed type in your schema. Not sure about the second point, but I can tell you that the current logic we have in place doesn't really have any opinion about the order, so adding some logic like you suggest probably would be no worse. More soon. Thanks for the contribution. |
Re 10 second impression - isn't that the point of the coercion feature in the first place? If coercing |
Hmm, well I don't think he really gave us enough detail about what he wants for the null case, but I hope we can agree that coercing string "null" to actual null for type string is a non-starter, because again, you would be coercing to something that is the wrong type. And further, I think you would be asking for a lot of bugs in your client app. Suppose you have a form for validating user names on your social site and a user happens to want to call himself "null". Perfectly legitimate string, but it would get converted to null by the coercion feature and cause a black hole somewhere. The only coercion case involving strings and the type "null" that I can think of that would make sense would be to coerce empty string to type null. Our pals over at ajv seem to agree: http://take.ms/QYQEo Anyway, I would be happy to add that feature if it would help. Will get back to you on the other point. |
Oh, I certainly wasn't suggesting that - we agree there completely! If the schema demands a string, and the document being validated supplies a string, then it should be left well alone. What I meant (and what I hope the OP was asking for) was that if the schema demands a null (i.e. defines something with the explicit type |
For the record, I disagree with ajv on not coercing |
Do we care about coercion reversibility? That could be an argument against coercing |
Hmm, I just don't know. You could be right. One thing to consider is that the stringification of echo http_build_query(array(
'foo'=>'bar',
'baz'=>'boom',
'cow'=>null
));
// "foo=bar&baz=boom but if you want to force it to a string, it comes out in caps: var_export(null); // "NULL" Javascript does: console.log(encodeURIComponent(null)); // "null" Python (not an expert, apologies): from urllib.parse import urlencode
url = urlencode({'pram1': 'foo', 'param2': None})
print(url) # "param2=None&pram1=foo" So I guess I wonder if teaching json-schema about "null" might be pushing things a little too far away from language agnosticism... |
I do feel that this one is your call - it's 'your' feature, you did all the initial research and work around it, and it's not something I use myself. My personal opinion is that we should coerce when (and only when) the schema defines a type, validation would fail without coercion, and the mapping from the input data to the correct value of the schema type is obvious. But I'd like to defer to your preference here, whatever that may be.
That's a really good point, although again I feel it falls into the same basket as coercing |
That was exactly my point. "null" and "NULL' are equivalent, so if we support coercing "null" then sooner or later someone is going to ask us to coerce 'NULL', and then after that someone will ask for "None", then "nil", and on, and on. Now we have this wad of strings that we will consider to be "null", thus weakening the "null" type's ability to do its job (suppose you're on a Python box, and a request comes in from some PHP client with "null". Rather than catch what is clearly an invalid value in the Python environment, it passes coercion with flying colors).
Because "true" and "false" are fairly universal. I don't know of any major (non-deliberately oddball) languages that don't have them as keywords. "null" is a keyword in some languages, but not all, and even the ones that do use it don't necessarily stringify it when building query params. Rather than beat it to death, why don't we start with coercing empty string to null, and go from there? |
I'll try to explain my use case a bit. Suppose you have a list of entities, whose attributes comprises the (integer) attribute "rank". The list is accessible at entities.html and by default lists all the entities - i.e. there is no filter set on "rank". You can optionally set or remove a filter on "rank" through a GET form, which generates the following urls:
I'd like to coerce and validate the query string using the following schema: {
"properties": {
"rankisgreaterthan": {
"default": null,
"type": ["null", "integer"]
}
},
"required": ["rankisgreaterthan"],
"type": "object"
} For this to work empty-string has to be coerced to null. There is no ambiguity, however, because null is explicitely listed among the expected types. |
Okay, so that's coercing empty string to null, which should be fine, right? |
I just read your last comments. I concur with your preference for '' :) |
This sounds like a great idea - you're making really good points re the various strings that might represent null, so maybe that particular case is best left for now, until it's had more thought / someone specifically asks for it. |
@gtuccini I'm still mulling this one. I guess I'm a little on the fence. Let's say we have a form on a website called "age". You can either enter a number, like "46", or a string representing your birth date, such as "1/1/1971". So, in my schema, I have this:
On my back end I turn the coercion option on and validate the form. Post-coercion, I'm expecting either an integer Can you provide any realistic counter examples? |
A user enters Personally, I feel that coercing an already-valid type opens up far too much room for ambiguity and getting things wrong - there's a point where it's more appropriate to validate this kind of thing using If you feel strongly in favour of coercing already-valid types though, what about adding e.g. |
Heh, okay, so you're shooting holes in my sample test case. Fair enough. Let's try a different one. We have a field called "month". You can enter a number, such as "1", or a name, such as "February". We have the same situation, and we don't have to bring |
Another idea would be to stop validating once a type validates (it might already work this way, I'll have to check). Then the schema designer can sort of game the coercion rules to his taste by putting his list of types in the order that he would prefer them to have precedence. In other words, you could use a system like this to short-circuit in your first example by putting "string" first. Someone like me who would rather coerce boolean straightaway would put it first in the list. |
That still requires an understanding of what a month is, which JSON schema does not have. How do you propose to distinguish between your month example and the age example? Because you cannot make a rational decision about what to do with the value unless you can clearly and consistently know what the user is expecting to receive. In lieu of such understanding, assuming that schema-compliance is sufficient seems reasonable, and reduces the potential for confusion.
I'll need to think about this one a bit more, but my initial instinct is deeply opposed to that kind of thing. IMO extending the schema to include a |
To clarify - nothing wrong with stopping once a type validates; it's the "gaming things by tweaking the order" that I have an issue with. |
A thought - do you know whether there has been anything regarding type-coercion proposed for the official spec? Because if you want to have this kind of thing widely adopted, and it hasn't already been discussed, proposing it there may be worthwhile. |
I don't follow. Post-coercion, I'm expecting |
You're deeply opposed to lists of things, or lists of things in preferential order? |
I'll try to rephrase. If the user has supplied a schema that expects a type of either If the value supplied is something other than an integer or a string, but can be coerced in an obvious way to one of those types to avoid failing validation, then this is a good thing, and as it's behind If the value supplied is an integer or a string (i.e. what the user asked for), and we mess with it anyway, then we're second-guessing the user's own business logic, which we should never, ever do. They have not told us what they intend to do with the values, there is no way for us to infer what they want to do with the values, and understanding their business logic is well outside the scope of what JSON schema is for. Randomly mutating valid data without being asked, especially when we don't have a reasonable basis for doing so, is not something a validator should ever do, under any circumstances. |
I'm deeply opposed to "gaming the system" by trying to hack business logic into a schema-validator via the element order. |
Really? I guess I'm having trouble mustering any outrage over it, mainly because it's already happening in some form; I believe we're walking over the types in the order that they're encountered, and the first one that can be coerced will be coerced, which will affect the following type's chances for doing further coercion. The nice thing about the idea is that I believe both of our use cases can be supported (if we stop looping over our types once something validates--really need to look at the code), at least in my month example. If I want my integer coercion to always trump the string, then I make sure it's first in the list, then things work the way I like. If @gtuccini wants it the other way, then he just puts |
"Because we're already doing it" isn't synonymous with "we should be doing it" - IMO if we find a problem (and I feel that this one is quite serious), we should fix it, just like any other bug.
Then we shouldn't be doing that. It's violating what the schema says is acceptable - if the schema says something is OK, then we should not be second-guessing it.
This is just taking advantage of a bug to obtain intended behavior. It works, but there be dragons down that road - once you start doing it, then you can't ever fix stuff, because you don't know what behavior might be relying on that bug. It's better to fix the bug and implement the desired behavior properly. |
Well, I'm not totally sure I agree yet that there is a bug. I don't have a huge sample set to work with here, or anything, but the way we have things now aligns with the two other coercive validation libraries I'm familiar with. Can we dial back the intensity and outrage a little bit and just sort of take it easy until we have a few more use cases from @gtuccini? It is still my day off, after all :) |
Certainly. I feel strongly about it, but quite happy to let this sit for a while :-). |
Wha?! Actually, I was hoping it would take all night so I could go to karaoke and maybe get some pizza. But let's take a look-see... |
I haven't done the additional coercions from the ajv matrix yet - working on that now. But the bug should be fixed, and there's a new test-case that checks for it. |
I still don't necessarily agree that we have a bug, here. Let's just say that the behavior has changed as agreed, okay? |
OK - I have changed the title. How's that? |
Have also updated the description. |
I don't know how typical my project is, but I just did a little accounting, and out of 151 schema files I could only find two instances where I have an array of types and one of the possible values is a string: "reissueOf" => [
"type" => [
"number",
"string"
],
"description" => "Id of some other sku that this toy is a reissue of, if any."
], In this case what I'm validating is a numeric id for a record in a MYSQL table, so it really doesn't make any difference if it gets coerced or not. As a matter of fact, there's no reason to keep the "string" option in there. So no big deal in this case. Just listing one real use case. Would be cool to see others. |
I have a lot of schemata like this. {
"properties": {
"code": {
"default": null,
"format": "taxCode",
"type": ["null", "string"]
},
"columns": {
"default": ["name", "surname"],
"items": {
"enum": ["code", "id", "name", "surname"]
},
"type": "array"
},
"direction": {
"default": "ASC",
"enum": ["ASC", "DESC"]
},
"id": {
"default": null,
"type": ["integer", "null"]
},
"limit": {
"default": 100,
"enum": [10, 100, 1000],
"type": "integer"
},
"name": {
"default": null,
"type": ["null", "string"]
},
"offset": {
"default": 0,
"type": "integer"
},
"orderBy": {
"default": "surname",
"enum": ["code", "id", "name", "surname"]
},
"surname": {
"default": null,
"type": ["null", "string"]
}
},
"required": ["code", "columns", "id", "limit", "name", "offset", "orderBy", "surname"],
"type": "object"
} The subject comes from the query string, which is generated by a GET form. Currently I'm just converting all the empty strings to null prior to performing the validation. To use the new features I'd need either a "typePreference" attribute or the equivalent of an "if": "name": {
"anyOf": [
{
"minLength": 1,
"type": "string"
},
{
"type": "null"
}
]
} or, through a more general construct: "name": {
"anyOf": [
{
"not": {
"enum": [""]
},
"type": "string"
},
{
"type": "null"
}
]
} Am I right? |
That is a bit troubling. We started by providing the ability to coerce empty string to null, but then our second change rendered the first change moot. At least for this use case, it sounds like you would have been happier without the second change. I'm wondering if we should have taken your earlier suggestion to have something along the lines of a |
Oh, I do have one question: why bother specifying the "null" type for your string? When you're processing your data after validation, empty string and null are equivalent when it comes to a truthiness check, so I don't see any reason to want null, and if you keep your type as "string" you can do concatenation and other operations in good faith without having to check for non-null... |
The value eventually ends in the arguments of a prepared statement which uses "IS NULL", so its type is relevant. I'll give you some more informations as soon as I'm be back from work. |
I see. And changing the prepared statement to do a truthiness check is not an option?
|
Unfortunately I'm not in charge of writing the code which interacts with the db :). |
@shmax @gtuccini Having read the above, I think that maybe an early / opportunistic coerce flag is in order. Such cases are hopefully rare, but @gtuccini won't be the only one to run into this kind of issue, so having an option to obtain the previous behavior when asked for is probably a good thing. I feel very strongly that the not-coercing-valid-values is the right thing to be doing, but as long as it's the default, I see no issue with also supporting the old behavior (behind a flag). @shmax - would you consider it an acceptable solution if I added such a flag to the current PR? Or would you like to discuss further before deciding what to do? |
Sounds good to me. While you're at it and totally on fire, what do you think about breaking up the coercive tests into smaller, more meaningful chunks? |
@shmax I've been thinking about it, and I'm concerned that implementing something like Thoughts? |
Agreed. I don't think we should be adding new schema keywords. |
@erayd The flag is definitely enough and, all in all, I like it better than a non standard keyword. Thanks for getting this done :) However, I think that a standard "typePreference" (or rather "defaultType") keyword would be useful to support type coercion and to inform the ui about the input widget to show by default when multiple types are allowed. Would you find it useful if I proposed the keyword (and, more broadly, to add some recommendations about type coercion) at https://github.com/json-schema-org/json-schema-spec? |
@gtuccini Having a It's not something I want to push - I have enough on my plate already - but if you want to take ownership of that and propose it, by all means go ahead :-). I'm quite happy to implement it if they include it in the spec. |
* Improve performance - don't loop over everything if already valid * Don't coerce already-valid types (fixes #379) * Add remaining coercion cases & rewrite tests * Add all remaining coercion cases from ajv matrix * Rewrite the coercion tests to tidy things up a bit * Add CHECK_MODE_EARLY_COERCE If set, falls back to the old behavior of coercing to the first compatible type, regardless of whether another already-valid type might be available. * Add multiple-type test that requires coercion * \JSON_PRETTY_PRINT doesn't exist in PHP-5.3, so work around this * Various PR cleanup stuff * Fix whitespace * Turn $early into $extraFlags * Change "string" to "ABC" in string test * Update README.md description of CHECK_MODE_EARLY_COERCE * Move loop after complex tests definition * Move test #39 to grid #15
* Improve performance - don't loop over everything if already valid * Don't coerce already-valid types (fixes jsonrainbow#379) * Add remaining coercion cases & rewrite tests * Add all remaining coercion cases from ajv matrix * Rewrite the coercion tests to tidy things up a bit * Add CHECK_MODE_EARLY_COERCE If set, falls back to the old behavior of coercing to the first compatible type, regardless of whether another already-valid type might be available. * Add multiple-type test that requires coercion * \JSON_PRETTY_PRINT doesn't exist in PHP-5.3, so work around this * Various PR cleanup stuff * Fix whitespace * Turn $early into $extraFlags * Change "string" to "ABC" in string test * Update README.md description of CHECK_MODE_EARLY_COERCE * Move loop after complex tests definition * Move test jsonrainbow#39 to grid jsonrainbow#15
@bighappyface Can this issue be closed? The fix is merged in 6.0.0-dev, and will not be backported to avoid compatibility-breaking changes. |
It would be useful to have a coercible string representation for null. "null" or '' would work just fine.
I think also that the coercion should happen only when the subject doesn't match any of the allowed types.
Currently, given the following subject
and the following schema
"value" is coerced to boolean, even if its type -- string -- is among the allowed ones.
What do you think?
The text was updated successfully, but these errors were encountered: