-
Notifications
You must be signed in to change notification settings - Fork 346
rust-url should perform percent-encoding normalization on input #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This came up when discussing #148, and is almost certainly dependent on that PR |
(also I'd be interested in implementing this if the maintainers agree with it) |
In general, rust-url implements https://url.spec.whatwg.org/, not RFC 3986. For parsing the path component specifically, the algorithm specified at https://url.spec.whatwg.org/#path-state percent-encodes some characters if they’re not encoded in the input, but leaves existing percent-encoded sequences unchanged. The introduction of https://tools.ietf.org/html/rfc3986#section-6 discusses the many ways to define URL equivalence:
https://url.spec.whatwg.org/#url-equivalence defines one specific algorithm, which is what is implemented in Do you think it would be useful to have additional methods for |
A
So In a web server where I map URL paths to actual filepaths, I need to normalize |
I was thinking of a For a web server, though, don’t you want to decode all percent-encoding rather than normalize it? |
Decoding a particular set of octets is just what percent-encoding normalization is. Decoding all characters is technically incorrect since you might e.g. decode |
But you do want to decode most reserved characters like
My mental model is that slash-separated components of an URL’s path should be URL-decoded separately. Then maybe the serialization to a filesystem path should "fail" (which might be where Apache responds with 404) when components contain "forbidden" characters? That’s not only the separators but also NUL, and on Windows some other punctuation. |
Right, so maybe that's not such a good usecase (I currently do it like nginx). I still think the example given in the OP, and its result, is basically a bug. On Fri, Dec 04, 2015 at 02:28:23PM -0800, Simon Sapin wrote:
|
Again, https://tools.ietf.org/html/rfc3986#section-6 defines various ways equivalence could be implemented, and says “it depends”. https://url.spec.whatwg.org/#url-equivalence picks one, and Sometimes you want a different algorithm, but I don’t think that’s a good enough reason to change |
You're completely right. Would it be sensible to implement some of those normalization methods as |
Sure.
I’m not sure what you mean here. Rust doesn’t provide “initializer” that can run code when the But the typical way to get a
What’s a bug is really a question of perspective. My primary motivation is to make Servo compatible with the web. If some random website works in other browsers but not in Servo, guess who’s gonna be blamed. Even if it’s because the other browsers happen to agree on a ridiculous behavior, and what Servo does is “obviously superior”. |
Interesting that you mention other browsers, Firefox seems to automatically correct entered URLs using percent-encoding normalization (but not case normalization):
|
Note that the URL bar has much more magic than, for example, parsing |
I've edited my previous post such that However, mitmproxy reveals that the normalization only occurs for Firefox' networking dev tool and the URL bar, not the actual HTTP request. |
Also the status bar (tooltip when hovering over link) also shows different encoding behavior regarding the slash than either the URL bar or the actual HTTP request made. Its encoding behavior is equivalent to the Firefox dev tools. |
BTW
Yes, that's what I meant. I'm now sufficiently confused to not know what the 90% usecase is. I guess the current behavior of not fiddling too much with normalization during parsing makes sense, although I think it'd be valuable to "nudge" the user into a sensible normalization strategy if the user wishes to do so. Concretely, the API I currently imagine is:
|
File-serving isn't the only game in town. Sometimes |
Percent encoding doesn't conflate %2f and / On 23 December 2015 14:38:04 CET, "Craig M. Brandenburg" [email protected] wrote:
Sent from my phone. Please excuse my brevity. |
@untitaker I don't understand your comment. Will you please elaborate? |
@cmbrandenburg Percent-encoding normalization is about decoding characters that never need to be encoded (
may be unnecessarily encoded like:
Percent-encoding normalization is about converting the latter form into the former one.
All of this is specified in https://tools.ietf.org/html/rfc3986#section-6.2.2.2 |
@untitaker Ah, now I understand. My original comment was in response to @SimonSapin's comment:
My point is to remind everyone that there exist Web APIs out there that do have consistent meaning for a But now I see that my #154 is only tangentially related to this issue. |
I’m gonna close this as “working as intended” per https://url.spec.whatwg.org/. Feel free to file an issue at https://github.com/whatwg/url about changing the spec. You may have a case here since at least Chromium does some percent-decoding: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4033 (Interop with existing browsers and "describing reality" are much stronger arguments in WHATWG-land than "ridiculous". Many ridiculous things on the web can’t be removed/changed without breaking lots of existing content.) |
(unless normalization is out of scope for this lib)
https://tools.ietf.org/html/rfc3986#section-6.2.2.2
o
is a unreserved character, so its percent-encoded form%6F
should be equivalent.The text was updated successfully, but these errors were encountered: