rust-url should perform percent-encoding normalization on input #149

untitaker · 2015-11-28T23:15:14Z

(unless normalization is out of scope for this lib)

let a = Url::parse("http://example.com/l%6Fgin/");
let b = Url::parse("http://example.com/login/");
a == b  # false

https://tools.ietf.org/html/rfc3986#section-6.2.2.2

o is a unreserved character, so its percent-encoded form %6F should be equivalent.

The text was updated successfully, but these errors were encountered:

untitaker · 2015-11-28T23:15:51Z

This came up when discussing #148, and is almost certainly dependent on that PR

untitaker · 2015-11-29T14:23:33Z

(also I'd be interested in implementing this if the maintainers agree with it)

SimonSapin · 2015-12-04T20:27:26Z

In general, rust-url implements https://url.spec.whatwg.org/, not RFC 3986.

For parsing the path component specifically, the algorithm specified at https://url.spec.whatwg.org/#path-state percent-encodes some characters if they’re not encoded in the input, but leaves existing percent-encoded sequences unchanged.

The introduction of https://tools.ietf.org/html/rfc3986#section-6 discusses the many ways to define URL equivalence:

URI comparison is performed for some particular purpose. Protocols
or implementations that compare URIs for different purposes will
often be subject to differing design trade-offs in regards to how
much effort should be spent in reducing aliased identifiers.

https://url.spec.whatwg.org/#url-equivalence defines one specific algorithm, which is what is implemented in PartialEq for Url (the == operator).

Do you think it would be useful to have additional methods for Url equivalence with more normalization? How should they behave exactly?

untitaker · 2015-12-04T21:31:37Z

A percent_normalize(&mut self) would be useful I think:

These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character, as described in Section 2.3.

So percent_normalize would do that normalization in-place (for all URL parts), just because that comes with no new memory allocations (potentially).

In a web server where I map URL paths to actual filepaths, I need to normalize /a%40b/ into the same thing as /a@b/. Either of them are valid paths, and I encounter either of them in interaction with real clients. I currently deal with this by running percent_decode on the serialized path.

SimonSapin · 2015-12-04T21:55:39Z

I was thinking of a equivalent(&self, other: &Url) -> bool comparison, but in place normalization would be possible too.

For a web server, though, don’t you want to decode all percent-encoding rather than normalize it?

untitaker · 2015-12-04T21:59:37Z

Decoding a particular set of octets is just what percent-encoding normalization is. Decoding all characters is technically incorrect since you might e.g. decode %2f into / and change the semantics of the path.

SimonSapin · 2015-12-04T22:28:22Z

But you do want to decode most reserved characters like %20 to , right?

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My mental model is that slash-separated components of an URL’s path should be URL-decoded separately.

Then maybe the serialization to a filesystem path should "fail" (which might be where Apache responds with 404) when components contain "forbidden" characters? That’s not only the separators but also NUL, and on Windows some other punctuation.

untitaker · 2015-12-05T17:44:54Z

Right, so maybe that's not such a good usecase (I currently do it like nginx).

I still think the example given in the OP, and its result, is basically a bug.
In another context (WebDAV) I do NOT need to map to the FS, but still need to
normalize URLs.

On Fri, Dec 04, 2015 at 02:28:23PM -0800, Simon Sapin wrote:

But you do want to decode most reserved characters like %20 to , right?

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My mental model is that slash-separated components of an URL’s path should be URL-decoded separately.

Then maybe the serialization to a filesystem path should "fail" (which might be where Apache responds with 404) when components contain "forbidden" characters? That’s not only the separators but also NUL, and on Windows some other punctuation.

Reply to this email directly or view it on GitHub:
#149 (comment)

SimonSapin · 2015-12-05T18:04:54Z

I still think the example given in the OP, and its result, is basically a bug.

Again, https://tools.ietf.org/html/rfc3986#section-6 defines various ways equivalence could be implemented, and says “it depends”. https://url.spec.whatwg.org/#url-equivalence picks one, and PartialEq for Url implements that one.

Sometimes you want a different algorithm, but I don’t think that’s a good enough reason to change == because sometimes not. But I’m not opposed to adding more algorithms, as methods. Or if you really want to use the == operator it could be a wrapper that defines its own equality, like https://crates.io/crates/unicase wraps around String and makes equality ASCII case-insensitive.

untitaker · 2015-12-05T18:15:13Z

You're completely right.

Would it be sensible to implement some of those normalization methods as &mut self, and apply them by default during struct initialization, with flags to turn that behavior off? I'm aiming for a sensible default behavior here that fits the 90% usecase, to avoid ridiculous bugs in user code such as %3a != %3A.

SimonSapin · 2015-12-05T18:33:25Z

Would it be sensible to implement some of those normalization methods as &mut self

Sure.

and apply them by default during struct initialization

I’m not sure what you mean here. Rust doesn’t provide “initializer” that can run code when the Url { … } syntax is used.

But the typical way to get a Url is not to use that syntax directly but to use the parser. The parser’s algorithm is defined precisely at https://url.spec.whatwg.org/#url-parsing. If there’s a good reason to change it, we should discuss changing the spec rather than unilaterally deviating from it.

to avoid ridiculous bugs

What’s a bug is really a question of perspective. My primary motivation is to make Servo compatible with the web. If some random website works in other browsers but not in Servo, guess who’s gonna be blamed. Even if it’s because the other browsers happen to agree on a ridiculous behavior, and what Servo does is “obviously superior”.

untitaker · 2015-12-05T18:41:37Z

Interesting that you mention other browsers, Firefox seems to automatically correct entered URLs using percent-encoding normalization (but not case normalization):

Percent-encoding normalization is done: Try to visit https://httpbin.org/get?foo=%4a. Firefox automatically corrects this to https://httpbin.org/get?foo=J (not a redirect it seems)
Case normalization is not done: https://httpbin.org/get?foo=%2f results in a different request than https://httpbin.org/get?foo=%2F

SimonSapin · 2015-12-05T18:45:20Z

Note that the URL bar has much more magic than, for example, parsing <a href="…">.

untitaker · 2015-12-05T19:00:43Z

I've edited my previous post such that <a> can be tested, and the results seem to be the same.

However, mitmproxy reveals that the normalization only occurs for Firefox' networking dev tool and the URL bar, not the actual HTTP request.

untitaker · 2015-12-05T19:05:22Z

Also the status bar (tooltip when hovering over link) also shows different encoding behavior regarding the slash than either the URL bar or the actual HTTP request made. Its encoding behavior is equivalent to the Firefox dev tools.

untitaker · 2015-12-05T19:16:05Z

BTW

But the typical way to get a Url is not to use that syntax directly but to use the parser.

Yes, that's what I meant.

I'm now sufficiently confused to not know what the 90% usecase is. I guess the current behavior of not fiddling too much with normalization during parsing makes sense, although I think it'd be valuable to "nudge" the user into a sensible normalization strategy if the user wishes to do so. Concretely, the API I currently imagine is:

Implement all normalization strategies as &mut self methods.
Have a normalize(&mut self) that calls a subset of those normalization methods. Not sure what should be part of that set, but probably case-normalization and percent-encoding.

cmbrandenburg · 2015-12-23T13:38:04Z

File-serving isn't the only game in town. Sometimes %2F in a path component has a valid, consistent meaning, such as with CouchDB servers. Consequently, I filed #154 to note some other problems with percent-encoding.

untitaker · 2015-12-23T14:18:16Z

Percent encoding doesn't conflate %2f and /

On 23 December 2015 14:38:04 CET, "Craig M. Brandenburg" [email protected] wrote:

File-serving isn't the only game in town. Sometimes %2F in a path
component has a valid, consistent meaning, such as with CouchDB
servers. Consequently, I filed #154 to note some other problems with
percent-encoding.

Reply to this email directly or view it on GitHub:
#149 (comment)

Sent from my phone. Please excuse my brevity.

cmbrandenburg · 2015-12-23T16:07:07Z

@untitaker I don't understand your comment. Will you please elaborate?

untitaker · 2015-12-23T17:24:11Z

@cmbrandenburg Percent-encoding normalization is about decoding characters that never need to be encoded (unreserved in RFC 3986). E.g., this path:

/a/b/c

may be unnecessarily encoded like:

/%78/b/c

Percent-encoding normalization is about converting the latter form into the former one.

/ is not a unreserved character. Decoding it leads to a different URI. I only mentioned it because Simon asked me about my usecase.

All of this is specified in https://tools.ietf.org/html/rfc3986#section-6.2.2.2

cmbrandenburg · 2015-12-23T18:29:05Z

@untitaker Ah, now I understand.

My original comment was in response to @SimonSapin's comment:

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My point is to remind everyone that there exist Web APIs out there that do have consistent meaning for a / character within a path component.

But now I see that my #154 is only tangentially related to this issue.

SimonSapin · 2016-04-04T08:28:09Z

I’m gonna close this as “working as intended” per https://url.spec.whatwg.org/. Feel free to file an issue at https://github.com/whatwg/url about changing the spec. You may have a case here since at least Chromium does some percent-decoding: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4033 (Interop with existing browsers and "describing reality" are much stronger arguments in WHATWG-land than "ridiculous". Many ridiculous things on the web can’t be removed/changed without breaking lots of existing content.)

untitaker mentioned this issue Nov 28, 2015

Add OAUTH_SIGNATURE EncodeSet #148

Closed

cmbrandenburg mentioned this issue Dec 23, 2015

DEFAULT_ENCODE_SET doesn't percent-encode / and % #154

Closed

SimonSapin closed this as completed Apr 4, 2016

rust-url should perform percent-encoding normalization on input #149

rust-url should perform percent-encoding normalization on input #149

Comments

untitaker commented Nov 28, 2015

untitaker commented Nov 28, 2015

Uh oh!

untitaker commented Nov 29, 2015

Uh oh!

SimonSapin commented Dec 4, 2015

Uh oh!

untitaker commented Dec 4, 2015

Uh oh!

SimonSapin commented Dec 4, 2015

Uh oh!

untitaker commented Dec 4, 2015

Uh oh!

SimonSapin commented Dec 4, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

SimonSapin commented Dec 5, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

SimonSapin commented Dec 5, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

SimonSapin commented Dec 5, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

untitaker commented Dec 5, 2015

Uh oh!

cmbrandenburg commented Dec 23, 2015

Uh oh!

untitaker commented Dec 23, 2015

Uh oh!

cmbrandenburg commented Dec 23, 2015

Uh oh!

untitaker commented Dec 23, 2015

Uh oh!

cmbrandenburg commented Dec 23, 2015

Uh oh!

SimonSapin commented Apr 4, 2016

Uh oh!