Skip to content

rust-url should perform percent-encoding normalization on input #149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
untitaker opened this issue Nov 28, 2015 · 22 comments
Closed

rust-url should perform percent-encoding normalization on input #149

untitaker opened this issue Nov 28, 2015 · 22 comments

Comments

@untitaker
Copy link
Contributor

(unless normalization is out of scope for this lib)

let a = Url::parse("http://example.com/l%6Fgin/");
let b = Url::parse("http://example.com/login/");
a == b  # false

https://tools.ietf.org/html/rfc3986#section-6.2.2.2

o is a unreserved character, so its percent-encoded form %6F should be equivalent.

@untitaker
Copy link
Contributor Author

This came up when discussing #148, and is almost certainly dependent on that PR

@untitaker
Copy link
Contributor Author

(also I'd be interested in implementing this if the maintainers agree with it)

@SimonSapin
Copy link
Member

In general, rust-url implements https://url.spec.whatwg.org/, not RFC 3986.

For parsing the path component specifically, the algorithm specified at https://url.spec.whatwg.org/#path-state percent-encodes some characters if they’re not encoded in the input, but leaves existing percent-encoded sequences unchanged.

The introduction of https://tools.ietf.org/html/rfc3986#section-6 discusses the many ways to define URL equivalence:

URI comparison is performed for some particular purpose. Protocols
or implementations that compare URIs for different purposes will
often be subject to differing design trade-offs in regards to how
much effort should be spent in reducing aliased identifiers.

https://url.spec.whatwg.org/#url-equivalence defines one specific algorithm, which is what is implemented in PartialEq for Url (the == operator).

Do you think it would be useful to have additional methods for Url equivalence with more normalization? How should they behave exactly?

@untitaker
Copy link
Contributor Author

A percent_normalize(&mut self) would be useful I think:

These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character, as described in Section 2.3.

So percent_normalize would do that normalization in-place (for all URL parts), just because that comes with no new memory allocations (potentially).

In a web server where I map URL paths to actual filepaths, I need to normalize /a%40b/ into the same thing as /a@b/. Either of them are valid paths, and I encounter either of them in interaction with real clients. I currently deal with this by running percent_decode on the serialized path.

@SimonSapin
Copy link
Member

I was thinking of a equivalent(&self, other: &Url) -> bool comparison, but in place normalization would be possible too.

For a web server, though, don’t you want to decode all percent-encoding rather than normalize it?

@untitaker
Copy link
Contributor Author

Decoding a particular set of octets is just what percent-encoding normalization is. Decoding all characters is technically incorrect since you might e.g. decode %2f into / and change the semantics of the path.

@SimonSapin
Copy link
Member

But you do want to decode most reserved characters like %20 to , right?

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My mental model is that slash-separated components of an URL’s path should be URL-decoded separately.

Then maybe the serialization to a filesystem path should "fail" (which might be where Apache responds with 404) when components contain "forbidden" characters? That’s not only the separators but also NUL, and on Windows some other punctuation.

@untitaker
Copy link
Contributor Author

Right, so maybe that's not such a good usecase (I currently do it like nginx).

I still think the example given in the OP, and its result, is basically a bug.
In another context (WebDAV) I do NOT need to map to the FS, but still need to
normalize URLs.

On Fri, Dec 04, 2015 at 02:28:23PM -0800, Simon Sapin wrote:

But you do want to decode most reserved characters like %20 to , right?

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My mental model is that slash-separated components of an URL’s path should be URL-decoded separately.

Then maybe the serialization to a filesystem path should "fail" (which might be where Apache responds with 404) when components contain "forbidden" characters? That’s not only the separators but also NUL, and on Windows some other punctuation.


Reply to this email directly or view it on GitHub:
#149 (comment)

@SimonSapin
Copy link
Member

I still think the example given in the OP, and its result, is basically a bug.

Again, https://tools.ietf.org/html/rfc3986#section-6 defines various ways equivalence could be implemented, and says “it depends”. https://url.spec.whatwg.org/#url-equivalence picks one, and PartialEq for Url implements that one.

Sometimes you want a different algorithm, but I don’t think that’s a good enough reason to change == because sometimes not. But I’m not opposed to adding more algorithms, as methods. Or if you really want to use the == operator it could be a wrapper that defines its own equality, like https://crates.io/crates/unicase wraps around String and makes equality ASCII case-insensitive.

@untitaker
Copy link
Contributor Author

You're completely right.

Would it be sensible to implement some of those normalization methods as &mut self, and apply them by default during struct initialization, with flags to turn that behavior off? I'm aiming for a sensible default behavior here that fits the 90% usecase, to avoid ridiculous bugs in user code such as %3a != %3A.

@SimonSapin
Copy link
Member

Would it be sensible to implement some of those normalization methods as &mut self

Sure.

and apply them by default during struct initialization

I’m not sure what you mean here. Rust doesn’t provide “initializer” that can run code when the Url { … } syntax is used.

But the typical way to get a Url is not to use that syntax directly but to use the parser. The parser’s algorithm is defined precisely at https://url.spec.whatwg.org/#url-parsing. If there’s a good reason to change it, we should discuss changing the spec rather than unilaterally deviating from it.

to avoid ridiculous bugs

What’s a bug is really a question of perspective. My primary motivation is to make Servo compatible with the web. If some random website works in other browsers but not in Servo, guess who’s gonna be blamed. Even if it’s because the other browsers happen to agree on a ridiculous behavior, and what Servo does is “obviously superior”.

@untitaker
Copy link
Contributor Author

Interesting that you mention other browsers, Firefox seems to automatically correct entered URLs using percent-encoding normalization (but not case normalization):

@SimonSapin
Copy link
Member

Note that the URL bar has much more magic than, for example, parsing <a href="…">.

@untitaker
Copy link
Contributor Author

I've edited my previous post such that <a> can be tested, and the results seem to be the same.

However, mitmproxy reveals that the normalization only occurs for Firefox' networking dev tool and the URL bar, not the actual HTTP request.

@untitaker
Copy link
Contributor Author

Also the status bar (tooltip when hovering over link) also shows different encoding behavior regarding the slash than either the URL bar or the actual HTTP request made. Its encoding behavior is equivalent to the Firefox dev tools.

@untitaker
Copy link
Contributor Author

BTW

But the typical way to get a Url is not to use that syntax directly but to use the parser.

Yes, that's what I meant.

I'm now sufficiently confused to not know what the 90% usecase is. I guess the current behavior of not fiddling too much with normalization during parsing makes sense, although I think it'd be valuable to "nudge" the user into a sensible normalization strategy if the user wishes to do so. Concretely, the API I currently imagine is:

  • Implement all normalization strategies as &mut self methods.
  • Have a normalize(&mut self) that calls a subset of those normalization methods. Not sure what should be part of that set, but probably case-normalization and percent-encoding.

@cmbrandenburg
Copy link
Contributor

File-serving isn't the only game in town. Sometimes %2F in a path component has a valid, consistent meaning, such as with CouchDB servers. Consequently, I filed #154 to note some other problems with percent-encoding.

@untitaker
Copy link
Contributor Author

Percent encoding doesn't conflate %2f and /

On 23 December 2015 14:38:04 CET, "Craig M. Brandenburg" [email protected] wrote:

File-serving isn't the only game in town. Sometimes %2F in a path
component has a valid, consistent meaning, such as with CouchDB
servers. Consequently, I filed #154 to note some other problems with
percent-encoding.


Reply to this email directly or view it on GitHub:
#149 (comment)

Sent from my phone. Please excuse my brevity.

@cmbrandenburg
Copy link
Contributor

@untitaker I don't understand your comment. Will you please elaborate?

@untitaker
Copy link
Contributor Author

@cmbrandenburg Percent-encoding normalization is about decoding characters that never need to be encoded (unreserved in RFC 3986). E.g., this path:

/a/b/c

may be unnecessarily encoded like:

/%78/b/c

Percent-encoding normalization is about converting the latter form into the former one.

/ is not a unreserved character. Decoding it leads to a different URI. I only mentioned it because Simon asked me about my usecase.

All of this is specified in https://tools.ietf.org/html/rfc3986#section-6.2.2.2

@cmbrandenburg
Copy link
Contributor

@untitaker Ah, now I understand.

My original comment was in response to @SimonSapin's comment:

/ is kind of a special case. I really don’t know what should happen if a path component contains %2f in a web server that maps URL paths directly to a filesystem, or if it’s defined anywhere. Implementations seem to disagree: nginx happily interprets %2f as directory separator, while I didn’t manage to get a non-404 response out of Apache with %2f in the path.

My point is to remind everyone that there exist Web APIs out there that do have consistent meaning for a / character within a path component.

But now I see that my #154 is only tangentially related to this issue.

@SimonSapin
Copy link
Member

I’m gonna close this as “working as intended” per https://url.spec.whatwg.org/. Feel free to file an issue at https://github.com/whatwg/url about changing the spec. You may have a case here since at least Chromium does some percent-decoding: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4033 (Interop with existing browsers and "describing reality" are much stronger arguments in WHATWG-land than "ridiculous". Many ridiculous things on the web can’t be removed/changed without breaking lots of existing content.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants