Skip to content

Add support for encoding_rs behind a new feature flag #262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hsivonen
Copy link
Collaborator

@hsivonen hsivonen commented Dec 21, 2016

This is an implementation of the design discussed f2f with @SimonSapin and briefly with @Manishearth: Introducing optional support for encoding_rs as the query encoding conversion back end behind a new feature flag (query_encoding_rs).

Observable changes:

  • Only WHATWG Encoding Standard encodings can be specified as the query encoding. (Consequence of what the Encoding type means in encoding_rs.)
  • When generating the _charset_ parameter, the name of the encoding is the canonical name per current spec, so the name can now contain upper-case letters.
  • Per current spec, to_output_encoding() returns UTF-8 for the replacement encoding.
  • Spec updates to the encodings themselves.
  • The meaning of EncodingRef in the outer API changes from &'static encoding::types::Encoding to &'static encoding_rs::Encoding.
  • Performance profile of encoding_rs.

This change is Reviewable

.project
.settings
*~
*.bk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you should add all of that to a global ignore file anyway or you are going to need to add it in all projects all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I removed that commit from the pull request (and also rebased the pull request).

@emilio
Copy link
Member

emilio commented Dec 28, 2016

r? @SimonSapin

@SimonSapin
Copy link
Member

Thank you for submitting this Henri. I haven’t been in a hurry to merge it because as far as I know nothing is blocked on it. Is it?

On the other hand there is a small concern that Cargo features are supposed to be purely additive. If two crates A and B both depend on a third crate C and enable different sets of C’s optional features, Cargo build C with the union of the features. What should happen when both query_encoding and query_encoding_rs features are enabled? At the same time this is very theoretical, since probably only Servo will ever want to use either. Still, I’m not yet decided what to do about this.

I’d also like to remove support for rust-encoding in rust-url eventually, but that would be a breaking change. And since so much of the Rust ecosystem has come to (indirectly) depend on this crate, the pain of upgrading the world should take a stronger justification than that.

@Manishearth
Copy link
Member

For additivity to work I would suggest exposing a mod query_encoding and mod query_encoding_rs, and in case both are not enabled, reexport their contents. It is up to crates to use proper hygeine.

@SimonSapin
Copy link
Member

@Manishearth This is not just a module exposing new items, it affect the behavior of methods that exist otherwise.

@Manishearth
Copy link
Member

Perhaps we can use a trait here? So that the methods all act on T: EncodingOverride?

@SimonSapin
Copy link
Member

Yes, another approach could be to not code for either encoding library in rust-url itself and instead have traits that users can implement. But this also would likely be a breaking change.

@hsivonen
Copy link
Collaborator Author

hsivonen commented Jan 4, 2017

I haven’t been in a hurry to merge it because as far as I know nothing is blocked on it. Is it?

Nothing is blocking on this at the moment.

On the other hand there is a small concern that Cargo features are supposed to be purely additive. If two crates A and B both depend on a third crate C and enable different sets of C’s optional features, Cargo build C with the union of the features. What should happen when both query_encoding and query_encoding_rs features are enabled?

Having a dependency graph that includes both encoding_rs and rust-encoding (the real one, not the encoding_rs_compat fork) is likely an unwanted situation: You end up with a bloated binary if you ship the lookup tables for both. For that reason, I thought it was a feature rather than a bug that enabling both query_encoding and query_encoding_rs would fail to compile.

Now that I think about this more: encoding_rs_compat is all about allowing apps to migrate to shipping only encoding_rs's lookup tables without forcing them to migrate away from all uses of the rust-encoding API at the same time. From this point of view, I can see why it could be desirable for rust-url to expose both &'static encoding::types::Encoding-typed and &'static encoding_rs::Encoding-typed API surface simultaneously. (However, if the former just used the existing code to call into encoding_rs_compat, the result would be less efficient for non-UTF-8 ASCII-compatible encodings than doing the mapping from &'static encoding::types::Encoding to &'static encoding_rs::Encoding up front, since encoding_rs borrows also for non-UTF-8 ASCII-compatible encodings when the input is ASCII-only.)

It would be a bit ugly, since Rust doesn't have method overloading, but doable, for rust-url to expose differently-named API surface when taking &'static encoding::types::Encoding and when taking &'static encoding_rs::Encoding.

(The main problem with citing encoding_rs_compat as a solution for anything is that it can only be used in non-crates.io top-level Cargo artifacts, because to use it, you have to abuse a Cargo feature that's meant for development only and that's banned in crates uploaded to crates.io. This might prevent rust-url from having code that maps &'static encoding::types::Encoding to &'static encoding_rs::Encoding for efficiency when both encoding APIs are present in the dependency graph but only the encoding_rs lookup tables are.)

@nox
Copy link
Contributor

nox commented Jan 4, 2017

There is something I still don't understand, why couldn't encoding be patched to also support UTF-16, instead of an all brand new crate?

@hsivonen
Copy link
Collaborator Author

hsivonen commented Jan 4, 2017

There is something I still understand, why couldn't encoding be patched to also support UTF-16, instead of an all brand new crate?

UTF-16 support isn't the only outward difference between encoding_rs and rust-encoding.

For the streaming API, encoding_rs takes a caller-allocated output buffer whereas rust-encoding takes a caller-supplied object that provides a method for rust-encoding to call. The encoding_rs approach is more efficient and more FFI-friendly (which is, of course, part of the Gecko focus of encoding_rs).

For the non-streaming Rust API, rust-encoding always copies, but encoding_rs borrows when possible (valid UTF-8 to UTF-8 decode, ASCII-only in ASCII-compatible encoding to UTF-8 decode, UTF-8 to UTF-8 encode and ASCII-only to ASCII-compatible encoding encode).

Implementation-wise, it wouldn't be a matter of patching in UTF-16. Getting encoding_rs-like decode performance involves different UTF-8-targeting internals, too. (And different outward API as described above, but the encoding_rs internals are faster for decode even when using the rust-encoding outward API via encoding_rs_compat.)

As minor points: The rust-encoding streaming API can't signal ISO-2022-JP unmappables in a spec-compliant way (the spec changed after the rust-encoding API was designed). Additionally, after rust-encoding had been released, the spec changed the case of the encoding names to include upper-case letters in IANA and browser tradition, which from the Rust perspective would be a breaking change for rust-encoding.

@nox
Copy link
Contributor

nox commented Jan 4, 2017

Well rust-encoding isn't 1.0.0 so breaking changes aren't a problem anyway, and it means the crate is not finalised and we can theoretically pursue any breaking improvement and whatnot.

Did we discuss about such changes with the rust-encoding maintainers before forking their crate? If not, why?

@nox
Copy link
Contributor

nox commented Jan 4, 2017

For the streaming API, encoding_rs takes a caller-allocated output buffer whereas rust-encoding takes a caller-supplied object that provides a method for rust-encoding to call. The encoding_rs approach is more efficient and more FFI-friendly (which is, of course, part of the Gecko focus of encoding_rs).

For FFI (and also efficiency concerns), couldn't you pass a caller-supplied object that fills a caller-allocated output buffer, and let rustc optimise away that intermediate object?

For the non-streaming Rust API, rust-encoding always copies, but encoding_rs borrows when possible (valid UTF-8 to UTF-8 decode, ASCII-only in ASCII-compatible encoding to UTF-8 decode, UTF-8 to UTF-8 encode and ASCII-only to ASCII-compatible encoding encode).

I don't see why this wouldn't be a welcomed change in rust-encoding.

Implementation-wise, it wouldn't be a matter of patching in UTF-16. Getting encoding_rs-like decode performance involves different UTF-8-targeting internals, too. (And different outward API as described above, but the encoding_rs internals are faster for decode even when using the rust-encoding outward API via encoding_rs_compat.)

Interesting, did you document the differences precisely somewhere and I missed that?

@hsivonen
Copy link
Collaborator Author

hsivonen commented Jan 4, 2017

Well rust-encoding isn't 1.0.0 so breaking changes aren't a problem anyway, and it means the crate is not finalised and we can theoretically pursue any breaking improvement and whatnot.

Theoretically. In practice, having rust-encoding 1.0.0 make the sort of API changes that I think are appropriate for the goals of encoding_rs would be as big a change for dependent crates to adapt to as switching from rust-encoding 0.2.x to encoding_rs.

Did we discuss about such changes with the rust-encoding maintainers before forking their crate? If not, why?

There was some discussion, but not about the encoding_rs_compat fork per se.

I thought having different internals and a different API would be appropriate for the Gecko use case. Initially I had no code (obviously). Showing up to a project with no code and saying I want to redesign both the internals and the API would have been rather presumptuous, and it would be quite appropriate for the author of a library not to agree to replacing the internals and externals of the library because a random person on the Internet shows up and says so.

The way to empirically show that my design had merit was to implement it. Since the implementation shares no code with rust-encoding, it doesn't seem bad for it to be in a fresh repo.

It is somewhat impolite to show up with competing crate instead of improving the existing one, but before being able to show that the change is an improvement, I though it would have been even more impolite to suggest a re-do of both the internals and externals of the existing crate (and to do some from the special Gecko point of view).

For FFI (and also efficiency concerns), couldn't you pass a caller-supplied object that fills a caller-allocated output buffer, and let rustc optimise away that intermediate object?

How would that be an improvement over the caller just passing &mut [u8], &mut str or &mut [u16] as appropriate?

Interesting, did you document the differences precisely somewhere and I missed that?

Main API differences:

  • No backtracking in the streaming API. Given sufficient output buffer space, the input buffer is consumed fully no matter where the end of the input buffer occurs relative to the structure of the encoding.

  • Caller-supplied output buffers in the streaming API.

  • Borrows when possible in the non-streaming API.

  • BOM handling in both the streaming and non-streaming API.

  • No heap allocations (except for the output buffers in the non-streaming API in the non-borrow case).

  • Unmappables signaled as returned char instead of references to the input.

  • No traits, no virtual calls.

Main current internal differences:

  • Handling runs of ASCII using SSE2 or, in the absence of SSE2, with multiple bytes at a time in a plain ALU register.

  • As of today, use of trailing_zeros() for performance. (The above point is a prerequisite for this one.)

  • No non-inline function calls inside the conversion loops. Especially no virtual calls in the conversion loops!

  • No encode-specific data structures. (It might make sense to relax this a little and have some encode-specific data.)

Upcoming differences:

  • Smaller CJK data tables even for decode due to making use of the structure of the indices.

There's a performance comparison reflecting the status as of the end of November, which doesn't include the gains from trailing_zeros() landed today.

(Additionally, there seems to be a difference in attitude towards the Encoding Standard between rust-encoding and encoding_rs. rust-encoding gives encodings potentially non-Encoding Standard names and qualifies the Encoding Standard names as "WHATWG names", suggesting that rust-encoding doesn't recognize the Encoding Standard as the only or even the primary source of naming. rust-encoding also supports non-Encoding Standard encodings. By contrast, encoding_rs is strictly an implementation of the Encoding Standard. No other encodings. No other naming schemes.)

@nox
Copy link
Contributor

nox commented Jan 4, 2017

Theoretically. In practice, having rust-encoding 1.0.0 make the sort of API changes that I think are appropriate for the goals of encoding_rs would be as big a change for dependent crates to adapt to as switching from rust-encoding 0.2.x to encoding_rs.

This is what <1.0.0 versions are for, big breaking changes before stabilisation. There is much precedent about that in the Rust ecosystem, the biggest being libc. I'm also not sure about your claim given you managed to do encoding_rs_compat.

The way to empirically show that my design had merit was to implement it. Since the implementation shares no code with rust-encoding, it doesn't seem bad for it to be in a fresh repo.

Serde is an example of a project that uses RFCs very well, and I would have loved to see an RFC for that design change of the encoding crate.

It is somewhat impolite to show up with competing crate instead of improving the existing one, but before being able to show that the change is an improvement, I though it would have been even more impolite to suggest a re-do of both the internals and externals of the existing crate (and to do some from the special Gecko point of view).

The impoliteness is more about publishing encoding_rs next to encoding, especially when @lifthrasiir put emphasis on the fact that encoding is a Rust-only library, and encoding_rs is the one who got FFI concerns in mind.

The name of the crate encoding_rs_compat also doesn't help.

No traits, no virtual calls.

Could encoding be made to use traits without trait objects, thus eliminating virtual calls? I think yes given you managed to get rid of traits altogether.

How would that be an improvement over the caller just passing &mut [u8], &mut str or &mut [u16] as appropriate?

The improvement would be that the caller could also pass something similar to the current API in encoding, but I may be misunderstanding how you use that passed buffer anyway. Will look at the code more. :)

(Additionally, there seems to be a difference in attitude towards the Encoding Standard between rust-encoding and encoding_rs. rust-encoding gives encodings potentially non-Encoding Standard names and qualifies the Encoding Standard names as "WHATWG names", suggesting that rust-encoding doesn't recognize the Encoding Standard as the only or even the primary source of naming. rust-encoding also supports non-Encoding Standard encodings. By contrast, encoding_rs is strictly an implementation of the Encoding Standard. No other encodings. No other naming schemes.)

Would you mind if we removed the bogus names that aren't part of the spec, @lifthrasiir?

@nox
Copy link
Contributor

nox commented Jan 4, 2017

According to lifthrasiir/rust-encoding#92 (comment), it doesn't seem like all these changes couldn't be incorporated into encoding, which is nice.

@hsivonen
Copy link
Collaborator Author

hsivonen commented Jan 4, 2017

Theoretically. In practice, having rust-encoding 1.0.0 make the sort of API changes that I think are appropriate for the goals of encoding_rs would be as big a change for dependent crates to adapt to as switching from rust-encoding 0.2.x to encoding_rs.

This is what <1.0.0 versions are for, big breaking changes before stabilisation. There is much precedent about that in the Rust ecosystem, the biggest being libc. I'm also not sure about your claim given you managed to do encoding_rs_compat.

If you use encoding_rs_compat, you get the performance benefits (for decode; performance regression for encode) of the encoding_rs internals but you don't get the API benefits. For non-streaming, you don't get borrows. For streaming, there's an extra copy if you implement StringWriter and ByteWriter on something other than String and Vec<u8>, which is passable for a compatibility layer but would violate the principle of least surprise in an API that's not positioned as a mere compatibility layer.

The impoliteness is more about publishing encoding_rs next to encoding, especially when @lifthrasiir put emphasis on the fact that encoding is a Rust-only library, and encoding_rs is the one who got FFI concerns in mind.

Gecko rather fundamentally has FFI concerns. It seemed inappropriate to try to push those concerns onto a library that puts the emphasis on being a Rust-only library. But I don't think it follows that Gecko shouldn't be able to address Gecko-relevant concerns. As for publishing encoding_rs on crates.io, it seems weird not to publish a Free Software crate that could potentially be useful outside Gecko.

The name of the crate encoding_rs_compat also doesn't help.

What's the concern? The name in Cargo.toml is encoding, because that's the only name that Cargo allows while achieving the compatibility goals. The name of the repo is encoding_rs plus something rather than rust-encoding plus something to a) avoid the appearance of using the name rust-encoding for something that's more than lightly patched rust-encoding (i.e. the appearance of using the name inappropriately) and b) to avoid including "rust" in the name to avoid having to consider how the name would interact with the Rust trademark policy. Beyond that, the name isn't particularly carefully chosen.

The purpose of encoding_rs_compat is to allow Gecko to depend on crates from the Rust ecosystem (i.e. crates that depend on rust-encoding) without having to ship two sets of data tables in libxul. From my point of view, this PR is about making encoding_rs_compat unnecessary for the purpose of using (query encoding-enabled) rust-url in Gecko in the future. I made the PR in December rather than when the PR would be more immediately Gecko-relevant, because I happened to talk about rust-url and encoding_rs with @SimonSapin in December.

Could encoding be made to use traits without trait objects, thus eliminating virtual calls? I think yes given you managed to get rid of traits altogether.

AFAICT, there are three sources of virtual calls in encoding:

  1. The set of encodings is extensible. That is, Encoding in encoding is a trait that allows other crates to implement more encodings using the same interface. encoding_rs is deliberately unextensible, so its Encoding can be a struct. (encoding_rs is unextensible because a) it doesn't need to be extensible to address its Gecko use cases, b) to make the point of treating legacy encodings as a closet set that shouldn't be expanded and c) to provide type-safety for reasoning about the characteristics of the encodings (i.e. when you have an &'static encoding_rs::Encoding, you know it doesn't have characteristics alien to what you can see in the Encoding Standard). I realize that there are non-Web use cases for non-Encoding Standard legacy encodings. Catering to those use cases is not a goal for encoding_rs. If someone who has those needs finds encoding_rs useful for a subset of their needs, they are welcome to create an outer crate appropriate for their uses that delegates some subset of tasks to encoding_rs internally.)

  2. ByteWriter and StringWriter. encoding_rs uses caller-allocated buffers instead.

  3. The internals for single-byte encodings. encoding_rs exposes the raw lookup table to code that's common for single-byte encodings instead of exposing a lookup function/method.

@nox
Copy link
Contributor

nox commented Jan 4, 2017

As for publishing encoding_rs on crates.io, it seems weird not to publish a Free Software crate that could potentially be useful outside Gecko.

My point was that it seems weird to publish there a crate with a name very similar to another crate that does pretty much the same thing differently, namely encoding. And we will never be able to clear that confusion because we can't unpublish from crates.io, even if we introduce the same changes back in encoding. I don't care about having 42 implementations about the same thing, but given that one is specifically for Gecko concerns, gecko_encoding would have been better. Unfortunately that ship sailed anyway.

The purpose of encoding_rs_compat is to allow Gecko to depend on crates from the Rust ecosystem (i.e. crates that depend on rust-encoding) without having to ship two sets of data tables in libxul.

I see.

From my point of view, this PR is about making encoding_rs_compat unnecessary for the purpose of using (query encoding-enabled) rust-url in Gecko in the future.

If some third-party crate A ends up using encoding_rs and rust-url, and some other third-party crate B encoding and rust-url, we need to still be able to use A and B together.

I'm more and more of the opinion that we should remove support for the both of them and publish rust-url 2.0.0 with a rust-url-encoding crate.

The set of encodings is extensible.

That it is extensible doesn't mean it must be done through virtual calls. What could go wrong about still having a Encoding trait not used as a trait object in your crate?

@lifthrasiir would you be against making the set of encodings unextensible, otherwise?

ByteWriter and StringWriter. encoding_rs uses caller-allocated buffers instead.

Again, these could be plain old traits not used as trait objects, and a caller-allocated buffer could implement them, no?

@nox
Copy link
Contributor

nox commented Jan 4, 2017

Note that if we merge that PR, and then later release encoding_rs 0.3 (or even encoding 0.3 with the current feature flag), rust-url will need a breaking bump to 2.0.0 anyway.

@hsivonen
Copy link
Collaborator Author

hsivonen commented Jan 4, 2017

My point was that it seems weird to publish there a crate with a name very similar to another crate that does pretty much the same thing differently, namely encoding.

The obvious good name for a crate that implements the Encoding Standard is encoding. That was taken, so encoding_rs was the simplest way to have something else where the difference is idiomatic to the ecosystem. (When the Cargo.toml name doesn't include rs, -rs seems more idiomatic than _rs, but since rs needed to be part of the Cargo.toml name, I renamed it from encoding-rs to encoding_rs in order to make the name something that's allowed in extern crate.)

And we will never be able to clear that confusion because we can't unpublish from crates.io, even if we introduce the same changes back in encoding.

My apologies. encoding_rs is the first crate I've developed, and I didn't think through the implications to that extent. I just thought that publishing early and often was good. I also didn't plan with the presumption of rust-encoding adopting Gecko-motivated API changes and breaking compatibility with its existing users.

If some third-party crate A ends up using encoding_rs and rust-url, and some other third-party crate B encoding and rust-url, we need to still be able to use A and B together.

One option is not merging this PR and relying on encoding_rs_compat in Gecko as the rust-url encoding back end. (I.e. fewer borrows.)

That it is extensible doesn't mean it must be done through virtual calls. What could go wrong about still having a Encoding trait not used as a trait object in your crate?

Maybe I don't know the details of type erasure well enough, but I thought that if you want third parties to see &Encoding (or, preferably, &'static Encoding) and Encoding is a trait, you end up with vtable. Is that not the case? (Not to suggest that the outer calls were perf critical or that the enum-based dispatch pattern encoding_rs uses were any faster.)

ByteWriter and StringWriter. encoding_rs uses caller-allocated buffers instead.

Again, these could be plain old traits not used as trait objects, and a caller-allocated buffer could implement them, no?

Traits vs. trait objects aside, the semantics of ByteWriter and StringWriter are that the converter can tell the buffer to grow. In my experience from Gecko and java.nio.charset, that's not a desirable semantic is a streaming API. A fixed-size intermediate buffer is a common pattern when using a streaming API. (I.e. when the fixed size buffer fills up, you process it the contents and let the converter overwrite it the next time instead of allocating a new buffer.) Granted, with Stylo, the main Gecko use of this pattern will go away. (Converting an entire CSS file in one go with the opportunity for borrow will make more sense.)

Another use of a streaming API is gathering the output to a growable list of fixed-size buffers. This is the HTML parser case in Gecko. With a StringWriter-style interface, either the buffers would be filled significantly less precisely or there'd be an additional copy (compared to encoding_rs/uconv/java.nio.charset-style interface with caller-allocated buffers passed as an argument to the conversion method).

@nox
Copy link
Contributor

nox commented Jan 4, 2017

My apologies. encoding_rs is the first crate I've developed, and I didn't think through the implications to that extent. I just thought that publishing early and often was good. I also didn't plan with the presumption of rust-encoding adopting Gecko-motivated API changes and breaking compatibility with its existing users.

No need for apologies, I should have enclosed all that part by a disclaimer saying this is just lamentations from someone who cares a bit too much about things. In the end, if we manage to just move your changes to a new version of encoding, we can just publish a new encoding_rs version with an all caps description telling people to use encoding instead.

ACK the rest of your reply.

@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably 3e5541e) made this pull request unmergeable. Please resolve the merge conflicts.

@SimonSapin
Copy link
Member

My current plan is to deprecate the current rust-url support (but not remove until we have stronger reasons to do rust-url 2.0), and instead add a mechanism where users of the url crate can implement a trait to provide non-UTF-8 encodings support. It’s such a niche feature that the loss of convenience isn’t a problem. This plan makes this PR obsolete, so don’t spend time rebasing it :) Thanks anyway for submitting it!

Later, (when other Servo dependencies have gotten a similar treatment,) I’ll switch Servo entirely from rust-encoding to encoding_rs.

@Hoverbear
Copy link

I'm closing this as it seems like we're taking a different path:

This plan makes this PR obsolete, so don’t spend time rebasing it :) Thanks anyway for submitting it!

Thank you!

@Hoverbear Hoverbear closed this May 17, 2017
talklittle added a commit to talklittle/rust-url that referenced this pull request Apr 5, 2018
Applies when using `encoding_rs` via the `query_encoding_2` feature.

Code originally by @hsivonen in
servo#262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants