Switch from kuchiki to LOL_HTML #930

jyn514 · 2020-08-02T15:27:29Z

Closes #876

My styles haven't been working since ~forever, if someone else could check this looks right that would be great. However this looks approximately the same before and after to me.

Ideas for testing this are also welcome - the previous tests were only testing the parsing, not the rendered code, so they're not very useful here.

Before:

After:

r? @Kixiron
cc @nataliescottdavidson

Before, it was generating code like this: ```html <div class="rustdoc mod" container-rustdoc="" id="rustdoc_body_wrapper" tabindex="-1"> ``` Now it generates code like this: ```html <div class="rustdoc mod container-rustdoc" id="rustdoc_body_wrapper" tabindex="-1"> ```

jyn514 · 2020-08-02T15:32:52Z

Source before: https://gist.github.com/jyn514/421101e65a28bb67720f0fd31628b4c6
Source after: https://gist.github.com/jyn514/08ee9831ad5cb853cb528adb6aa2ea37

jyn514 · 2020-08-02T15:33:54Z

Going to try this on the 300 MB file and see what happens.

- Inline `navigation.html` into `body.html` - Remove `page.html`

jyn514 · 2020-08-02T16:11:31Z

Once cloudflare/lol-html#56 is fixed, we could also remove SizedBuffer and use MemorySettings instead (technically we could do it earlier but I'm lazy). That can wait for a follow-up PR though.

The way the new LOL rewriter works requires a valid <head> and <body> tag. I think it's pretty safe to assume any HTML generated by rustdoc will have at least those. However, much of the test suite did not, because it was using random content like `b"lah"`. This adds a default HTML content which both makes it easier to write tests and makes the content valid HTML by default. A new function `rustdoc_file_with` was added in case you still need the old behavior.

jyn514 · 2020-08-02T17:46:14Z

Going to try this on the 300 MB file and see what happens.

It worked 🎉 🎉 🎉 Took about a minute to serve then another 5 to load but that's to be expected :)

jyn514 · 2020-08-02T17:48:20Z

It used ~350 MB of memory to serve which seems about right.

- Remove commented-out tests and code - Fix bad comment - Remove trailing whitespace

Kixiron

This looks like a really great change, makes me wonder if we shouldn't add a time limit of some sort later on down the line to prevent us getting swamped (alternatively having a higher size cap would probably work, letting us not even try to parse files that are over x MB). You may need to update the benchmarks, and I'd be really interested to know if there's a speedup of any sort after the switch

src/web/rustdoc.rs

jyn514 · 2020-08-02T19:21:27Z

makes me wonder if we shouldn't add a time limit of some sort later on down the line to prevent us getting swamped (alternatively having a higher size cap would probably work, letting us not even try to parse files that are over x MB)

For now I left the cap at 5 MB. I planned to make a follow-up PR removing SizedBuffer in favor of memory limits, but since you mentioned it I'll make it part of this one.

You may need to update the benchmarks, and I'd be really interested to know if there's a speedup of any sort after the switch

Good point, I forgot to update them. The benchmarks are not going to be one-to-one though - before we only measured the parse time, while now it measures both parsing and templating (and that's sort of a fundamental part of this change, there's not a way to do one without the other).

- Add a memory limit for the parser - Abstract most of rendering into a method on `RustdocPage` - Use bytes for parsing to avoid validating UTF8-encoding twice

Kixiron

LGTM modulo nits, great job!

src/utils/html.rs

src/web/rustdoc.rs

jyn514 · 2020-08-02T20:35:19Z

@Kixiron how set are you on having benchmarks? So far I have

    let html = std::fs::read_to_string("benches/struct.CaptureMatches.html").unwrap();
    let page = RustdocPage {

    };
    let config = unimplemented!();
    let conn = Connection::connect(config.database_url.as_str(), postgres::TlsMode::None)?;
    let ctx = Context::from_serialize(&page).unwrap();
    let templates = TemplateData::new(conn).unwrap();

and it shows no signs of slowing down, I just need more and more things to be public.

Co-authored-by: Chase Wilson <[email protected]>

Kixiron · 2020-08-02T20:44:38Z

Benches don't matter a ton, I can follow-up with them if need be

- Raise size limit from 5 MB to 50 MB - Use DOCSRS_MAX_PARSE_MEMORY to configure the max memory, defaulting to 350 MB

jyn514 · 2020-08-02T20:54:19Z

I also raised the default HTML size limit from 5 MB to 50 MB. Since I tested on a 300 MB file this shouldn't give the prod server any problems.

inikulin · 2020-08-02T21:54:36Z

@jyn514 If understand the context correctly then 5Mb should be enough. Lol-html requires to allocate memory only for really long tags (without content, just start tag tokens). So if you don't have/don't plan to have start tags longer than 5Mb than it should be more than enough for the observable future. (just in case: increasing the buffer size doesn't boost performance)

jyn514 · 2020-08-02T22:25:12Z

@jyn514 If understand the context correctly then 5Mb should be enough. Lol-html requires to allocate memory only for really long tags (without content, just start tag tokens). So if you don't have/don't plan to have start tags longer than 5Mb than it should be more than enough for the observable future. (just in case: increasing the buffer size doesn't boost performance)

Wait, did I read that right? Lol-html will only allocate as much memory as the size of the start tag, even for multi-megabyte files? That's amazing!

inikulin · 2020-08-02T23:03:44Z

@jyn514 yep, it's a streaming parser, so we allocate as less as possible

jyn514 · 2020-08-03T02:57:48Z

Switched the limit to 5 MB and it still loaded the 300 MB file fine 🎉

jyn514 · 2020-08-03T15:36:23Z

Aha, I was looking at the wrong screen! Those are some nice numbers :D

inikulin · 2020-08-04T16:50:02Z

@jyn514 btw, I believe you can push the memory limit even further to 3-5Kb, I believe in your case that should be enough

jyn514 added 2 commits August 2, 2020 11:04

[WIP] Initial try at LOL HTML rewriter

d7a702e

jyn514 added 2 commits August 2, 2020 11:53

Add missing html files

22b8bc8

Remove unused files

7889db9

- Inline `navigation.html` into `body.html` - Remove `page.html`

This was referenced Aug 2, 2020

Restrict HTML emission to specific nodes cloudflare/lol-html#40

Open

Log when a file is not served because it was too big #932

Merged

jyn514 added A-backend Area: Webserver backend P-medium Medium priority labels Aug 2, 2020

Cleanup

d9604a2

- Remove commented-out tests and code - Fix bad comment - Remove trailing whitespace

jyn514 added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Aug 2, 2020

Kixiron reviewed Aug 2, 2020

View reviewed changes

src/web/rustdoc.rs Outdated Show resolved Hide resolved

src/web/rustdoc.rs Outdated Show resolved Hide resolved

Use HtmlRewriter instead of rewrite_str

adb8929

- Add a memory limit for the parser - Abstract most of rendering into a method on `RustdocPage` - Use bytes for parsing to avoid validating UTF8-encoding twice

Kixiron approved these changes Aug 2, 2020

View reviewed changes

src/utils/html.rs Show resolved Hide resolved

src/utils/html.rs Outdated Show resolved Hide resolved

src/web/rustdoc.rs Outdated Show resolved Hide resolved

jyn514 and others added 3 commits August 2, 2020 16:36

Use log instead of println in tests

b39d7c3

Document rewrite_lol

b888c84

Add whitespace

cbe962b

Co-authored-by: Chase Wilson <[email protected]>

Make parse memory configurable and raise HTML size limit

ec5ca48

- Raise size limit from 5 MB to 50 MB - Use DOCSRS_MAX_PARSE_MEMORY to configure the max memory, defaulting to 350 MB

Improve comment

f78472b

jyn514 force-pushed the lol-no branch from 9b7f195 to f78472b Compare August 2, 2020 20:54

Kixiron approved these changes Aug 2, 2020

View reviewed changes

Only allot 5 MB for LOL parser

2dc045d

jyn514 added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Aug 3, 2020

jyn514 merged commit 482d566 into rust-lang:master Aug 3, 2020

jyn514 deleted the lol-no branch August 3, 2020 15:03

This comment has been minimized.

Sign in to view

jyn514 mentioned this pull request Aug 4, 2020

Header bar font is different on front page and rustdoc files #935

Closed

Kixiron mentioned this pull request Aug 4, 2020

[Feature Request] Allow memory introspection of HtmlRewriter cloudflare/lol-html#61

Open

jyn514 mentioned this pull request Aug 4, 2020

Log HTML rewriting errors #936

Merged

Nemo157 mentioned this pull request Aug 5, 2020

Cleanup from LOL HTML rewrite #937

Merged

This was referenced Aug 5, 2020

Search pages for some crates look broken #940

Closed

Fix broken tracing documentation formatting tokio-rs/tracing#881

Merged

Font rendering got weird recently #945

Closed

jyn514 mentioned this pull request Aug 15, 2020

Avoid creating two <title>s #971

Closed

jyn514 mentioned this pull request Aug 26, 2020

Styles are broken on old rustdoc pages #1005

Closed

jyn514 mentioned this pull request Sep 2, 2020

Rendering messed up for old crates #1027

Closed

jyn514 removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Sep 20, 2020

Switch from kuchiki to LOL_HTML #930

Switch from kuchiki to LOL_HTML #930

Uh oh!

Conversation

jyn514 commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

Kixiron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

Kixiron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

Kixiron commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

inikulin commented Aug 2, 2020

Uh oh!

jyn514 commented Aug 2, 2020

Uh oh!

inikulin commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyn514 commented Aug 3, 2020

Uh oh!

This comment has been minimized.

jyn514 commented Aug 3, 2020

Uh oh!

inikulin commented Aug 4, 2020

Uh oh!

Uh oh!

inikulin commented Aug 2, 2020 •

edited

Loading