Open
Description
It takes 750 ms to deserialize these types while json.load in python takes 300 ms.
Reported by @mitsuhiko in IRC.
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
It takes 750 ms to deserialize these types while json.load in python takes 300 ms.
Reported by @mitsuhiko in IRC.
Activity
mitsuhiko commentedon Oct 12, 2016
These are some smaller files from Sentry that should show the same behavior. The files I'm working with are about four to six times the size but unfortunately i cannot publicly share them.
maps.zip
dtolnay commentedon Oct 13, 2016
I took the larger of the two files in the zip (vendor.js.map) and extended the "mappings" and "sourcesContent" to be each four copies of themselves. The result is 23 MiB. vendor2.zip
Parsing unbuffered directly from the file takes 4710ms. We expect this to be slow.
Parsing from a buffered reader takes 562ms. I assume this is what @mitsuhiko was running.
Parsing from a string (including reading the file to a string!) takes 55ms. This is the case that I optimized a while back.
Parsing from a vec is the same at 55ms.
Note that in all of these cases parsing to RawSourceMap vs parsing to serde_json::Value takes exactly the same time because the JSON is dominated by large strings.
Parsing in Python takes 248ms in Python 2.7.12 and 186ms in Python 3.5.2. Both Pythons are reading the file into memory as a string first. The read happens here. So Python is doing a slower version of what Rust is doing in 55ms.
I also tried the other json crate for good measure which takes 77ms (still impressive compared to Python).
And of course I tried RapidJSON which may be the fastest C/C++ JSON parser. Don't mind the nasty but actually really fast reading of the file to a std::string, it only takes 6ms. Using clang++ 3.8.0 with -O3 it takes 110ms and using g++ 5.4.0 with -O3 it takes 67ms
Conclusion
json.load
in Python by 3x when making a fair comparison.std::io::Seek
, which both File and BufReader do.[-]Parsing 20 MB RawSourceMap is slow[/-][+]Parsing 20MB file using from_reader is slow[/+]dtolnay commentedon Oct 13, 2016
For those wondering, bincode takes 14ms.
dtolnay commentedon Oct 13, 2016
Comments from @erickt in IRC:
dimfeld commentedon Nov 5, 2016
Depending on your timeline for improving the speed of
from_reader
, what do you think about mentioning this in the docs as a first step?I just encountered this problem with a 45MB JSON file that was taking about 25 seconds to load using
from_reader
. Since I'm new to Rust I didn't think to use aBufReader
at first, and that brought the time down to about 1.5 seconds. But as you mentioned here, reading it all into memory in advance and usingfrom_slice
was faster still, at 350ms or so.I don't think the
BufReader
technique necessarily needs to be documented here since that's not specific to this crate and is more of a newbie thing, but the vast speed difference betweenfrom_slice
andfrom_reader
seems worth mentioning if it's not going to change soon. Any thoughts?edit: if you agree this is a good idea, I'll be glad to submit a PR.
mitsuhiko commentedon Nov 5, 2016
The problem with
BufRead
is that rust does not support a way to tell aRead
apart from aBufRead
in a generic interface. Ideally serde could auto wrap in aBufReader
if only aRead
is supplied :(oli-obk commentedon Nov 5, 2016
We could use specialization for Seek and BufReader. With Seek we can detect the size and then choose between slice, buf or read processing
serde_json::from_reader
indiv0/xkcd-rs#2bouk commentedon Nov 29, 2017
I'm taking a stab at implementing the BufRead using specialization, which would make it nightly-only for now., although I guess we could add a from_bufread. I think with
fill_buf
we could do without any kind of copying for the (presumably) common case where a string fits in the buffer of the BufReader.I'll create a PR for discussion when I have something to show.
bouk commentedon Dec 7, 2017
All right I have an absolutely terrible but working PoC. It required a lot of 'open-heart surgery' on the project to make all the lifetimes and stuff work (you can't return a buffer from a BufRead) but you can look at the result here: https://github.com/bouk/json/tree/buf-read-keys (I think a rewrite of the whole read.rs file would be the most prudent course of action). Again, it's a PoC, the code is 💩.
Anyways, for the result: with this script:
I get 450ms parse time on the current master, but on my branch it's brought down to ~100ms with the buffer optimizations. So, a 4-5x speed up is what we can expect here. Like I mentioned before, a lot of assumptions need to be rethought, like the Reference enum which I couldn't get working properly and which doesn't even lead to improvements in the default json Value parser, as borrowed strings aren't used (but they could be useful for other types I guess).
So, to conclude: definitely possible and worthwhile, look at my untested and broken implementation for inspiration, but there is more work required.
EDIT: OK I take it back, it's slightly nicer now. Not much, but some
21 remaining items