-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Astropy CSV table reader using pyarrow #17706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
👋 Thank you for your draft pull request! Do you know that you can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?
There are one-time benchmarks here: #16869 (comment). These demonstrate that pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the general idea; my only more major comment is that I'm not sure in this initial stage one should add the commented-line skipper.
For follow-up, I guess, would be to make this the default "first try" if pyarrow
is available, and then deprecate the fast reader?
It does seem Table.{from,to}_pyarrow
methods would be reasonable, but better as follow-up.
Since @dhomeier has shown pyarrow to be significantly faster, it would be good to have it for the biggest tables. And this is a relatively thin wrapper just to match the API we are used to, so why not? For smaller tables we have other established solutions which are more flexible (not the least our own pure-python readers and our own C reader). How many GB-sized tables are there in the wild with commented lines that are not in the header? I'm just worried about user confusion along the lines of "It's reading this table just fine and that table that's almost identical (but with comment lines) crashes with a Python out-of-memory error". Of course, that only applies to the biggest tables of them all. For csv files in the 0.5-1GB rage, this is probably still be faster AND would fit into memory (and maybe not be too slow) on modern machines. So it's a trade-off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a bit too technical for a first round of review but I wanted to get this kind of feedback in early too so it doesn't grow into too much of a pain later:
Here are a couple suggestions and comments mostly about type annotations and internal consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great comments! I think I've addressed them all, or at least responded. Sounds like I have agreement to keep going ahead on this and start working on tests, docs etc?
cb44ed7
to
054856d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk @neutrinoceros - I've addressed all the first round of comments.
I lost track of the feature freeze date until the recent announcement, but I hope this can get into this release. I'm planning to add testing and documentation on Monday so I would be very grateful if you are able to make time to look at this again early this week.
One unfortunate thing I just noticed is that my current profiling shows that this full implementation is only a factor of 2x faster than the fast ASCII reader.
It turns out that converting an object arrays of strings into a numpy string array is the bottleneck. |
This may well go away eventually, with numpy's new |
Profiling the new implementation
For a table with two commented lines at the top of the file:
|
For a real-world example, reading the first Gaia ECSV source file here: https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/ (after unzipping) gives:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taldcroft - this mostly looks good, and quite a few of my comments are really just nitpicks. Though I think I found a bug that more detailed tests would surely have found too...
p.s. The GAIA example shows this is really worth it, great!
astropy/io/misc/pyarrow/csv.py
Outdated
|
||
Notes | ||
----- | ||
- If the input array is of string type, it delegates the conversion to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is out of date and can be removed.
@mhvk - I added a suite of tests that covers all the A significant addition since you last looked is support for date, time, timestamp types. Currently these result in numpy datetime64 arrays, or an object array of datetime.time objects (for pure times like 12:34:45.233). An obvious idea is an option to allow converting to a Apart from that, I think this is mostly feature complete at least for this release. So it's just the docs, change log, what's new. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks rather nice! One real comment left, with the suggestion to do the filling on the numpy side for speed and to save memory. But even that not serious.
A problem though is that most CI runs errored:
TypeError: ChunkedArray.to_numpy() takes no keyword arguments
EDIT: if this is a pyarrow version problem, one could probably set it to a higher minimum, since it is new for astropy.
astropy/io/misc/pyarrow/csv.py
Outdated
if pa.types.is_integer(arr.type) or pa.types.is_floating(arr.type): | ||
fill_value = 0 | ||
elif is_string: | ||
is_string = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, right?
astropy/io/misc/pyarrow/csv.py
Outdated
elif pa.types.is_timestamp(arr.type): | ||
fill_value = pa.scalar(datetime.datetime(2000, 1, 1), type=arr.type) | ||
else: | ||
raise TypeError(f"unsupported PyArrow array type: {arr.type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any unsupported PyArrow
types left? If not, the above logic is just for fill_value
, which is not needed if there are no masked elements, so it could be moved inside the masked branch.
088da14
to
5a378cf
Compare
@mhvk - I think this ready for final review.
|
This is not supported in older pyarrow and does not seem to be required based on testing.
ca53b32
to
1ef8037
Compare
@mhvk - thanks so much for the review, as always this PR ended up in a much better place! I've addressed your minor docs comments and auto-merge squash is now enabled. 🤞 it gets in without another What's New merge conflict. |
Description
This pull request is an implementation of a fast CSV reader for astropy that uses pyarrow.csv.read_csv. This was discussed in #16869. In particular, a speed-up of around 10x over pandas and the astropy fast reader is noted in this profiling by @dhomeier.
The goal for this reader is to make an interface that will be familiar to astropy
io.ascii
users, while exposing some additional features brought by pyarrowread_csv
. The idea is to keep the interface clean and consistent with astropy.A quick demonstration notebook that you can use to play with this is at: https://gist.github.com/taldcroft/ac15bc516a7bf7c76f9eec644c787298
Fixes #16869
Related
pandas-dev/pandas#54466
Please DO squash and merge, the individual commits are not valuable here.