Astropy CSV table reader using pyarrow #17706

taldcroft · 2025-02-01T12:05:54Z

Description

This pull request is an implementation of a fast CSV reader for astropy that uses pyarrow.csv.read_csv. This was discussed in #16869. In particular, a speed-up of around 10x over pandas and the astropy fast reader is noted in this profiling by @dhomeier.

The goal for this reader is to make an interface that will be familiar to astropy io.ascii users, while exposing some additional features brought by pyarrow read_csv. The idea is to keep the interface clean and consistent with astropy.

A quick demonstration notebook that you can use to play with this is at: https://gist.github.com/taldcroft/ac15bc516a7bf7c76f9eec644c787298

Fixes #16869

Do the proposed changes actually accomplish desired goals?
Do the proposed changes follow the Astropy coding guidelines?
Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

github-actions · 2025-02-01T12:06:24Z

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

pllim

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

astropy/io/misc/pyarrow/csv.py

taldcroft · 2025-02-06T18:28:21Z

Thanks! I want to benchmark this but does that mean we need to install pyarrow in https://github.com/astropy/astropy/blob/main/.github/workflows/ci_benchmark.yml ?

There are one-time benchmarks here: #16869 (comment). These demonstrate that pyarrow read_csv() appears to be a factor of 10 faster than any other readers.

mhvk

Nice! I like the general idea; my only more major comment is that I'm not sure in this initial stage one should add the commented-line skipper.

For follow-up, I guess, would be to make this the default "first try" if pyarrow is available, and then deprecate the fast reader?

It does seem Table.{from,to}_pyarrow methods would be reasonable, but better as follow-up.

astropy/io/misc/pyarrow/csv.py

hamogu · 2025-02-07T01:42:52Z

Since @dhomeier has shown pyarrow to be significantly faster, it would be good to have it for the biggest tables. And this is a relatively thin wrapper just to match the API we are used to, so why not?
I do wonder (similar to @mhvk ) how far it makes sense to go to have capabilities that are not native to pyarrow (e.g. comment characters). Is it worth the pure-Python preprocessing at all? Would that dilute the advertised point "this is super fast and super-memory efficient, so use it for tables in the GB range"?

For smaller tables we have other established solutions which are more flexible (not the least our own pure-python readers and our own C reader). How many GB-sized tables are there in the wild with commented lines that are not in the header? I'm just worried about user confusion along the lines of "It's reading this table just fine and that table that's almost identical (but with comment lines) crashes with a Python out-of-memory error". Of course, that only applies to the biggest tables of them all. For csv files in the 0.5-1GB rage, this is probably still be faster AND would fit into memory (and maybe not be too slow) on modern machines. So it's a trade-off.

neutrinoceros

This may be a bit too technical for a first round of review but I wanted to get this kind of feedback in early too so it doesn't grow into too much of a pain later:
Here are a couple suggestions and comments mostly about type annotations and internal consistency.

astropy/io/misc/pyarrow/csv.py

taldcroft

Thanks for the great comments! I think I've addressed them all, or at least responded. Sounds like I have agreement to keep going ahead on this and start working on tests, docs etc?

astropy/io/misc/pyarrow/csv.py

taldcroft

@mhvk @neutrinoceros - I've addressed all the first round of comments.

I lost track of the feature freeze date until the recent announcement, but I hope this can get into this release. I'm planning to add testing and documentation on Monday so I would be very grateful if you are able to make time to look at this again early this week.

astropy/io/misc/pyarrow/csv.py

taldcroft · 2025-04-20T11:53:50Z

One unfortunate thing I just noticed is that my current profiling shows that this full implementation is only a factor of 2x faster than the fast ASCII reader.

dat = simple_table(size=100000, cols=20)
dat.write("junk.csv", format="ascii.basic", overwrite=True, delimiter=",")

%timeit Table.read("junk.csv", format="ascii.csv", fast_reader="force", delimiter=",", guess=False)
103 ms

%timeit Table.read("junk.csv", format="pyarrow.csv")
51 ms

It turns out that converting an object arrays of strings into a numpy string array is the bottleneck.

mhvk · 2025-04-20T21:21:11Z

It turns out that converting an object arrays of strings into a numpy string array is the bottleneck.

This may well go away eventually, with numpy's new StringDType - there's talk about trying to have a copy-less version with pyarrow. Anyway, a factor 2 is still nice! And presumably a lot better for mostly numeric data?

taldcroft · 2025-04-21T12:08:19Z

Profiling the new implementation pyarrow.csv vs fast io.ascii gives:

About 30% faster for all string columns
About 16x faster for all int and float columns

For a table with two commented lines at the top of the file:

About 5x faster for all int, float columns, with no change in memory use.

taldcroft · 2025-04-21T12:15:48Z

For a real-world example, reading the first Gaia ECSV source file here: https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/ (after unzipping) gives:

fast io.ascii: 10.1 s (using the CSV reader just ignoring comments)
pyarrow.csv: 1.5 s

mhvk

@taldcroft - this mostly looks good, and quite a few of my comments are really just nitpicks. Though I think I found a bug that more detailed tests would surely have found too...

p.s. The GAIA example shows this is really worth it, great!

astropy/io/misc/pyarrow/csv.py

mhvk · 2025-04-21T14:43:24Z

astropy/io/misc/pyarrow/csv.py

+
+    Notes
+    -----
+    - If the input array is of string type, it delegates the conversion to


This comment is out of date and can be removed.

astropy/io/misc/pyarrow/csv.py

taldcroft · 2025-04-22T19:12:51Z

@mhvk - I added a suite of tests that covers all the read_csv arguments and a few more things.

A significant addition since you last looked is support for date, time, timestamp types. Currently these result in numpy datetime64 arrays, or an object array of datetime.time objects (for pure times like 12:34:45.233). An obvious idea is an option to allow converting to a Time mixin column where applicable.

Apart from that, I think this is mostly feature complete at least for this release. So it's just the docs, change log, what's new.

mhvk

This looks rather nice! One real comment left, with the suggestion to do the filling on the numpy side for speed and to save memory. But even that not serious.

A problem though is that most CI runs errored:

TypeError: ChunkedArray.to_numpy() takes no keyword arguments

EDIT: if this is a pyarrow version problem, one could probably set it to a higher minimum, since it is new for astropy.

mhvk · 2025-04-23T14:56:15Z

astropy/io/misc/pyarrow/csv.py

+    if pa.types.is_integer(arr.type) or pa.types.is_floating(arr.type):
+        fill_value = 0
+    elif is_string:
+        is_string = True


This is not needed, right?

mhvk · 2025-04-23T15:02:45Z

astropy/io/misc/pyarrow/csv.py

+    elif pa.types.is_timestamp(arr.type):
+        fill_value = pa.scalar(datetime.datetime(2000, 1, 1), type=arr.type)
+    else:
+        raise TypeError(f"unsupported PyArrow array type: {arr.type}")


Are there any unsupported PyArrow types left? If not, the above logic is just for fill_value, which is not needed if there are no masked elements, so it could be moved inside the masked branch.

astropy/io/misc/pyarrow/csv.py

astropy/io/misc/pyarrow/tests/test_csv.py

taldcroft · 2025-04-24T17:11:48Z

@mhvk - I think this ready for final review.

Fixed the annoying Windows-only problems.
Fixed a real bug in the comment handling and added a number of tests.
Fixed most of the coverage issues. The remaining 3 bits of uncovered code are OK (import check and two data types where the code is obvious).
Added a lot more documentation.

This is not supported in older pyarrow and does not seem to be required based on testing.

taldcroft · 2025-04-25T18:22:33Z

@mhvk - thanks so much for the review, as always this PR ended up in a much better place! I've addressed your minor docs comments and auto-merge squash is now enabled. 🤞 it gets in without another What's New merge conflict.

github-actions bot added the io.misc label Feb 1, 2025

github-actions bot added the unified-io label Feb 1, 2025

pllim added this to the v7.1.0 milestone Feb 3, 2025

taldcroft requested review from hamogu, mhvk and dhomeier February 6, 2025 16:54

pllim reviewed Feb 6, 2025

View reviewed changes

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

mhvk reviewed Feb 6, 2025

View reviewed changes

neutrinoceros reviewed Feb 7, 2025

View reviewed changes

taldcroft commented Feb 7, 2025

View reviewed changes

neutrinoceros reviewed Feb 12, 2025

View reviewed changes

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

astropy/io/misc/pyarrow/csv.py Outdated Show resolved Hide resolved

taldcroft mentioned this pull request Mar 21, 2025

Refactor and rework documentation on astropy file I/O #17804

Merged

1 task

taldcroft force-pushed the io-pyarrow-csv branch from cb44ed7 to 054856d Compare April 20, 2025 10:09

taldcroft commented Apr 20, 2025

View reviewed changes

taldcroft marked this pull request as ready for review April 20, 2025 10:13

taldcroft requested review from WilliamJamieson and matteobachetti as code owners April 20, 2025 10:13

mhvk reviewed Apr 21, 2025

View reviewed changes

mhvk reviewed Apr 23, 2025

View reviewed changes

taldcroft force-pushed the io-pyarrow-csv branch from 088da14 to 5a378cf Compare April 23, 2025 19:33

taldcroft added 24 commits April 25, 2025 14:18

Improve convert from pa array to numpy

02d5eca

Add tests

f87b931

Add support for date, time, timestamp types

459a73e

Add timestamp_parsers to read_csv API

d57fe75

More tests

edc24eb

Expand dtypes support to include pyarrow types

9cd67fa

Fix test regression from previous commit

e7123b7

Remove explicit zero_copy_only in to_numpy()

cb08ac8

This is not supported in older pyarrow and does not seem to be required based on testing.

Add changelog

d77662b

Add initial documentation

cd97179

Module-level pytest skip

ac748cf

More doc updates

230ec56

Sphinx fixes

3bbe34a

Doctest fixes

3eccab0

Try disabling threads for Windows process problem

e88d727

Revert use_threads=False

d246fa1

Use pytest tmp_path instead of tempfile for testing

690f443

Try no unicode in column name

9cc9d0b

Enforce utf-8 encoding as default in testing

5148f72

More docs

4e15980

Fix comment handling and related coverage issues

5fca95e

More doc updates

96e284e

Fix testing mistake

e252dd0

Minor docs fixes

1ef8037

taldcroft force-pushed the io-pyarrow-csv branch from ca53b32 to 1ef8037 Compare April 25, 2025 18:20

taldcroft enabled auto-merge (squash) April 25, 2025 18:20

taldcroft merged commit 5ce134d into astropy:main Apr 25, 2025
24 of 27 checks passed

taldcroft deleted the io-pyarrow-csv branch April 25, 2025 18:36

taldcroft mentioned this pull request Apr 26, 2025

Fix bug not setting null_values=[""] by default in pyarrow.csv reader #18061

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Astropy CSV table reader using pyarrow #17706

Astropy CSV table reader using pyarrow #17706

taldcroft commented Feb 1, 2025 •

edited

Loading

github-actions bot commented Feb 1, 2025

github-actions bot commented Feb 1, 2025

pllim left a comment

taldcroft commented Feb 6, 2025

mhvk left a comment

hamogu commented Feb 7, 2025

neutrinoceros left a comment

taldcroft left a comment

taldcroft left a comment

taldcroft commented Apr 20, 2025 •

edited

Loading

mhvk commented Apr 20, 2025

taldcroft commented Apr 21, 2025

taldcroft commented Apr 21, 2025 •

edited

Loading

mhvk left a comment

mhvk Apr 21, 2025

taldcroft commented Apr 22, 2025

mhvk left a comment •

edited

Loading

mhvk Apr 23, 2025

mhvk Apr 23, 2025

taldcroft commented Apr 24, 2025

taldcroft commented Apr 25, 2025

Astropy CSV table reader using pyarrow #17706

Astropy CSV table reader using pyarrow #17706

Conversation

taldcroft commented Feb 1, 2025 • edited Loading

Description

Related

github-actions bot commented Feb 1, 2025

github-actions bot commented Feb 1, 2025

pllim left a comment

Choose a reason for hiding this comment

taldcroft commented Feb 6, 2025

mhvk left a comment

Choose a reason for hiding this comment

hamogu commented Feb 7, 2025

neutrinoceros left a comment

Choose a reason for hiding this comment

taldcroft left a comment

Choose a reason for hiding this comment

taldcroft left a comment

Choose a reason for hiding this comment

taldcroft commented Apr 20, 2025 • edited Loading

mhvk commented Apr 20, 2025

taldcroft commented Apr 21, 2025

taldcroft commented Apr 21, 2025 • edited Loading

mhvk left a comment

Choose a reason for hiding this comment

mhvk Apr 21, 2025

Choose a reason for hiding this comment

taldcroft commented Apr 22, 2025

mhvk left a comment • edited Loading

Choose a reason for hiding this comment

mhvk Apr 23, 2025

Choose a reason for hiding this comment

mhvk Apr 23, 2025

Choose a reason for hiding this comment

taldcroft commented Apr 24, 2025

taldcroft commented Apr 25, 2025

taldcroft commented Feb 1, 2025 •

edited

Loading

taldcroft commented Apr 20, 2025 •

edited

Loading

taldcroft commented Apr 21, 2025 •

edited

Loading

mhvk left a comment •

edited

Loading