Skip to content

Reading/writing of W3C-style embeded metadata in CSV, TSV files #25379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nedclimaterisk opened this issue Feb 20, 2019 · 3 comments
Open

Reading/writing of W3C-style embeded metadata in CSV, TSV files #25379

nedclimaterisk opened this issue Feb 20, 2019 · 3 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@nedclimaterisk
Copy link

Related to #2485

The W3C Tabular Data Model recommendation that include arbitrary text data, as well as column-specific metadata, such as column data types.

It would be very nice if Pandas could read metadata like this. There is a section with an example of CSV/TSV meader metadata that might make a good starting point. The full recommendation seems somewhat vague, but perhaps that means that Pandas could help to define some more specific standards.

Perhaps a YAML header behind # characters, where some known variable names (e.g. datatype) are captured for use in reading the rest of the file, where remaining unused YAML data is added to a df.metadata dictionary?

@WillAyd WillAyd added Enhancement IO CSV read_csv, to_csv labels Feb 20, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Feb 20, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 20, 2019

Thanks - I wasn't even aware of this. I think this is an interesting idea and would agree that the datatype annotations seems like a logical starting point.

PRs are always welcome if you have an idea on how to implement

@jbrockmendel
Copy link
Member

how common is this format in the wild?

@naught101
Copy link

Probably not very at all, but it's a recommended spec, CSV metadata management is a real PITA, and this seems to solve it. Getting it added to the most popular CSV manipulation library around would really help make it more common, I reckon.

There are also potential side-benefits, for example the #datatype declaration would allow immediate inference of datatypes without having to scan the first 100 lines of the CSV.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants