Skip to content

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

ParfaitG
Copy link
Contributor

@ParfaitG ParfaitG commented Apr 11, 2021


Note: This PR relies on a new dependency, pyreadr.

@ParfaitG ParfaitG changed the title Rdata io ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata Apr 11, 2021
@@ -360,6 +360,8 @@ zlib Compression for HDF5
fastparquet 0.4.0 Parquet reading / writing
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing
pyreadstat SPSS files (.sav) reading
pyreadr R files (.RData, .rda, .rds) reading / writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR relies on a new dependency, pyreadr, (available in conda) for default engine option. Please advise how to add to builds for pytests.

this would be ok. we do not want to add r as a dep for even testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to maintenance and licensing, the consensus was to use pyreadr as a soft dep like pyreadstat for read_spss. Understood about not adding R as dep. For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in test_rscript.py on Linux/Windows/Mac builds which checks for Rscript (nothing yet for test_pyreadr.py).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyreadr as a soft dep like pyreadstat for read_spss.

yes this is ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.

we may have to return a dict of DataFrames like pyreadr does.

that would be ok, we do this in read_html, just clearly document & make the signature reflect this.

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Apr 12, 2021
@ParfaitG
Copy link
Contributor Author

@twoertwein - If you have a chance, please advise on io handling in this new rdata module. R data files are binary types. Handling is largely borrowed from the recent io xml. But here, buffers need to be saved to disk for parsing and writing.

filepath_or_buffer = (
handle_obj.handle.read()
if hasattr(handle_obj.handle, "read")
else handle_obj.handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is handle_obj.handle used and does it return the content?

Copy link
Contributor Author

@ParfaitG ParfaitG Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, all valid file/URL paths and buffers render in handle_obj.handle. While buffers return with read(), stringified paths return unchanged. I have tests for both file and file-like types:

def test_read_rds_file(datapath):
    filename = datapath("io", "data", "rdata", "ghg_df.rds")
    r_df = read_rdata(filename)
    ...

def test_bytes_read_rda(datapath):
    filename = datapath("io", "data", "rdata", "env_data_dfs.rda")

    with open(filename, "rb") as f:
        r_dfs = read_rdata(f, file_format="rda")
    ...

def test_bytesio_rds(datapath):
    filename = datapath("io", "data", "rdata", "sea_ice_df.rds")

    with open(filename, "rb") as f:
        with BytesIO(f.read()) as b_io:
            r_df = read_rdata(b_io, file_format="rds")
    ...

R data files can also have an ASCII (i.e., text) version format (see docs), so I accommodated this as well with different modes within code.

with _preprocess_data(handle_data) as r_data:
    mode = "wb" if isinstance(r_data, io.BytesIO) else "w"
    with open(r_temp, mode) as f:
        f.write(r_data.read())

Actually reading docs closer they indicate ASCII is still a binary file. So, I will adjust above to handle mode here only for binary, wb. And advise users in docs to only use treat R data in rb modes for file-like objects.

Saved R objects are binary files, even those saved with ascii = TRUE, so ensure that they are transferred without conversion of end-of-line markers and of 8-bit characters. The lines are delimited by LF on all platforms.

)

with TemporaryDirectory() as tmp_dir:
r_temp = os.path.join(tmp_dir, "rdata.rda")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that create problems when multiple users try to read/write R data (or the same user in multiple processes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe TempDirectory runs on users' local machine and whichever user saves to final path last will see those changes. According to Python docs, TempDirectory runs same rules as tempfile.mkdtemp:

Creates a temporary directory in the most secure manner possible. There are no race conditions in the directory’s creation. The directory is readable, writable, and searchable only by the creating user ID.

Consequently, no other user will have access to that directory and files during processing. Also, paths are pretty unique like for Unix: i.e., /tmp/tmpXKJXHZ to indicate each call points to different temp directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, perfect! I thought it would just return /tmp.

@ParfaitG
Copy link
Contributor Author

Rdata_IO_tools

@ParfaitG ParfaitG requested a review from jreback April 16, 2021 17:25
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good a couple of questions / comments

The top-level function ``read_rdata`` will read the native serialization types
in the R language and environment. For .RData and its synonymous shorthand, .rda,
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
For .rds types that only contains a single R object, method will return a single
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reference that you can point to for this format (e.g. docs)

select_frames: Optional[List[str]] = None,
rownames: bool = True,
storage_options: StorageOptions = None,
) -> Union[DataFrame, Dict[str, DataFrame]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, shouldn't this always return a dict-of-frames? e.g. when does it and when does it not?

commands. Default 'infer' will use extension in file name to
to determine the format type.

select_frames : list, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is to return all?

Selected names of DataFrames to return from R rda and RData types that
can contain multiple objects.

rownames : bool, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe call this index=True?

Returns
-------
DataFrame or dict of DataFrames
Depends on R data type where rds formats returns a single DataFrame and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify when this happens


See Also
--------
read_sas : Read SAS datasets into DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you lited read_parquet / read_feather above (ok to add more, but would add in both places the same list)


Notes
-----
Any R data file that contains a non-data.frame object may raise parsing errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link to the R references here



def test_read_rds_non_df(datapath):
from pyreadr import custom_errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this import to the top

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

pyreadr uses this license

License: GNU Affero General Public License v3 or later (AGPLv3+) (AGPLv3)

does this matter to us since we are not including the actual code (and just using it).

cc @pandas-dev/pandas-core

@bashtage
Copy link
Contributor

INAL but from what I can tell simply calling code in another package without copying it doesn't require adopting the stricter license.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2021

With regards to the licensing issue, the short answer is that nobody knows if importing a GPL licensed library creates a derivative work that also must be GPL licensed.

The Free Software Foundation (authors of GPL) claims that it does. This is a rather questionable interpretation of copyright law, so many ignore it (especially in the R ecosystem), but really at their own risk. Regardless of legal concerns, at the very least it's rude, because it goes against the desires of whoever wrote the original GPL licensed software. See this article for a nice summary: https://tech.popdata.org/the-gpl-license-and-linking-still-unclear-after-30-years/

So unfortunately I don't think pandas can accept this PR. The risk of pandas being possibly AGPL licensed would stop some companies from using it (and likely lead to a fork), due to concern about this exact same "viral" aspect of the (A)GPL requiring that all code that uses pandas also be open source.

(Note that AGPL and GPL are quite similar, except AGPL was designed to be even more restrictive, requiring even software used over a network to make its code available. At Google, for example, we are allowed to use GPL licensed code in many cases but are strictly prohibited from using or even running AGPL code.)

@toobaz
Copy link
Member

toobaz commented Apr 21, 2021

See this article for a nice summary: https://tech.popdata.org/the-gpl-license-and-linking-still-unclear-after-30-years/

That article covers the "most provocatively borderline case" of mixing BSD and GPL case: one in which you are writing a GPL thin library strongly embedded with an existing GPLed software precisely to circumvent the virality of the GPL while using a GPL covered library. In practice, the use pandas would make of the pyreader library would be intrinsically not very "intimate", to use the vocabulary of the FSF.

This said, if, as I suspect, the code in the PR could be released in a separate (GPLed) software (one that would import pandas, rather than the opposite) without much effort, then it is probably worth it, just to avoid any legal risk whatsoever.

@bashtage
Copy link
Contributor

bashtage commented Apr 21, 2021

I agree that even the possibility of the worries that @shoyer wrote about is enough of a case to impose an embargo the import of (A)GPL libraries (anything similar other than LGPL).

It is probably even more convoluted in the case of Python since there is linking, only very loosely coupled components that duck type each other.

@ParfaitG
Copy link
Contributor Author

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing. I have been exploring alternative routes to directly interface with the C library, librdata, which uses an MIT license. But we would need a different cython interface than pyreadr runs. As I explore options, I can table this PR for now.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2021

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing.

(A)GPL libraries depending on BSD libraries like pandas/numpy is OK, just not the other way around.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2021

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing. I have been exploring alternative routes to directly interface with the C library, librdata, which uses an MIT license. But we would need a different cython interface than pyreadr runs. As I explore options, I can table this PR for now.

You could also try convince the authors of pyreadr to relicense from AGPL to LGPL, in which case we could accept it as a dependency. But that's really up to them.

Thanks for putting in the effort with this PR!

@jreback
Copy link
Contributor

jreback commented Apr 22, 2021

ok, unfortunately I think we need to put this PR on hold @ParfaitG thanks again for all of the work on this :->

please ping if the original author changes the license or can find another way to read the data.

@jreback jreback closed this Apr 22, 2021
@jreback jreback added this to the No action milestone Apr 22, 2021
@ParfaitG
Copy link
Contributor Author

ParfaitG commented May 2, 2021

@jreback - I may have a working solution without subprocess or pyreadr using an original Cython interface to the librdata library (its MIT license to be added under pandas' licenses). I got the C module to compile in Linux and Windows and cythonize in pandas' setup.py.

@bashtage
Copy link
Contributor

bashtage commented May 5, 2021

@ParfaitG if you get the interface working and want to have it considered, you should open a new PR.

@ParfaitG ParfaitG deleted the rdata_io branch May 5, 2021 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata
6 participants