-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: IO support for R data files with pandas.read_rdata
and DataFrame.to_rdata
#40884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas.read_rdata
and DataFrame.to_rdata
@@ -360,6 +360,8 @@ zlib Compression for HDF5 | |||
fastparquet 0.4.0 Parquet reading / writing | |||
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing | |||
pyreadstat SPSS files (.sav) reading | |||
pyreadr R files (.RData, .rda, .rds) reading / writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR relies on a new dependency, pyreadr, (available in conda) for default engine option. Please advise how to add to builds for pytests.
this would be ok. we do not want to add r as a dep for even testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to maintenance and licensing, the consensus was to use pyreadr
as a soft dep like pyreadstat
for read_spss
. Understood about not adding R as dep. For rscript engine, I am suggesting we use R via subprocess
call similar to backends in io.clipboard
. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).
However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in test_rscript.py
on Linux/Windows/Mac builds which checks for Rscript
(nothing yet for test_pyreadr.py
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyreadr as a soft dep like pyreadstat for read_spss.
yes this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).
I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr
entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat
is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict
of DataFrames like pyreadr
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will adjust code and docs to exclusively run pyreadr. Should I add
pyreadr
entries to the three yaml files in pandas/tree/master/ci/deps wherepyreadstat
is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return adict
of DataFrames likepyreadr
does.
yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.
we may have to return a dict of DataFrames like pyreadr does.
that would be ok, we do this in read_html
, just clearly document & make the signature reflect this.
@twoertwein - If you have a chance, please advise on io handling in this new rdata module. R data files are binary types. Handling is largely borrowed from the recent io xml. But here, buffers need to be saved to disk for parsing and writing. |
filepath_or_buffer = ( | ||
handle_obj.handle.read() | ||
if hasattr(handle_obj.handle, "read") | ||
else handle_obj.handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is handle_obj.handle
used and does it return the content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, all valid file/URL paths and buffers render in handle_obj.handle
. While buffers return with read()
, stringified paths return unchanged. I have tests for both file and file-like types:
def test_read_rds_file(datapath):
filename = datapath("io", "data", "rdata", "ghg_df.rds")
r_df = read_rdata(filename)
...
def test_bytes_read_rda(datapath):
filename = datapath("io", "data", "rdata", "env_data_dfs.rda")
with open(filename, "rb") as f:
r_dfs = read_rdata(f, file_format="rda")
...
def test_bytesio_rds(datapath):
filename = datapath("io", "data", "rdata", "sea_ice_df.rds")
with open(filename, "rb") as f:
with BytesIO(f.read()) as b_io:
r_df = read_rdata(b_io, file_format="rds")
...
R data files can also have an ASCII (i.e., text) version format (see docs), so I accommodated this as well with different modes within code.
with _preprocess_data(handle_data) as r_data:
mode = "wb" if isinstance(r_data, io.BytesIO) else "w"
with open(r_temp, mode) as f:
f.write(r_data.read())
Actually reading docs closer they indicate ASCII is still a binary file. So, I will adjust above to handle mode here only for binary, wb
. And advise users in docs to only use treat R data in rb
modes for file-like objects.
Saved R objects are binary files, even those saved with ascii = TRUE, so ensure that they are transferred without conversion of end-of-line markers and of 8-bit characters. The lines are delimited by LF on all platforms.
) | ||
|
||
with TemporaryDirectory() as tmp_dir: | ||
r_temp = os.path.join(tmp_dir, "rdata.rda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could that create problems when multiple users try to read/write R data (or the same user in multiple processes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe TempDirectory
runs on users' local machine and whichever user saves to final path last will see those changes. According to Python docs, TempDirectory runs same rules as tempfile.mkdtemp
:
Creates a temporary directory in the most secure manner possible. There are no race conditions in the directory’s creation. The directory is readable, writable, and searchable only by the creating user ID.
Consequently, no other user will have access to that directory and files during processing. Also, paths are pretty unique like for Unix: i.e., /tmp/tmpXKJXHZ
to indicate each call points to different temp directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, perfect! I thought it would just return /tmp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good a couple of questions / comments
The top-level function ``read_rdata`` will read the native serialization types | ||
in the R language and environment. For .RData and its synonymous shorthand, .rda, | ||
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``. | ||
For .rds types that only contains a single R object, method will return a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reference that you can point to for this format (e.g. docs)
select_frames: Optional[List[str]] = None, | ||
rownames: bool = True, | ||
storage_options: StorageOptions = None, | ||
) -> Union[DataFrame, Dict[str, DataFrame]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, shouldn't this always return a dict-of-frames? e.g. when does it and when does it not?
commands. Default 'infer' will use extension in file name to | ||
to determine the format type. | ||
|
||
select_frames : list, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is to return all?
Selected names of DataFrames to return from R rda and RData types that | ||
can contain multiple objects. | ||
|
||
rownames : bool, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe call this index=True
?
Returns | ||
------- | ||
DataFrame or dict of DataFrames | ||
Depends on R data type where rds formats returns a single DataFrame and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify when this happens
|
||
See Also | ||
-------- | ||
read_sas : Read SAS datasets into DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you lited read_parquet / read_feather above (ok to add more, but would add in both places the same list)
|
||
Notes | ||
----- | ||
Any R data file that contains a non-data.frame object may raise parsing errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a link to the R references here
|
||
|
||
def test_read_rds_non_df(datapath): | ||
from pyreadr import custom_errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this import to the top
does this matter to us since we are not including the actual code (and just using it). cc @pandas-dev/pandas-core |
INAL but from what I can tell simply calling code in another package without copying it doesn't require adopting the stricter license. |
With regards to the licensing issue, the short answer is that nobody knows if importing a GPL licensed library creates a derivative work that also must be GPL licensed. The Free Software Foundation (authors of GPL) claims that it does. This is a rather questionable interpretation of copyright law, so many ignore it (especially in the R ecosystem), but really at their own risk. Regardless of legal concerns, at the very least it's rude, because it goes against the desires of whoever wrote the original GPL licensed software. See this article for a nice summary: https://tech.popdata.org/the-gpl-license-and-linking-still-unclear-after-30-years/ So unfortunately I don't think pandas can accept this PR. The risk of pandas being possibly AGPL licensed would stop some companies from using it (and likely lead to a fork), due to concern about this exact same "viral" aspect of the (A)GPL requiring that all code that uses pandas also be open source. (Note that AGPL and GPL are quite similar, except AGPL was designed to be even more restrictive, requiring even software used over a network to make its code available. At Google, for example, we are allowed to use GPL licensed code in many cases but are strictly prohibited from using or even running AGPL code.) |
That article covers the "most provocatively borderline case" of mixing BSD and GPL case: one in which you are writing a GPL thin library strongly embedded with an existing GPLed software precisely to circumvent the virality of the GPL while using a GPL covered library. In practice, the use pandas would make of the This said, if, as I suspect, the code in the PR could be released in a separate (GPLed) software (one that would import pandas, rather than the opposite) without much effort, then it is probably worth it, just to avoid any legal risk whatsoever. |
I agree that even the possibility of the worries that @shoyer wrote about is enough of a case to impose an embargo the import of (A)GPL libraries (anything similar other than LGPL). It is probably even more convoluted in the case of Python since there is linking, only very loosely coupled components that duck type each other. |
Thanks, all. Interestingly, |
(A)GPL libraries depending on BSD libraries like pandas/numpy is OK, just not the other way around. |
You could also try convince the authors of pyreadr to relicense from AGPL to LGPL, in which case we could accept it as a dependency. But that's really up to them. Thanks for putting in the effort with this PR! |
ok, unfortunately I think we need to put this PR on hold @ParfaitG thanks again for all of the work on this :-> please ping if the original author changes the license or can find another way to read the data. |
@jreback - I may have a working solution without subprocess or pyreadr using an original Cython interface to the |
@ParfaitG if you get the interface working and want to have it considered, you should open a new PR. |
pandas.read_rdata
andDataFrame.to_rdata
#40287Note: This PR relies on a new dependency,
pyreadr
.