ENH: IO support for R data files with `pandas.read_rdata` and `DataFrame.to_rdata` #40884

ParfaitG · 2021-04-11T19:42:33Z

closes ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40287
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry
install.rst entry
io.rst entry

Note: This PR relies on a new dependency, pyreadr.

…rame.to_rdata

…skip

….rst

jreback · 2021-04-12T13:22:21Z

doc/source/getting_started/install.rst

@@ -360,6 +360,8 @@ zlib                                         Compression for HDF5
 fastparquet               0.4.0              Parquet reading / writing
 pyarrow                   0.15.0             Parquet, ORC, and feather reading / writing
 pyreadstat                                   SPSS files (.sav) reading
+pyreadr                                      R files (.RData, .rda, .rds) reading / writing


@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.

This PR relies on a new dependency, pyreadr, (available in conda) for default engine option. Please advise how to add to builds for pytests.

this would be ok. we do not want to add r as a dep for even testing.

Due to maintenance and licensing, the consensus was to use pyreadr as a soft dep like pyreadstat for read_spss. Understood about not adding R as dep. For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in test_rscript.py on Linux/Windows/Mac builds which checks for Rscript (nothing yet for test_pyreadr.py).

pyreadr as a soft dep like pyreadstat for read_spss.

yes this is ok

For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.

we may have to return a dict of DataFrames like pyreadr does.

that would be ok, we do this in read_html, just clearly document & make the signature reflect this.

ParfaitG · 2021-04-14T12:29:53Z

@twoertwein - If you have a chance, please advise on io handling in this new rdata module. R data files are binary types. Handling is largely borrowed from the recent io xml. But here, buffers need to be saved to disk for parsing and writing.

twoertwein · 2021-04-14T13:00:25Z

pandas/io/rdata.py

+            filepath_or_buffer = (
+                handle_obj.handle.read()
+                if hasattr(handle_obj.handle, "read")
+                else handle_obj.handle


When is handle_obj.handle used and does it return the content?

IIUC, all valid file/URL paths and buffers render in handle_obj.handle. While buffers return with read(), stringified paths return unchanged. I have tests for both file and file-like types:

def test_read_rds_file(datapath): filename = datapath("io", "data", "rdata", "ghg_df.rds") r_df = read_rdata(filename) ... def test_bytes_read_rda(datapath): filename = datapath("io", "data", "rdata", "env_data_dfs.rda") with open(filename, "rb") as f: r_dfs = read_rdata(f, file_format="rda") ... def test_bytesio_rds(datapath): filename = datapath("io", "data", "rdata", "sea_ice_df.rds") with open(filename, "rb") as f: with BytesIO(f.read()) as b_io: r_df = read_rdata(b_io, file_format="rds") ...

R data files can also have an ASCII (i.e., text) version format (see docs), so I accommodated this as well with different modes within code.

with _preprocess_data(handle_data) as r_data: mode = "wb" if isinstance(r_data, io.BytesIO) else "w" with open(r_temp, mode) as f: f.write(r_data.read())

Actually reading docs closer they indicate ASCII is still a binary file. So, I will adjust above to handle mode here only for binary, wb. And advise users in docs to only use treat R data in rb modes for file-like objects.

Saved R objects are binary files, even those saved with ascii = TRUE, so ensure that they are transferred without conversion of end-of-line markers and of 8-bit characters. The lines are delimited by LF on all platforms.

twoertwein · 2021-04-14T13:05:33Z

pandas/io/rdata.py

+        )
+
+        with TemporaryDirectory() as tmp_dir:
+            r_temp = os.path.join(tmp_dir, "rdata.rda")


Could that create problems when multiple users try to read/write R data (or the same user in multiple processes)?

I believe TempDirectory runs on users' local machine and whichever user saves to final path last will see those changes. According to Python docs, TempDirectory runs same rules as tempfile.mkdtemp:

Creates a temporary directory in the most secure manner possible. There are no race conditions in the directory’s creation. The directory is readable, writable, and searchable only by the creating user ID.

Consequently, no other user will have access to that directory and files during processing. Also, paths are pretty unique like for Unix: i.e., /tmp/tmpXKJXHZ to indicate each call points to different temp directory.

okay, perfect! I thought it would just return /tmp.

ParfaitG · 2021-04-16T03:35:30Z

jreback

looks really good a couple of questions / comments

jreback · 2021-04-20T23:35:12Z

doc/source/user_guide/io.rst

+The top-level function ``read_rdata`` will read the native serialization types
+in the R language and environment. For .RData and its synonymous shorthand, .rda,
+that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
+For .rds types that only contains a single R object, method will return a single


is there a reference that you can point to for this format (e.g. docs)

jreback · 2021-04-20T23:37:03Z

pandas/io/rdata.py

+    select_frames: Optional[List[str]] = None,
+    rownames: bool = True,
+    storage_options: StorageOptions = None,
+) -> Union[DataFrame, Dict[str, DataFrame]]:


hmm, shouldn't this always return a dict-of-frames? e.g. when does it and when does it not?

jreback · 2021-04-20T23:37:18Z

pandas/io/rdata.py

+        commands. Default 'infer' will use extension in file name to
+        to determine the format type.
+
+    select_frames : list, default None


default is to return all?

jreback · 2021-04-20T23:37:46Z

pandas/io/rdata.py

+        Selected names of DataFrames to return from R rda and RData types that
+        can contain multiple objects.
+
+    rownames : bool, default True


hmm, maybe call this index=True?

jreback · 2021-04-20T23:38:12Z

pandas/io/rdata.py

+    Returns
+    -------
+    DataFrame or dict of DataFrames
+        Depends on R data type where rds formats returns a single DataFrame and


can you clarify when this happens

jreback · 2021-04-20T23:38:44Z

pandas/io/rdata.py

+
+    See Also
+    --------
+    read_sas : Read SAS datasets into DataFrame.


i think you lited read_parquet / read_feather above (ok to add more, but would add in both places the same list)

jreback · 2021-04-20T23:39:03Z

pandas/io/rdata.py

+
+    Notes
+    -----
+    Any R data file that contains a non-data.frame object may raise parsing errors.


can you add a link to the R references here

jreback · 2021-04-20T23:40:10Z

pandas/tests/io/test_rdata.py

+
+
+def test_read_rds_non_df(datapath):
+    from pyreadr import custom_errors


can you move this import to the top

jreback · 2021-04-20T23:43:49Z

pyreadr uses this license

License: GNU Affero General Public License v3 or later (AGPLv3+) (AGPLv3)

does this matter to us since we are not including the actual code (and just using it).

cc @pandas-dev/pandas-core

bashtage · 2021-04-20T23:51:14Z

INAL but from what I can tell simply calling code in another package without copying it doesn't require adopting the stricter license.

shoyer · 2021-04-21T05:33:06Z

With regards to the licensing issue, the short answer is that nobody knows if importing a GPL licensed library creates a derivative work that also must be GPL licensed.

The Free Software Foundation (authors of GPL) claims that it does. This is a rather questionable interpretation of copyright law, so many ignore it (especially in the R ecosystem), but really at their own risk. Regardless of legal concerns, at the very least it's rude, because it goes against the desires of whoever wrote the original GPL licensed software. See this article for a nice summary: https://tech.popdata.org/the-gpl-license-and-linking-still-unclear-after-30-years/

So unfortunately I don't think pandas can accept this PR. The risk of pandas being possibly AGPL licensed would stop some companies from using it (and likely lead to a fork), due to concern about this exact same "viral" aspect of the (A)GPL requiring that all code that uses pandas also be open source.

(Note that AGPL and GPL are quite similar, except AGPL was designed to be even more restrictive, requiring even software used over a network to make its code available. At Google, for example, we are allowed to use GPL licensed code in many cases but are strictly prohibited from using or even running AGPL code.)

toobaz · 2021-04-21T16:16:55Z

See this article for a nice summary: https://tech.popdata.org/the-gpl-license-and-linking-still-unclear-after-30-years/

That article covers the "most provocatively borderline case" of mixing BSD and GPL case: one in which you are writing a GPL thin library strongly embedded with an existing GPLed software precisely to circumvent the virality of the GPL while using a GPL covered library. In practice, the use pandas would make of the pyreader library would be intrinsically not very "intimate", to use the vocabulary of the FSF.

This said, if, as I suspect, the code in the PR could be released in a separate (GPLed) software (one that would import pandas, rather than the opposite) without much effort, then it is probably worth it, just to avoid any legal risk whatsoever.

bashtage · 2021-04-21T16:20:09Z

I agree that even the possibility of the worries that @shoyer wrote about is enough of a case to impose an embargo the import of (A)GPL libraries (anything similar other than LGPL).

It is probably even more convoluted in the case of Python since there is linking, only very loosely coupled components that duck type each other.

ParfaitG · 2021-04-21T17:01:50Z

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing. I have been exploring alternative routes to directly interface with the C library, librdata, which uses an MIT license. But we would need a different cython interface than pyreadr runs. As I explore options, I can table this PR for now.

shoyer · 2021-04-21T17:43:59Z

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing.

(A)GPL libraries depending on BSD libraries like pandas/numpy is OK, just not the other way around.

shoyer · 2021-04-21T17:46:45Z

Thanks, all. Interestingly, pyreadr uses pandas as dependency even imports numpy (see its PyPi metadata). Not sure of the implications of that re licensing. I have been exploring alternative routes to directly interface with the C library, librdata, which uses an MIT license. But we would need a different cython interface than pyreadr runs. As I explore options, I can table this PR for now.

You could also try convince the authors of pyreadr to relicense from AGPL to LGPL, in which case we could accept it as a dependency. But that's really up to them.

Thanks for putting in the effort with this PR!

jreback · 2021-04-22T22:35:21Z

ok, unfortunately I think we need to put this PR on hold @ParfaitG thanks again for all of the work on this :->

please ping if the original author changes the license or can find another way to read the data.

ParfaitG · 2021-05-02T18:56:06Z

@jreback - I may have a working solution without subprocess or pyreadr using an original Cython interface to the librdata library (its MIT license to be added under pandas' licenses). I got the C module to compile in Linux and Windows and cythonize in pandas' setup.py.

bashtage · 2021-05-05T09:16:32Z

@ParfaitG if you get the interface working and want to have it considered, you should open a new PR.

ParfaitG added 2 commits April 11, 2021 12:53

ENH: Add IO support for R data files with pandas.read_rdata and DataF…

d1d3e4f

…rame.to_rdata

Fix rebase issues in whatsnew and type style in frame.py

de848dd

ParfaitG changed the title ~~Rdata io~~ ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata Apr 11, 2021

ParfaitG added 5 commits April 11, 2021 16:17

Fix skipif logic for test params, move package checks, add to test_api

3379fa1

Refactor from built-in filter, add encoding to subprocess and locale …

966cb78

…skip

Fix tests for OS newline and mypy, mark xfail, use default mode in io…

22c7ade

….rst

Added needed test skips and fixed io docs ref in whatsnew

8b1aa9c

Merge remote-tracking branch 'upstream/master' into rdata_io

41f817f

jreback requested changes Apr 12, 2021

View reviewed changes

jreback added the IO Data IO issues that don't fit into a more specific label label Apr 12, 2021

ParfaitG added 3 commits April 13, 2021 22:57

Remove rscript implementation from code, tests, and docs

2341dff

Merge remote-tracking branch 'upstream/master' into rdata_io

1f8f033

Fix duplicate entry in ci dep yaml

a5983e0

twoertwein reviewed Apr 14, 2021

View reviewed changes

ParfaitG added 2 commits April 15, 2021 22:29

Refactor to handle binary content, add datetime notes in docs

e78bf6e

Merge remote-tracking branch 'upstream/master' into rdata_io

1475281

ParfaitG requested a review from jreback April 16, 2021 17:25

Merge remote-tracking branch 'upstream/master' into rdata_io

7e0c152

jreback requested changes Apr 20, 2021

View reviewed changes

jreback closed this Apr 22, 2021

jreback added this to the No action milestone Apr 22, 2021

ParfaitG deleted the rdata_io branch May 5, 2021 18:38

ParfaitG mentioned this pull request May 8, 2021

ENH: IO support for R data files with C extension #41386

Closed

5 tasks



		def test_read_rds_non_df(datapath):
		from pyreadr import custom_errors

Uh oh!

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884

Uh oh!

Conversation

ParfaitG commented Apr 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParfaitG commented Apr 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParfaitG Apr 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParfaitG commented Apr 16, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 20, 2021

Uh oh!

bashtage commented Apr 20, 2021

Uh oh!

shoyer commented Apr 21, 2021

Uh oh!

toobaz commented Apr 21, 2021

Uh oh!

bashtage commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ParfaitG commented Apr 21, 2021

Uh oh!

shoyer commented Apr 21, 2021

Uh oh!

shoyer commented Apr 21, 2021

Uh oh!

jreback commented Apr 22, 2021

Uh oh!

ParfaitG commented May 2, 2021

Uh oh!

bashtage commented May 5, 2021

Uh oh!

ENH: IO support for R data files with `pandas.read_rdata` and `DataFrame.to_rdata` #40884

ENH: IO support for R data files with `pandas.read_rdata` and `DataFrame.to_rdata` #40884

ParfaitG commented Apr 11, 2021 •

edited

Loading

ParfaitG Apr 14, 2021 •

edited

Loading

bashtage commented Apr 21, 2021 •

edited

Loading