-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: IO support for R data files with pandas.read_rdata
and DataFrame.to_rdata
#40884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
d1d3e4f
ENH: Add IO support for R data files with pandas.read_rdata and DataF…
ParfaitG de848dd
Fix rebase issues in whatsnew and type style in frame.py
ParfaitG 3379fa1
Fix skipif logic for test params, move package checks, add to test_api
ParfaitG 966cb78
Refactor from built-in filter, add encoding to subprocess and locale …
ParfaitG 22c7ade
Fix tests for OS newline and mypy, mark xfail, use default mode in io…
ParfaitG 8b1aa9c
Added needed test skips and fixed io docs ref in whatsnew
ParfaitG 41f817f
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG 2341dff
Remove rscript implementation from code, tests, and docs
ParfaitG 1f8f033
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG a5983e0
Fix duplicate entry in ci dep yaml
ParfaitG e78bf6e
Refactor to handle binary content, add datetime notes in docs
ParfaitG 1475281
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG 7e0c152
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,4 +25,5 @@ dependencies: | |
- flask | ||
- tabulate | ||
- pyreadstat | ||
- pyreadr | ||
- pip |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,4 +33,5 @@ dependencies: | |
- pip: | ||
- cython>=0.29.21 | ||
- pyreadstat | ||
- pyreadr | ||
- pyxlsb |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,7 @@ dependencies: | |
- xlsxwriter | ||
- xlwt | ||
- pyreadstat | ||
- pyreadr | ||
- pip | ||
- pip: | ||
- pyxlsb |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like | |
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>` | ||
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`; | ||
binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>` | ||
binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>` | ||
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>` | ||
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`; | ||
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`; | ||
|
@@ -5903,6 +5904,289 @@ respective functions from ``pandas-gbq``. | |
|
||
Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__. | ||
|
||
|
||
.. _io.rdata: | ||
|
||
R data format | ||
------------- | ||
|
||
.. _io.rdata_reader: | ||
|
||
Reading R data | ||
'''''''''''''' | ||
|
||
.. versionadded:: 1.3.0 | ||
|
||
The top-level function ``read_rdata`` will read the native serialization types | ||
in the R language and environment. For .RData and its synonymous shorthand, .rda, | ||
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``. | ||
For .rds types that only contains a single R object, method will return a single | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a reference that you can point to for this format (e.g. docs) |
||
``DataFrame``. | ||
|
||
.. note:: | ||
|
||
Since any R object can be saved in these types, this method will only return | ||
data.frame objects or objects coercible to data.frames including matrices, | ||
tibbles, and data.tables and to some extent, arrays. | ||
|
||
For example, consider the following generated data.frames in R using environment | ||
data samples from US EPA, UK BGCI, and NOAA pubilc data: | ||
|
||
.. code-block:: r | ||
|
||
ghg_df <- data.frame( | ||
gas = c("Carbon dioxide", "Methane", "Nitrous oxide", | ||
"Fluorinated gases", "Total"), | ||
year = c(2018, 2018, 2018, 2018, 2018), | ||
emissions = c(5424.88150213288, 634.457127078267, 434.528555376666, | ||
182.782432461777, 6676.64961704959), | ||
row.names = c(141:145), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(ghg_df, file="ghg_df.rds") | ||
|
||
plants_df <- data.frame( | ||
plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes", | ||
"Pteridophytes", "Pteridophytes"), | ||
status = c("Data Deficient", "Extinct", "Not Threatened", | ||
"Possibly Threatened", "Threatened"), | ||
count = c(398, 65, 1294, 408, 1275), | ||
row.names = c(16:20), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(plants_df, file="plants_df.rds") | ||
|
||
sea_ice_df_new <- data.frame( | ||
year = c(2016, 2017, 2018, 2019, 2020), | ||
mo = c(12, 12, 12, 12, 12), | ||
data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"), | ||
region = c("S", "S", "S", "S", "S"), | ||
extent = c(8.28, 9.48, 9.19, 9.41, 10.44), | ||
area = c(5.51, 6.23, 5.59, 6.59, 6.5), | ||
row.names = c(1012:1016), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(sea_ice_df, file="sea_ice_df.rds") | ||
|
||
save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda") | ||
|
||
With ``read_rdata``, you can read these above .rds or .rda files: | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata") | ||
file_path = os.path.abspath(rel_path) | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "ghg_df.rds") | ||
ghg_df = pd.read_rdata(rds_file).tail() | ||
ghg_df | ||
|
||
rda_file = os.path.join(file_path, "env_data_dfs.rda") | ||
env_dfs = pd.read_rdata(rda_file) | ||
{k: df.tail() for k, df in env_dfs.items()} | ||
|
||
To ignore the rownames of data.frame, use option ``rownames=False``: | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "plants_df.rds") | ||
plants_df = pd.read_rdata(rds_file, rownames=False).tail() | ||
plants_df | ||
|
||
|
||
To select specific objects in .rda, pass a list of names into ``select_frames``: | ||
|
||
.. ipython:: python | ||
|
||
rda_file = os.path.join(file_path, "env_data_dfs.rda") | ||
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"]) | ||
env_dfs | ||
|
||
To read from a file-like object, read object in argument, ``path_or_buffer``: | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "plants_df.rds") | ||
with open(rds_file, "rb") as f: | ||
plants_df = pd.read_rdata(f.read(), file_format="rds") | ||
|
||
plants_df | ||
|
||
To read from URL, pass link directly into method: | ||
|
||
.. ipython:: python | ||
|
||
url = ("https://github.com/hadley/nycflights13/" | ||
"blob/master/data/airlines.rda?raw=true") | ||
|
||
airlines = pd.read_rdata(url, file_format="rda") | ||
airlines | ||
|
||
To read from an Amazon S3 bucket, point to the storage path. This also raises | ||
another issue. Any R data encoded in non utf-8 is currently not supported: | ||
|
||
.. code-block:: ipython | ||
|
||
In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata") | ||
... | ||
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte | ||
|
||
Also, remember if R data files do not contain any data frame object, a parsing error | ||
will occur: | ||
|
||
.. code-block:: ipython | ||
|
||
In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda") | ||
... | ||
LibrdataError: Invalid file, or file has unsupported features | ||
|
||
|
||
.. _io.rdata_writer: | ||
|
||
Please note R's ``Date`` (without time component) will translate to ``object`` type | ||
in pandas. Also, R's date/time field type, ``POSIXct``, will translate to UTC time | ||
in pandas. | ||
|
||
.. ipython:: python | ||
|
||
ppm_df = pd.read_rdata(os.path.join(file_path, "ppm_df.rds")) | ||
ppm_df.head() | ||
ppm_df.tail() | ||
ppm_df.dtypes | ||
|
||
Writing R data | ||
'''''''''''''' | ||
|
||
.. versionadded:: 1.3.0 | ||
|
||
The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame | ||
or multiple DataFrames into R data files (.RData, .rda, and .rds). | ||
|
||
For a single DataFrame in rds type, pass in a file or buffer in method: | ||
|
||
.. ipython:: python | ||
|
||
plants_df.to_rdata("plants_df.rds") | ||
|
||
For a single DataFrame in RData or rda types, pass in a file or buffer in method | ||
and optionally give it a name: | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df") | ||
|
||
While RData and rda types can hold multiple R objects, this method currently | ||
only supports writing out a single DataFrame. | ||
|
||
Even write to a buffer and read its content: | ||
|
||
.. ipython:: python | ||
|
||
with BytesIO() as b_io: | ||
env_dfs["sea_ice_df"].to_rdata(b_io, file_format="rda", index=False) | ||
print( | ||
pd.read_rdata( | ||
b_io.getvalue(), | ||
file_format="rda", | ||
rownames=False, | ||
)["pandas_dataframe"].tail() | ||
) | ||
|
||
While DataFrame index will not map into R rownames, by default ``index=True`` | ||
will output as a named column or multiple columns for MultiIndex. | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.rename_axis(None).to_rdata("ghg_df.rds") | ||
|
||
pd.read_rdata("ghg_df.rds").tail() | ||
|
||
To ignore the index, use ``index=False``: | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False) | ||
|
||
pd.read_rdata("ghg_df.rds").tail() | ||
|
||
By default, these R serialized types are compressed files in either gzip, bzip2, | ||
or xz algorithms. Similarly to R, the default type in this method is "gzip" or | ||
"gz". Notice difference of compressed and uncompressed files | ||
|
||
.. ipython:: python | ||
|
||
plants_df.to_rdata("plants_df_gz.rds") | ||
plants_df.to_rdata("plants_df_bz2.rds", compression="bz2") | ||
plants_df.to_rdata("plants_df_xz.rds", compression="xz") | ||
plants_df.to_rdata("plants_df_non_comp.rds", compression=None) | ||
|
||
os.stat("plants_df_gz.rds").st_size | ||
os.stat("plants_df_bz2.rds").st_size | ||
os.stat("plants_df_xz.rds").st_size | ||
os.stat("plants_df_non_comp.rds").st_size | ||
|
||
Like other IO methods, ``storage_options`` are enabled to write to those platforms: | ||
|
||
.. code-block:: ipython | ||
|
||
ghg_df.to_rdata( | ||
"s3://path/to/my/storage/pandas_df.rda", | ||
storage_options={"user": "xxx", "password": "???"} | ||
) | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
os.remove("ghg_df.rds") | ||
os.remove("ghg_df.rda") | ||
os.remove("plants_df.rds") | ||
os.remove("plants_df_gz.rds") | ||
os.remove("plants_df_bz2.rds") | ||
os.remove("plants_df_xz.rds") | ||
os.remove("plants_df_non_comp.rds") | ||
|
||
Once exported, the single DataFrame can be read back in R or multiple DataFrames | ||
loaded in R: | ||
|
||
.. code-block:: r | ||
|
||
plants_df <- readRDS("plants_df.rds") | ||
plants_df | ||
plant_group status count | ||
16 Pteridophytes Data Deficient 398 | ||
17 Pteridophytes Extinct 65 | ||
18 Pteridophytes Not Threatened 1294 | ||
19 Pteridophytes Possibly Threatened 408 | ||
20 Pteridophytes Threatened 1275 | ||
|
||
load("ghg_df.rda") | ||
|
||
mget(list=ls()) | ||
$ghg_df | ||
gas year emissions | ||
141 Carbon dioxide 2018 5424.8815 | ||
142 Methane 2018 634.4571 | ||
143 Nitrous oxide 2018 434.5286 | ||
144 Fluorinated gases 2018 182.7824 | ||
145 Total 2018 6676.6496 | ||
|
||
For more information of the underlying ``pyreadr`` package, see main page of | ||
`pyreadr`_ for further notes on support and limitations. For more information of R | ||
serialization data types, see docs on `rds`_ and `rda`_ data files. | ||
|
||
.. _pyreadr: https://github.com/ofajardo/pyreadr | ||
|
||
.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS | ||
|
||
.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save | ||
|
||
|
||
.. _io.stata: | ||
|
||
Stata format | ||
|
@@ -5958,6 +6242,7 @@ outside of this range, the variable is cast to ``int16``. | |
115 dta file format. Attempting to write *Stata* dta files with strings | ||
longer than 244 characters raises a ``ValueError``. | ||
|
||
|
||
.. _io.stata_reader: | ||
|
||
Reading from Stata format | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be ok. we do not want to add r as a dep for even testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to maintenance and licensing, the consensus was to use
pyreadr
as a soft dep likepyreadstat
forread_spss
. Understood about not adding R as dep. For rscript engine, I am suggesting we use R viasubprocess
call similar to backends inio.clipboard
. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in
test_rscript.py
on Linux/Windows/Mac builds which checks forRscript
(nothing yet fortest_pyreadr.py
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will adjust code and docs to exclusively run pyreadr. Should I add
pyreadr
entries to the three yaml files in pandas/tree/master/ci/deps wherepyreadstat
is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return adict
of DataFrames likepyreadr
does.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.
that would be ok, we do this in
read_html
, just clearly document & make the signature reflect this.