Skip to content

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/deps/actions-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ dependencies:
- flask
- tabulate
- pyreadstat
- pyreadr
- pip
1 change: 1 addition & 0 deletions ci/deps/azure-macos-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,5 @@ dependencies:
- pip:
- cython>=0.29.21
- pyreadstat
- pyreadr
- pyxlsb
1 change: 1 addition & 0 deletions ci/deps/azure-windows-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ dependencies:
- xlsxwriter
- xlwt
- pyreadstat
- pyreadr
- pip
- pip:
- pyxlsb
1 change: 1 addition & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,7 @@ zlib Compression for HDF5
fastparquet 0.4.0 Parquet reading / writing
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing
pyreadstat SPSS files (.sav) reading
pyreadr R files (.RData, .rda, .rds) reading / writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR relies on a new dependency, pyreadr, (available in conda) for default engine option. Please advise how to add to builds for pytests.

this would be ok. we do not want to add r as a dep for even testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to maintenance and licensing, the consensus was to use pyreadr as a soft dep like pyreadstat for read_spss. Understood about not adding R as dep. For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in test_rscript.py on Linux/Windows/Mac builds which checks for Rscript (nothing yet for test_pyreadr.py).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyreadr as a soft dep like pyreadstat for read_spss.

yes this is ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.

we may have to return a dict of DataFrames like pyreadr does.

that would be ok, we do this in read_html, just clearly document & make the signature reflect this.

========================= ================== =============================================================

Access data in the cloud
Expand Down
285 changes: 285 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
Expand Down Expand Up @@ -5903,6 +5904,289 @@ respective functions from ``pandas-gbq``.

Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__.


.. _io.rdata:

R data format
-------------

.. _io.rdata_reader:

Reading R data
''''''''''''''

.. versionadded:: 1.3.0

The top-level function ``read_rdata`` will read the native serialization types
in the R language and environment. For .RData and its synonymous shorthand, .rda,
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
For .rds types that only contains a single R object, method will return a single
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reference that you can point to for this format (e.g. docs)

``DataFrame``.

.. note::

Since any R object can be saved in these types, this method will only return
data.frame objects or objects coercible to data.frames including matrices,
tibbles, and data.tables and to some extent, arrays.

For example, consider the following generated data.frames in R using environment
data samples from US EPA, UK BGCI, and NOAA pubilc data:

.. code-block:: r

ghg_df <- data.frame(
gas = c("Carbon dioxide", "Methane", "Nitrous oxide",
"Fluorinated gases", "Total"),
year = c(2018, 2018, 2018, 2018, 2018),
emissions = c(5424.88150213288, 634.457127078267, 434.528555376666,
182.782432461777, 6676.64961704959),
row.names = c(141:145),
stringsAsFactors = FALSE
)

saveRDS(ghg_df, file="ghg_df.rds")

plants_df <- data.frame(
plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes",
"Pteridophytes", "Pteridophytes"),
status = c("Data Deficient", "Extinct", "Not Threatened",
"Possibly Threatened", "Threatened"),
count = c(398, 65, 1294, 408, 1275),
row.names = c(16:20),
stringsAsFactors = FALSE
)

saveRDS(plants_df, file="plants_df.rds")

sea_ice_df_new <- data.frame(
year = c(2016, 2017, 2018, 2019, 2020),
mo = c(12, 12, 12, 12, 12),
data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"),
region = c("S", "S", "S", "S", "S"),
extent = c(8.28, 9.48, 9.19, 9.41, 10.44),
area = c(5.51, 6.23, 5.59, 6.59, 6.5),
row.names = c(1012:1016),
stringsAsFactors = FALSE
)

saveRDS(sea_ice_df, file="sea_ice_df.rds")

save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda")

With ``read_rdata``, you can read these above .rds or .rda files:

.. ipython:: python
:suppress:

rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata")
file_path = os.path.abspath(rel_path)

.. ipython:: python

rds_file = os.path.join(file_path, "ghg_df.rds")
ghg_df = pd.read_rdata(rds_file).tail()
ghg_df

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file)
{k: df.tail() for k, df in env_dfs.items()}

To ignore the rownames of data.frame, use option ``rownames=False``:

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
plants_df = pd.read_rdata(rds_file, rownames=False).tail()
plants_df


To select specific objects in .rda, pass a list of names into ``select_frames``:

.. ipython:: python

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
env_dfs

To read from a file-like object, read object in argument, ``path_or_buffer``:

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
with open(rds_file, "rb") as f:
plants_df = pd.read_rdata(f.read(), file_format="rds")

plants_df

To read from URL, pass link directly into method:

.. ipython:: python

url = ("https://github.com/hadley/nycflights13/"
"blob/master/data/airlines.rda?raw=true")

airlines = pd.read_rdata(url, file_format="rda")
airlines

To read from an Amazon S3 bucket, point to the storage path. This also raises
another issue. Any R data encoded in non utf-8 is currently not supported:

.. code-block:: ipython

In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte

Also, remember if R data files do not contain any data frame object, a parsing error
will occur:

.. code-block:: ipython

In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda")
...
LibrdataError: Invalid file, or file has unsupported features


.. _io.rdata_writer:

Please note R's ``Date`` (without time component) will translate to ``object`` type
in pandas. Also, R's date/time field type, ``POSIXct``, will translate to UTC time
in pandas.

.. ipython:: python

ppm_df = pd.read_rdata(os.path.join(file_path, "ppm_df.rds"))
ppm_df.head()
ppm_df.tail()
ppm_df.dtypes

Writing R data
''''''''''''''

.. versionadded:: 1.3.0

The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame
or multiple DataFrames into R data files (.RData, .rda, and .rds).

For a single DataFrame in rds type, pass in a file or buffer in method:

.. ipython:: python

plants_df.to_rdata("plants_df.rds")

For a single DataFrame in RData or rda types, pass in a file or buffer in method
and optionally give it a name:

.. ipython:: python

ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df")

While RData and rda types can hold multiple R objects, this method currently
only supports writing out a single DataFrame.

Even write to a buffer and read its content:

.. ipython:: python

with BytesIO() as b_io:
env_dfs["sea_ice_df"].to_rdata(b_io, file_format="rda", index=False)
print(
pd.read_rdata(
b_io.getvalue(),
file_format="rda",
rownames=False,
)["pandas_dataframe"].tail()
)

While DataFrame index will not map into R rownames, by default ``index=True``
will output as a named column or multiple columns for MultiIndex.

.. ipython:: python

ghg_df.rename_axis(None).to_rdata("ghg_df.rds")

pd.read_rdata("ghg_df.rds").tail()

To ignore the index, use ``index=False``:

.. ipython:: python

ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False)

pd.read_rdata("ghg_df.rds").tail()

By default, these R serialized types are compressed files in either gzip, bzip2,
or xz algorithms. Similarly to R, the default type in this method is "gzip" or
"gz". Notice difference of compressed and uncompressed files

.. ipython:: python

plants_df.to_rdata("plants_df_gz.rds")
plants_df.to_rdata("plants_df_bz2.rds", compression="bz2")
plants_df.to_rdata("plants_df_xz.rds", compression="xz")
plants_df.to_rdata("plants_df_non_comp.rds", compression=None)

os.stat("plants_df_gz.rds").st_size
os.stat("plants_df_bz2.rds").st_size
os.stat("plants_df_xz.rds").st_size
os.stat("plants_df_non_comp.rds").st_size

Like other IO methods, ``storage_options`` are enabled to write to those platforms:

.. code-block:: ipython

ghg_df.to_rdata(
"s3://path/to/my/storage/pandas_df.rda",
storage_options={"user": "xxx", "password": "???"}
)

.. ipython:: python
:suppress:

os.remove("ghg_df.rds")
os.remove("ghg_df.rda")
os.remove("plants_df.rds")
os.remove("plants_df_gz.rds")
os.remove("plants_df_bz2.rds")
os.remove("plants_df_xz.rds")
os.remove("plants_df_non_comp.rds")

Once exported, the single DataFrame can be read back in R or multiple DataFrames
loaded in R:

.. code-block:: r

plants_df <- readRDS("plants_df.rds")
plants_df
plant_group status count
16 Pteridophytes Data Deficient 398
17 Pteridophytes Extinct 65
18 Pteridophytes Not Threatened 1294
19 Pteridophytes Possibly Threatened 408
20 Pteridophytes Threatened 1275

load("ghg_df.rda")

mget(list=ls())
$ghg_df
gas year emissions
141 Carbon dioxide 2018 5424.8815
142 Methane 2018 634.4571
143 Nitrous oxide 2018 434.5286
144 Fluorinated gases 2018 182.7824
145 Total 2018 6676.6496

For more information of the underlying ``pyreadr`` package, see main page of
`pyreadr`_ for further notes on support and limitations. For more information of R
serialization data types, see docs on `rds`_ and `rda`_ data files.

.. _pyreadr: https://github.com/ofajardo/pyreadr

.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save


.. _io.stata:

Stata format
Expand Down Expand Up @@ -5958,6 +6242,7 @@ outside of this range, the variable is cast to ``int16``.
115 dta file format. Attempting to write *Stata* dta files with strings
longer than 244 characters raises a ``ValueError``.


.. _io.stata_reader:

Reading from Stata format
Expand Down
Loading