pandas-dev · ParfaitG · Apr 11, 2021 · Apr 11, 2021 · Apr 11, 2021 · Apr 12, 2021
diff --git a/ci/deps/actions-37.yaml b/ci/deps/actions-37.yaml
@@ -25,4 +25,5 @@ dependencies:
   - flask
   - tabulate
   - pyreadstat
+  - pyreadr
   - pip
diff --git a/ci/deps/azure-macos-37.yaml b/ci/deps/azure-macos-37.yaml
@@ -33,4 +33,5 @@ dependencies:
   - pip:
     - cython>=0.29.21
     - pyreadstat
+    - pyreadr
     - pyxlsb
diff --git a/ci/deps/azure-windows-37.yaml b/ci/deps/azure-windows-37.yaml
@@ -37,6 +37,7 @@ dependencies:
   - xlsxwriter
   - xlwt
   - pyreadstat
+  - pyreadr
   - pip
   - pip:
     - pyxlsb
diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst
@@ -360,6 +360,7 @@ zlib                                         Compression for HDF5
 fastparquet               0.4.0              Parquet reading / writing
 pyarrow                   0.15.0             Parquet, ORC, and feather reading / writing
 pyreadstat                                   SPSS files (.sav) reading
+pyreadr                                      R files (.RData, .rda, .rds) reading / writing
 ========================= ================== =============================================================
 
 Access data in the cloud

diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
@@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
     binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
     binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
     binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
+    binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>`
     binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
     binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
     binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
@@ -5903,6 +5904,289 @@ respective functions from ``pandas-gbq``.
 
 Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__.
 
+
+.. _io.rdata:
+
+R data format
+-------------
+
+.. _io.rdata_reader:
+
+Reading R data
+''''''''''''''
+
+.. versionadded:: 1.3.0
+
+The top-level function ``read_rdata`` will read the native serialization types
+in the R language and environment. For .RData and its synonymous shorthand, .rda,
+that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
+For .rds types that only contains a single R object, method will return a single
+``DataFrame``.
+
+.. note::
+
+   Since any R object can be saved in these types, this method will only return
+   data.frame objects or objects coercible to data.frames including matrices,
+   tibbles, and data.tables and to some extent, arrays.
+
+For example, consider the following generated data.frames in R using environment
+data samples from US EPA, UK BGCI, and NOAA pubilc data:
+
+.. code-block:: r
+
+   ghg_df <- data.frame(
+     gas = c("Carbon dioxide", "Methane", "Nitrous oxide",
+             "Fluorinated gases", "Total"),
+     year = c(2018, 2018, 2018, 2018, 2018),
+     emissions = c(5424.88150213288, 634.457127078267, 434.528555376666,
+                   182.782432461777, 6676.64961704959),
+     row.names = c(141:145),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(ghg_df, file="ghg_df.rds")
+
+   plants_df <- data.frame(
+     plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes",
+                     "Pteridophytes", "Pteridophytes"),
+     status = c("Data Deficient", "Extinct", "Not Threatened",
+                "Possibly Threatened", "Threatened"),
+     count = c(398, 65, 1294, 408, 1275),
+     row.names = c(16:20),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(plants_df, file="plants_df.rds")
+
+   sea_ice_df_new <- data.frame(
+     year = c(2016, 2017, 2018, 2019, 2020),
+     mo = c(12, 12, 12, 12, 12),
+     data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"),
+     region = c("S", "S", "S", "S", "S"),
+     extent = c(8.28, 9.48, 9.19, 9.41, 10.44),
+     area = c(5.51, 6.23, 5.59, 6.59, 6.5),
+     row.names = c(1012:1016),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(sea_ice_df, file="sea_ice_df.rds")
+
+   save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda")
+
+With ``read_rdata``, you can read these above .rds or .rda files:
+
+.. ipython:: python
+   :suppress:
+
+   rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata")
+   file_path = os.path.abspath(rel_path)
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "ghg_df.rds")
+   ghg_df = pd.read_rdata(rds_file).tail()
+   ghg_df
+
+   rda_file = os.path.join(file_path, "env_data_dfs.rda")
+   env_dfs = pd.read_rdata(rda_file)
+   {k: df.tail() for k, df in env_dfs.items()}
+
+To ignore the rownames of data.frame, use option ``rownames=False``:
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "plants_df.rds")
+   plants_df = pd.read_rdata(rds_file, rownames=False).tail()
+   plants_df
+
+
+To select specific objects in .rda, pass a list of names into ``select_frames``:
+
+.. ipython:: python
+
+   rda_file = os.path.join(file_path, "env_data_dfs.rda")
+   env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
+   env_dfs
+
+To read from a file-like object, read object in argument, ``path_or_buffer``:
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "plants_df.rds")
+   with open(rds_file, "rb") as f:
+       plants_df = pd.read_rdata(f.read(), file_format="rds")
+
+   plants_df
+
+To read from URL, pass link directly into method:
+
+.. ipython:: python
+
+   url = ("https://github.com/hadley/nycflights13/"
+          "blob/master/data/airlines.rda?raw=true")
+
+   airlines = pd.read_rdata(url, file_format="rda")
+   airlines
+
+To read from an Amazon S3 bucket, point to the storage path. This also raises
+another issue. Any R data encoded in non utf-8 is currently not supported:
+
+.. code-block:: ipython
+
+   In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata")
+   ...
+   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte
+
+Also, remember if R data files do not contain any data frame object, a parsing error
+will occur:
+
+.. code-block:: ipython
+
+   In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda")
+   ...
+   LibrdataError: Invalid file, or file has unsupported features
+
+
+.. _io.rdata_writer:
+
+Please note R's ``Date`` (without time component) will translate to ``object`` type
+in pandas. Also, R's date/time field type, ``POSIXct``, will translate to UTC time
+in pandas.
+
+.. ipython:: python
+
+   ppm_df = pd.read_rdata(os.path.join(file_path, "ppm_df.rds"))
+   ppm_df.head()
+   ppm_df.tail()
+   ppm_df.dtypes
+
+Writing R data
+''''''''''''''
+
+.. versionadded:: 1.3.0
+
+The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame
+or multiple DataFrames into R data files (.RData, .rda, and .rds).
+
+For a single DataFrame in rds type, pass in a file or buffer in method:
+
+.. ipython:: python
+
+   plants_df.to_rdata("plants_df.rds")
+
+For a single DataFrame in RData or rda types, pass in a file or buffer in method
+and optionally give it a name:
+
+.. ipython:: python
+
+   ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df")
+
+While RData and rda types can hold multiple R objects, this method currently
+only supports writing out a single DataFrame.
+
+Even write to a buffer and read its content:
+
+.. ipython:: python
+
+    with BytesIO() as b_io:
+        env_dfs["sea_ice_df"].to_rdata(b_io, file_format="rda", index=False)
+        print(
+            pd.read_rdata(
+                b_io.getvalue(),
+                file_format="rda",
+                rownames=False,
+            )["pandas_dataframe"].tail()
+        )
+
+While DataFrame index will not map into R rownames, by default ``index=True``
+will output as a named column or multiple columns for MultiIndex.
+
+.. ipython:: python
+
+    ghg_df.rename_axis(None).to_rdata("ghg_df.rds")
+
+    pd.read_rdata("ghg_df.rds").tail()
+
+To ignore the index, use ``index=False``:
+
+.. ipython:: python
+
+    ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False)
+
+    pd.read_rdata("ghg_df.rds").tail()
+
+By default, these R serialized types are compressed files in either gzip, bzip2,
+or xz algorithms. Similarly to R, the default type in this method is "gzip" or
+"gz". Notice difference of compressed and uncompressed files
+
+.. ipython:: python
+
+   plants_df.to_rdata("plants_df_gz.rds")
+   plants_df.to_rdata("plants_df_bz2.rds", compression="bz2")
+   plants_df.to_rdata("plants_df_xz.rds", compression="xz")
+   plants_df.to_rdata("plants_df_non_comp.rds", compression=None)
+
+   os.stat("plants_df_gz.rds").st_size
+   os.stat("plants_df_bz2.rds").st_size
+   os.stat("plants_df_xz.rds").st_size
+   os.stat("plants_df_non_comp.rds").st_size
+
+Like other IO methods, ``storage_options`` are enabled to write to those platforms:
+
+.. code-block:: ipython
+
+   ghg_df.to_rdata(
+       "s3://path/to/my/storage/pandas_df.rda",
+       storage_options={"user": "xxx", "password": "???"}
+   )
+
+.. ipython:: python
+   :suppress:
+
+   os.remove("ghg_df.rds")
+   os.remove("ghg_df.rda")
+   os.remove("plants_df.rds")
+   os.remove("plants_df_gz.rds")
+   os.remove("plants_df_bz2.rds")
+   os.remove("plants_df_xz.rds")
+   os.remove("plants_df_non_comp.rds")
+
+Once exported, the single DataFrame can be read back in R or multiple DataFrames
+loaded in R:
+
+.. code-block:: r
+
+   plants_df <- readRDS("plants_df.rds")
+   plants_df
+        plant_group              status count
+   16 Pteridophytes      Data Deficient   398
+   17 Pteridophytes             Extinct    65
+   18 Pteridophytes      Not Threatened  1294
+   19 Pteridophytes Possibly Threatened   408
+   20 Pteridophytes          Threatened  1275
+
+   load("ghg_df.rda")
+
+   mget(list=ls())
+   $ghg_df
+                     gas year emissions
+   141    Carbon dioxide 2018 5424.8815
+   142           Methane 2018  634.4571
+   143     Nitrous oxide 2018  434.5286
+   144 Fluorinated gases 2018  182.7824
+   145             Total 2018 6676.6496
+
+For more information of the underlying ``pyreadr`` package, see main page of
+`pyreadr`_ for further notes on support and limitations. For more information of R
+serialization data types, see docs on `rds`_ and `rda`_ data files.
+
+.. _pyreadr: https://github.com/ofajardo/pyreadr
+
+.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS
+
+.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save
+
+
 .. _io.stata:
 
 Stata format
@@ -5958,6 +6242,7 @@ outside of this range, the variable is cast to ``int16``.
   115 dta file format. Attempting to write *Stata* dta files with strings
   longer than 244 characters raises a ``ValueError``.
 
+
 .. _io.stata_reader:
 
 Reading from Stata format
-Original file line number
+Diff line change
@@ Expand Up / @@ -25,4 +25,5 @@ dependencies: @@
       - flask
       - tabulate
       - pyreadstat
+      - pyreadr
       - pip