-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: IO support for R data files with pandas.read_rdata
and DataFrame.to_rdata
#40287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are R data files commonly used to data exchange? One of the arguments for SAS and Stata are that is it unfortunately common to see organizations publishing datasets in these formats. I can't see I've ever seen RDS used in this way. |
@ParfaitG Interesting proposal. We would have to be a bit careful with naming here, as |
Good point @bashtage! Perhaps since R is a programming language or environment and not traditional software with proprietary types, the .rds format is not traditionally used in data exchange. Raw data and code would be enough to reproduce end use data. However, anecdotally among internal teams, many advanced useRs use this format to save cleaned data, plotting data, modeling results, etc. due to its efficiency as a binary and compressed serialization type without need to parse text files and detect types. Also, the .rda format is the dominant data storage format in R packages. And this need is routinely asked on StackOverflow usually for @dsaxton, I didn't think of that name collision. We can call it |
pandas.read_rds
and DataFrame.to_rds
pandas.read_rdata
and DataFrame.to_rdata
I think supporting R data files is reasonable. The big question would be how shoudl support be added. Would it be better to take a soft dep on pyreadr like pandas does for most IO (e.g., openpyxl for Excel)? This way it will work as expected if this library is available. It saves the cost of maintaining a vendored code snippet and keeping it synced upstream. The downside is that new releases on a soft dep can break CI. |
Thanks, @bashtage, my thoughts it to bypass Specifically, the pandas plan would include for no external dependencies:
Now, would the pandas team be open to a new io C extension? |
pyreadr developer here. I personally would suggest to use pyreadr as soft dep. It is not correct that rds and rda formats do not change ,they do with major and minor versions of R and these changes are undocumented (see for example here, here, here). And we are still improving as we cannot read all existing features, as again everything is undocumented. That means if you do do your own code base you will have to maintain it (maintenance would be completely on your side since I don't have capacity to maimtain two code bases) I also develop pyreadstat, pandas is using it as soft dep for read_spss and that approach seems to be working really well. Of course pyreadr is an opensource project, so you are free take the code. However take into account that the license of pyreadr is very restrictive, I am not sure what kind of license pandas has, but you have to ensure that the restrictions for these pieces of code, even if they become detached from pyreadr stay as strict as they are now. You will also need to distribe the pyreadr license and attached licenses toghether with pandas license. I will also ask you to do the first commit with my github handle so that I appear as contributor to the repo. |
Thank you, @ofajardo, for your input! First, do be aware you can take the lead on a PR for this proposed IO module. As an author who relies on pandas, why not become an original contributor? If using soft dep approach, you can follow the similar setup of Given your response, here are my thoughts:
With that said, thank you for authoring various data exchange packages in the pandas ecosystem over the years! From SO posts above, many have been grateful. I am looking into other solutions to build this specific IO support and I may have a different approach in mind. |
It is important to acknowledge that there is a non-trivial developer cost to streamlining. There are three options here:
More IO formats do than don't. An incomplete list of formats that require a soft dep:
I don't see how this argues against a soft dependency. pandas could take it as a soft dep and still provide a uniform API on top, including building any missing features, or converting between what the dep prefers and what pandas prefers.
There have been about a page full of commits in the past year. If these are all necessary then it seems to drift around a bit.
I'm don't think this is much of an argument against the soft dep approach. Each package is allowed to have their own accepted code style. NumPy is pretty far from "full" PEP yet no one suggests not building on NumPy. |
You also seem to acknowledge pyreadr in your Cython above. You cannot use any code from pyreadr since it is GPL. A vendored version will need to have a clean-sheet implementation that directly wraps the C library without using code from pyreadr. |
hey @ParfaitG thanks for your thoughtful answer! I am still aligned with @bashtage thinking that a soft dep is better in this (and other io) case(s); and in general that modular is better than monolithic. It seems that there are enough successful examples of this approach in pandas as @bashtage has pointed out to demonstrate the approach works very well ... But, that's just my humble opinion, and I am not a pandas dev, so up to you guys to decide! In case you guys would like to go for a soft dep, you got my full collaboration to do changes in pyreadr to align and better integrate with pandas, including cleaning the code to make it more PEP conforming; either doing my self or accepting PRs from others. As @bashtage suggests, notice that the license of pyreadr is AGPL so it probably clashes with pandas and indeed you cannot take it. But doing a better wrapper for librdata from scratch (or some other approach as you mentioned) should be no issue for you in case you guys decide to go for an internal module. Just a couple of other comments:
I actually am using the full libradata API trying to be as comprehensive as possible. Librdata has currently a lot of limitations you won't be able to overcome unless you directly contribute to librdata C code. If you include librdata as hard dep you will start getting issues around R lists not read, S4 objects not read, etc (just check pyreadr and librdata issues to see what I mean). I currently don't have capacity to work on those issues, but if you do, and you fix those things in librdata + your internal module, that would be a step forward! An in case you decide for a soft rep and have ideas on how to improve pyreadr and would like to contribute, you would be very welcome! However if you truly want to become in full control of the process and overcome the current limitations imposed by librdata you should consider writing the convertor truly from scratch without relying on librdata.
I decided to support older versions of python as much as possible, but I understand your dissagreement with that. What I see in reality is that we do have old production servers with old centos which are still running with python 3.5 and 3.4, from there the motivation to keep backward compatibility at expense of PEP styling.
I hope that data exchange between Python, R and other technologies happen through libraries designed for that purpose such as feather/arrow. R binary files are undocumented and fully supported only by R having very low interoperabiliy, so in my opinion is a very poor solution for data exchange. |
@ofajardo, FYI - re documentation, see CRAN doc R Internals (updated 2021-03-05) at section 1.8 Serialisation Formats. Also, see serialize.c (underlying C code to |
Thank you @bashtage, for your comments. Not to belabor this discussion and I appreciate your time, how about a 4th option: direct command line call to R to access For option 4, consider demo for .rds types which converts data to .csv, respecting column names, row names/indexes, and data types using a temp files and directory (like SAS's Read RdataR mtcars$now <- Sys.time() # ADD TIME FOR DEMONSTRATOIN
saveRDS(mtcars, "mtcars.rds") Python from datetime import datetime
import os
from subprocess import Popen, PIPE
from tempfile import TemporaryDirectory
import pandas as pd
def read_rdata(rds_file):
cmd = "Rscript"
r_to_py_types = {'logical': 'bool', 'integer':'int64', 'numeric': 'float64',
'character': 'str', 'factor': 'str', 'Date': 'date' , 'POSIXct': 'date'}
with TemporaryDirectory() as tmpdir:
py_csv = os.path.join(tmpdir, "pydata.csv")
# BUILD TEMP SCRIPT TO READ RDS AND OUTPUT TO CONSOLE
r_code = os.path.join(tmpdir, "r_batch.R")
with open(r_code, "w") as f:
f.write("""args <- commandArgs(trailingOnly=TRUE)
df_r <- readRDS(args[length(args)-1])
write.csv(df_r, file=args[length(args)])
cat(paste(colnames(df_r), collapse=","),"|",
paste(sapply(df_r, function(x) class(x)[1]), collapse=","),
sep="")
""")
# SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR
cmds = [cmd, r_code, rds_file, py_csv]
a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, error = a.communicate()
if error:
print(error.decode("UTF-8"))
r_hdrs = [h.split(",") for h in output.decode("UTF-8").split("|")]
py_types = {n:r_to_py_types[d] for n,d in zip(*r_hdrs)}
dt_cols = [col for col, d in py_types.items() if d == "date"]
py_types = {k:v for k,v in py_types.items() if v != "date"}
# IMPORT PANDAS DATA FRAME
df = pd.read_csv(py_csv, index_col=0, dtype=py_types, parse_dates=dt_cols)
return df
py_df = read_rdata("mtcars.rds")
print(py_df.dtypes)
# mpg float64
# cyl float64
# disp float64
# hp float64
# drat float64
# wt float64
# qsec float64
# vs float64
# am float64
# gear float64
# carb float64
# now datetime64[ns]
# dtype: object
print(py_df.head)
# mpg cyl disp hp drat wt qsec vs am gear carb now
# Mazda RX4 21.0 6.0 160.0 110.0 3.90 2.620 16.46 0.0 1.0 4.0 4.0 2021-03-22 10:58:34
# Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 2.875 17.02 0.0 1.0 4.0 4.0 2021-03-22 10:58:34
# Datsun 710 22.8 4.0 108.0 93.0 3.85 2.320 18.61 1.0 1.0 4.0 1.0 2021-03-22 10:58:34
# Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0 2021-03-22 10:58:34
# Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 3.440 17.02 0.0 0.0 3.0 2.0 2021-03-22 10:58:34
print(py_df.tail)
# Lotus Europa 30.4 4.0 95.1 113.0 3.77 1.513 16.90 1.0 1.0 5.0 2.0 2021-03-22 10:58:34
# Ford Pantera L 15.8 8.0 351.0 264.0 4.22 3.170 14.50 0.0 1.0 5.0 4.0 2021-03-22 10:58:34
# Ferrari Dino 19.7 6.0 145.0 175.0 3.62 2.770 15.50 0.0 1.0 5.0 6.0 2021-03-22 10:58:34
# Maserati Bora 15.0 8.0 301.0 335.0 3.54 3.570 14.60 0.0 1.0 5.0 8.0 2021-03-22 10:58:34
# Volvo 142E 21.4 4.0 121.0 109.0 4.11 2.780 18.60 1.0 1.0 4.0 2.0 2021-03-22 10:58:34 Write RdataPython def write_rdata(frame, rds_file):
cmd = "Rscript"
py_to_r_types = {'int32': 'integer', 'int64': 'integer', 'float64': 'numeric',
'object': 'character', 'bool': 'logical', 'datetime64[ns]': 'POSIXct'}
r_types = ",".join(frame.reset_index().dtypes.replace(py_to_r_types))
with TemporaryDirectory() as tmpdir:
py_csv = os.path.join(tmpdir, "py_df.csv")
frame.to_csv(py_csv)
# BUILD TEMP SCRIPT TO INPUT CSV AND SAVE RDS
r_code = os.path.join(tmpdir, "r_batch.R")
with open(r_code, "w") as f:
f.write("""args <- commandArgs(trailingOnly=TRUE)
py_csv <- args[length(args)-2]
r_types <- strsplit(args[length(args)-1], ",")[[1]]
df_r <- read.csv(py_csv, colClasses=r_types)
df_r <- `row.names<-`(df_r[-1], df_r[[1]])
saveRDS(df_r, args[length(args)])
""")
# SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR
cmds = [cmd, r_code, py_csv, r_types, rds_file]
a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, error = a.communicate()
if error:
print(error.decode("UTF-8"))
return None
py_df = (pd.read_csv("https://github.com/raw/mwaskom/seaborn-data/master/mpg.csv")
.assign(now=datetime.now())) # ADD TIME FOR DEMONSTRATON
write_rdata(py_df, "mpg.rds") R r_df <- readRDS("mpg.rds")
str(r_df)
# 'data.frame': 398 obs. of 10 variables:
# $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
# $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
# $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
# $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
# $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
# $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
# $ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
# $ origin : chr "usa" "usa" "usa" "usa" ...
# $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
# $ now : POSIXct, format: "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" ...
head(r_df)
# mpg cylinders displacement horsepower weight acceleration model_year origin name now
# 0 18 8 307 130 3504 12.0 70 usa chevrolet chevelle malibu 2021-03-22 10:58:40
# 1 15 8 350 165 3693 11.5 70 usa buick skylark 320 2021-03-22 10:58:40
# 2 18 8 318 150 3436 11.0 70 usa plymouth satellite 2021-03-22 10:58:40
# 3 16 8 304 150 3433 12.0 70 usa amc rebel sst 2021-03-22 10:58:40
# 4 17 8 302 140 3449 10.5 70 usa ford torino 2021-03-22 10:58:40
# 5 15 8 429 198 4341 10.0 70 usa ford galaxie 500 2021-03-22 10:58:40
tail(r_df)
# mpg cylinders displacement horsepower weight acceleration model_year origin name now
# 392 27 4 151 90 2950 17.3 82 usa chevrolet camaro 2021-03-22 10:58:40
# 393 27 4 140 86 2790 15.6 82 usa ford mustang gl 2021-03-22 10:58:40
# 394 44 4 97 52 2130 24.6 82 europe vw pickup 2021-03-22 10:58:40
# 395 32 4 135 84 2295 11.6 82 usa dodge rampage 2021-03-22 10:58:40
# 396 28 4 120 79 2625 18.6 82 usa ford ranger 2021-03-22 10:58:40
# 397 31 4 119 82 2720 19.4 82 usa chevy s-10 2021-03-22 10:58:40 |
I will get started on a PR that will read/write R data files with |
Currently, Pandas IO tools for binary files support largely the commercial statistical packages (SAS, Stata, SPSS). Interestingly, R binary types (.rds, .rda) are not included. Since many data science teams work between the open source stacks, some IO pandas support of R data files may be worthwhile to pursue.
I know there is some history of pandas with rpy2. However, there may be a way to integrate an IO module for R data files without optional dependency (i.e, pyreadr) but using a lightweight C library: librdata. Also, R's
saveRDS
uses compression types (gzip
,bzip2
, andxz
) already handled with pandas io.Thanks to the authors of pyreadr and librdata (not unlike the
sas7bdat
authors forread_sas
orPyDTA
authors forread_stata
), I was able to implement a demo on an uncompressed rds type.R
Python (using a Cython built module)
Parser
Writer
R
The text was updated successfully, but these errors were encountered: