Skip to content

Alternative serialization for text data #14462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Oct 20, 2016 · 5 comments
Closed

Alternative serialization for text data #14462

mrocklin opened this issue Oct 20, 2016 · 5 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Data IO issues that don't fit into a more specific label Performance Memory or execution speed performance

Comments

@mrocklin
Copy link
Contributor

Pandas DataFrames containing text columns are expensive to serialize. This affects dask.dataframe performance in multiprocessing or distributed settings.

Pickle is expensive

In particular the current solution of using pickle.dumps for object dtype columns can be needlessly expensive when all of the values in the column are text. In this case fairly naive solutions, like msgpack can be much much faster.

Here is an old blogpost on the topic: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

And an image

Alternatives

There are naive solutions like msgpack (already in pandas) or encoding the text directly.

There are more sophisticated solutions as well that would provide efficient packing, including with repeated elements.

But sometimes objects are objects

One concern here is that sometimes the Python objects aren't text. I propose that Pandas does a check each time or asks for forgiveness on an exception.

Anyway, I would find this valuable. It would help to reduce bottlenecks in dask.dataframe in some situations.

@jreback jreback added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Oct 21, 2016
@jreback
Copy link
Contributor

jreback commented Mar 3, 2017

I think its reasonable now that pyarrow 0.2 is out, that we could add convenience functions (marked experimental hah!)
.to_arrow() and .from_arrow, similar to the .to_feather and .from_feather
xref dask/distributed#614
cc @wesm @mrocklin

@jreback jreback added this to the 0.20.0 milestone Mar 3, 2017
@jreback jreback added API Design IO Data IO issues that don't fit into a more specific label labels Mar 3, 2017
@wesm
Copy link
Member

wesm commented Mar 3, 2017

FYI: I started working on combining the Feather and pyarrow codebases which will also bring new Feather features (e.g. reading and writing to Python file objects).

Another aside: I'm also planning eventually to develop a native C++ JSON table parser using RapidJSON within the Arrow codebase. Schema inference is the most annoying part, but I would want this arrow_json component to pass the pandas JSON test suite. Eventually we can support schemas with unions (where some fields might contain multiple types). pandas 2.0 can acquire union types, too

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jbrockmendel
Copy link
Member

has pyarrow solved this to everyone's satisfaction?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 22, 2020
@xhochy
Copy link
Contributor

xhochy commented Sep 22, 2020

I guess pickle5 / pickle.dump(protocol=5) and a new string type based on Arrow will fully solve this but Feather may already be enough.

@mroeschke
Copy link
Member

I think the 'string[pyarrow]' dtype released in 1.2 might have sufficiently addressed this so closing. We can follow up with a new issue if there are additional requirements not covered by that dtype

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Data IO issues that don't fit into a more specific label Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants