Alternative serialization for text data

Pandas DataFrames containing text columns are expensive to serialize.  This affects dask.dataframe performance in multiprocessing or distributed settings.
### Pickle is expensive

In particular the current solution of using `pickle.dumps` for object dtype columns can be needlessly expensive when all of the values in the column are text.  In this case fairly naive solutions, like `msgpack` can be much much faster.  

Here is an old blogpost on the topic: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

And an image

![](https://mrocklin.github.com/blog/images/serialize.png)
### Alternatives

There are naive solutions like msgpack (already in pandas) or encoding the text directly.

There are more sophisticated solutions as well that would provide efficient packing, including with repeated elements.
### But sometimes objects are objects

One concern here is that sometimes the Python objects aren't text.  I propose that Pandas does a check each time or asks for forgiveness on an exception.

Anyway, I would find this valuable.  It would help to reduce bottlenecks in dask.dataframe in some situations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Alternative serialization for text data #14462

Pickle is expensive

Alternatives

But sometimes objects are objects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Alternative serialization for text data #14462

Description

Pickle is expensive

Alternatives

But sometimes objects are objects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions