Alternative serialization for text data #14462
Labels
API Design
Closing Candidate
May be closeable, needs more eyeballs
Dtype Conversions
Unexpected or buggy dtype conversions
Enhancement
IO Data
IO issues that don't fit into a more specific label
Performance
Memory or execution speed performance
Pandas DataFrames containing text columns are expensive to serialize. This affects dask.dataframe performance in multiprocessing or distributed settings.
Pickle is expensive
In particular the current solution of using
pickle.dumps
for object dtype columns can be needlessly expensive when all of the values in the column are text. In this case fairly naive solutions, likemsgpack
can be much much faster.Here is an old blogpost on the topic: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
And an image
Alternatives
There are naive solutions like msgpack (already in pandas) or encoding the text directly.
There are more sophisticated solutions as well that would provide efficient packing, including with repeated elements.
But sometimes objects are objects
One concern here is that sometimes the Python objects aren't text. I propose that Pandas does a check each time or asks for forgiveness on an exception.
Anyway, I would find this valuable. It would help to reduce bottlenecks in dask.dataframe in some situations.
The text was updated successfully, but these errors were encountered: