Description
Pandas DataFrames containing text columns are expensive to serialize. This affects dask.dataframe performance in multiprocessing or distributed settings.
Pickle is expensive
In particular the current solution of using pickle.dumps
for object dtype columns can be needlessly expensive when all of the values in the column are text. In this case fairly naive solutions, like msgpack
can be much much faster.
Here is an old blogpost on the topic: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
And an image
Alternatives
There are naive solutions like msgpack (already in pandas) or encoding the text directly.
There are more sophisticated solutions as well that would provide efficient packing, including with repeated elements.
But sometimes objects are objects
One concern here is that sometimes the Python objects aren't text. I propose that Pandas does a check each time or asks for forgiveness on an exception.
Anyway, I would find this valuable. It would help to reduce bottlenecks in dask.dataframe in some situations.