Skip to content

Alternative serialization for text data #14462

Closed
@mrocklin

Description

@mrocklin

Pandas DataFrames containing text columns are expensive to serialize. This affects dask.dataframe performance in multiprocessing or distributed settings.

Pickle is expensive

In particular the current solution of using pickle.dumps for object dtype columns can be needlessly expensive when all of the values in the column are text. In this case fairly naive solutions, like msgpack can be much much faster.

Here is an old blogpost on the topic: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

And an image

Alternatives

There are naive solutions like msgpack (already in pandas) or encoding the text directly.

There are more sophisticated solutions as well that would provide efficient packing, including with repeated elements.

But sometimes objects are objects

One concern here is that sometimes the Python objects aren't text. I propose that Pandas does a check each time or asks for forgiveness on an exception.

Anyway, I would find this valuable. It would help to reduce bottlenecks in dask.dataframe in some situations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignClosing CandidateMay be closeable, needs more eyeballsDtype ConversionsUnexpected or buggy dtype conversionsEnhancementIO DataIO issues that don't fit into a more specific labelPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions