-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Create efficient binary storage format alternative to pickle #686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As a first order approximation, what do you think about: import pickle, bz2
def pickle_compress(obj, path):
pickle.dump(obj, bz2.BZ2File(path, 'w'), pickle.HIGHEST_PROTOCOL)
def unpickle_compress(obj, path):
return pickle.load(bz2.BZ2File(path, 'r')) |
Unfortunately, this doesn't solve the "pickle problem" (i.e. that classes can't move to different modules). |
@wesm have you considered making HDF5 the preferred backend? You would get compression baked in. Plus, the random access support would make on-disk dataframe operations possible. |
I have, but the main problem is the deserialization speed for lots of small objects (this turns out to be a fairly important use case to a lot of users). I think using msgpack along with snappy or blosc (which uses the fastlz algorithm) for fast in-memory compression might be a good way to go I should point out that using HDF5 (or at least PyTables) adds quite a bit of overhead for loading small objects |
When you talk about loading many small objects, what do you mean exactly? I've definitely run into a lot of performance cul-de-sacs when it comes to pytables, so I'm not sure which one you're referring to :/ I've come to having a DatetimeIndex in memory and translating the bool arrays to list of slices since pytables dies on long list of indexes. Really quick but bypassing most pytables features like in-kernel searches. |
As a benchmark, try loading 1000 Series objects each containing 100 values and random string indexes with 10 or fewer characters. So we're talking roughly 8K of data per Series |
related #3151 |
The metadata discussion in #2485 reached an impasse about having metadata The pickle code in pandas is not my favorite bit, would be nice to tackle it |
I think in 0.12 we could add a generic |
How do you mean propogate? you have a way to do #2485 in a clean way across operations? or |
pickle/hdf is easy, I would do it like name propogates in series, essentially its an add to the constructor (and maybe move name and make it a property of meta). I think can prop across most common operations. of course not exactly sure what to do in a case like this:
and maybe a warning for clobbering? could also |
#2485 (possibly the longest discussion ever?) convinced me there be dragons on the propagation Adding metadata to store/load only is indeed fairly straightforward, and useful for archiving. That's |
re: msgpack, maybe google protobuf or fb/apache avro should be considered? perhaps dataframes in the browser may be a future consideration, msgpack is a |
what exactly is the goal here? to provide essentially a pickle replacement? or just to support saving/loading a large amount of smaller type of objects? its clear (to me at least!) if you have large amounts of data, HDF5 is the way to go, so we are talking about small fast storage of data (in binary)? Essentially a Why is compression a requirement in this case in any event, or even a binary format? what is wrong with JSON / BSON? I am somewhat agnostic on msgback / protobuf / avro looks good too We need a use case to figure out what format suits |
Yeah, for a ton of small objects using pickle is the best way. I would be willing to explore avro, especially since it's compatible with HDFS (S, not 5) |
http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro +1 on avro - it's more python like (and no compiling schemes) |
good read, http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html |
Please, bake in serialization format version into this. |
Agreed re: versioning |
closed by #3525 |
Ideally it should support compression! Possibly using blosc or some other method
The text was updated successfully, but these errors were encountered: