-
Notifications
You must be signed in to change notification settings - Fork 6
issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612
Conversation
@gdementen I have implemented something (not definitive) for the HDF format. Could you please take a look (commit by commit) and tell me if I'm going in the right direction before to go further? |
larray/core/session.py
Outdated
List of arrays to add written as 'name'=array, ... | ||
Path to the file containing the session to load or | ||
list/tuple/dictionary containing couples (name, object). | ||
kwargs : dict of str, object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{str: object} dict
or something similar would be less ambiguous. Or use typing hints syntax (which I do not know)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas says it's dict of {str: object}
.
larray/core/session.py
Outdated
Path to the file containing the session to load or | ||
list/tuple/dictionary containing couples (name, object). | ||
kwargs : dict of str, object | ||
List of objects to add written as name=object, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/List of o/O/ ?
larray/core/session.py
Outdated
@@ -191,7 +198,8 @@ def load(self, fname, names=None, engine='auto', display=False, **kwargs): | |||
This can be either the path to a single file, a path to a directory containing .csv files or a pattern | |||
representing several .csv files. | |||
names : list of str, optional | |||
List of arrays to load. If `fname` is None, list of paths to CSV files. | |||
List of objects to load. Objects must be of kind LArray or Axis or Group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describing here the kind of objects to load seems odd. If you want to keep that precision, change the sentence to: List of objects (LArray, Axis or Group) to load.
?
larray/core/session.py
Outdated
Notes | ||
----- | ||
For CSV and Excel files, all Axis and Group objects are saved in the same CSV file/Sheet | ||
named __axis__(.csv) and __group__(.csv) respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the wording is a bit odd. Maybe separate CSV and Excel files.
- I think it would make more sense to change "axis" to "axes" and "group" to "groups"
@@ -983,7 +983,7 @@ def apply(self, func, *args, **kwargs): | |||
|
|||
def summary(self, template=None): | |||
""" | |||
Returns a summary of the content of the session. | |||
Returns a summary of the content of the session (arrays only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is unfortunate. I think having all objects would be better. A first step would be to simply call str() on non-LArray objects and in the future we could imagine passing a dict of templates {type: template}, so that display of other types would be customizable too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to solve issue #608 in the same release version as this PR. But, in the meantime, I prefer to specify that summary
only shows arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
larray/core/group.py
Outdated
@@ -1275,6 +1275,39 @@ def containing(self, substring): | |||
substring = substring.eval() | |||
return LGroup([v for v in self.eval() if substring in v], axis=self.axis) | |||
|
|||
def to_hdf(self, filepath, key, axis_key, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key and axis_key should be optional. Default to self.name and self.axis.name if present.
larray/core/axis.py
Outdated
>>> a.to_hdf('test.h5', 'axes/a') # doctest: +SKIP | ||
""" | ||
key = _translate_key_hdf(key) | ||
s = pd.Series(data=self.labels, name=self.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we go via LArray instead of via Series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the added value of using LArray since LArray use pandas internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- less reliance on pandas, if we ever want to provide our own serializers
- different output format. This is both a good thing (arrays and axes use the same format) and a bad thing (the format is more verbose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Considering the time to develop our own serializers divided by the maximum number of users we can expect, I'm not sure it will ever worth it. pandas makes some sanity checks that we don't need to implement ourselves.
- While I'm developing the code to import/export Axis and Group objects, I keep in mind that some day we will also include
metadata
to sessions (i.e. objects of basic types like string, int, float, boolean, list and dictionary). So, I'm not sure it is a good idea to go via LArray.
Why not using pytables
directly instead of pandas for non-LArray objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In general, it is a good idea to rather use the higher level interface instead of the lower level one to do the same thing, because it simplifies maintenance. In some cases, the performance difference justifies calling the lower level API directly. Here I consider pandas to be the lower level API. That being said, in this case the different format is more important than the ease of maintenance so going via pandas is probably the better option.
- you are right but I don't see how it relates to my previous comment.
Why not using pytables directly instead of pandas for non-LArray objects?
We could but pandas is so prevalent, that we need to be able to read (and possibly write) the pandas dialect of hdf, which has a very specific and a bit odd way to store objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I will continue to rely on pandas. For metadata, there is a solution: http://pandas-docs.github.io/pandas-docs-travis/cookbook.html#hdfstore (see also https://stackoverflow.com/questions/29129095/save-additional-attributes-in-pandas-dataframe/29130146#29130146)
larray/inout/array.py
Outdated
@@ -492,8 +498,20 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa | |||
|
|||
key = _translate_key_hdf(key) | |||
df = pd.read_hdf(filepath_or_buffer, key, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename df to pd_obj? because AFAICT, it can also be a Series
larray/inout/array.py
Outdated
@@ -480,6 +481,11 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa | |||
sort_columns : bool, optional | |||
Whether or not to sort the columns alphabetically (sorting is more efficient than not sorting). | |||
Defaults to False. | |||
type : { 'LArray' | 'Axis' | 'Group' }, optional | |||
Type of object to be read. Defaults to 'LArray'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice to have this option, but this shouldn't be required for HDF. We should store the type in the HDF as extra metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
larray/core/group.py
Outdated
>>> a.to_hdf('test.h5', 'groups/a', 'a') # doctest: +SKIP | ||
""" | ||
key = _translate_key_hdf(key) | ||
s = pd.Series(data=self.eval(), name='{}@{}'.format(self.name, axis_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should test that the axis is in the HDF file, but that is probably going to cause an ordering problem (there is currently no guarantee that the group will be saved after the axis), so testing that the correct axis exists in the session seems like a good proxy. It would still be possible to save a group without its axis, but that would be much harder.
@gdementen you can skip the commit 3d439a6 (updated read_hdf() --> create a new HDFStore only if arg filepath_or_buffer is not already an HDFStore) during the review. |
larray/inout/array.py
Outdated
raise ValueError("Attribute 'axis_key' is missing for group {}".format(key)) | ||
axis = read_hdf(filepath_or_buffer, attrs['axis_key']) | ||
res = LGroup(key=pd_obj.values, name=name, axis=axis) | ||
if isinstance(filepath_or_buffer, pd.HDFStore): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip this commit during review
larray/core/group.py
Outdated
>>> a.to_hdf('test.h5', 'groups/a', 'a') # doctest: +SKIP | ||
>>> a.to_hdf('test.h5', 'groups/a01') # doctest: +SKIP | ||
|
||
The associated axis is saved at the time with the group if not present in the HDF file yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this sentence. I guess you hesitated between "with the group" and "at the same time than the group" and ended up with a mix of both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad interactive rebase...
larray/inout/array.py
Outdated
name = None | ||
res = Axis(labels=pd_obj.values, name=name) | ||
if 'wildcard' not in dir(attrs): | ||
raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason for this exception? Since this case shouldn't happen in normal circumstances, unless the "raw"/original exception of accessing a missing attribute is awfully not understandable, I would let it pass.
larray/inout/array.py
Outdated
_type = attrs.type if 'type' in dir(attrs) else 'LArray' | ||
if _type == 'LArray': | ||
res = df_aslarray(pd_obj, sort_rows=sort_rows, sort_columns=sort_columns, fill_value=fill_value, parse_header=False) | ||
elif _type == 'Axis': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kinda bothers me that this code is in the inout/array module. Not urgent to fix this but still... Here are a few potential solutions:
- defer to some function in a inout/axis module when type == 'Axis'
- rename inout/array to inout/core
- have inout/api with read_hdf which defers to functions in inout/array and inout/axis modules depending on type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with option 2 (rename inout/array to inout/core). It will be easy to move from option 2 to option 1 or 3 later if we changed our mind.
larray/inout/array.py
Outdated
if 'wildcard' not in dir(attrs): | ||
raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key)) | ||
res._iswildcard = attrs['wildcard'] | ||
elif _type == 'Group': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
larray/inout/array.py
Outdated
if name == 'None': | ||
name = None | ||
if 'axis_key' not in dir(attrs): | ||
raise ValueError("Attribute 'axis_key' is missing for group {}".format(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
larray/core/array.py
Outdated
self.to_frame().to_hdf(filepath, key, *args, **kwargs) | ||
with LHDFStore(filepath) as store: | ||
store.put(key, self.to_frame()) | ||
store.get_storer(key).attrs.type = 'LArray' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to call this 'Array' instead if we decide to implement #611 later.
@@ -732,3 +734,23 @@ def common_type(arrays): | |||
return np.dtype(('U' if need_unicode else 'S', max_size)) | |||
else: | |||
return object | |||
|
|||
|
|||
class LHDFStore(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need our own HDFStore instead of using the one from pandas directly? To avoid closing the store if we were given an open store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because HDFStore.__init__
only accept string as path argument. Both pandas.pytables.to_hdf and pandas.pytables.read_hdf functions contains an if-else to test if argument path_or_buf
is already an HDFStore instance or not.
I chose to implement my own context manager ( which is actually based on the beginning of the function pandas.pytables.read_hdf )
@@ -68,7 +68,7 @@ | |||
from larray.core.axis import Axis, AxisReference, AxisCollection, X, _make_axis | |||
from larray.util.misc import (table2str, size2str, basestring, izip, rproduct, ReprString, duplicates, | |||
float_error_handler_factory, _isnoneslice, light_product, unique_list, common_type, | |||
renamed_to, deprecate_kwarg) | |||
renamed_to, deprecate_kwarg, LHDFStore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should name it something more user-friendly and expose it to users. This can/should(?) come later though. We need to have something consistent for excel files, HDF files etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a new issue for this (see #614). In my mind, LHDFStore was never meant to be public.
I'm not sure that our users will ever use the HDF format to do something else than save and load Session objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will continue the conversation on the issue.
larray/core/array.py
Outdated
self.to_frame().to_hdf(filepath, key, *args, **kwargs) | ||
with LHDFStore(filepath) as store: | ||
store.put(key, self.to_frame()) | ||
store.get_storer(key).attrs.type = 'Array' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think of it, this seems a bit odd. Shouldn't the LHDFStore do this (add the type attribute) for us automatically? We define our own class, let's take advantage of that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative is to make it a single method to store data with metadata, something like:
with LHDFStore(filepath) as store:
store.put(key, self.to_frame(), attrs={'type': 'Array'})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it somewhat bothers me to have to convert to a pandas object first. I know the LHDFStore is internal for now, but when we make it part of the API, this will need to be changed. We should be able to write something like:
arr = ndtest((2, 3))
with HDFStore(filepath) as store:
store.put('arr', arr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: see #614
Parameters | ||
---------- | ||
filepath : str | ||
Path where the hdf file has to be written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is too obscure for users, but I guess it also supports (L)HDFStore here. It is probably premature to mention it now, but let us keep this in mind when we make LHDFStore part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #614
larray/core/axis.py
Outdated
filepath : str | ||
Path where the hdf file has to be written. | ||
key : str or Group, optional | ||
Key of the axis within the HDF file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is key clear enough for users? As the argument name it's fine, but some extra explanation in the description might help. s/Key/Key (path)/ ?
larray/core/axis.py
Outdated
Examples | ||
-------- | ||
>>> a = Axis("a=a0..a2") | ||
>>> a.to_hdf('test.h5', 'a') # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the first example should be without any key specified.
larray/inout/string.py
Outdated
|
||
|
||
@deprecate_kwarg('nb_index', 'nb_axes', arg_converter=lambda x: x + 1) | ||
def from_lists(data, nb_axes=None, index_col=None, fill_value=np.nan, sort_rows=False, sort_columns=False, wide=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not really belong in a string module. Maybe rename the module to misc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
larray/inout/utils.py
Outdated
_allowed_types = (LArray, Axis, Group) | ||
|
||
|
||
class FileHandler(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really an utility class (which should be usable in a lot of different contexts). This is a base class, so move this to a module named "base", "core", "abstractbases" or something like that.
larray/tests/test_axis.py
Outdated
age359 = age[['3', '5', '9']] | ||
age468 = age['4,6,8'] >> 'even' | ||
assert 5 in age | ||
assert not ('5' in age) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'5' not in age
is slightly more readable IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used unittest2pytest
tool. It works pretty well but sometimes lead to some ugly output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
larray/tests/test_axis.py
Outdated
assert 5 in age | ||
assert not ('5' in age) | ||
assert age2 in age | ||
assert not (age2bis in age) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above: age2bis not in age
larray/tests/test_axis.py
Outdated
assert not ('5' in age) | ||
assert age2 in age | ||
assert not (age2bis in age) | ||
assert not (age2ter in age) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem... (and for all subsequent lines)
@gdementen If you don't mind, I think I prefer to make a different PR for each file extension (one for HDF, one for CSV, one for EXCEL and one for pickle)? |
@alixdamman sure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work!
ddc4def
to
df3cb18
Compare
- updated unittests in test_session.py, test_axis.py, test_axiscollection.py and test_group.py --> changed syntax: nosetest --> pytest
…project#578) : - added to_hdf method to Axis and Group - updated read_hdf (inout/hdf.py) - updated documentation of Session's methods - updated doctests of Session.load and Session.save - added context manager LHDFStore (utils/misc.py) refactored package inout: created one module per file extension or external object type like in pandas/io/: new modules: - common.py - pandas.py - csv.py - excel.py - hdf.py - sas.py - misc.py - pickle.py renamed modules: - excel.py --> xw_excel.py deleted modules: - array.py
df3cb18
to
baa1cb9
Compare
It is normal if Travis crashes. For this issue, I would like to try the TDD approach.