Skip to content

issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

alixdamman
Copy link
Collaborator

It is normal if Travis crashes. For this issue, I would like to try the TDD approach.

@alixdamman alixdamman requested a review from gdementen March 26, 2018 14:30
@alixdamman
Copy link
Collaborator Author

@gdementen I have implemented something (not definitive) for the HDF format. Could you please take a look (commit by commit) and tell me if I'm going in the right direction before to go further?

List of arrays to add written as 'name'=array, ...
Path to the file containing the session to load or
list/tuple/dictionary containing couples (name, object).
kwargs : dict of str, object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{str: object} dict or something similar would be less ambiguous. Or use typing hints syntax (which I do not know)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas says it's dict of {str: object}.

Path to the file containing the session to load or
list/tuple/dictionary containing couples (name, object).
kwargs : dict of str, object
List of objects to add written as name=object, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/List of o/O/ ?

@@ -191,7 +198,8 @@ def load(self, fname, names=None, engine='auto', display=False, **kwargs):
This can be either the path to a single file, a path to a directory containing .csv files or a pattern
representing several .csv files.
names : list of str, optional
List of arrays to load. If `fname` is None, list of paths to CSV files.
List of objects to load. Objects must be of kind LArray or Axis or Group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describing here the kind of objects to load seems odd. If you want to keep that precision, change the sentence to: List of objects (LArray, Axis or Group) to load. ?

Notes
-----
For CSV and Excel files, all Axis and Group objects are saved in the same CSV file/Sheet
named __axis__(.csv) and __group__(.csv) respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the wording is a bit odd. Maybe separate CSV and Excel files.
  • I think it would make more sense to change "axis" to "axes" and "group" to "groups"

@@ -983,7 +983,7 @@ def apply(self, func, *args, **kwargs):

def summary(self, template=None):
"""
Returns a summary of the content of the session.
Returns a summary of the content of the session (arrays only).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is unfortunate. I think having all objects would be better. A first step would be to simply call str() on non-LArray objects and in the future we could imagine passing a dict of templates {type: template}, so that display of other types would be customizable too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to solve issue #608 in the same release version as this PR. But, in the meantime, I prefer to specify that summary only shows arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1275,6 +1275,39 @@ def containing(self, substring):
substring = substring.eval()
return LGroup([v for v in self.eval() if substring in v], axis=self.axis)

def to_hdf(self, filepath, key, axis_key, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key and axis_key should be optional. Default to self.name and self.axis.name if present.

>>> a.to_hdf('test.h5', 'axes/a') # doctest: +SKIP
"""
key = _translate_key_hdf(key)
s = pd.Series(data=self.labels, name=self.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we go via LArray instead of via Series?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the added value of using LArray since LArray use pandas internally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. less reliance on pandas, if we ever want to provide our own serializers
  2. different output format. This is both a good thing (arrays and axes use the same format) and a bad thing (the format is more verbose).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Considering the time to develop our own serializers divided by the maximum number of users we can expect, I'm not sure it will ever worth it. pandas makes some sanity checks that we don't need to implement ourselves.
  2. While I'm developing the code to import/export Axis and Group objects, I keep in mind that some day we will also include metadata to sessions (i.e. objects of basic types like string, int, float, boolean, list and dictionary). So, I'm not sure it is a good idea to go via LArray.

Why not using pytables directly instead of pandas for non-LArray objects?

Copy link
Contributor

@gdementen gdementen Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In general, it is a good idea to rather use the higher level interface instead of the lower level one to do the same thing, because it simplifies maintenance. In some cases, the performance difference justifies calling the lower level API directly. Here I consider pandas to be the lower level API. That being said, in this case the different format is more important than the ease of maintenance so going via pandas is probably the better option.
  2. you are right but I don't see how it relates to my previous comment.

Why not using pytables directly instead of pandas for non-LArray objects?

We could but pandas is so prevalent, that we need to be able to read (and possibly write) the pandas dialect of hdf, which has a very specific and a bit odd way to store objects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -492,8 +498,20 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa

key = _translate_key_hdf(key)
df = pd.read_hdf(filepath_or_buffer, key, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename df to pd_obj? because AFAICT, it can also be a Series

@@ -480,6 +481,11 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa
sort_columns : bool, optional
Whether or not to sort the columns alphabetically (sorting is more efficient than not sorting).
Defaults to False.
type : { 'LArray' | 'Axis' | 'Group' }, optional
Type of object to be read. Defaults to 'LArray'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to have this option, but this shouldn't be required for HDF. We should store the type in the HDF as extra metadata.

Copy link
Collaborator Author

@alixdamman alixdamman Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> a.to_hdf('test.h5', 'groups/a', 'a') # doctest: +SKIP
"""
key = _translate_key_hdf(key)
s = pd.Series(data=self.eval(), name='{}@{}'.format(self.name, axis_key))
Copy link
Contributor

@gdementen gdementen Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should test that the axis is in the HDF file, but that is probably going to cause an ordering problem (there is currently no guarantee that the group will be saved after the axis), so testing that the correct axis exists in the session seems like a good proxy. It would still be possible to save a group without its axis, but that would be much harder.

@alixdamman alixdamman requested a review from gdementen April 4, 2018 09:22
@alixdamman
Copy link
Collaborator Author

@gdementen you can skip the commit 3d439a6 (updated read_hdf() --> create a new HDFStore only if arg filepath_or_buffer is not already an HDFStore) during the review.

raise ValueError("Attribute 'axis_key' is missing for group {}".format(key))
axis = read_hdf(filepath_or_buffer, attrs['axis_key'])
res = LGroup(key=pd_obj.values, name=name, axis=axis)
if isinstance(filepath_or_buffer, pd.HDFStore):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip this commit during review

>>> a.to_hdf('test.h5', 'groups/a', 'a') # doctest: +SKIP
>>> a.to_hdf('test.h5', 'groups/a01') # doctest: +SKIP

The associated axis is saved at the time with the group if not present in the HDF file yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this sentence. I guess you hesitated between "with the group" and "at the same time than the group" and ended up with a mix of both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad interactive rebase...

name = None
res = Axis(labels=pd_obj.values, name=name)
if 'wildcard' not in dir(attrs):
raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason for this exception? Since this case shouldn't happen in normal circumstances, unless the "raw"/original exception of accessing a missing attribute is awfully not understandable, I would let it pass.

_type = attrs.type if 'type' in dir(attrs) else 'LArray'
if _type == 'LArray':
res = df_aslarray(pd_obj, sort_rows=sort_rows, sort_columns=sort_columns, fill_value=fill_value, parse_header=False)
elif _type == 'Axis':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda bothers me that this code is in the inout/array module. Not urgent to fix this but still... Here are a few potential solutions:

  • defer to some function in a inout/axis module when type == 'Axis'
  • rename inout/array to inout/core
  • have inout/api with read_hdf which defers to functions in inout/array and inout/axis modules depending on type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with option 2 (rename inout/array to inout/core). It will be easy to move from option 2 to option 1 or 3 later if we changed our mind.

if 'wildcard' not in dir(attrs):
raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key))
res._iswildcard = attrs['wildcard']
elif _type == 'Group':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

if name == 'None':
name = None
if 'axis_key' not in dir(attrs):
raise ValueError("Attribute 'axis_key' is missing for group {}".format(key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

self.to_frame().to_hdf(filepath, key, *args, **kwargs)
with LHDFStore(filepath) as store:
store.put(key, self.to_frame())
store.get_storer(key).attrs.type = 'LArray'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to call this 'Array' instead if we decide to implement #611 later.

@@ -732,3 +734,23 @@ def common_type(arrays):
return np.dtype(('U' if need_unicode else 'S', max_size))
else:
return object


class LHDFStore(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need our own HDFStore instead of using the one from pandas directly? To avoid closing the store if we were given an open store?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because HDFStore.__init__ only accept string as path argument. Both pandas.pytables.to_hdf and pandas.pytables.read_hdf functions contains an if-else to test if argument path_or_buf is already an HDFStore instance or not.

I chose to implement my own context manager ( which is actually based on the beginning of the function pandas.pytables.read_hdf )

@alixdamman alixdamman requested a review from gdementen April 5, 2018 15:09
@@ -68,7 +68,7 @@
from larray.core.axis import Axis, AxisReference, AxisCollection, X, _make_axis
from larray.util.misc import (table2str, size2str, basestring, izip, rproduct, ReprString, duplicates,
float_error_handler_factory, _isnoneslice, light_product, unique_list, common_type,
renamed_to, deprecate_kwarg)
renamed_to, deprecate_kwarg, LHDFStore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should name it something more user-friendly and expose it to users. This can/should(?) come later though. We need to have something consistent for excel files, HDF files etc.

Copy link
Collaborator Author

@alixdamman alixdamman Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a new issue for this (see #614). In my mind, LHDFStore was never meant to be public.
I'm not sure that our users will ever use the HDF format to do something else than save and load Session objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will continue the conversation on the issue.

self.to_frame().to_hdf(filepath, key, *args, **kwargs)
with LHDFStore(filepath) as store:
store.put(key, self.to_frame())
store.get_storer(key).attrs.type = 'Array'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of it, this seems a bit odd. Shouldn't the LHDFStore do this (add the type attribute) for us automatically? We define our own class, let's take advantage of that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative is to make it a single method to store data with metadata, something like:

        with LHDFStore(filepath) as store:
            store.put(key, self.to_frame(), attrs={'type': 'Array'})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it somewhat bothers me to have to convert to a pandas object first. I know the LHDFStore is internal for now, but when we make it part of the API, this will need to be changed. We should be able to write something like:

arr = ndtest((2, 3))
with HDFStore(filepath) as store:
    store.put('arr', arr)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: see #614

Parameters
----------
filepath : str
Path where the hdf file has to be written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is too obscure for users, but I guess it also supports (L)HDFStore here. It is probably premature to mention it now, but let us keep this in mind when we make LHDFStore part of the API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #614

filepath : str
Path where the hdf file has to be written.
key : str or Group, optional
Key of the axis within the HDF file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is key clear enough for users? As the argument name it's fine, but some extra explanation in the description might help. s/Key/Key (path)/ ?

Examples
--------
>>> a = Axis("a=a0..a2")
>>> a.to_hdf('test.h5', 'a') # doctest: +SKIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the first example should be without any key specified.



@deprecate_kwarg('nb_index', 'nb_axes', arg_converter=lambda x: x + 1)
def from_lists(data, nb_axes=None, index_col=None, fill_value=np.nan, sort_rows=False, sort_columns=False, wide=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not really belong in a string module. Maybe rename the module to misc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

_allowed_types = (LArray, Axis, Group)


class FileHandler(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really an utility class (which should be usable in a lot of different contexts). This is a base class, so move this to a module named "base", "core", "abstractbases" or something like that.

age359 = age[['3', '5', '9']]
age468 = age['4,6,8'] >> 'even'
assert 5 in age
assert not ('5' in age)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'5' not in age is slightly more readable IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used unittest2pytest tool. It works pretty well but sometimes lead to some ugly output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

assert 5 in age
assert not ('5' in age)
assert age2 in age
assert not (age2bis in age)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: age2bis not in age

assert not ('5' in age)
assert age2 in age
assert not (age2bis in age)
assert not (age2ter in age)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem... (and for all subsequent lines)

@alixdamman
Copy link
Collaborator Author

@gdementen If you don't mind, I think I prefer to make a different PR for each file extension (one for HDF, one for CSV, one for EXCEL and one for pickle)?

@gdementen
Copy link
Contributor

@alixdamman sure!

@alixdamman alixdamman requested a review from gdementen April 6, 2018 15:04
Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!

@alixdamman alixdamman force-pushed the save_load_session_include_axis_group_objs_578 branch from ddc4def to df3cb18 Compare April 9, 2018 07:58
- updated unittests in test_session.py, test_axis.py, test_axiscollection.py and test_group.py --> changed syntax: nosetest --> pytest
…project#578) :

- added to_hdf method to Axis and Group
- updated read_hdf (inout/hdf.py)
- updated documentation of Session's methods
- updated doctests of Session.load and Session.save
- added context manager LHDFStore (utils/misc.py)

refactored package inout: created one module per file extension or external object type like in pandas/io/:

new modules:
  - common.py
  - pandas.py
  - csv.py
  - excel.py
  - hdf.py
  - sas.py
  - misc.py
  - pickle.py

renamed modules:
  - excel.py --> xw_excel.py

deleted modules:
  - array.py
@alixdamman alixdamman force-pushed the save_load_session_include_axis_group_objs_578 branch from df3cb18 to baa1cb9 Compare April 9, 2018 08:02
@alixdamman alixdamman merged commit eb1c935 into larray-project:master Apr 9, 2018
@alixdamman alixdamman changed the title issue 578: include Axis and Group objects to Session.save and Session.load methods issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) Apr 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants