issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612

alixdamman · 2018-03-26T14:30:05Z

It is normal if Travis crashes. For this issue, I would like to try the TDD approach.

alixdamman · 2018-03-27T14:24:57Z

@gdementen I have implemented something (not definitive) for the HDF format. Could you please take a look (commit by commit) and tell me if I'm going in the right direction before to go further?

gdementen · 2018-03-26T14:44:14Z

larray/core/session.py

-        List of arrays to add written as 'name'=array, ...
+        Path to the file containing the session to load or
+        list/tuple/dictionary containing couples (name, object).
+    kwargs : dict of str, object


{str: object} dict or something similar would be less ambiguous. Or use typing hints syntax (which I do not know)?

pandas says it's dict of {str: object}.

gdementen · 2018-03-26T14:44:55Z

larray/core/session.py

+        Path to the file containing the session to load or
+        list/tuple/dictionary containing couples (name, object).
+    kwargs : dict of str, object
+        List of objects to add written as name=object, ...


s/List of o/O/ ?

gdementen · 2018-03-26T14:47:17Z

larray/core/session.py

@@ -191,7 +198,8 @@ def load(self, fname, names=None, engine='auto', display=False, **kwargs):
            This can be either the path to a single file, a path to a directory containing .csv files or a pattern
            representing several .csv files.
        names : list of str, optional
-            List of arrays to load. If `fname` is None, list of paths to CSV files.
+            List of objects to load. Objects must be of kind LArray or Axis or Group.


Describing here the kind of objects to load seems odd. If you want to keep that precision, change the sentence to: List of objects (LArray, Axis or Group) to load. ?

gdementen · 2018-03-26T14:52:44Z

larray/core/session.py

+        Notes
+        -----
+        For CSV and Excel files, all Axis and Group objects are saved in the same CSV file/Sheet
+        named __axis__(.csv) and __group__(.csv) respectively.


the wording is a bit odd. Maybe separate CSV and Excel files.

I think it would make more sense to change "axis" to "axes" and "group" to "groups"

gdementen · 2018-03-28T07:09:10Z

larray/core/session.py

@@ -983,7 +983,7 @@ def apply(self, func, *args, **kwargs):

    def summary(self, template=None):
        """
-        Returns a summary of the content of the session.
+        Returns a summary of the content of the session (arrays only).


I think this change is unfortunate. I think having all objects would be better. A first step would be to simply call str() on non-LArray objects and in the future we could imagine passing a dict of templates {type: template}, so that display of other types would be customizable too.

I hope to solve issue #608 in the same release version as this PR. But, in the meantime, I prefer to specify that summary only shows arrays.

gdementen · 2018-03-28T08:24:58Z

larray/core/group.py

@@ -1275,6 +1275,39 @@ def containing(self, substring):
            substring = substring.eval()
        return LGroup([v for v in self.eval() if substring in v], axis=self.axis)

+    def to_hdf(self, filepath, key, axis_key, **kwargs):


key and axis_key should be optional. Default to self.name and self.axis.name if present.

gdementen · 2018-03-28T08:26:13Z

larray/core/axis.py

+        >>> a.to_hdf('test.h5', 'axes/a')  # doctest: +SKIP
+        """
+        key = _translate_key_hdf(key)
+        s = pd.Series(data=self.labels, name=self.name)


Can't we go via LArray instead of via Series?

What would be the added value of using LArray since LArray use pandas internally?

less reliance on pandas, if we ever want to provide our own serializers

different output format. This is both a good thing (arrays and axes use the same format) and a bad thing (the format is more verbose).

Considering the time to develop our own serializers divided by the maximum number of users we can expect, I'm not sure it will ever worth it. pandas makes some sanity checks that we don't need to implement ourselves.

While I'm developing the code to import/export Axis and Group objects, I keep in mind that some day we will also include metadata to sessions (i.e. objects of basic types like string, int, float, boolean, list and dictionary). So, I'm not sure it is a good idea to go via LArray.

Why not using pytables directly instead of pandas for non-LArray objects?

In general, it is a good idea to rather use the higher level interface instead of the lower level one to do the same thing, because it simplifies maintenance. In some cases, the performance difference justifies calling the lower level API directly. Here I consider pandas to be the lower level API. That being said, in this case the different format is more important than the ease of maintenance so going via pandas is probably the better option.

you are right but I don't see how it relates to my previous comment.

Why not using pytables directly instead of pandas for non-LArray objects?

We could but pandas is so prevalent, that we need to be able to read (and possibly write) the pandas dialect of hdf, which has a very specific and a bit odd way to store objects.

So I will continue to rely on pandas. For metadata, there is a solution: http://pandas-docs.github.io/pandas-docs-travis/cookbook.html#hdfstore (see also https://stackoverflow.com/questions/29129095/save-additional-attributes-in-pandas-dataframe/29130146#29130146)

gdementen · 2018-03-28T08:29:43Z

larray/inout/array.py

@@ -492,8 +498,20 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa

    key = _translate_key_hdf(key)
    df = pd.read_hdf(filepath_or_buffer, key, **kwargs)


rename df to pd_obj? because AFAICT, it can also be a Series

gdementen · 2018-03-28T08:31:41Z

larray/inout/array.py

@@ -480,6 +481,11 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa
    sort_columns : bool, optional
        Whether or not to sort the columns alphabetically (sorting is more efficient than not sorting).
        Defaults to False.
+    type : { 'LArray' | 'Axis' | 'Group' }, optional
+        Type of object to be read. Defaults to 'LArray'.


It is nice to have this option, but this shouldn't be required for HDF. We should store the type in the HDF as extra metadata.

Reminder for myself : http://pandas-docs.github.io/pandas-docs-travis/cookbook.html#hdfstore and https://stackoverflow.com/questions/29129095/save-additional-attributes-in-pandas-dataframe/29130146#29130146

gdementen · 2018-03-28T08:36:02Z

larray/core/group.py

+        >>> a.to_hdf('test.h5', 'groups/a', 'a')   # doctest: +SKIP
+        """
+        key = _translate_key_hdf(key)
+        s = pd.Series(data=self.eval(), name='{}@{}'.format(self.name, axis_key))


Ideally we should test that the axis is in the HDF file, but that is probably going to cause an ordering problem (there is currently no guarantee that the group will be saved after the axis), so testing that the correct axis exists in the session seems like a good proxy. It would still be possible to save a group without its axis, but that would be much harder.

alixdamman · 2018-04-04T09:24:08Z

@gdementen you can skip the commit 3d439a6 (updated read_hdf() --> create a new HDFStore only if arg filepath_or_buffer is not already an HDFStore) during the review.

alixdamman · 2018-04-04T09:24:44Z

larray/inout/array.py

-                raise ValueError("Attribute 'axis_key' is missing for group {}".format(key))
-            axis = read_hdf(filepath_or_buffer, attrs['axis_key'])
-            res = LGroup(key=pd_obj.values, name=name, axis=axis)
+    if isinstance(filepath_or_buffer, pd.HDFStore):


skip this commit during review

gdementen · 2018-04-05T08:16:37Z

larray/core/group.py

-        >>> a.to_hdf('test.h5', 'groups/a', 'a')   # doctest: +SKIP
+        >>> a.to_hdf('test.h5', 'groups/a01')       # doctest: +SKIP
+
+        The associated axis is saved at the time with the group if not present in the HDF file yet


I don't really understand this sentence. I guess you hesitated between "with the group" and "at the same time than the group" and ended up with a mix of both?

Bad interactive rebase...

gdementen · 2018-04-05T08:22:18Z

larray/inout/array.py

+                name = None
+            res = Axis(labels=pd_obj.values, name=name)
+            if 'wildcard' not in dir(attrs):
+                raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key))


Is there a good reason for this exception? Since this case shouldn't happen in normal circumstances, unless the "raw"/original exception of accessing a missing attribute is awfully not understandable, I would let it pass.

gdementen · 2018-04-05T08:26:51Z

larray/inout/array.py

+        _type = attrs.type if 'type' in dir(attrs) else 'LArray'
+        if _type == 'LArray':
+            res = df_aslarray(pd_obj, sort_rows=sort_rows, sort_columns=sort_columns, fill_value=fill_value, parse_header=False)
+        elif _type == 'Axis':


It kinda bothers me that this code is in the inout/array module. Not urgent to fix this but still... Here are a few potential solutions:

defer to some function in a inout/axis module when type == 'Axis'

rename inout/array to inout/core

have inout/api with read_hdf which defers to functions in inout/array and inout/axis modules depending on type.

Let's start with option 2 (rename inout/array to inout/core). It will be easy to move from option 2 to option 1 or 3 later if we changed our mind.

gdementen · 2018-04-05T08:27:09Z

larray/inout/array.py

+            if 'wildcard' not in dir(attrs):
+                raise ValueError("Attribute 'wildcard' is missing for axis {}".format(key))
+            res._iswildcard = attrs['wildcard']
+        elif _type == 'Group':


same as above

gdementen · 2018-04-05T08:27:20Z

larray/inout/array.py

+            if name == 'None':
+                name = None
+            if 'axis_key' not in dir(attrs):
+                raise ValueError("Attribute 'axis_key' is missing for group {}".format(key))


same as above

gdementen · 2018-04-05T08:29:19Z

larray/core/array.py

-        self.to_frame().to_hdf(filepath, key, *args, **kwargs)
+        with LHDFStore(filepath) as store:
+            store.put(key, self.to_frame())
+            store.get_storer(key).attrs.type = 'LArray'


We might want to call this 'Array' instead if we decide to implement #611 later.

gdementen · 2018-04-05T08:39:51Z

larray/util/misc.py

@@ -732,3 +734,23 @@ def common_type(arrays):
        return np.dtype(('U' if need_unicode else 'S', max_size))
    else:
        return object
+
+
+class LHDFStore(object):


Why do we need our own HDFStore instead of using the one from pandas directly? To avoid closing the store if we were given an open store?

Because HDFStore.__init__ only accept string as path argument. Both pandas.pytables.to_hdf and pandas.pytables.read_hdf functions contains an if-else to test if argument path_or_buf is already an HDFStore instance or not.

I chose to implement my own context manager ( which is actually based on the beginning of the function pandas.pytables.read_hdf )

gdementen · 2018-04-06T07:43:26Z

larray/core/array.py

@@ -68,7 +68,7 @@
 from larray.core.axis import Axis, AxisReference, AxisCollection, X, _make_axis
 from larray.util.misc import (table2str, size2str, basestring, izip, rproduct, ReprString, duplicates,
                              float_error_handler_factory, _isnoneslice, light_product, unique_list, common_type,
-                              renamed_to, deprecate_kwarg)
+                              renamed_to, deprecate_kwarg, LHDFStore)


I think we should name it something more user-friendly and expose it to users. This can/should(?) come later though. We need to have something consistent for excel files, HDF files etc.

I opened a new issue for this (see #614). In my mind, LHDFStore was never meant to be public.
I'm not sure that our users will ever use the HDF format to do something else than save and load Session objects.

Good point. I will continue the conversation on the issue.

gdementen · 2018-04-06T07:44:48Z

larray/core/array.py

-        self.to_frame().to_hdf(filepath, key, *args, **kwargs)
+        with LHDFStore(filepath) as store:
+            store.put(key, self.to_frame())
+            store.get_storer(key).attrs.type = 'Array'


When I think of it, this seems a bit odd. Shouldn't the LHDFStore do this (add the type attribute) for us automatically? We define our own class, let's take advantage of that :)

an alternative is to make it a single method to store data with metadata, something like:

with LHDFStore(filepath) as store: store.put(key, self.to_frame(), attrs={'type': 'Array'})

Also, it somewhat bothers me to have to convert to a pandas object first. I know the LHDFStore is internal for now, but when we make it part of the API, this will need to be changed. We should be able to write something like:

arr = ndtest((2, 3)) with HDFStore(filepath) as store: store.put('arr', arr)

For the record: see #614

gdementen · 2018-04-06T07:47:10Z

larray/core/axis.py

+        Parameters
+        ----------
+        filepath : str
+            Path where the hdf file has to be written.


Maybe this is too obscure for users, but I guess it also supports (L)HDFStore here. It is probably premature to mention it now, but let us keep this in mind when we make LHDFStore part of the API.

gdementen · 2018-04-06T07:49:23Z

larray/core/axis.py

+        filepath : str
+            Path where the hdf file has to be written.
+        key : str or Group, optional
+            Key of the axis within the HDF file.


Is key clear enough for users? As the argument name it's fine, but some extra explanation in the description might help. s/Key/Key (path)/ ?

gdementen · 2018-04-06T07:51:04Z

larray/core/axis.py

+        Examples
+        --------
+        >>> a = Axis("a=a0..a2")
+        >>> a.to_hdf('test.h5', 'a')       # doctest: +SKIP


IMO the first example should be without any key specified.

gdementen · 2018-04-06T08:51:03Z

larray/inout/string.py

+
+
+@deprecate_kwarg('nb_index', 'nb_axes', arg_converter=lambda x: x + 1)
+def from_lists(data, nb_axes=None, index_col=None, fill_value=np.nan, sort_rows=False, sort_columns=False, wide=True):


does not really belong in a string module. Maybe rename the module to misc?

gdementen · 2018-04-06T08:52:28Z

larray/inout/utils.py

+_allowed_types = (LArray, Axis, Group)
+
+
+class FileHandler(object):


not really an utility class (which should be usable in a lot of different contexts). This is a base class, so move this to a module named "base", "core", "abstractbases" or something like that.

gdementen · 2018-04-06T08:57:30Z

larray/tests/test_axis.py

+    age359 = age[['3', '5', '9']]
+    age468 = age['4,6,8'] >> 'even'
+    assert 5 in age
+    assert not ('5' in age)


'5' not in age is slightly more readable IMO

I used unittest2pytest tool. It works pretty well but sometimes lead to some ugly output.

gdementen · 2018-04-06T08:58:00Z

larray/tests/test_axis.py

+    assert 5 in age
+    assert not ('5' in age)
+    assert age2 in age
+    assert not (age2bis in age)


same as above: age2bis not in age

gdementen · 2018-04-06T08:58:22Z

larray/tests/test_axis.py

+    assert not ('5' in age)
+    assert age2 in age
+    assert not (age2bis in age)
+    assert not (age2ter in age)


idem... (and for all subsequent lines)

alixdamman · 2018-04-06T14:00:32Z

@gdementen If you don't mind, I think I prefer to make a different PR for each file extension (one for HDF, one for CSV, one for EXCEL and one for pickle)?

gdementen · 2018-04-06T14:02:01Z

@alixdamman sure!

gdementen

LGTM. Great work!

- updated unittests in test_session.py, test_axis.py, test_axiscollection.py and test_group.py --> changed syntax: nosetest --> pytest

…project#578) : - added to_hdf method to Axis and Group - updated read_hdf (inout/hdf.py) - updated documentation of Session's methods - updated doctests of Session.load and Session.save - added context manager LHDFStore (utils/misc.py) refactored package inout: created one module per file extension or external object type like in pandas/io/: new modules: - common.py - pandas.py - csv.py - excel.py - hdf.py - sas.py - misc.py - pickle.py renamed modules: - excel.py --> xw_excel.py deleted modules: - array.py

alixdamman requested a review from gdementen March 26, 2018 14:30

gdementen reviewed Mar 28, 2018

View reviewed changes

alixdamman requested a review from gdementen April 4, 2018 09:22

alixdamman commented Apr 4, 2018

View reviewed changes

gdementen reviewed Apr 5, 2018

View reviewed changes

alixdamman requested a review from gdementen April 5, 2018 15:09

gdementen reviewed Apr 6, 2018

View reviewed changes

alixdamman requested a review from gdementen April 6, 2018 15:04

gdementen approved these changes Apr 6, 2018

View reviewed changes

alixdamman force-pushed the save_load_session_include_axis_group_objs_578 branch from ddc4def to df3cb18 Compare April 9, 2018 07:58

alixdamman added 2 commits April 9, 2018 10:01

- split test_axis.py in test_axis.py and test_axiscollection.py

3def83b

- updated unittests in test_session.py, test_axis.py, test_axiscollection.py and test_group.py --> changed syntax: nosetest --> pytest

alixdamman force-pushed the save_load_session_include_axis_group_objs_578 branch from df3cb18 to baa1cb9 Compare April 9, 2018 08:02

alixdamman merged commit eb1c935 into larray-project:master Apr 9, 2018

alixdamman changed the title ~~issue 578: include Axis and Group objects to Session.save and Session.load methods~~ issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) Apr 9, 2018

alixdamman mentioned this pull request Apr 26, 2018

Documentation of Session.__init__ is not accurate #566

Closed

		@@ -492,8 +498,20 @@ def read_hdf(filepath_or_buffer, key, fill_value=np.nan, na=np.nan, sort_rows=Fa

		key = _translate_key_hdf(key)
		df = pd.read_hdf(filepath_or_buffer, key, **kwargs)



		@deprecate_kwarg('nb_index', 'nb_axes', arg_converter=lambda x: x + 1)
		def from_lists(data, nb_axes=None, index_col=None, fill_value=np.nan, sort_rows=False, sort_columns=False, wide=True):

		_allowed_types = (LArray, Axis, Group)


		class FileHandler(object):

issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612

issue 578: include Axis and Group objects to Session.save and Session.load methods (HDF) #612

Uh oh!

Conversation

alixdamman commented Mar 26, 2018

Uh oh!

alixdamman commented Mar 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gdementen Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alixdamman Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gdementen Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alixdamman commented Apr 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alixdamman Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gdementen Mar 30, 2018 •

edited

Loading

alixdamman Mar 30, 2018 •

edited

Loading

gdementen Mar 28, 2018 •

edited

Loading

alixdamman Apr 6, 2018 •

edited

Loading