Skip to content

Tutorial #647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 24, 2018
Merged

Tutorial #647

merged 3 commits into from
Aug 24, 2018

Conversation

alixdamman
Copy link
Collaborator

I would to update and reorganize the IO section of the tutorial.
But first of all, I would to be sure about the "table of content".

I assume that issue #155 will be done before this PR is merged

Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good so far. Please keep the example arrays/axes as small as possible (e.g. 2x3 or something like that)


##### Basic Options

##### Specifying the Number of Axes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need #648


##### Specifying Position in Sheet (open_excel only)

#### Dumping an Array to an Excel Sheet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure where to place this but we need to show dumping and reading with header or without headers, including reconstructing an array with axes labels in arbitrary positions

## Load and Dump Sessions


### Loading a Session from an Excel Sheet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Sheet/file|Workbook/?


#### Specifying Objets To Be Loaded

### Dumping a Session to an Excel Sheet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem



- markdown: |
## Load and Dump Sessions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about load and dump sessions from .hdf?

@alixdamman
Copy link
Collaborator Author

reminder: add a note about labels like 102E3 in Excel which are converted to integers by xlwings.

@gdementen
Copy link
Contributor

I always prefer to fix issues than document them.

@alixdamman alixdamman requested a review from gdementen June 20, 2018 16:22
id: 1

- markdown: |
A new file is created if it does not exist yet.
If the file already exists, a new sheet is added at the end the existing ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that sheet does not already exists

- code: |
arr = ndtest((3, 3))

# 1. reset the file Excel file 'arrays.xlsx' (the sheet 'array_2D' is removed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the file Excel file 'arrays.xlsx'/'arrays.xlsx'/ (or at least remove one "file")

arr = ndtest((3, 3))

# 1. reset the file Excel file 'arrays.xlsx' (the sheet 'array_2D' is removed)
# 2. save the array arr in the sheet 'array_3D'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use another name than array_3D as this is confusing since the array is 2D

- markdown: |
##### Specifying Position in Sheet

By default, array are dumped starting at cell 'A1'. Using the argument ``position`` it is possible to change the top left cell of the dumped array. This can be useful when several arrays must be dumped in the same sheet:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/array/arrays/
s/Using the argument position it is/Using the position argument, it is/

- markdown: |
A new file is created if it does not exist yet.
If the file already exists, a new sheet is added at the end the existing ones.
To reset an Excel file, you simply set the flag `overwrite_file` as True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/flag overwrite_file/overwrite_file argument

# dump the array 'arr2' in sheet 'arrays' starting at cell 'A5'
wb['arrays']['A5'] = arr2.dump()
# dump the array 'arr3' in sheet 'arrays' starting at cell 'A9'
wb['arrays']['A9'] = arr3.dump()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also show that if you have many arrays to store on the same sheet, you can store the sheet in a variable and use that?

sheet = wb['arrays']
sheet['A5'] = arr2.dump()

# add a new sheet 'arrays' and dump the array 'arr' starting at cell 'A1'
wb['arrays'] = arr.dump()
# dump the array 'arr2' in sheet 'arrays' starting at cell 'A5'
wb['arrays']['A5'] = arr2.dump()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also show somewhere (here?) how to create an new empty/blank sheet (this question has come up several times in the past):
wb['new_sheet'] = ''

- markdown: |
##### Specifying Wide VS Narrow format

Like with the ``to_excel`` method, it possible to export the data in a ``narrow`` format by setting the argument ``wide`` to False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible

- markdown: |
##### Exporting only data

To export only data in Excel sheets, you can set argument ``header`` of the method ``dump`` to false or not calling ``dump``:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/argument header/header argument
s/method dump/dump method/
s/false/False
s/not calling/avoid calling/


with open_excel('new_excel_file.xlsx') as wb:
# export only data
wb['data_only'] = arr.dump(header=False)
Copy link
Contributor

@gdementen gdementen Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users export only data, they are usually in one of two situations:

  1. they want to fill data in a "predefined/template" file.
  2. they want to export data and axes in a specific way. For example, two 2d tables on top of each other with a single "time" axis in columns but potentially blank lines between the two tables.

In both cases, it uses 'precise' position.

  1. is probably hard to demonstrate but is very frequent so should at least be mentioned. If you can come up with a simple example to demonstrate this, it would be nice. Honestly I don't see how we could make this simple enough and still interesting but I think our users would appreciate if you managed that. Maybe first writing the array with headers, then saying this is the template to fill and updating only the values would be enough, or maybe using the specific formatted sheet done for 2. and updating the values would be more telling.
  2. could be demonstrated, and if so, needs to include an example of how to output an axis labels vertically instead of horizontally (sheet[idx].options(transpose=True).value = axis.labels)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the second situation, I'm thinking of an additional column 'description' next to a vertical axis column. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would that info (the descriptions) come from?

@gdementen
Copy link
Contributor

Sorry for the slow review. As usual, I started my review then got side tracked for long enough that I forgot about the review and started something else... When I don't review something for a day or two, feel free to ping me.

@alixdamman alixdamman requested a review from gdementen July 18, 2018 15:24
@@ -46,60 +46,73 @@ cells:

- markdown: |
A new file is created if it does not exist yet.
If the file already exists, a new sheet is added at the end the existing ones.
To reset an Excel file, you simply set the flag `overwrite_file` as True:
If the file already exists, a new sheet is added at the end the existing ones if that sheet does not already exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/at the end the/after/

@@ -132,7 +145,8 @@ cells:
- markdown: |
##### Basic Usage

``open_excel`` must be used with the Python keyword ``with``. If the Excel file doesn't exist, the argument ``overwrite_file`` must set to True. The method ``save`` must be called to actually write data in the Excel file.
``open_excel`` should be used with the Python keyword ``with``, which ensures that the file is properly closed even if an error occurs.
If the Excel file doesn't exist, the argument ``overwrite_file`` must set to True. The method ``save`` must be called to actually write data in the Excel file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/argument overwrite_file/overwrite_file argument/
s/must set/must be set/
s/method save/save method/

To export only data in Excel sheets, you can set argument ``header`` of the method ``dump`` to false or not calling ``dump``:
|city | continent | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| ------- | ------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|Brussel | Europe | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example!! I would have used fewer years though. Beyond 4 or 5, it doesn't add anything to the example and it makes it unnecessarily big.

# export only data
wb['data_only'] = arr.dump(header=False)
with open_excel('pop_projection.xlsx', overwrite_file=True) as wb:
# create new sheet 'pop_2015_2025'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/new/new empty/

with open_excel('pop_projection.xlsx', overwrite_file=True) as wb:
# create new sheet 'pop_2015_2025'
wb['pop_2015_2025'] = []
sh = wb['pop_2015_2025']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment for this line too (like in the previous example)? It doesn't hurt to repeat some stuff :)

wb['pop_2015_2025'] = []
sh = wb['pop_2015_2025']
# export column names
sh['A1'] = ['city', 'continent'] + list(time.labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe split this line in two to avoid having to convert to list?

sh['A1'] = ['city', 'continent']
sh['C1'] = time.labels

@alixdamman alixdamman requested a review from gdementen July 19, 2018 12:47
Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is progressing nicely...

- code: |
# first of all, import the LArray library
from larray import *

# then, let's define a function to get full path of any example file
def get_filepath(fname):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a function like that to larray itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think that users will play with example files outside the tutorial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not, but it would make the tutorial and at least one doctest a bit more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. It doesn't hurt anyway.

pop = ndtest((city, time))

with open_excel('pop_projection.xlsx', overwrite_file=True) as wb:
# create new sheet 'pop_2015_2025'
# create new empty sheet 'pop_2015_2025'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020

wb['pop_2015_2025'] = []
# store sheet 'pop_2015_2025' in a temporary variable sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020

pop = ndtest((city, time))

with open_excel('pop_projection.xlsx', overwrite_file=True) as wb:
# create new sheet 'pop_2015_2025'
# create new empty sheet 'pop_2015_2025'
wb['pop_2015_2025'] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020

wb['pop_2015_2025'] = []
# store sheet 'pop_2015_2025' in a temporary variable sh
sh = wb['pop_2015_2025']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020

a0,b1,2,3
a1,b0,4,5
a1,b1,6,7
>>> # read the array stored in the CSV file as it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as is", not "as it" (it was correct before ;-))

a [4]: 'a0' 'a0' 'a1' 'a1'
{1} [3]: 'b' 'c0' 'c1'
dtype: object
memory used: 96 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a comment somewhere to say that this is not what we want/need/expected would help I think

a0 0 1 2
a1 3 4 5
a2 6 7 8
>>> # read the array stored in the sheet 'missing_axis_name' as it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as is

@@ -80,11 +80,6 @@ def test_setitem(self):
assert wb.sheet_names() == ['sheet1', 'sheet2', 'sheet3']
assert wb['sheet2']['A1'].value == 'sheet1 content'

with open_excel(visible=False, app="new") as wb2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this test intact. It must raise. It should be fixed in master (commit c8aac0b). I guess you haven't rebased yet.

@alixdamman alixdamman requested a review from gdementen July 19, 2018 14:31
@@ -117,8 +117,18 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=np.na

Specify the number of axes of the output array (useful when the name of the last axis is implicit)

>>> # read the array stored in the sheet 'missing_axis_name' as it
>>> # The content of the sheet 'missing_axis_name' is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the >>> #

@alixdamman alixdamman requested a review from gdementen July 19, 2018 15:27
Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please ask for reviews a bit less often? This is getting a bit silly.


Notes
-----
A ValueError is raised if provided filename does not represent an existing example file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the provided



def test_get_example_filepath():
with pytest.raises(ValueError, message="Example file non_existing_example_file.xlsx does not exist. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be annoying. We will need to change the test each time we add or rename an example file. It would be better to use a pattern to only check the beginning of the error message.


Examples
--------
>>> fpath = get_example_filepath('examples.xlsx')
"""
fpath = os.path.abspath(os.path.join(EXAMPLE_FILES_DIR, fname))
if not os.path.isfile(fpath):
if not (os.path.isfile(fpath) or os.path.isdir(fpath)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use os.path.exists instead?

2 2 3
3 4 5
city\\time 2010 2011 2012
Brussel 0 1 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Brussel/Brussels/


- markdown: |
## Load and Dump Sessions

One of the main advantage of grouping arrays in session objects is that you can load and save all them in one shot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • advantages
  • all of them

Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure where, but I think we should mention the data source somewhere. Possibly using metadata.

| Paris | male | 10 | 11 | 12 |
| city | gender\time | 2010 | 2011 | 2012 |
| -------- | ----------- | ---- | ---- | ---- |
| Brussels | female | 0 | 1 | 2 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use eurostat data here too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's in the pipe. Give me some time.

A1,BI,F,FO,0,0,0
A1,A0,H,BE,0,0,0
geo,gender\\time,2015,2016,2017
BE,M,5524068.0,5569264.0,5589272.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't using longer labels make things more readable? (BE -> Belgium, M -> Male, ...)?

@@ -0,0 +1,7 @@
geo,gender\time,2013,2014,2015
BE,M,64371,64173,62561
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be birth numbers, not fertility (which I assume are usually given as rates).

Copy link
Collaborator Author

@alixdamman alixdamman Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset comes from Database by themes > Population and social conditions > Demography and migration (demo) > Fertility ( http://ec.europa.eu/eurostat/data/database ). But yes, it represents births.

@alixdamman
Copy link
Collaborator Author

Give me some time. I'm not ready. I didn't ask for any review this time.
Since the data used for examples have completely changed, can I exceptionally rebase this PR?

@alixdamman alixdamman requested a review from gdementen July 24, 2018 08:38
@alixdamman
Copy link
Collaborator Author

Could you please review only the last commit?

@@ -178,6 +178,17 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=np.na
geo\\time 2013 2014 2015
Belgium 11137974 11180840 11237274
France 65600350 65942267 66456279

Extract array from a given range (useful when several arrays are stored in the same sheet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other cases where this could be useful (e.g. if there is some text before, after or next to the data you want to load), so I am unsure the parentheses add anything.

@@ -197,7 +208,9 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=np.na
.format(list(kwargs.keys())[0]))
from larray.inout.xw_excel import open_excel
with open_excel(filepath) as wb:
return wb[sheet].load(index_col=index_col, fill_value=fill_value, sort_rows=sort_rows,
if range is None:
range = slice(None, None, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • range = slice(None) is enough.
  • you could have range default to slice(None) directly since a slice object is readonly.

@@ -61,7 +61,7 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=np.na
Engine to use to read the Excel file. If None (default), it will use 'xlwings' by default if the module is
installed and relies on Pandas default reader otherwise.
range : str, optional
Range in which array is stored. Used only if engine is 'xlwings'. Defaults to None.
Range in which array is stored. Used only if engine is 'xlwings'. Defaults to slice(None).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should explain what that slice(None) means. Something like "(which means the entire sheet)"



- code: |
# create a session with two arrays
session = Session([('arr1', ndtest((3, 3))), ('arr2', ndtest((2, 2, 2)))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an empty session instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to show that the load method add new items to a session and does not delete the existing items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use an empty session (using Session()) and use .load several times on it IF THAT MAKES SENSE given the current files (if you have distinct arrays in different files), because this is closer to what people actually do. If that's not the case, leave the thing as it is.

### Dumping Sessions (CSV, Excel, HDF5)
The ``load`` method offers some options:

1) Using the ``names`` argument, you can specify which items will be loaded:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will be loaded/to load/


session = Session()

# using names, you can select the items to load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/# .../# use the names argument to only load births and deaths arrays/

# save session to an Excel file
session.save('population.xlsx')

with open_excel('population.xlsx') as wb:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here. e.g. # check the sheets contained in the file

- markdown: |
The ``save`` method has several arguments:

1) Using the ``names`` argument, you can specify which items will be loaded:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/loaded/saved/ or dumped

# using names, you can select the items to dump
session.save('population.xlsx', names=['births', 'deaths'])

with open_excel('population.xlsx') as wb:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment

pop = read_csv('./population/pop.csv')
ses_pop = Session([('pop', pop)])

# by setting overwrite to False, the destination file is updated instead of overwritten
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should precise what we mean by updated. ie that existing data in the file is left intact if it is not in the session, but if an array exists in both the file and the session it is completely overwritten. My wording is bad, but it might not be obvious that arrays are completely replaced and not somehow merged

# by setting overwrite to False, the destination file is updated instead of overwritten
ses_pop.save('population.xlsx', overwrite=False)

with open_excel('population.xlsx') as wb:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment again?

@@ -61,7 +61,8 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=np.na
Engine to use to read the Excel file. If None (default), it will use 'xlwings' by default if the module is
installed and relies on Pandas default reader otherwise.
range : str, optional
Range in which array is stored. Used only if engine is 'xlwings'. Defaults to slice(None).
Range in which the array is stored. Used only if engine is 'xlwings'. If slice(None) (default), the range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range to load the array from (only supported for the 'xlwings' engine). Defaults to slice(None) which loads the whole sheet, ignoring blank cells in the bottom right corner.

Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably needs a mention in the changelog.

@alixdamman alixdamman force-pushed the tutorial branch 3 times, most recently from 3800cfe to 94d0888 Compare August 23, 2018 10:00
@alixdamman alixdamman force-pushed the tutorial branch 3 times, most recently from ecb7e39 to 71b880d Compare August 24, 2018 07:53
@alixdamman alixdamman merged commit 0dc8d93 into larray-project:master Aug 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants