diff --git a/doc/source/api.rst b/doc/source/api.rst
index 8886c0cf4..9648c6e0d 100644
--- a/doc/source/api.rst
+++ b/doc/source/api.rst
@@ -537,6 +537,7 @@ Miscellaneous
aslarray
from_frame
+ get_example_filepath
labels_array
union
stack
diff --git a/doc/source/changes/version_0_29.rst.inc b/doc/source/changes/version_0_29.rst.inc
index 74e689432..1fc8c6952 100644
--- a/doc/source/changes/version_0_29.rst.inc
+++ b/doc/source/changes/version_0_29.rst.inc
@@ -348,6 +348,9 @@ Miscellaneous improvements
* made the `from_series` function support series with multiindex (closes :issue:`465`)
+* completely rewritten the 'Load And Dump Arrays, Sessions, Axes And Groups' section of the tutorial
+ (closes :issue:`645`)
+
Fixes
-----
diff --git a/doc/source/tutorial/tutorial_IO.ipyml b/doc/source/tutorial/tutorial_IO.ipyml
index a04c7e539..0edaea2f4 100644
--- a/doc/source/tutorial/tutorial_IO.ipyml
+++ b/doc/source/tutorial/tutorial_IO.ipyml
@@ -1,251 +1,726 @@
cells:
- markdown: |
- # Load/Dump Arrays And Sessions From/To Files
+ # Load And Dump Arrays, Sessions, Axes And Groups
- markdown: |
- Import the LArray library:
+ LArray provides methods and functions to load and dump LArray, Session, Axis Group objects to several formats such as Excel, CSV and HDF5. The HDF5 file format is designed to store and organize large amounts of data. It allows to read and write data much faster than when working with CSV and Excel files.
- code: |
+ # first of all, import the LArray library
from larray import *
+ id: 0
+ metadata:
+ nbsphinx: hidden
+
+- markdown: |
+ ## Loading and Dumping Arrays
+
+
+- markdown: |
+ ### Loading Arrays - Basic Usage (CSV, Excel, HDF5)
+
+ To read an array from a CSV file, you must use the ``read_csv`` function:
+
+
+- code: |
+ csv_dir = get_example_filepath('examples')
+
+ # read the array pop from the file 'pop.csv'.
+ # The data of the array below is derived from a subset of the demo_pjan table from Eurostat
+ pop = read_csv(csv_dir + '/pop.csv')
+ pop
+
id: 1
- markdown: |
- ## Load from CVS, Excel or HDF5 files
+ To read an array from a sheet of an Excel file, you can use the ``read_excel`` function:
+
+
+- code: |
+ filepath_excel = get_example_filepath('examples.xlsx')
- Arrays can be loaded from CSV files
+ # read the array from the sheet 'pop' of the Excel file 'examples.xlsx'
+ pop = read_excel(filepath_excel, 'pop')
+ pop
+
+ id: 2
+
+- markdown: |
+ The ``open_excel`` function in combination with the ``load`` method allows you to load several arrays from the same Workbook without opening and closing it several times:
+
+
+- code: |
+ # open the Excel file 'population.xlsx' and let it opened as long as you keep the indent.
+ # The Python keyword ``with`` ensures that the Excel file is properly closed even if an error occurs
+ with open_excel(filepath_excel) as wb:
+ # load the array 'pop' from the sheet 'pop'
+ pop = wb['pop'].load()
+ # load the array 'births' from the sheet 'births'
+ # The data of the array below is derived from a subset of the demo_fasec table from Eurostat
+ births = wb['births'].load()
+ # load the array 'deaths' from the sheet 'deaths'
+ # The data of the array below is derived from a subset of the demo_magec table from Eurostat
+ deaths = wb['deaths'].load()
- ```python
- # read_tsv is a shortcut when data are separated by tabs instead of commas (default separator of read_csv)
- # read_eurostat is a shortcut to read EUROSTAT TSV files
- household = read_csv('hh.csv')
- ```
+ # the Workbook is automatically closed when getting out the block defined by the with statement
+ print('pop:\n', pop)
+ print('\nbirths:\n', births)
+ print('\ndeaths:\n', deaths)
+
+ id: 3
+
+- markdown: |
+
+ **Warning:** `open_excel` requires to work on Windows and to have the library ``xlwings`` installed.
+
- markdown: |
- or Excel sheets
+ The `HDF5` file format is specifically designed to store and organize large amounts of data.
+ Reading and writing data in this file format is much faster than with CSV or Excel.
+ An HDF5 file can contain multiple arrays, each array being associated with a key.
+ To read an array from an HDF5 file, you must use the ``read_hdf`` function and provide the key associated with the array:
+
+
+- code: |
+ filepath_hdf = get_example_filepath('examples.h5')
- ```python
- # loads array from the first sheet if no sheet is given
- pop = read_excel('demography.xlsx', 'pop')
- ```
+ # read the array from the file 'examples.h5' associated with the key 'pop'
+ pop = read_hdf(filepath_hdf, 'pop')
+ pop
+ id: 4
- markdown: |
- or HDF5 files (HDF5 is file format designed to store and organize large amounts of data.
- An HDF5 file can contain multiple arrays.
+ ### Dumping Arrays - Basic Usage (CSV, Excel, HDF5)
- ```python
- mortality = read_hdf('demography.h5','qx')
- ```
+ To write an array in a CSV file, you must use the ``to_csv`` method:
+
+- code: |
+ # save the array pop in the file 'pop.csv'
+ pop.to_csv('pop.csv')
+
+ id: 5
- markdown: |
- See documentation of reading functions for more details.
+ To write an array to a sheet of an Excel file, you can use the ``to_excel`` method:
+
+- code: |
+ # save the array pop in the sheet 'pop' of the Excel file 'population.xlsx'
+ pop.to_excel('population.xlsx', 'pop')
+
+ id: 6
- markdown: |
- ### Load Sessions
+ Note that ``to_excel`` create a new Excel file if it does not exist yet.
+ If the file already exists, a new sheet is added after the existing ones if that sheet does not already exists:
+
+- code: |
+ # add a new sheet 'births' to the file 'population.xlsx' and save the array births in it
+ births.to_excel('population.xlsx', 'births')
+
+ id: 7
- markdown: |
- The advantage of sessions is that you can load many arrays in one shot:
+ To reset an Excel file, you simply need to set the `overwrite_file` argument as True:
+
+
+- code: |
+ # 1. reset the file 'population.xlsx' (all sheets are removed)
+ # 2. create a sheet 'pop' and save the array pop in it
+ pop.to_excel('population.xlsx', 'pop', overwrite_file=True)
+
+ id: 8
+
+- markdown: |
+ The ``open_excel`` function in combination with the ``dump()`` method allows you to open a Workbook and to export several arrays at once. If the Excel file doesn't exist, the ``overwrite_file`` argument must be set to True.
- ```python
- # this load several arrays from a single Excel file (each array is stored on a different sheet)
- s = Session()
- s.load('test.xlsx')
- # or
- s = Session('test.xlsx')
+
+ **Warning:** The ``save`` method must be called at the end of the block defined by the *with* statement to actually write data in the Excel file, otherwise you will end up with an empty file.
+
+
+
+- code: |
+ # to create a new Excel file, argument overwrite_file must be set to True
+ with open_excel('population.xlsx', overwrite_file=True) as wb:
+ # add a new sheet 'pop' and dump the array pop in it
+ wb['pop'] = pop.dump()
+ # add a new sheet 'births' and dump the array births in it
+ wb['births'] = births.dump()
+ # add a new sheet 'deaths' and dump the array deaths in it
+ wb['deaths'] = deaths.dump()
+ # actually write data in the Workbook
+ wb.save()
+
+ # the Workbook is automatically closed when getting out the block defined by the with statement
+
+ id: 9
+
+- markdown: |
+ To write an array in an HDF5 file, you must use the ``read_hdf`` function and provide the key that will be associated with the array:
+
+
+- code: |
+ # save the array pop in the file 'population.h5' and associate it with the key 'pop'
+ pop.to_hdf('population.h5', 'pop')
+
+ id: 10
+
+- markdown: |
+ ### Specifying Wide VS Narrow format (CSV, Excel)
+
+ By default, all reading functions assume that arrays are stored in the ``wide`` format, meaning that their last axis is represented horizontally:
- # this load several arrays from a single HDF5 file (which is a very fast format)
- s = Session()
- s.load('test.h5')
- # or
- s = Session('test.h5')
- ```
+ | geo\time | 2013 | 2014 | 2015 |
+ | -------- | -------- | -------- | -------- |
+ | Belgium | 11137974 | 11180840 | 11237274 |
+ | France | 65600350 | 65942267 | 66456279 |
+
+ By setting the ``wide`` argument to False, reading functions will assume instead that arrays are stored in the ``narrow`` format, i.e. one column per axis plus one value column:
+
+ | geo | time | value |
+ | ------- | ---- | -------- |
+ | Belgium | 2013 | 11137974 |
+ | Belgium | 2014 | 11180840 |
+ | Belgium | 2015 | 11237274 |
+ | France | 2013 | 65600350 |
+ | France | 2014 | 65942267 |
+ | France | 2015 | 66456279 |
+
+- code: |
+ # set 'wide' argument to False to indicate that the array is stored in the 'narrow' format
+ pop_BE_FR = read_csv(csv_dir + '/pop_narrow_format.csv', wide=False)
+ pop_BE_FR
+
+ id: 11
+
+- code: |
+ # same for the read_excel function
+ pop_BE_FR = read_excel(filepath_excel, sheet='pop_narrow_format', wide=False)
+ pop_BE_FR
+
+ id: 12
- markdown: |
- ## Dump to CSV, Excel or HDF5 files
+ By default, writing functions will set the name of the column containing the data to 'value'. You can choose the name of this column by using the ``value_name`` argument. For example, using ``value_name='population'`` you can export the previous array as:
- Arrays can be dumped in CSV files
+ | geo | time | population |
+ | ------- | ---- | ---------- |
+ | Belgium | 2013 | 11137974 |
+ | Belgium | 2014 | 11180840 |
+ | Belgium | 2015 | 11237274 |
+ | France | 2013 | 65600350 |
+ | France | 2014 | 65942267 |
+ | France | 2015 | 66456279 |
+
+
+- code: |
+ # dump the array pop_BE_FR in a narrow format (one column per axis plus one value column).
+ # By default, the name of the column containing data is set to 'value'
+ pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False)
- ```python
- household.to_csv('hh2.csv')
- ```
+ # same but replace 'value' by 'population'
+ pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False, value_name='population')
+
+ id: 13
+
+- code: |
+ # same for the to_excel method
+ pop_BE_FR.to_excel('population.xlsx', 'pop_narrow_format', wide=False, value_name='population')
+
+ id: 14
+
+- markdown: |
+ Like with the ``to_excel`` method, it is possible to export arrays in a ``narrow`` format using ``open_excel``.
+ To do so, you must set the ``wide`` argument of the ``dump`` method to False:
+- code: |
+ with open_excel('population.xlsx') as wb:
+ # dump the array pop_BE_FR in a narrow format:
+ # one column per axis plus one value column.
+ # Argument value_name can be used to change the name of the
+ # column containing the data (default name is 'value')
+ wb['pop_narrow_format'] = pop_BE_FR.dump(wide=False, value_name='population')
+ # don't forget to call save()
+ wb.save()
+
+ # in the sheet 'pop_narrow_format', data is written as:
+ # | geo | time | value |
+ # | ------- | ---- | -------- |
+ # | Belgium | 2013 | 11137974 |
+ # | Belgium | 2014 | 11180840 |
+ # | Belgium | 2015 | 11237274 |
+ # | France | 2013 | 65600350 |
+ # | France | 2014 | 65942267 |
+ # | France | 2015 | 66456279 |
+
+ id: 15
+
- markdown: |
- or in Excel files
+ ### Specifying Position in Sheet (Excel)
- ```python
- # if the file does not already exist, it is created with a single sheet,
- # otherwise a new sheet is added to it
- household.to_excel('demography_2.xlsx', overwrite_file=True)
- # it is usually better to specify the sheet explicitly (by name or position) though
- household.to_excel('demography_2.xlsx', 'hh')
- ```
+ If you want to read an array from an Excel sheet which does not start at cell `A1` (when there is more than one array stored in the same sheet for example), you will need to use the ``range`` argument. Note that this argument is only available if you have the library ``xlwings`` installed.
+
+
+- code: |
+ # the 'range' argument must be used to load data not starting at cell A1.
+ # This is useful when there is several arrays stored in the same sheet
+ births = read_excel(filepath_excel, sheet='pop_births_deaths', range='A9:E15')
+ births
+ id: 16
- markdown: |
- or in HDF5 files
+ Using ``open_excel``, ranges are passed in brackets:
+
+
+- code: |
+ with open_excel(filepath_excel) as wb:
+ # store sheet 'pop_births_deaths' in a temporary variable sh
+ sh = wb['pop_births_deaths']
+ # load the array pop from range A1:E7
+ pop = sh['A1:E7'].load()
+ # load the array births from range A9:E15
+ births = sh['A9:E15'].load()
+ # load the array deaths from range A17:E23
+ deaths = sh['A17:E23'].load()
- ```python
- household.to_hdf('demography_2.h5', 'hh')
- ```
+ # the Workbook is automatically closed when getting out the block defined by the with statement
+ print('pop:\n', pop)
+ print('\nbirths:\n', births)
+ print('\ndeaths:\n', deaths)
+ id: 17
- markdown: |
- See documentation of writing methods for more details.
+ When exporting arrays to Excel files, data is written starting at cell `A1` by default. Using the ``position`` argument of the ``to_excel`` method, it is possible to specify the top left cell of the dumped data. This can be useful when you want to export several arrays in the same sheet for example:
+- code: |
+ filename = 'population.xlsx'
+ sheetname = 'pop_births_deaths'
+
+ # save the arrays pop, births and deaths in the same sheet 'pop_births_and_deaths'.
+ # The 'position' argument is used to shift the location of the second and third arrays to be dumped
+ pop.to_excel(filename, sheetname)
+ births.to_excel(filename, sheetname, position='A9')
+ deaths.to_excel(filename, sheetname, position='A17')
+
+ id: 18
+
- markdown: |
- ### Dump Sessions
+ Using ``open_excel``, the position is passed in brackets (this allows you to also add extra informations):
+
+- code: |
+ with open_excel('population.xlsx') as wb:
+ # add a new sheet 'pop_births_deaths' and write 'population' in the first cell
+ # note: you can use wb['new_sheet_name'] = '' to create an empty sheet
+ wb['pop_births_deaths'] = 'population'
+ # store sheet 'pop_births_deaths' in a temporary variable sh
+ sh = wb['pop_births_deaths']
+ # dump the array pop in sheet 'pop_births_deaths' starting at cell A2
+ sh['A2'] = pop.dump()
+ # add 'births' in cell A10
+ sh['A10'] = 'births'
+ # dump the array births in sheet 'pop_births_deaths' starting at cell A11
+ sh['A11'] = births.dump()
+ # add 'deaths' in cell A19
+ sh['A19'] = 'deaths'
+ # dump the array deaths in sheet 'pop_births_deaths' starting at cell A20
+ sh['A20'] = deaths.dump()
+ # don't forget to call save()
+ wb.save()
+
+ # the Workbook is automatically closed when getting out the block defined by the with statement
+
+ id: 19
- markdown: |
- The advantage of sessions is that you can save many arrays in one shot:
+ ### Exporting data without headers (Excel)
+
+ For some reasons, you may want to export only the data of an array without axes. For example, you may want to insert a new column containing extra information. As an exercise, let us consider we want to add the capital city for each country present in the array containing the total population by country:
+
+ | country | capital city | 2013 | 2014 | 2015 |
+ | ------- | ------------ | -------- | -------- | -------- |
+ | Belgium | Brussels | 11137974 | 11180840 | 11237274 |
+ | France | Paris | 65600350 | 65942267 | 66456279 |
+ | Germany | Berlin | 80523746 | 80767463 | 81197537 |
+
+ Assuming you have prepared an excel sheet as below:
- ```python
- # this saves all the arrays in a single excel file (each array on a different sheet)
- s.save('test.xlsx')
+ | country | capital city | 2013 | 2014 | 2015 |
+ | ------- | ------------ | -------- | -------- | -------- |
+ | Belgium | Brussels | | | |
+ | France | Paris | | | |
+ | Germany | Berlin | | | ||
- # this saves all the arrays in a single HDF5 file (which is a very fast format)
- s.save('test.h5')
- ```
+ you can then dump the data at right place by setting the ``header`` argument of ``to_excel`` to False and specifying the position of the data in sheet:
+
+
+- code: |
+ pop_by_country = pop.sum('gender')
+
+ # export only the data of the array pop_by_country starting at cell C2
+ pop_by_country.to_excel('population.xlsx', 'pop_by_country', header=False, position='C2')
+
+ id: 20
+- markdown: |
+ Using ``open_excel``, you can easily prepare the sheet and then export only data at the right place by either setting the ``header`` argument of the ``dump`` method to False or avoiding to call ``dump``:
+
+
+- code: |
+ with open_excel('population.xlsx') as wb:
+ # create new empty sheet 'pop_by_country'
+ wb['pop_by_country'] = ''
+ # store sheet 'pop_by_country' in a temporary variable sh
+ sh = wb['pop_by_country']
+ # write extra information (description)
+ sh['A1'] = 'Population at 1st January by country'
+ # export column names
+ sh['A2'] = ['country', 'capital city']
+ sh['C2'] = pop_by_country.time.labels
+ # export countries as first column
+ sh['A3'].options(transpose=True).value = pop_by_country.geo.labels
+ # export capital cities as second column
+ sh['B3'].options(transpose=True).value = ['Brussels', 'Paris', 'Berlin']
+ # export only data of pop_by_country
+ sh['C3'] = pop_by_country.dump(header=False)
+ # or equivalently
+ sh['C3'] = pop_by_country
+ # don't forget to call save()
+ wb.save()
+
+ # the Workbook is automatically closed when getting out the block defined by the with statement
+
+ id: 21
- markdown: |
- ## Interact with Excel files
+ ### Specifying the Number of Axes at Reading (CSV, Excel)
+
+ By default, ``read_csv`` and ``read_excel`` will search the position of the first cell containing the special character ``\`` in the header line in order to determine the number of axes of the array to read. The special character ``\`` is used to separate the name of the two last axes. If there is no special character ``\``, ``read_csv`` and ``read_excel`` will consider that the array to read has only one dimension. For an array stored as:
+
+ | geo | gender\time | 2013 | 2014 | 2015 |
+ | ------- | ----------- | -------- | -------- | -------- |
+ | Belgium | Male | 5472856 | 5493792 | 5524068 |
+ | Belgium | Female | 5665118 | 5687048 | 5713206 |
+ | France | Male | 31772665 | 31936596 | 32175328 |
+ | France | Female | 33827685 | 34005671 | 34280951 |
+ | Germany | Male | 39380976 | 39556923 | 39835457 |
+ | Germany | Female | 41142770 | 41210540 | 41362080 |
+
+ ``read_csv`` and ``read_excel`` will find the special character ``\`` in the second cell meaning it expects three axes (geo, gender and time).
+
+ Sometimes, you need to read an array for which the name of the last axis is implicit:
+
+ | geo | gender | 2013 | 2014 | 2015 |
+ | ------- | ------ | -------- | -------- | -------- |
+ | Belgium | Male | 5472856 | 5493792 | 5524068 |
+ | Belgium | Female | 5665118 | 5687048 | 5713206 |
+ | France | Male | 31772665 | 31936596 | 32175328 |
+ | France | Female | 33827685 | 34005671 | 34280951 |
+ | Germany | Male | 39380976 | 39556923 | 39835457 |
+ | Germany | Female | 41142770 | 41210540 | 41362080 |
+
+ For such case, you will have to inform ``read_csv`` and ``read_excel`` of the number of axes of the output array by setting the ``nb_axes`` argument:
+
+
+- code: |
+ # read the 3 x 2 x 3 array stored in the file 'pop_missing_axis_name.csv' wihout using 'nb_axes' argument.
+ pop = read_csv(csv_dir + '/pop_missing_axis_name.csv')
+ # shape and data type of the output array are not what we expected
+ pop.info
+
+ id: 22
+
+- code: |
+ # by setting the 'nb_axes' argument, you can indicate to read_csv the number of axes of the output array
+ pop = read_csv(csv_dir + '/pop_missing_axis_name.csv', nb_axes=3)
+
+ # give a name to the last axis
+ pop = pop.rename(-1, 'time')
+
+ # shape and data type of the output array are what we expected
+ pop.info
+ id: 23
+
+- code: |
+ # same for the read_excel function
+ pop = read_excel(filepath_excel, sheet='pop_missing_axis_name', nb_axes=3)
+ pop = pop.rename(-1, 'time')
+ pop.info
+
+ id: 24
- markdown: |
- ### Write Arrays
+ ### NaNs and Missing Data Handling at Reading (CSV, Excel)
- Open an Excel file
+ Sometimes, there is no data available for some label combinations. In the example below, the rows corresponding to `France - Male` and `Germany - Female` are missing:
- ```python
- wb = open_excel('test.xlsx', overwrite_file=True)
- ```
+ | geo | gender\time | 2013 | 2014 | 2015 |
+ | ------- | ----------- | -------- | -------- | -------- |
+ | Belgium | Male | 5472856 | 5493792 | 5524068 |
+ | Belgium | Female | 5665118 | 5687048 | 5713206 |
+ | France | Female | 33827685 | 34005671 | 34280951 |
+ | Germany | Male | 39380976 | 39556923 | 39835457 |
+
+ By default, ``read_csv`` and ``read_excel`` will fill cells associated with missing label combinations with nans.
+ Be aware that, in that case, an int array will be converted to a float array.
+
+- code: |
+ # by default, cells associated will missing label combinations are filled with nans.
+ # In that case, the output array is converted to a float array
+ read_csv(csv_dir + '/pop_missing_values.csv')
+
+ id: 25
+
+- markdown: |
+ However, it is possible to choose which value to use to fill missing cells using the ``fill_value`` argument:
+
+
+- code: |
+ read_csv(csv_dir + '/pop_missing_values.csv', fill_value=0)
+
+ id: 26
+
+- code: |
+ # same for the read_excel function
+ read_excel(filepath_excel, sheet='pop_missing_values', fill_value=0)
+
+ id: 27
- markdown: |
- Put an array in an Excel Sheet, **excluding** headers (labels)
+ ### Sorting Axes at Reading (CSV, Excel, HDF5)
- ```python
- # put arr at A1 in Sheet1, excluding headers (labels)
- wb['Sheet1'] = arr
- # same but starting at A9
- # note that Sheet1 must exist
- wb['Sheet1']['A9'] = arr
- ```
+ The ``sort_rows`` and ``sort_columns`` arguments of the reading functions allows you to sort rows and columns alphabetically:
+
+- code: |
+ # sort labels at reading --> Male and Female labels are inverted
+ read_csv(csv_dir + '/pop.csv', sort_rows=True)
+
+ id: 28
+
+- code: |
+ read_excel(filepath_excel, sheet='births', sort_rows=True)
+
+ id: 29
+
+- code: |
+ read_hdf(filepath_hdf, key='deaths', sort_rows=True)
+
+ id: 30
- markdown: |
- Put an array in an Excel Sheet, **including** headers (labels)
+ ### Metadata (HDF5)
+
+ Since the version 0.29 of LArray, it is possible to add metadata to arrays:
+
+
+- code: |
+ pop.meta.title = 'Population at 1st January'
+ pop.meta.origin = 'Table demo_jpan from Eurostat'
- ```python
- # dump arr at A1 in Sheet2, including headers (labels)
- wb['Sheet2'] = arr.dump()
- # same but starting at A10
- wb['Sheet2']['A10'] = arr.dump()
- ```
+ pop.info
+ id: 31
- markdown: |
- Save file to disk
+ These metadata are automatically saved and loaded when working with the HDF5 file format:
+
+
+- code: |
+ pop.to_hdf('population.h5', 'pop')
- ```python
- wb.save()
- ```
+ new_pop = read_hdf('population.h5', 'pop')
+ new_pop.info
+
+ id: 32
+
+- markdown: |
+
+ **Warning:** Currently, metadata associated with arrays cannot be saved and loaded when working with CSV and Excel files.
+ This restriction does not apply however to metadata associated with sessions.
+
- markdown: |
- Close file
+ ## Loading and Dumping Sessions
- ```python
- wb.close()
- ```
+ One of the main advantages of grouping arrays, axes and groups in session objects is that you can load and save all of them in one shot. Like arrays, it is possible to associate metadata to a session. These can be saved and loaded in all file formats.
- markdown: |
- ### Read Arrays
+ ### Loading Sessions (CSV, Excel, HDF5)
- Open an Excel file
+ To load the items of a session, you have two options:
- ```python
- wb = open_excel('test.xlsx')
- ```
+ 1) Instantiate a new session and pass the path to the Excel/HDF5 file or to the directory containing CSV files to the Session constructor:
-- markdown: |
- Load an array from a sheet (assuming the presence of (correctly formatted) headers and only one array in sheet)
+- code: |
+ # create a new Session object and load all arrays, axes, groups and metadata
+ # from all CSV files located in the passed directory
+ csv_dir = get_example_filepath('population_session')
+ session = Session(csv_dir)
+
+ # create a new Session object and load all arrays, axes, groups and metadata
+ # stored in the passed Excel file
+ filepath_excel = get_example_filepath('population_session.xlsx')
+ session = Session(filepath_excel)
- ```python
- # save one array in Sheet3 (including headers)
- wb['Sheet3'] = arr.dump()
+ # create a new Session object and load all arrays, axes, groups and metadata
+ # stored in the passed HDF5 file
+ filepath_hdf = get_example_filepath('population_session.h5')
+ session = Session(filepath_hdf)
- # load array from the data starting at A1 in Sheet3
- arr = wb['Sheet3'].load()
- ```
+ print(session.summary())
+ id: 33
- markdown: |
- Load an array with its axes information from a range
+ 2) Call the ``load`` method on an existing session and pass the path to the Excel/HDF5 file or to the directory containing CSV files as first argument:
+
+
+- code: |
+ # create a session containing 3 axes, 2 groups and one array 'pop'
+ filepath = get_example_filepath('pop_only.xlsx')
+ session = Session(filepath)
- ```python
- # if you need to use the same sheet several times,
- # you can create a sheet variable
- sheet2 = wb['Sheet2']
+ print(session.summary())
+
+ id: 34
+
+- code: |
+ # call the load method on the previous session and add the 'births' and 'deaths' arrays to it
+ filepath = get_example_filepath('births_and_deaths.xlsx')
+ session.load(filepath)
- # load array contained in the 4 x 4 table defined by cells A10 and D14
- arr2 = sheet2['A10:D14'].load()
- ```
+ print(session.summary())
+ id: 35
- markdown: |
- ### Read Ranges (experimental)
+ The ``load`` method offers some options:
- Load an array (raw data) with no axis information from a range
+ 1) Using the ``names`` argument, you can specify which items to load:
+
+
+- code: |
+ session = Session()
+
+ # use the names argument to only load births and deaths arrays
+ session.load(filepath_hdf, names=['births', 'deaths'])
+
+ print(session.summary())
+
+ id: 36
+
+- markdown: |
+ 2) Setting the ``display`` argument to True, the ``load`` method will print a message each time a new item is loaded:
+
+
+- code: |
+ session = Session()
- ```python
- arr3 = wb['Sheet1']['A1:B4']
- ```
+ # with display=True, the load method will print a message
+ # each time a new item is loaded
+ session.load(filepath_hdf, display=True)
+ id: 37
- markdown: |
- in fact, this is not really an LArray ...
+ ### Dumping Sessions (CSV, Excel, HDF5)
+
+ To save a session, you need to call the ``save`` method. The first argument is the path to a Excel/HDF5 file or to a directory if items are saved to CSV files:
+
+
+- code: |
+ # save items of a session in CSV files.
+ # Here, the save method will create a 'population' directory in which CSV files will be written
+ session.save('population')
+
+ # save session to an HDF5 file
+ session.save('population.h5')
- ```python
- type(arr3)
+ # save session to an Excel file
+ session.save('population.xlsx')
- larray.io.excel.Range
- ```
+ # display the sheets contained in the file 'population.xlsx'
+ with open_excel('population.xlsx') as wb:
+ print(wb.sheet_names())
+
+ id: 38
+
+- markdown: |
+
+ **Note:** Concerning the CSV and Excel formats:
+
+ all Axis objects are saved together in the same Excel sheet (CSV file) named __axes__(.csv)
+ all Group objects are saved together in the same Excel sheet (CSV file) named __groups__(.csv)
+ metadata is saved in one Excel sheet (CSV file) named __metadata__(.csv)
+
+ These sheet (CSV file) names cannot be changed.
+
- markdown: |
- ... but it can be used as such
+ The ``save`` method has several arguments:
- ```python
- arr3.sum(axis=0)
- ```
+ 1) Using the ``names`` argument, you can specify which items to save:
+- code: |
+ # use the names argument to only save births and deaths arrays
+ session.save('population.xlsx', names=['births', 'deaths'])
+
+ # display the sheets contained in the file 'population.xlsx'
+ with open_excel('population.xlsx') as wb:
+ print(wb.sheet_names())
+
+ id: 39
+
- markdown: |
- ... and it can be used for other stuff, like setting the formula instead of the value:
+ 2) By default, dumping a session to an Excel or HDF5 file will overwrite it. By setting the ``overwrite`` argument to False, you can choose to update the existing Excel or HDF5 file:
+
+
+- code: |
+ pop = read_csv('./population/pop.csv')
+ ses_pop = Session([('pop', pop)])
- ```python
- arr3.formula = '=D10+1'
- ```
+ # by setting overwrite to False, the destination file is updated instead of overwritten.
+ # The items already stored in the file but not present in the session are left intact.
+ # On the contrary, the items that exist in both the file and the session are completely overwritten.
+ ses_pop.save('population.xlsx', overwrite=False)
+
+ # display the sheets contained in the file 'population.xlsx'
+ with open_excel('population.xlsx') as wb:
+ print(wb.sheet_names())
+ id: 40
- markdown: |
- In the future, we should also be able to set font name, size, style, etc.
+ 3) Setting the ``display`` argument to True, the ``save`` method will print a message each time an item is dumped:
+
+- code: |
+ # with display=True, the save method will print a message
+ # each time an item is dumped
+ session.save('population.xlsx', display=True)
+
+ id: 41
# The lines below here may be deleted if you do not need them.
# ---------------------------------------------------------------------------
@@ -273,4 +748,21 @@ nbformat_minor: 2
# ---------------------------------------------------------------------------
data:
- [{execution_count: null, outputs: []}, {execution_count: null, outputs: []}]
+ [{execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}, {execution_count: null,
+ outputs: []}, {execution_count: null, outputs: []}, {execution_count: null, outputs: []},
+ {execution_count: null, outputs: []}, {execution_count: null, outputs: []}]
+
diff --git a/doc/source/tutorial/tutorial_IO.ipynb b/doc/source/tutorial/tutorial_IO.ipynb
index 7e47370ef..51bf03c3a 100644
--- a/doc/source/tutorial/tutorial_IO.ipynb
+++ b/doc/source/tutorial/tutorial_IO.ipynb
@@ -4,22 +4,25 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Load/Dump Arrays And Sessions From/To Files\n"
+ "# Load And Dump Arrays, Sessions, Axes And Groups\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Import the LArray library:\n"
+ "LArray provides methods and functions to load and dump LArray, Session, Axis Group objects to several formats such as Excel, CSV and HDF5. The HDF5 file format is designed to store and organize large amounts of data. It allows to read and write data much faster than when working with CSV and Excel files. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "nbsphinx": "hidden"
+ },
"outputs": [],
"source": [
+ "# first of all, import the LArray library\n",
"from larray import *"
]
},
@@ -27,312 +30,984 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Load from CVS, Excel or HDF5 files\n",
+ "## Loading and Dumping Arrays\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Loading Arrays - Basic Usage (CSV, Excel, HDF5)\n",
"\n",
- "Arrays can be loaded from CSV files\n",
+ "To read an array from a CSV file, you must use the ``read_csv`` function:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "csv_dir = get_example_filepath('examples')\n",
"\n",
- "```python\n",
- "# read_tsv is a shortcut when data are separated by tabs instead of commas (default separator of read_csv)\n",
- "# read_eurostat is a shortcut to read EUROSTAT TSV files\n",
- "household = read_csv('hh.csv')\n",
- "```\n"
+ "# read the array pop from the file 'pop.csv'.\n",
+ "# The data of the array below is derived from a subset of the demo_pjan table from Eurostat\n",
+ "pop = read_csv(csv_dir + '/pop.csv')\n",
+ "pop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "or Excel sheets\n",
+ "To read an array from a sheet of an Excel file, you can use the ``read_excel`` function:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filepath_excel = get_example_filepath('examples.xlsx')\n",
"\n",
- "```python\n",
- "# loads array from the first sheet if no sheet is given\n",
- "pop = read_excel('demography.xlsx', 'pop')\n",
- "```\n"
+ "# read the array from the sheet 'pop' of the Excel file 'examples.xlsx'\n",
+ "pop = read_excel(filepath_excel, 'pop')\n",
+ "pop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "or HDF5 files (HDF5 is file format designed to store and organize large amounts of data.\n",
- "An HDF5 file can contain multiple arrays.\n",
+ "The ``open_excel`` function in combination with the ``load`` method allows you to load several arrays from the same Workbook without opening and closing it several times:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# open the Excel file 'population.xlsx' and let it opened as long as you keep the indent.\n",
+ "# The Python keyword ``with`` ensures that the Excel file is properly closed even if an error occurs\n",
+ "with open_excel(filepath_excel) as wb:\n",
+ " # load the array 'pop' from the sheet 'pop' \n",
+ " pop = wb['pop'].load()\n",
+ " # load the array 'births' from the sheet 'births'\n",
+ " # The data of the array below is derived from a subset of the demo_fasec table from Eurostat\n",
+ " births = wb['births'].load()\n",
+ " # load the array 'deaths' from the sheet 'deaths'\n",
+ " # The data of the array below is derived from a subset of the demo_magec table from Eurostat\n",
+ " deaths = wb['deaths'].load()\n",
"\n",
- "```python\n",
- "mortality = read_hdf('demography.h5','qx')\n",
- "```\n"
+ "# the Workbook is automatically closed when getting out the block defined by the with statement\n",
+ "print('pop:\\n', pop)\n",
+ "print('\\nbirths:\\n', births)\n",
+ "print('\\ndeaths:\\n', deaths)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "See documentation of reading functions for more details.\n"
+ "\n",
+ " **Warning:** `open_excel` requires to work on Windows and to have the library ``xlwings`` installed.\n",
+ "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Load Sessions\n"
+ "The `HDF5` file format is specifically designed to store and organize large amounts of data. \n",
+ "Reading and writing data in this file format is much faster than with CSV or Excel. \n",
+ "An HDF5 file can contain multiple arrays, each array being associated with a key.\n",
+ "To read an array from an HDF5 file, you must use the ``read_hdf`` function and provide the key associated with the array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filepath_hdf = get_example_filepath('examples.h5')\n",
+ "\n",
+ "# read the array from the file 'examples.h5' associated with the key 'pop'\n",
+ "pop = read_hdf(filepath_hdf, 'pop')\n",
+ "pop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "The advantage of sessions is that you can load many arrays in one shot:\n",
+ "### Dumping Arrays - Basic Usage (CSV, Excel, HDF5)\n",
"\n",
- "```python\n",
- "# this load several arrays from a single Excel file (each array is stored on a different sheet)\n",
- "s = Session()\n",
- "s.load('test.xlsx')\n",
- "# or \n",
- "s = Session('test.xlsx')\n",
+ "To write an array in a CSV file, you must use the ``to_csv`` method:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# save the array pop in the file 'pop.csv'\n",
+ "pop.to_csv('pop.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To write an array to a sheet of an Excel file, you can use the ``to_excel`` method:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# save the array pop in the sheet 'pop' of the Excel file 'population.xlsx' \n",
+ "pop.to_excel('population.xlsx', 'pop')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that ``to_excel`` create a new Excel file if it does not exist yet. \n",
+ "If the file already exists, a new sheet is added after the existing ones if that sheet does not already exists:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# add a new sheet 'births' to the file 'population.xlsx' and save the array births in it\n",
+ "births.to_excel('population.xlsx', 'births')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To reset an Excel file, you simply need to set the `overwrite_file` argument as True:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 1. reset the file 'population.xlsx' (all sheets are removed)\n",
+ "# 2. create a sheet 'pop' and save the array pop in it\n",
+ "pop.to_excel('population.xlsx', 'pop', overwrite_file=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The ``open_excel`` function in combination with the ``dump()`` method allows you to open a Workbook and to export several arrays at once. If the Excel file doesn't exist, the ``overwrite_file`` argument must be set to True.\n",
"\n",
- "# this load several arrays from a single HDF5 file (which is a very fast format)\n",
- "s = Session()\n",
- "s.load('test.h5')\n",
- "# or \n",
- "s = Session('test.h5')\n",
- "```\n"
+ "\n",
+ " **Warning:** The ``save`` method must be called at the end of the block defined by the *with* statement to actually write data in the Excel file, otherwise you will end up with an empty file.\n",
+ "
\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# to create a new Excel file, argument overwrite_file must be set to True\n",
+ "with open_excel('population.xlsx', overwrite_file=True) as wb:\n",
+ " # add a new sheet 'pop' and dump the array pop in it \n",
+ " wb['pop'] = pop.dump()\n",
+ " # add a new sheet 'births' and dump the array births in it \n",
+ " wb['births'] = births.dump()\n",
+ " # add a new sheet 'deaths' and dump the array deaths in it \n",
+ " wb['deaths'] = deaths.dump()\n",
+ " # actually write data in the Workbook\n",
+ " wb.save()\n",
+ " \n",
+ "# the Workbook is automatically closed when getting out the block defined by the with statement"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To write an array in an HDF5 file, you must use the ``read_hdf`` function and provide the key that will be associated with the array:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# save the array pop in the file 'population.h5' and associate it with the key 'pop'\n",
+ "pop.to_hdf('population.h5', 'pop')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Dump to CSV, Excel or HDF5 files\n",
+ "### Specifying Wide VS Narrow format (CSV, Excel)\n",
"\n",
- "Arrays can be dumped in CSV files\n",
+ "By default, all reading functions assume that arrays are stored in the ``wide`` format, meaning that their last axis is represented horizontally:\n",
"\n",
- "```python\n",
- "household.to_csv('hh2.csv')\n",
- "```\n"
+ "| geo\\time | 2013 | 2014 | 2015 |\n",
+ "| -------- | -------- | -------- | -------- |\n",
+ "| Belgium | 11137974 | 11180840 | 11237274 |\n",
+ "| France | 65600350 | 65942267 | 66456279 |\n",
+ "\n",
+ "By setting the ``wide`` argument to False, reading functions will assume instead that arrays are stored in the ``narrow`` format, i.e. one column per axis plus one value column:\n",
+ "\n",
+ "| geo | time | value |\n",
+ "| ------- | ---- | -------- |\n",
+ "| Belgium | 2013 | 11137974 |\n",
+ "| Belgium | 2014 | 11180840 |\n",
+ "| Belgium | 2015 | 11237274 |\n",
+ "| France | 2013 | 65600350 |\n",
+ "| France | 2014 | 65942267 |\n",
+ "| France | 2015 | 66456279 |\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# set 'wide' argument to False to indicate that the array is stored in the 'narrow' format\n",
+ "pop_BE_FR = read_csv(csv_dir + '/pop_narrow_format.csv', wide=False)\n",
+ "pop_BE_FR"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# same for the read_excel function\n",
+ "pop_BE_FR = read_excel(filepath_excel, sheet='pop_narrow_format', wide=False)\n",
+ "pop_BE_FR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "or in Excel files\n",
+ "By default, writing functions will set the name of the column containing the data to 'value'. You can choose the name of this column by using the ``value_name`` argument. For example, using ``value_name='population'`` you can export the previous array as:\n",
"\n",
- "```python\n",
- "# if the file does not already exist, it is created with a single sheet,\n",
- "# otherwise a new sheet is added to it\n",
- "household.to_excel('demography_2.xlsx', overwrite_file=True)\n",
- "# it is usually better to specify the sheet explicitly (by name or position) though\n",
- "household.to_excel('demography_2.xlsx', 'hh')\n",
- "```\n"
+ "| geo | time | population |\n",
+ "| ------- | ---- | ---------- |\n",
+ "| Belgium | 2013 | 11137974 |\n",
+ "| Belgium | 2014 | 11180840 |\n",
+ "| Belgium | 2015 | 11237274 |\n",
+ "| France | 2013 | 65600350 |\n",
+ "| France | 2014 | 65942267 |\n",
+ "| France | 2015 | 66456279 |\n"
]
},
{
- "cell_type": "markdown",
+ "cell_type": "code",
+ "execution_count": null,
"metadata": {},
+ "outputs": [],
"source": [
- "or in HDF5 files\n",
+ "# dump the array pop_BE_FR in a narrow format (one column per axis plus one value column).\n",
+ "# By default, the name of the column containing data is set to 'value'\n",
+ "pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False)\n",
"\n",
- "```python\n",
- "household.to_hdf('demography_2.h5', 'hh')\n",
- "```\n"
+ "# same but replace 'value' by 'population'\n",
+ "pop_BE_FR.to_csv('pop_narrow_format.csv', wide=False, value_name='population')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# same for the to_excel method\n",
+ "pop_BE_FR.to_excel('population.xlsx', 'pop_narrow_format', wide=False, value_name='population')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "See documentation of writing methods for more details.\n"
+ "Like with the ``to_excel`` method, it is possible to export arrays in a ``narrow`` format using ``open_excel``. \n",
+ "To do so, you must set the ``wide`` argument of the ``dump`` method to False:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open_excel('population.xlsx') as wb:\n",
+ " # dump the array pop_BE_FR in a narrow format: \n",
+ " # one column per axis plus one value column.\n",
+ " # Argument value_name can be used to change the name of the \n",
+ " # column containing the data (default name is 'value')\n",
+ " wb['pop_narrow_format'] = pop_BE_FR.dump(wide=False, value_name='population')\n",
+ " # don't forget to call save()\n",
+ " wb.save()\n",
+ "\n",
+ "# in the sheet 'pop_narrow_format', data is written as:\n",
+ "# | geo | time | value |\n",
+ "# | ------- | ---- | -------- |\n",
+ "# | Belgium | 2013 | 11137974 |\n",
+ "# | Belgium | 2014 | 11180840 |\n",
+ "# | Belgium | 2015 | 11237274 |\n",
+ "# | France | 2013 | 65600350 |\n",
+ "# | France | 2014 | 65942267 |\n",
+ "# | France | 2015 | 66456279 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Dump Sessions\n"
+ "### Specifying Position in Sheet (Excel)\n",
+ "\n",
+ "If you want to read an array from an Excel sheet which does not start at cell `A1` (when there is more than one array stored in the same sheet for example), you will need to use the ``range`` argument. Note that this argument is only available if you have the library ``xlwings`` installed. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# the 'range' argument must be used to load data not starting at cell A1.\n",
+ "# This is useful when there is several arrays stored in the same sheet\n",
+ "births = read_excel(filepath_excel, sheet='pop_births_deaths', range='A9:E15')\n",
+ "births"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "The advantage of sessions is that you can save many arrays in one shot:\n",
+ "Using ``open_excel``, ranges are passed in brackets:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open_excel(filepath_excel) as wb:\n",
+ " # store sheet 'pop_births_deaths' in a temporary variable sh\n",
+ " sh = wb['pop_births_deaths']\n",
+ " # load the array pop from range A1:E7\n",
+ " pop = sh['A1:E7'].load()\n",
+ " # load the array births from range A9:E15\n",
+ " births = sh['A9:E15'].load()\n",
+ " # load the array deaths from range A17:E23\n",
+ " deaths = sh['A17:E23'].load()\n",
"\n",
- "```python\n",
- "# this saves all the arrays in a single excel file (each array on a different sheet)\n",
- "s.save('test.xlsx')\n",
+ "# the Workbook is automatically closed when getting out the block defined by the with statement\n",
+ "print('pop:\\n', pop)\n",
+ "print('\\nbirths:\\n', births)\n",
+ "print('\\ndeaths:\\n', deaths)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When exporting arrays to Excel files, data is written starting at cell `A1` by default. Using the ``position`` argument of the ``to_excel`` method, it is possible to specify the top left cell of the dumped data. This can be useful when you want to export several arrays in the same sheet for example:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = 'population.xlsx'\n",
+ "sheetname = 'pop_births_deaths'\n",
"\n",
- "# this saves all the arrays in a single HDF5 file (which is a very fast format)\n",
- "s.save('test.h5')\n",
- "```\n"
+ "# save the arrays pop, births and deaths in the same sheet 'pop_births_and_deaths'.\n",
+ "# The 'position' argument is used to shift the location of the second and third arrays to be dumped\n",
+ "pop.to_excel(filename, sheetname)\n",
+ "births.to_excel(filename, sheetname, position='A9')\n",
+ "deaths.to_excel(filename, sheetname, position='A17')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Interact with Excel files\n"
+ "Using ``open_excel``, the position is passed in brackets (this allows you to also add extra informations): \n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open_excel('population.xlsx') as wb:\n",
+ " # add a new sheet 'pop_births_deaths' and write 'population' in the first cell\n",
+ " # note: you can use wb['new_sheet_name'] = '' to create an empty sheet\n",
+ " wb['pop_births_deaths'] = 'population'\n",
+ " # store sheet 'pop_births_deaths' in a temporary variable sh\n",
+ " sh = wb['pop_births_deaths']\n",
+ " # dump the array pop in sheet 'pop_births_deaths' starting at cell A2\n",
+ " sh['A2'] = pop.dump()\n",
+ " # add 'births' in cell A10\n",
+ " sh['A10'] = 'births'\n",
+ " # dump the array births in sheet 'pop_births_deaths' starting at cell A11 \n",
+ " sh['A11'] = births.dump()\n",
+ " # add 'deaths' in cell A19\n",
+ " sh['A19'] = 'deaths'\n",
+ " # dump the array deaths in sheet 'pop_births_deaths' starting at cell A20\n",
+ " sh['A20'] = deaths.dump()\n",
+ " # don't forget to call save()\n",
+ " wb.save()\n",
+ " \n",
+ "# the Workbook is automatically closed when getting out the block defined by the with statement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Write Arrays\n",
+ "### Exporting data without headers (Excel)\n",
+ "\n",
+ "For some reasons, you may want to export only the data of an array without axes. For example, you may want to insert a new column containing extra information. As an exercise, let us consider we want to add the capital city for each country present in the array containing the total population by country:\n",
+ "\n",
+ "| country | capital city | 2013 | 2014 | 2015 |\n",
+ "| ------- | ------------ | -------- | -------- | -------- |\n",
+ "| Belgium | Brussels | 11137974 | 11180840 | 11237274 |\n",
+ "| France | Paris | 65600350 | 65942267 | 66456279 |\n",
+ "| Germany | Berlin | 80523746 | 80767463 | 81197537 |\n",
+ "\n",
+ "Assuming you have prepared an excel sheet as below: \n",
"\n",
- "Open an Excel file\n",
+ "| country | capital city | 2013 | 2014 | 2015 |\n",
+ "| ------- | ------------ | -------- | -------- | -------- |\n",
+ "| Belgium | Brussels | | | |\n",
+ "| France | Paris | | | |\n",
+ "| Germany | Berlin | | | ||\n",
"\n",
- "```python\n",
- "wb = open_excel('test.xlsx', overwrite_file=True)\n",
- "```\n"
+ "you can then dump the data at right place by setting the ``header`` argument of ``to_excel`` to False and specifying the position of the data in sheet:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pop_by_country = pop.sum('gender')\n",
+ "\n",
+ "# export only the data of the array pop_by_country starting at cell C2\n",
+ "pop_by_country.to_excel('population.xlsx', 'pop_by_country', header=False, position='C2')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using ``open_excel``, you can easily prepare the sheet and then export only data at the right place by either setting the ``header`` argument of the ``dump`` method to False or avoiding to call ``dump``:\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open_excel('population.xlsx') as wb:\n",
+ " # create new empty sheet 'pop_by_country'\n",
+ " wb['pop_by_country'] = ''\n",
+ " # store sheet 'pop_by_country' in a temporary variable sh\n",
+ " sh = wb['pop_by_country']\n",
+ " # write extra information (description)\n",
+ " sh['A1'] = 'Population at 1st January by country'\n",
+ " # export column names\n",
+ " sh['A2'] = ['country', 'capital city']\n",
+ " sh['C2'] = pop_by_country.time.labels\n",
+ " # export countries as first column\n",
+ " sh['A3'].options(transpose=True).value = pop_by_country.geo.labels\n",
+ " # export capital cities as second column\n",
+ " sh['B3'].options(transpose=True).value = ['Brussels', 'Paris', 'Berlin']\n",
+ " # export only data of pop_by_country\n",
+ " sh['C3'] = pop_by_country.dump(header=False)\n",
+ " # or equivalently\n",
+ " sh['C3'] = pop_by_country\n",
+ " # don't forget to call save()\n",
+ " wb.save()\n",
+ " \n",
+ "# the Workbook is automatically closed when getting out the block defined by the with statement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Put an array in an Excel Sheet, **excluding** headers (labels)\n",
+ "### Specifying the Number of Axes at Reading (CSV, Excel)\n",
+ "\n",
+ "By default, ``read_csv`` and ``read_excel`` will search the position of the first cell containing the special character ``\\`` in the header line in order to determine the number of axes of the array to read. The special character ``\\`` is used to separate the name of the two last axes. If there is no special character ``\\``, ``read_csv`` and ``read_excel`` will consider that the array to read has only one dimension. For an array stored as:\n",
+ "\n",
+ "| geo | gender\\time | 2013 | 2014 | 2015 |\n",
+ "| ------- | ----------- | -------- | -------- | -------- |\n",
+ "| Belgium | Male | 5472856 | 5493792 | 5524068 |\n",
+ "| Belgium | Female | 5665118 | 5687048 | 5713206 |\n",
+ "| France | Male | 31772665 | 31936596 | 32175328 |\n",
+ "| France | Female | 33827685 | 34005671 | 34280951 |\n",
+ "| Germany | Male | 39380976 | 39556923 | 39835457 |\n",
+ "| Germany | Female | 41142770 | 41210540 | 41362080 |\n",
+ "\n",
+ "``read_csv`` and ``read_excel`` will find the special character ``\\`` in the second cell meaning it expects three axes (geo, gender and time). \n",
+ "\n",
+ "Sometimes, you need to read an array for which the name of the last axis is implicit: \n",
+ "\n",
+ "| geo | gender | 2013 | 2014 | 2015 |\n",
+ "| ------- | ------ | -------- | -------- | -------- |\n",
+ "| Belgium | Male | 5472856 | 5493792 | 5524068 |\n",
+ "| Belgium | Female | 5665118 | 5687048 | 5713206 |\n",
+ "| France | Male | 31772665 | 31936596 | 32175328 |\n",
+ "| France | Female | 33827685 | 34005671 | 34280951 |\n",
+ "| Germany | Male | 39380976 | 39556923 | 39835457 |\n",
+ "| Germany | Female | 41142770 | 41210540 | 41362080 |\n",
+ "\n",
+ "For such case, you will have to inform ``read_csv`` and ``read_excel`` of the number of axes of the output array by setting the ``nb_axes`` argument:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# read the 3 x 2 x 3 array stored in the file 'pop_missing_axis_name.csv' wihout using 'nb_axes' argument.\n",
+ "pop = read_csv(csv_dir + '/pop_missing_axis_name.csv')\n",
+ "# shape and data type of the output array are not what we expected\n",
+ "pop.info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# by setting the 'nb_axes' argument, you can indicate to read_csv the number of axes of the output array\n",
+ "pop = read_csv(csv_dir + '/pop_missing_axis_name.csv', nb_axes=3)\n",
"\n",
- "```python\n",
- "# put arr at A1 in Sheet1, excluding headers (labels)\n",
- "wb['Sheet1'] = arr\n",
- "# same but starting at A9\n",
- "# note that Sheet1 must exist\n",
- "wb['Sheet1']['A9'] = arr\n",
- "```\n"
+ "# give a name to the last axis\n",
+ "pop = pop.rename(-1, 'time')\n",
+ "\n",
+ "# shape and data type of the output array are what we expected\n",
+ "pop.info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# same for the read_excel function\n",
+ "pop = read_excel(filepath_excel, sheet='pop_missing_axis_name', nb_axes=3)\n",
+ "pop = pop.rename(-1, 'time')\n",
+ "pop.info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Put an array in an Excel Sheet, **including** headers (labels)\n",
+ "### NaNs and Missing Data Handling at Reading (CSV, Excel)\n",
"\n",
- "```python\n",
- "# dump arr at A1 in Sheet2, including headers (labels)\n",
- "wb['Sheet2'] = arr.dump()\n",
- "# same but starting at A10\n",
- "wb['Sheet2']['A10'] = arr.dump()\n",
- "```\n"
+ "Sometimes, there is no data available for some label combinations. In the example below, the rows corresponding to `France - Male` and `Germany - Female` are missing:\n",
+ "\n",
+ "| geo | gender\\time | 2013 | 2014 | 2015 |\n",
+ "| ------- | ----------- | -------- | -------- | -------- |\n",
+ "| Belgium | Male | 5472856 | 5493792 | 5524068 |\n",
+ "| Belgium | Female | 5665118 | 5687048 | 5713206 |\n",
+ "| France | Female | 33827685 | 34005671 | 34280951 |\n",
+ "| Germany | Male | 39380976 | 39556923 | 39835457 |\n",
+ "\n",
+ "By default, ``read_csv`` and ``read_excel`` will fill cells associated with missing label combinations with nans. \n",
+ "Be aware that, in that case, an int array will be converted to a float array."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# by default, cells associated will missing label combinations are filled with nans.\n",
+ "# In that case, the output array is converted to a float array\n",
+ "read_csv(csv_dir + '/pop_missing_values.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, it is possible to choose which value to use to fill missing cells using the ``fill_value`` argument:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "read_csv(csv_dir + '/pop_missing_values.csv', fill_value=0)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# same for the read_excel function\n",
+ "read_excel(filepath_excel, sheet='pop_missing_values', fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Save file to disk\n",
+ "### Sorting Axes at Reading (CSV, Excel, HDF5)\n",
"\n",
- "```python\n",
- "wb.save()\n",
- "```\n"
+ "The ``sort_rows`` and ``sort_columns`` arguments of the reading functions allows you to sort rows and columns alphabetically:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# sort labels at reading --> Male and Female labels are inverted\n",
+ "read_csv(csv_dir + '/pop.csv', sort_rows=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "read_excel(filepath_excel, sheet='births', sort_rows=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "read_hdf(filepath_hdf, key='deaths', sort_rows=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Close file\n",
+ "### Metadata (HDF5)\n",
"\n",
- "```python\n",
- "wb.close()\n",
- "```\n"
+ "Since the version 0.29 of LArray, it is possible to add metadata to arrays:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pop.meta.title = 'Population at 1st January'\n",
+ "pop.meta.origin = 'Table demo_jpan from Eurostat'\n",
+ "\n",
+ "pop.info"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Read Arrays\n",
+ "These metadata are automatically saved and loaded when working with the HDF5 file format: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pop.to_hdf('population.h5', 'pop')\n",
"\n",
- "Open an Excel file\n",
+ "new_pop = read_hdf('population.h5', 'pop')\n",
+ "new_pop.info"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " **Warning:** Currently, metadata associated with arrays cannot be saved and loaded when working with CSV and Excel files.\n",
+ " This restriction does not apply however to metadata associated with sessions.\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Loading and Dumping Sessions\n",
"\n",
- "```python\n",
- "wb = open_excel('test.xlsx')\n",
- "```\n"
+ "One of the main advantages of grouping arrays, axes and groups in session objects is that you can load and save all of them in one shot. Like arrays, it is possible to associate metadata to a session. These can be saved and loaded in all file formats. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Load an array from a sheet (assuming the presence of (correctly formatted) headers and only one array in sheet)\n",
+ "### Loading Sessions (CSV, Excel, HDF5)\n",
"\n",
- "```python\n",
- "# save one array in Sheet3 (including headers)\n",
- "wb['Sheet3'] = arr.dump()\n",
+ "To load the items of a session, you have two options:\n",
"\n",
- "# load array from the data starting at A1 in Sheet3\n",
- "arr = wb['Sheet3'].load()\n",
- "```\n"
+ "1) Instantiate a new session and pass the path to the Excel/HDF5 file or to the directory containing CSV files to the Session constructor:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create a new Session object and load all arrays, axes, groups and metadata \n",
+ "# from all CSV files located in the passed directory\n",
+ "csv_dir = get_example_filepath('population_session')\n",
+ "session = Session(csv_dir)\n",
+ "\n",
+ "# create a new Session object and load all arrays, axes, groups and metadata\n",
+ "# stored in the passed Excel file\n",
+ "filepath_excel = get_example_filepath('population_session.xlsx')\n",
+ "session = Session(filepath_excel)\n",
+ "\n",
+ "# create a new Session object and load all arrays, axes, groups and metadata\n",
+ "# stored in the passed HDF5 file\n",
+ "filepath_hdf = get_example_filepath('population_session.h5')\n",
+ "session = Session(filepath_hdf)\n",
+ "\n",
+ "print(session.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Load an array with its axes information from a range\n",
+ "2) Call the ``load`` method on an existing session and pass the path to the Excel/HDF5 file or to the directory containing CSV files as first argument:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# create a session containing 3 axes, 2 groups and one array 'pop'\n",
+ "filepath = get_example_filepath('pop_only.xlsx')\n",
+ "session = Session(filepath)\n",
"\n",
- "```python\n",
- "# if you need to use the same sheet several times,\n",
- "# you can create a sheet variable\n",
- "sheet2 = wb['Sheet2']\n",
+ "print(session.summary())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# call the load method on the previous session and add the 'births' and 'deaths' arrays to it\n",
+ "filepath = get_example_filepath('births_and_deaths.xlsx')\n",
+ "session.load(filepath)\n",
"\n",
- "# load array contained in the 4 x 4 table defined by cells A10 and D14\n",
- "arr2 = sheet2['A10:D14'].load()\n",
- "```\n"
+ "print(session.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Read Ranges (experimental)\n",
+ "The ``load`` method offers some options:\n",
"\n",
- "Load an array (raw data) with no axis information from a range\n",
+ "1) Using the ``names`` argument, you can specify which items to load:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "session = Session()\n",
+ "\n",
+ "# use the names argument to only load births and deaths arrays\n",
+ "session.load(filepath_hdf, names=['births', 'deaths'])\n",
+ "\n",
+ "print(session.summary())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "2) Setting the ``display`` argument to True, the ``load`` method will print a message each time a new item is loaded: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "session = Session()\n",
"\n",
- "```python\n",
- "arr3 = wb['Sheet1']['A1:B4']\n",
- "```\n"
+ "# with display=True, the load method will print a message\n",
+ "# each time a new item is loaded\n",
+ "session.load(filepath_hdf, display=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "in fact, this is not really an LArray ...\n",
+ "### Dumping Sessions (CSV, Excel, HDF5)\n",
"\n",
- "```python\n",
- "type(arr3)\n",
+ "To save a session, you need to call the ``save`` method. The first argument is the path to a Excel/HDF5 file or to a directory if items are saved to CSV files:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# save items of a session in CSV files.\n",
+ "# Here, the save method will create a 'population' directory in which CSV files will be written \n",
+ "session.save('population')\n",
+ "\n",
+ "# save session to an HDF5 file\n",
+ "session.save('population.h5')\n",
"\n",
- "larray.io.excel.Range\n",
- "```\n"
+ "# save session to an Excel file\n",
+ "session.save('population.xlsx')\n",
+ "\n",
+ "# display the sheets contained in the file 'population.xlsx'\n",
+ "with open_excel('population.xlsx') as wb:\n",
+ " print(wb.sheet_names())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ " **Note:** Concerning the CSV and Excel formats: \n",
+ " \n",
+ " all Axis objects are saved together in the same Excel sheet (CSV file) named __axes__(.csv)\n",
+ " all Group objects are saved together in the same Excel sheet (CSV file) named __groups__(.csv)\n",
+ " metadata is saved in one Excel sheet (CSV file) named __metadata__(.csv)\n",
+ " \n",
+ " These sheet (CSV file) names cannot be changed. \n",
+ "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "... but it can be used as such\n",
+ "The ``save`` method has several arguments:\n",
+ "\n",
+ "1) Using the ``names`` argument, you can specify which items to save:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# use the names argument to only save births and deaths arrays\n",
+ "session.save('population.xlsx', names=['births', 'deaths'])\n",
"\n",
- "```python\n",
- "arr3.sum(axis=0)\n",
- "```\n"
+ "# display the sheets contained in the file 'population.xlsx'\n",
+ "with open_excel('population.xlsx') as wb:\n",
+ " print(wb.sheet_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "... and it can be used for other stuff, like setting the formula instead of the value:\n",
+ "2) By default, dumping a session to an Excel or HDF5 file will overwrite it. By setting the ``overwrite`` argument to False, you can choose to update the existing Excel or HDF5 file: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pop = read_csv('./population/pop.csv')\n",
+ "ses_pop = Session([('pop', pop)])\n",
"\n",
- "```python\n",
- "arr3.formula = '=D10+1'\n",
- "```\n"
+ "# by setting overwrite to False, the destination file is updated instead of overwritten.\n",
+ "# The items already stored in the file but not present in the session are left intact. \n",
+ "# On the contrary, the items that exist in both the file and the session are completely overwritten.\n",
+ "ses_pop.save('population.xlsx', overwrite=False)\n",
+ "\n",
+ "# display the sheets contained in the file 'population.xlsx'\n",
+ "with open_excel('population.xlsx') as wb:\n",
+ " print(wb.sheet_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "In the future, we should also be able to set font name, size, style, etc.\n"
+ "3) Setting the ``display`` argument to True, the ``save`` method will print a message each time an item is dumped: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# with display=True, the save method will print a message\n",
+ "# each time an item is dumped\n",
+ "session.save('population.xlsx', display=True)"
]
}
],
diff --git a/larray/core/array.py b/larray/core/array.py
index d6a7ac69b..aff5e735a 100644
--- a/larray/core/array.py
+++ b/larray/core/array.py
@@ -6285,7 +6285,7 @@ def to_excel(self, filepath=None, sheet=None, position='A1', overwrite_file=Fals
existing file, "Sheet1" otherwise. sheet can also refer to the position of the sheet
(e.g. 0 for the first sheet, -1 for the last one).
position : str or tuple of integers, optional
- Integer position (row, column) must be 1-based. Defaults to 'A1'.
+ Integer position (row, column) must be 1-based. Used only if engine is 'xlwings'. Defaults to 'A1'.
overwrite_file : bool, optional
Whether or not to overwrite the existing file (or just modify the specified sheet). Defaults to False.
clear_sheet : bool, optional
diff --git a/larray/example.py b/larray/example.py
index 340fc6836..3bf25c20e 100644
--- a/larray/example.py
+++ b/larray/example.py
@@ -1,12 +1,41 @@
import os
import larray as la
-__all__ = ['EXAMPLE_FILES_DIR', 'load_example_data']
+__all__ = ['get_example_filepath', 'load_example_data']
EXAMPLE_FILES_DIR = os.path.dirname(__file__) + '/tests/data/'
AVAILABLE_EXAMPLE_DATA = {
'demography': os.path.join(EXAMPLE_FILES_DIR, 'demography.h5')
}
+AVAILABLE_EXAMPLE_FILES = os.listdir(EXAMPLE_FILES_DIR)
+
+
+def get_example_filepath(fname):
+ """Return absolute path to an example file if exist.
+
+ Parameters
+ ----------
+ fname : str
+ Filename of an existing example file.
+
+ Returns
+ -------
+ Filepath
+ Absolute filepath to an example file if exists.
+
+ Notes
+ -----
+ A ValueError is raised if the provided filename does not represent an existing example file.
+
+ Examples
+ --------
+ >>> fpath = get_example_filepath('examples.xlsx')
+ """
+ fpath = os.path.abspath(os.path.join(EXAMPLE_FILES_DIR, fname))
+ if not os.path.exists(fpath):
+ raise ValueError("Example file {} does not exist. "
+ "Available example files are: {}".format(fname, AVAILABLE_EXAMPLE_FILES))
+ return fpath
def load_example_data(name):
diff --git a/larray/inout/csv.py b/larray/inout/csv.py
index 4c9161ab5..a1813ed0e 100644
--- a/larray/inout/csv.py
+++ b/larray/inout/csv.py
@@ -18,6 +18,7 @@
from larray.inout.session import register_file_handler
from larray.inout.common import _get_index_col, FileHandler
from larray.inout.pandas import df_aslarray, _axes_to_df, _df_to_axes, _groups_to_df, _df_to_groups
+from larray.example import get_example_filepath
__all__ = ['read_csv', 'read_tsv', 'read_eurostat']
@@ -32,12 +33,14 @@ def read_csv(filepath_or_buffer, nb_axes=None, index_col=None, sep=',', headerse
Notes
-----
csv file format:
- arr,ages,sex,nat\time,1991,1992,1993
- A1,BI,H,BE,1,0,0
- A1,BI,H,FO,2,0,0
- A1,BI,F,BE,0,0,1
- A1,BI,F,FO,0,0,0
- A1,A0,H,BE,0,0,0
+
+ geo,gender\\time,2013,2014,2015
+ Belgium,Male,5472856,5493792,5524068
+ Belgium,Female,5665118,5687048,5713206
+ France,Male,31772665,31936596,32175328
+ France,Female,33827685,34005671,34280951
+ Germany,Male,39380976,39556923,39835457
+ Germany,Female,41142770,41210540,41362080
Parameters
----------
@@ -76,91 +79,105 @@ def read_csv(filepath_or_buffer, nb_axes=None, index_col=None, sep=',', headerse
Examples
--------
- >>> import os
- >>> from larray import EXAMPLE_FILES_DIR
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'test2d.csv')
+ >>> csv_dir = get_example_filepath('examples')
+ >>> fname = csv_dir + '/pop.csv'
+
+ >>> # The data below is derived from a subset of the demo_pjan table from Eurostat
>>> read_csv(fname)
- a\\b b0 b1
- 1 0 1
- 2 2 3
- 3 4 5
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 31772665 31936596 32175328
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 41142770 41210540 41362080
Missing label combinations
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'missing_values_3d.csv')
+ >>> fname = csv_dir + '/pop_missing_values.csv'
>>> # let's take a look inside the CSV file.
- >>> # they are missing label combinations: (a=2, b=b0) and (a=3, b=b1)
+ >>> # they are missing label combinations: (Paris, male) and (New York, female)
>>> with open(fname) as f:
... print(f.read().strip())
- a,b\c,c0,c1,c2
- 1,b0,0,1,2
- 1,b1,3,4,5
- 2,b1,9,10,11
- 3,b0,12,13,14
+ geo,gender\\time,2013,2014,2015
+ Belgium,Male,5472856,5493792,5524068
+ Belgium,Female,5665118,5687048,5713206
+ France,Female,33827685,34005671,34280951
+ Germany,Male,39380976,39556923,39835457
>>> # by default, cells associated with missing label combinations are filled with NaN.
>>> # In that case, an int array is converted to a float array.
>>> read_csv(fname)
- a b\c c0 c1 c2
- 1 b0 0.0 1.0 2.0
- 1 b1 3.0 4.0 5.0
- 2 b0 nan nan nan
- 2 b1 9.0 10.0 11.0
- 3 b0 12.0 13.0 14.0
- 3 b1 nan nan nan
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856.0 5493792.0 5524068.0
+ Belgium Female 5665118.0 5687048.0 5713206.0
+ France Male nan nan nan
+ France Female 33827685.0 34005671.0 34280951.0
+ Germany Male 39380976.0 39556923.0 39835457.0
+ Germany Female nan nan nan
>>> # using argument 'fill_value', you can choose which value to use to fill missing cells.
>>> read_csv(fname, fill_value=0)
- a b\c c0 c1 c2
- 1 b0 0 1 2
- 1 b1 3 4 5
- 2 b0 0 0 0
- 2 b1 9 10 11
- 3 b0 12 13 14
- 3 b1 0 0 0
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 0 0 0
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 0 0 0
Specify the number of axes of the output array (useful when the name of the last axis is implicit)
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'missing_axis_name.csv')
+ >>> fname = csv_dir + '/pop_missing_axis_name.csv'
>>> # let's take a look inside the CSV file.
- >>> # The name of the second axis is missing.
+ >>> # The name of the last axis is missing.
>>> with open(fname) as f:
... print(f.read().strip())
- a,b0,b1,b2
- a0,0,1,2
- a1,3,4,5
- a2,6,7,8
+ geo,gender,2013,2014,2015
+ Belgium,Male,5472856,5493792,5524068
+ Belgium,Female,5665118,5687048,5713206
+ France,Male,31772665,31936596,32175328
+ France,Female,33827685,34005671,34280951
+ Germany,Male,39380976,39556923,39835457
+ Germany,Female,41142770,41210540,41362080
>>> # read the array stored in the CSV file as is
- >>> read_csv(fname)
- a\{1} b0 b1 b2
- a0 0 1 2
- a1 3 4 5
- a2 6 7 8
+ >>> arr = read_csv(fname)
+ >>> # we expected a 3 x 2 x 3 array with data of type int
+ >>> # but we got a 6 x 4 array with data of type object
+ >>> arr.info
+ 6 x 4
+ geo [6]: 'Belgium' 'Belgium' 'France' 'France' 'Germany' 'Germany'
+ {1} [4]: 'gender' '2013' '2014' '2015'
+ dtype: object
+ memory used: 192 bytes
>>> # using argument 'nb_axes', you can force the number of axes of the output array
- >>> read_csv(fname, nb_axes=2)
- a\{1} b0 b1 b2
- a0 0 1 2
- a1 3 4 5
- a2 6 7 8
+ >>> arr = read_csv(fname, nb_axes=3)
+ >>> # as expected, we have a 3 x 2 x 3 array with data of type int
+ >>> arr.info
+ 3 x 2 x 3
+ geo [3]: 'Belgium' 'France' 'Germany'
+ gender [2]: 'Male' 'Female'
+ {2} [3]: 2013 2014 2015
+ dtype: int64
+ memory used: 144 bytes
Read array saved in "narrow" format (wide=False)
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'narrow_2d.csv')
+ >>> fname = csv_dir + '/pop_narrow_format.csv'
>>> # let's take a look inside the CSV file.
>>> # Here, data are stored in a 'narrow' format.
>>> with open(fname) as f:
... print(f.read().strip())
- a,b,value
- 1,b0,0
- 1,b1,1
- 2,b0,2
- 2,b1,3
- 3,b0,4
- 3,b1,5
+ geo,time,value
+ Belgium,2013,11137974
+ Belgium,2014,11180840
+ Belgium,2015,11237274
+ France,2013,65600350
+ France,2014,65942267
+ France,2015,66456279
>>> # to read arrays stored in 'narrow' format, you must pass wide=False to read_csv
>>> read_csv(fname, wide=False)
- a\\b b0 b1
- 1 0 1
- 2 2 3
- 3 4 5
+ geo\\time 2013 2014 2015
+ Belgium 11137974 11180840 11237274
+ France 65600350 65942267 66456279
"""
if not np.isnan(na):
fill_value = na
diff --git a/larray/inout/excel.py b/larray/inout/excel.py
index 1f1f2bf19..9d3651986 100644
--- a/larray/inout/excel.py
+++ b/larray/inout/excel.py
@@ -21,14 +21,17 @@
from larray.inout.common import _get_index_col, FileHandler
from larray.inout.pandas import df_aslarray, _axes_to_df, _df_to_axes, _groups_to_df, _df_to_groups
from larray.inout.xw_excel import open_excel
+from larray.example import get_example_filepath
+
__all__ = ['read_excel']
+# TODO: remove '# doctest: +SKIP' next to arr.info when Python 2.7 will be dropped
@deprecate_kwarg('nb_index', 'nb_axes', arg_converter=lambda x: x + 1)
@deprecate_kwarg('sheetname', 'sheet')
def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=nan, na=nan,
- sort_rows=False, sort_columns=False, wide=True, engine=None, **kwargs):
+ sort_rows=False, sort_columns=False, wide=True, engine=None, range=slice(None), **kwargs):
"""
Reads excel file from sheet name and returns an LArray with the contents
@@ -61,6 +64,9 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=nan,
engine : {'xlrd', 'xlwings'}, optional
Engine to use to read the Excel file. If None (default), it will use 'xlwings' by default if the module is
installed and relies on Pandas default reader otherwise.
+ range : str, optional
+ Range to load the array from (only supported for the 'xlwings' engine). Defaults to slice(None) which loads
+ the whole sheet, ignoring blank cells in the bottom right corner.
**kwargs
Returns
@@ -69,89 +75,125 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=nan,
Examples
--------
- >>> import os
- >>> from larray import EXAMPLE_FILES_DIR
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'examples.xlsx')
+ >>> fname = get_example_filepath('examples.xlsx')
Read array from first sheet
+ >>> # The data below is derived from a subset of the demo_pjan table from Eurostat
>>> read_excel(fname)
- a a0 a1 a2
- 0 1 2
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 31772665 31936596 32175328
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 41142770 41210540 41362080
Read array from a specific sheet
- >>> read_excel(fname, '2d')
- a\\b b0 b1
- 1 0 1
- 2 2 3
- 3 4 5
+ >>> # The data below is derived from a subset of the demo_fasec table from Eurostat
+ >>> read_excel(fname, 'births')
+ geo gender\\time 2013 2014 2015
+ Belgium Male 64371 64173 62561
+ Belgium Female 61235 60841 59713
+ France Male 415762 418721 409145
+ France Female 396581 400607 390526
+ Germany Male 349820 366835 378478
+ Germany Female 332249 348092 359097
Missing label combinations
- >>> # let's take a look inside the sheet 'missing_values'.
- >>> # they are missing label combinations: (a=2, b=b0) and (a=3, b=b1):
+ >>> # let's take a look inside the sheet 'pop_missing_values'.
+ >>> # they are missing label combinations: (Paris, male) and (New York, female):
- a b\c c0 c1 c2
- 1 b0 0 1 2
- 1 b1 3 4 5
- 2 b1 9 10 11
- 3 b0 12 13 14
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
>>> # by default, cells associated with missing label combinations are filled with NaN.
>>> # In that case, an int array is converted to a float array.
- >>> read_excel(fname, sheet='missing_values')
- a b\c c0 c1 c2
- 1 b0 0.0 1.0 2.0
- 1 b1 3.0 4.0 5.0
- 2 b0 nan nan nan
- 2 b1 9.0 10.0 11.0
- 3 b0 12.0 13.0 14.0
- 3 b1 nan nan nan
+ >>> read_excel(fname, sheet='pop_missing_values')
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856.0 5493792.0 5524068.0
+ Belgium Female 5665118.0 5687048.0 5713206.0
+ France Male nan nan nan
+ France Female 33827685.0 34005671.0 34280951.0
+ Germany Male 39380976.0 39556923.0 39835457.0
+ Germany Female nan nan nan
>>> # using argument 'fill_value', you can choose which value to use to fill missing cells.
- >>> read_excel(fname, sheet='missing_values', fill_value=0)
- a b\c c0 c1 c2
- 1 b0 0 1 2
- 1 b1 3 4 5
- 2 b0 0 0 0
- 2 b1 9 10 11
- 3 b0 12 13 14
- 3 b1 0 0 0
+ >>> read_excel(fname, sheet='pop_missing_values', fill_value=0)
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 0 0 0
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 0 0 0
Specify the number of axes of the output array (useful when the name of the last axis is implicit)
- >>> # read the array stored in the CSV file as it
- >>> read_excel(fname, sheet='missing_axis_name')
- a\{1} b0 b1 b2
- a0 0 1 2
- a1 3 4 5
- a2 6 7 8
+ The content of the sheet 'missing_axis_name' is:
+
+ geo gender 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 31772665 31936596 32175328
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 41142770 41210540 41362080
+
+ >>> # read the array stored in the sheet 'pop_missing_axis_name' as is
+ >>> arr = read_excel(fname, sheet='pop_missing_axis_name')
+ >>> # we expected a 3 x 2 x 3 array with data of type int
+ >>> # but we got a 6 x 4 array with data of type object
+ >>> arr.info # doctest: +SKIP
+ 6 x 4
+ geo [6]: 'Belgium' 'Belgium' 'France' 'France' 'Germany' 'Germany'
+ {1} [4]: 'gender' '2013' '2014' '2015'
+ dtype: object
+ memory used: 192 bytes
>>> # using argument 'nb_axes', you can force the number of axes of the output array
- >>> read_excel(fname, sheet='missing_axis_name', nb_axes=2)
- a\{1} b0 b1 b2
- a0 0 1 2
- a1 3 4 5
- a2 6 7 8
+ >>> arr = read_excel(fname, sheet='pop_missing_axis_name', nb_axes=3)
+ >>> # as expected, we have a 3 x 2 x 3 array with data of type int
+ >>> arr.info # doctest: +SKIP
+ 3 x 2 x 3
+ geo [3]: 'Belgium' 'France' 'Germany'
+ gender [2]: 'Male' 'Female'
+ {2} [3]: 2013 2014 2015
+ dtype: int64
+ memory used: 144 bytes
Read array saved in "narrow" format (wide=False)
- >>> # let's take a look inside the sheet 'narrow_2d'.
+ >>> # let's take a look inside the sheet 'pop_narrow'.
>>> # The data are stored in a 'narrow' format:
- a b value
- 1 b0 0
- 1 b1 1
- 2 b0 2
- 2 b1 3
- 3 b0 4
- 3 b1 5
+ geo time value
+ Belgium 2013 11137974
+ Belgium 2014 11180840
+ Belgium 2015 11237274
+ France 2013 65600350
+ France 2014 65942267
+ France 2015 66456279
>>> # to read arrays stored in 'narrow' format, you must pass wide=False to read_excel
- >>> read_excel(fname, 'narrow_2d', wide=False)
- a\\b b0 b1
- 1 0 1
- 2 2 3
- 3 4 5
+ >>> read_excel(fname, 'pop_narrow_format', wide=False)
+ geo\\time 2013 2014 2015
+ Belgium 11137974 11180840 11237274
+ France 65600350 65942267 66456279
+
+ Extract array from a given range (xlwings only)
+
+ >>> read_excel(fname, 'pop_births_deaths', range='A9:E15') # doctest: +SKIP
+ geo gender\\time 2013 2014 2015
+ Belgium Male 64371 64173 62561
+ Belgium Female 61235 60841 59713
+ France Male 415762 418721 409145
+ France Female 396581 400607 390526
+ Germany Male 349820 366835 378478
+ Germany Female 332249 348092 359097
"""
if not np.isnan(na):
fill_value = na
@@ -171,9 +213,10 @@ def read_excel(filepath, sheet=0, nb_axes=None, index_col=None, fill_value=nan,
.format(list(kwargs.keys())[0]))
from larray.inout.xw_excel import open_excel
with open_excel(filepath) as wb:
- return wb[sheet].load(index_col=index_col, fill_value=fill_value, sort_rows=sort_rows,
- sort_columns=sort_columns, wide=wide)
+ return wb[sheet][range].load(index_col=index_col, fill_value=fill_value, sort_rows=sort_rows,
+ sort_columns=sort_columns, wide=wide)
else:
+ # TODO: add support for range argument (using usecols, skiprows and nrows arguments of pandas.read_excel)
df = pd.read_excel(filepath, sheet, index_col=index_col, engine=engine, **kwargs)
return df_aslarray(df, sort_rows=sort_rows, sort_columns=sort_columns, raw=index_col is None,
fill_value=fill_value, wide=wide)
diff --git a/larray/inout/hdf.py b/larray/inout/hdf.py
index 2fc6776f9..b59e64232 100644
--- a/larray/inout/hdf.py
+++ b/larray/inout/hdf.py
@@ -14,6 +14,7 @@
from larray.inout.session import register_file_handler
from larray.inout.common import FileHandler
from larray.inout.pandas import df_aslarray
+from larray.example import get_example_filepath
__all__ = ['read_hdf']
@@ -47,42 +48,19 @@ def read_hdf(filepath_or_buffer, key, fill_value=nan, na=nan, sort_rows=False, s
Examples
--------
- >>> import os
- >>> from larray import EXAMPLE_FILES_DIR
- >>> fname = os.path.join(EXAMPLE_FILES_DIR, 'test.h5')
+ >>> fname = get_example_filepath('examples.h5')
Read array by passing its identifier (key) inside the HDF file
- >>> read_hdf(fname, '3d')
- a b\c c0 c1 c2
- 1 b0 0 1 2
- 1 b1 3 4 5
- 2 b0 6 7 8
- 2 b1 9 10 11
- 3 b0 12 13 14
- 3 b1 15 16 17
-
- Missing label combinations
-
- >>> # by default, cells associated with missing label combinations are filled with NaN.
- >>> # In that case, an int array is converted to a float array.
- >>> read_hdf(fname, key='missing_values')
- a b\c c0 c1 c2
- 1 b0 0.0 1.0 2.0
- 1 b1 3.0 4.0 5.0
- 2 b0 nan nan nan
- 2 b1 9.0 10.0 11.0
- 3 b0 12.0 13.0 14.0
- 3 b1 nan nan nan
- >>> # using argument 'fill_value', you can choose which value to use to fill missing cells.
- >>> read_hdf(fname, key='missing_values', fill_value=0)
- a b\c c0 c1 c2
- 1 b0 0 1 2
- 1 b1 3 4 5
- 2 b0 0 0 0
- 2 b1 9 10 11
- 3 b0 12 13 14
- 3 b1 0 0 0
+ >>> # The data below is derived from a subset of the demo_pjan table from Eurostat
+ >>> read_hdf(fname, 'pop')
+ geo gender\\time 2013 2014 2015
+ Belgium Male 5472856 5493792 5524068
+ Belgium Female 5665118 5687048 5713206
+ France Male 31772665 31936596 32175328
+ France Female 33827685 34005671 34280951
+ Germany Male 39380976 39556923 39835457
+ Germany Female 41142770 41210540 41362080
"""
if not np.isnan(na):
fill_value = na
diff --git a/larray/tests/data/births_and_deaths.xlsx b/larray/tests/data/births_and_deaths.xlsx
new file mode 100644
index 000000000..5fb2ba5e9
Binary files /dev/null and b/larray/tests/data/births_and_deaths.xlsx differ
diff --git a/larray/tests/data/examples.h5 b/larray/tests/data/examples.h5
new file mode 100644
index 000000000..7b8a21dc3
Binary files /dev/null and b/larray/tests/data/examples.h5 differ
diff --git a/larray/tests/data/examples.xlsx b/larray/tests/data/examples.xlsx
index 03811b799..e1f1c265d 100644
Binary files a/larray/tests/data/examples.xlsx and b/larray/tests/data/examples.xlsx differ
diff --git a/larray/tests/data/examples/births.csv b/larray/tests/data/examples/births.csv
new file mode 100644
index 000000000..b8136dea4
--- /dev/null
+++ b/larray/tests/data/examples/births.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,64371,64173,62561
+Belgium,Female,61235,60841,59713
+France,Male,415762,418721,409145
+France,Female,396581,400607,390526
+Germany,Male,349820,366835,378478
+Germany,Female,332249,348092,359097
diff --git a/larray/tests/data/examples/deaths.csv b/larray/tests/data/examples/deaths.csv
new file mode 100644
index 000000000..9b98fd078
--- /dev/null
+++ b/larray/tests/data/examples/deaths.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,53908,51579,53631
+Belgium,Female,55426,53176,56910
+France,Male,287410,282381,297028
+France,Female,281955,277054,296779
+Germany,Male,429645,422225,449512
+Germany,Female,464180,446131,475688
diff --git a/larray/tests/data/examples/pop.csv b/larray/tests/data/examples/pop.csv
new file mode 100644
index 000000000..4d1675f75
--- /dev/null
+++ b/larray/tests/data/examples/pop.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,5472856,5493792,5524068
+Belgium,Female,5665118,5687048,5713206
+France,Male,31772665,31936596,32175328
+France,Female,33827685,34005671,34280951
+Germany,Male,39380976,39556923,39835457
+Germany,Female,41142770,41210540,41362080
diff --git a/larray/tests/data/examples/pop_missing_axis_name.csv b/larray/tests/data/examples/pop_missing_axis_name.csv
new file mode 100644
index 000000000..c68fbe20e
--- /dev/null
+++ b/larray/tests/data/examples/pop_missing_axis_name.csv
@@ -0,0 +1,7 @@
+geo,gender,2013,2014,2015
+Belgium,Male,5472856,5493792,5524068
+Belgium,Female,5665118,5687048,5713206
+France,Male,31772665,31936596,32175328
+France,Female,33827685,34005671,34280951
+Germany,Male,39380976,39556923,39835457
+Germany,Female,41142770,41210540,41362080
diff --git a/larray/tests/data/examples/pop_missing_values.csv b/larray/tests/data/examples/pop_missing_values.csv
new file mode 100644
index 000000000..6bb7d53bf
--- /dev/null
+++ b/larray/tests/data/examples/pop_missing_values.csv
@@ -0,0 +1,5 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,5472856,5493792,5524068
+Belgium,Female,5665118,5687048,5713206
+France,Female,33827685,34005671,34280951
+Germany,Male,39380976,39556923,39835457
diff --git a/larray/tests/data/examples/pop_narrow_format.csv b/larray/tests/data/examples/pop_narrow_format.csv
new file mode 100644
index 000000000..44a7f925b
--- /dev/null
+++ b/larray/tests/data/examples/pop_narrow_format.csv
@@ -0,0 +1,7 @@
+geo,time,value
+Belgium,2013,11137974
+Belgium,2014,11180840
+Belgium,2015,11237274
+France,2013,65600350
+France,2014,65942267
+France,2015,66456279
diff --git a/larray/tests/data/missing_axis_name.csv b/larray/tests/data/missing_axis_name.csv
index 28c1d639e..3a69dc182 100644
--- a/larray/tests/data/missing_axis_name.csv
+++ b/larray/tests/data/missing_axis_name.csv
@@ -1,4 +1,5 @@
-a,b0,b1,b2
-a0,0,1,2
-a1,3,4,5
-a2,6,7,8
+a,b,c0,c1
+a0,b0,0,1
+a0,b1,2,3
+a1,b0,4,5
+a1,b1,6,7
diff --git a/larray/tests/data/pop_only.xlsx b/larray/tests/data/pop_only.xlsx
new file mode 100644
index 000000000..960b3fd3b
Binary files /dev/null and b/larray/tests/data/pop_only.xlsx differ
diff --git a/larray/tests/data/population_session.h5 b/larray/tests/data/population_session.h5
new file mode 100644
index 000000000..c4200608d
Binary files /dev/null and b/larray/tests/data/population_session.h5 differ
diff --git a/larray/tests/data/population_session.xlsx b/larray/tests/data/population_session.xlsx
new file mode 100644
index 000000000..79dd17636
Binary files /dev/null and b/larray/tests/data/population_session.xlsx differ
diff --git a/larray/tests/data/population_session/__axes__.csv b/larray/tests/data/population_session/__axes__.csv
new file mode 100644
index 000000000..c8e70545e
--- /dev/null
+++ b/larray/tests/data/population_session/__axes__.csv
@@ -0,0 +1,4 @@
+geo,gender,time
+Belgium,Male,2013
+France,Female,2014
+Germany,,2015
diff --git a/larray/tests/data/population_session/__groups__.csv b/larray/tests/data/population_session/__groups__.csv
new file mode 100644
index 000000000..a25e717f9
--- /dev/null
+++ b/larray/tests/data/population_session/__groups__.csv
@@ -0,0 +1,3 @@
+even_years@time,odd_years@time
+2014,2013
+,2015
diff --git a/larray/tests/data/population_session/births.csv b/larray/tests/data/population_session/births.csv
new file mode 100644
index 000000000..b8136dea4
--- /dev/null
+++ b/larray/tests/data/population_session/births.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,64371,64173,62561
+Belgium,Female,61235,60841,59713
+France,Male,415762,418721,409145
+France,Female,396581,400607,390526
+Germany,Male,349820,366835,378478
+Germany,Female,332249,348092,359097
diff --git a/larray/tests/data/population_session/deaths.csv b/larray/tests/data/population_session/deaths.csv
new file mode 100644
index 000000000..9b98fd078
--- /dev/null
+++ b/larray/tests/data/population_session/deaths.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,53908,51579,53631
+Belgium,Female,55426,53176,56910
+France,Male,287410,282381,297028
+France,Female,281955,277054,296779
+Germany,Male,429645,422225,449512
+Germany,Female,464180,446131,475688
diff --git a/larray/tests/data/population_session/pop.csv b/larray/tests/data/population_session/pop.csv
new file mode 100644
index 000000000..4d1675f75
--- /dev/null
+++ b/larray/tests/data/population_session/pop.csv
@@ -0,0 +1,7 @@
+geo,gender\time,2013,2014,2015
+Belgium,Male,5472856,5493792,5524068
+Belgium,Female,5665118,5687048,5713206
+France,Male,31772665,31936596,32175328
+France,Female,33827685,34005671,34280951
+Germany,Male,39380976,39556923,39835457
+Germany,Female,41142770,41210540,41362080
diff --git a/larray/tests/data/test.xlsx b/larray/tests/data/test.xlsx
index 8a957d23c..61b849a26 100644
Binary files a/larray/tests/data/test.xlsx and b/larray/tests/data/test.xlsx differ
diff --git a/larray/tests/data/test_narrow.xlsx b/larray/tests/data/test_narrow.xlsx
index 010c47986..8422ace79 100644
Binary files a/larray/tests/data/test_narrow.xlsx and b/larray/tests/data/test_narrow.xlsx differ
diff --git a/larray/tests/test_array.py b/larray/tests/test_array.py
index 9c374b3cb..17b60952f 100644
--- a/larray/tests/test_array.py
+++ b/larray/tests/test_array.py
@@ -2888,6 +2888,10 @@ def test_read_excel_xlwings():
expected[isnan(expected)] = 42
assert_array_equal(arr, expected)
+ # range
+ arr = read_excel(inputpath('test.xlsx'), 'position', range='D3:H9')
+ assert_array_equal(arr, io_3d)
+
#################
# narrow format #
#################
@@ -2910,6 +2914,10 @@ def test_read_excel_xlwings():
arr = read_excel(inputpath('test_narrow.xlsx'), 'unsorted', wide=False)
assert_array_equal(arr, io_unsorted)
+ # range
+ arr = read_excel(inputpath('test_narrow.xlsx'), 'position', range='D3:G21', wide=False)
+ assert_array_equal(arr, io_3d)
+
##############################
# invalid keyword argument #
##############################