Skip to content

Commit 77db9aa

Browse files
committed
DOC: updated HDFStore docs for indexing support and better explanations on how to deal with strings in indexables/values
1 parent 93f75b3 commit 77db9aa

File tree

1 file changed

+33
-18
lines changed

1 file changed

+33
-18
lines changed

doc/source/io.rst

+33-18
Original file line numberDiff line numberDiff line change
@@ -1001,7 +1001,7 @@ Objects can be written to the file just like adding key-value pairs to a dict:
10011001
store['wp'] = wp
10021002
10031003
# the type of stored data
1004-
store.handle.root.wp._v_attrs.pandas_type
1004+
store.root.wp._v_attrs.pandas_type
10051005
10061006
store
10071007
@@ -1037,8 +1037,7 @@ Storing in Table format
10371037

10381038
``HDFStore`` supports another ``PyTables`` format on disk, the ``table`` format. Conceptually a ``table`` is shaped
10391039
very much like a DataFrame, with rows and columns. A ``table`` may be appended to in the same or other sessions.
1040-
In addition, delete & query type operations are supported. You can create an index with ``create_table_index``
1041-
after data is already in the table (this may become automatic in the future or an option on appending/putting a ``table``).
1040+
In addition, delete & query type operations are supported.
10421041

10431042
.. ipython:: python
10441043
:suppress:
@@ -1061,11 +1060,7 @@ after data is already in the table (this may become automatic in the future or a
10611060
store.select('df')
10621061
10631062
# the type of stored data
1064-
store.handle.root.df._v_attrs.pandas_type
1065-
1066-
# create an index
1067-
store.create_table_index('df')
1068-
store.handle.root.df.table
1063+
store.root.df._v_attrs.pandas_type
10691064
10701065
Hierarchical Keys
10711066
~~~~~~~~~~~~~~~~~
@@ -1090,8 +1085,7 @@ Storing Mixed Types in a Table
10901085
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10911086

10921087
Storing mixed-dtype data is supported. Strings are store as a fixed-width using the maximum size of the appended column. Subsequent appends will truncate strings at this length.
1093-
Passing ``min_itemsize = { column_name : size }`` as a paremeter to append will set a larger minimum for the column. Storing ``floats, strings, ints, bools`` are currently supported.
1094-
Pass ``min_itemsize`` with a ``column_name`` of values to effect a minimum pre-allocation of space for strings in the dataset.
1088+
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools`` are currently supported.
10951089

10961090
.. ipython:: python
10971091
@@ -1100,11 +1094,14 @@ Pass ``min_itemsize`` with a ``column_name`` of values to effect a minimum pre-a
11001094
df_mixed['int'] = 1
11011095
df_mixed['bool'] = True
11021096
1103-
store.append('df_mixed',df_mixed)
1097+
store.append('df_mixed', df_mixed, min_itemsize = { 'values' : 50 })
11041098
df_mixed1 = store.select('df_mixed')
11051099
df_mixed1
11061100
df_mixed1.get_dtype_counts()
11071101
1102+
# we have provided a minimum string column size
1103+
store.root.df_mixed.table
1104+
11081105
11091106
Querying a Table
11101107
~~~~~~~~~~~~~~~~
@@ -1136,6 +1133,23 @@ Queries are built up using a list of ``Terms`` (currently only **anding** of ter
11361133
store
11371134
store.select('wp',[ 'major_axis>20000102', ('minor_axis', '=', ['A','B']) ])
11381135
1136+
Indexing
1137+
~~~~~~~~
1138+
You can create an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. It is not automagically done now because you may want to index different axes than the default (except in the case of a DataFrame, where it almost always makes sense to index the ``index``.
1139+
1140+
.. ipython:: python
1141+
1142+
# create an index
1143+
store.create_table_index('df')
1144+
i = store.root.df.table.cols.index.index
1145+
i.optlevel, i.kind
1146+
1147+
# change an index by passing new parameters
1148+
store.create_table_index('df', optlevel = 9, kind = 'full')
1149+
i = store.root.df.table.cols.index.index
1150+
i.optlevel, i.kind
1151+
1152+
11391153
Delete from a Table
11401154
~~~~~~~~~~~~~~~~~~~
11411155

@@ -1152,36 +1166,37 @@ Notes & Caveats
11521166
- You can not append/select/delete to a non-table (table creation is determined on the first append, or by passing ``table=True`` in a put operation)
11531167
- ``HDFStore`` is **not-threadsafe for writing**. The underlying ``PyTables`` only supports concurrent reads (via threading or processes). If you need reading and writing *at the same time*, you need to serialize these operations in a single thread in a single process. You will corrupt your data otherwise. See the issue <https://github.com/pydata/pandas/issues/2397> for more information.
11541168

1155-
- ``PyTables`` only supports fixed-width string columns in ``tables``. The sizes of a string based indexing column (e.g. *column* or *minor_axis*) are determined as the maximum size of the elements in that axis or by passing the parameter ``min_itemsize`` on the first table creation (``min_itemsize`` can be an integer or a dict of column name to an integer). If subsequent appends introduce elements in the indexing axis that are larger than the supported indexer, an Exception will be raised (otherwise you could have a silent truncation of these indexers, leading to loss of information).
1169+
- ``PyTables`` only supports fixed-width string columns in ``tables``. The sizes of a string based indexing column (e.g. *columns* or *minor_axis*) are determined as the maximum size of the elements in that axis or by passing the parameter ``min_itemsize`` on the first table creation (``min_itemsize`` can be an integer or a dict of column name to an integer). If subsequent appends introduce elements in the indexing axis that are larger than the supported indexer, an Exception will be raised (otherwise you could have a silent truncation of these indexers, leading to loss of information). Just to be clear, this fixed-width restriction applies to **indexables** (the indexing columns) and **string values** in a mixed_type table.
11561170

11571171
.. ipython:: python
11581172
1159-
store.append('wp_big_strings', wp, min_itemsize = 30)
1173+
store.append('wp_big_strings', wp, min_itemsize = { 'minor_axis' : 30 })
11601174
wp = wp.rename_axis(lambda x: x + '_big_strings', axis=2)
11611175
store.append('wp_big_strings', wp)
11621176
store.select('wp_big_strings')
11631177

1178+
# we have provided a minimum minor_axis indexable size
1179+
store.root.wp_big_strings.table
1180+
11641181
Compatibility
11651182
~~~~~~~~~~~~~
11661183

11671184
0.10 of ``HDFStore`` is backwards compatible for reading tables created in a prior version of pandas,
1168-
however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entire
1169-
file and write it out using the new format to take advantage of the updates.
1185+
however, query terms using the prior (undocumented) methodology are unsupported. ``HDFStore`` will issue a warning if you try to use a prior-version format file. You must read in the entire
1186+
file and write it out using the new format to take advantage of the updates. The group attribute ``pandas_version`` contains the version information.
11701187

11711188

11721189
Performance
11731190
~~~~~~~~~~~
11741191

1175-
- ``Tables`` come with a performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
1192+
- ``Tables`` come with a writing performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
11761193
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
11771194
- ``Tables`` can (as of 0.10.0) be expressed as different types.
11781195

11791196
- ``AppendableTable`` which is a similiar table to past versions (this is the default).
11801197
- ``WORMTable`` (pending implementation) - is available to faciliate very fast writing of tables that are also queryable (but CANNOT support appends)
11811198

11821199
- To delete a lot of data, it is sometimes better to erase the table and rewrite it. ``PyTables`` tends to increase the file size with deletions
1183-
- In general it is best to store Panels with the most frequently selected dimension in the minor axis and a time/date like dimension in the major axis, but this is not required. Panels can have any major_axis and minor_axis type that is a valid Panel indexer.
1184-
- No dimensions are currently indexed automagically (in the ``PyTables`` sense); these require an explict call to ``create_table_index``
11851200
- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
11861201
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
11871202
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)

0 commit comments

Comments
 (0)