Skip to content

Developed zfp python wrapper for numcodecs and is ready to be used #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

halehawk
Copy link
Contributor

@halehawk halehawk commented Dec 21, 2018

Closes #117

Want to know how to add tests.
Also want to know how to add zfp-0.5.4 c source codes to c-blosc/internal-complibs

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py37 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@rabernat
Copy link
Contributor

Hi @halehawk -- thanks so much for getting this started! We really appreciate your contribution and we look forward to working with you to get this pull request merged.

Since you are a new numcodecs contributor, I would like to explain some aspects of our process. The requirements for all PRs are outlined in the "TODO" list at the top. There are 8 items that must be satisfied (and currently 0 have been checked off). These requirements are all related to testing and documentation. Nearly every open source python package follows similar procedures. Since you asked for some help on tests, I'm going to assume you are unfamiliar with this process but interested in in learning.

Numcodecs has a comprehensive test suite which helps us maintain the reliability of the package. These tests are located in the numcodecs/tests/ subdirectory. For example, the tests for blosc are in
https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/tests/test_blosc.py.
You will need to add a new test file test_zfp.py which validates that your code does what it is supposed to do. This is significantly more work that just writing the code itself; however, it is a useful and necessary part of the development process. We will be happy to advise you on the details of testing once you have made a first pass at it.

The test suite is run using tox. You need to have tox and pytest installed in your local environment (full requirements are in requirements_test.txt). You run the tests from the command line using the commands tox -e py37 and tox -e py27. You should get used to running these command from your local numcodecs development environment.

The tests are also run automatically in the cloud using two free services, travis (for unix) and appveyor (for windows). As you can see above, those tests are currently failing on this PR. If you follow the travis link, you will eventually see the following error

==================================== ERRORS ====================================
________________ ERROR collecting numcodecs/tests/test_blosc.py ________________
ImportError while importing test module '/home/travis/build/zarr-developers/numcodecs/numcodecs/tests/test_blosc.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_blosc.py:12: in <module>
    from numcodecs import blosc
numcodecs/compat_ext.pxd:4: in init numcodecs.blosc
    cdef class Buffer:
E   ModuleNotFoundError: No module named 'numcodecs.compat_ext'

This indicates that the tests could not be run at all because of an import error introduced by your changes.

The fraction of the package which is covered by tests is called the "coverage". We use the coveralls utility to measure our coverage. Currently coverage is at 100%: https://coveralls.io/github/zarr-developers/numcodecs?branch=master.

We will also want documentation on the new zfp functions. Our documentation is built by sphinx. You will want to add a new documentation page modeled after https://github.com/zarr-developers/numcodecs/blob/master/docs/blosc.rst.

I would recommend starting with the tests. Your first step will be to get the test suite running locally. Once all the old tests are running correctly and passing, then you can begin adding new tests.

To update your pull request, simply make more commits to your local branch and then run git push. This page will automatically update, and all the automated tests will re-run.

Please let us know if anything is unclear. If you are new to this process, it can be a bit overwhelming. However, these are very useful skills to learn for anyone involved in open source software, so I hope you bear with us and help finish this off right.

Thanks again for your excellent contribution!

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

ext_modules = (blosc_extension() + zstd_extension() + lz4_extension() +
compat_extension() + vlen_extension())
ext_modules = (blosc_extension() + lz4_extension() + # zstd_extension() +
zfp_extension() + compat_extension() + vlen_extension())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halehawk Nice work! I have a couple of comments.

First, you're making the assumption that the ndarray uses C ordering, which may or may not be the case. Perhaps Zarr enforces this? It would be better to verify that this is the case, perhaps by checking ndarray.flags or ndarray.strides (I'm no Python expert--there may be a better way).

Second, the default settings of tolerance=0.01, precision=32, and rate=16 are arbitrary and inconsistent with the default expert mode parameters (minbits, maxbits, maxprec, minexp). Absent such specifications, it would be better to use the zfp defaults. For instance, tolerance=0.01 could lose all data if exponents are small. I would expect tolerance=0 to be a more reasonable default value. precision=32 is similarly not an appropriate choice for double-precision data. I would suggest using ZFP_MAX_PREC, which is analogous to setting tolerance=0 and works both for floats and doubles Finally, a default rate of 16 will also inevitably default to some level of data loss. It is difficult to suggest a good default rate, as you wouldn't want to use more bits per value than for the uncompressed representation. Perhaps the safest setting is rate=0, which I believe implies that the user has to override this value and set a rate meaningful to the application.

@rabernat
Copy link
Contributor

I write a test_zfp.py, but when I try to run the test, it keeps saying cannot import zfp. I tired several ways and can’t make it work.

Then how do you know that any of your code works? How are you testing it now?

setup.py Outdated

# setup sources - use ZFP bundled in blosc
zfp_sources = glob('c-blosc/internal-complibs/zfp*/src/*.c')
include_dirs = [d for d in glob('c-blosc/internal-complibs/zfp*/*') if os.path.isdir(d)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this. zfp is not part of blosc. Why are you looking for it there?

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@lindstro
Copy link

zfp expects the dimensions to be specified in what's commonly referred to as Fortran order, i.e., x (the leftmost dimension) varies fastest, then y, then z, etc. The rightmost dimension varies slowest. This behavior can be modified by specifying strides (see the zfp_set_stride family of functions).

My concern was that the ndarray being compressed can be ordered in Python to follow a Fortran contiguous ordering (e.g., via numpy.asfortranarray or via order='F' in the array constructor), in which case I believe (although I could be wrong) the dimensions from ndarray.shape should be reversed. In other words, ndarray.shape will return the same whether order='C' or order='F', but the strides (i.e., memory layout) will be different, and zfp needs to know the correct memory layout.

A more general solution is to consult ndarray.strides and set the zfp_field strides to match this. Then you don't have to explicitly test if he order is C, F, or something else.

@jakirkham
Copy link
Member

We have some utility functions (e.g. ensure_continguous_ndarray), which takes a ndarrary view of any continuous array-like object implementing and flattens it. Expect that’s what you will want to do here.

@jakirkham
Copy link
Member

As to the C code, we will probably want to do the same thing we do with Blosc and include it as a Git submodule.

@halehawk
Copy link
Contributor Author

I did the following:

elif input_array.ndim == 2:
zfp_field_set_size_2d(field,input_array.shape[1],input_array.shape[0])
elif input_array.ndim == 3:
zfp_field_set_size_3d(field,input_array.shape[2],input_array.shape[1],input_array.shape[0])
elif input_array.ndim == 4:
print(input_array.shape)
zfp_field_set_size_4d(field,input_array.shape[3],input_array.shape[2],input_array.shape[1],input_array.shape[0])
else:

This way, zfp can read the c order array in a correct way.

After I decompress, I reshape the array in (nw,nz,ny,nx). I tested in accuracy and precision modes, worked fine.

ensure_contiguous_ndarray flattens the array, which zfp does not expect, zfp works better with higher dimension data.

I will try set_stride later.

@halehawk
Copy link
Contributor Author

halehawk commented Dec 22, 2018 via email

@jakirkham
Copy link
Member

Ah ok. Didn’t realize the compressor was doing something different for higher dimensional arrays. Your proposal makes sense.

IIRC under the hood Buffer is using ensure_contiguous_ndarray, but it should still work as the data is unchanged.

Basically yes. Would go ahead and make it a new Git submodule. If you’re unfamiliar with Git submodules, this doc may help.

@jakirkham
Copy link
Member

Also for errors we should raise exceptions instead of printing. For example using raise TypeError(...) if the type is unsupported.

@lindstro
Copy link

I did the following:

elif input_array.ndim == 2:
zfp_field_set_size_2d(field,input_array.shape[1],input_array.shape[0])
elif input_array.ndim == 3:
zfp_field_set_size_3d(field,input_array.shape[2],input_array.shape[1],input_array.shape[0])
elif input_array.ndim == 4:
print(input_array.shape)
zfp_field_set_size_4d(field,input_array.shape[3],input_array.shape[2],input_array.shape[1],input_array.shape[0])
else:

This works as long as the ndarray uses (the default) C order layout, but fails otherwise. What if the user attempts to compress an ndarray laid out in Fortran order?

My suggestion would be to keep your code the same, since C order will be the most common case for numpy arrays, but to also set strides to correctly handle other layouts. Note that ndarray.strides are measured in bytes, while zfp measures strides in number of scalars.

@lindstro
Copy link

Do you mean link to Peter’s LLNL/zfp on github directly? Then type and exec need to changed.

Is there a Cython workaround for this? If not, then we can probably address this in a January release together with some other bug fixes.

@halehawk
Copy link
Contributor Author

halehawk commented Dec 23, 2018 via email

@lindstro lindstro mentioned this pull request Dec 23, 2018
20 tasks
@halehawk
Copy link
Contributor Author

halehawk commented Dec 28, 2018 via email

@jakirkham
Copy link
Member

My recommendation (though it would be good to hear others' thoughts as well) would be to raise a TypeError for any data types that it cannot compress. In terms of getting the type, can run ensure_ndarray and just inspect the dtype of the resulting ndarray to figure out what it is. This technique will catch bytes and bytearray as those will be uint8.

@rabernat
Copy link
Contributor

rabernat commented Jan 9, 2019

I agree with @jakirkham's suggestion. If a type can't be compressed by a particular coded, an informative TypeError should be raised.

@halehawk - please let us know if you have any other questions or need help with getting your tests running. Feel free to push your work-in-progress brach to github so we can follow along.

@rabernat
Copy link
Contributor

rabernat commented Jan 9, 2019

Also, given the rapid recent progress in LLNL/zfp#29, perhaps we should consider whether to try that implementation of zfp in python.

@lindstro
Copy link

lindstro commented Mar 5, 2020

@halehawk Is this PR based on zfp's official Python bindings (available since 0.5.5 and released in May 2019) or your own wrappers?

@halehawk
Copy link
Contributor Author

halehawk commented Mar 5, 2020 via email

@halehawk
Copy link
Contributor Author

halehawk commented Mar 5, 2020 via email

@lindstro
Copy link

lindstro commented Mar 5, 2020

Of course. zfp is licensed under BSD, so there are very few restrictions on what you can do with it. I figured zarr and numcodecs would be better served by integrating with the official zfPy implementation. Thanks for your help with this.

Currently is based on my wrappers. I need a tool to compress MPAS data and decompress them using zarr/numcodecs. But I have problems to let them pass the CI build and test now. If I use your bindings, can I include them in numcodecs?

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2020

Now that the official zfp python wrapper has been released, it seems like we should consider use that, rather than generating new python bindings ourselves. This will mean less maintenance work going forward. It might also fix our windows problem (or at least make it not our problem). @jakirkham - do you have preference? How do we feel about introducing a new dependency?

The zfp python bindings are documented here: https://zfp.readthedocs.io/en/release0.5.5/python.html

However, @lindstro, I'm confused about the python package release status. On pip there is

Ideally, if we were to include zfp as a dependency, we would need both a pip and conda-forge package.

If we do go this route, we should probably start fresh with a new PR.

@lindstro
Copy link

lindstro commented Mar 6, 2020

@rabernat We have no association with pyzfp, which was developed independently. The conda zfpy package is based on the official zfPy API. We reserved the name for a corresponding pip package, but we have not had the time to develop it yet. We would gladly accept some help with this in case anyone is interested.

@jakirkham
Copy link
Member

Trying to build a codec using zfpy seems like a good first step. No objection to adding the codec here. Though maybe we should consider making zfpy an optional dependency. WDYT @rabernat?

@rabernat
Copy link
Contributor

rabernat commented Mar 6, 2020

Though maybe we should consider making zfpy an optional dependency.

Do we have any optional dependencies currently? What happens if you try to read zarr dataset without the dependency installed? Is there a useful error message?

If so, I think we should make zfpy and optional dependency and write a very thin layer for its codec in numcodecs. We could also help with the pip packaging issue.

@halehawk
Copy link
Contributor Author

halehawk commented Mar 6, 2020 via email

@jakirkham
Copy link
Member

Yep these all seem like good ideas.

This is sort of how Blosc worked for people pip installing it. We initially made LZMA optional as part of Python 2 support. Though yeah this is a bit less common.

There is an error raised here if the codec is not available. We could improve this a bit to include some instructions on what to do to make it work. Agree this is important to have right.

Please let me know if you have any thoughts here 🙂

@halehawk
Copy link
Contributor Author

halehawk commented Mar 6, 2020 via email

@rabernat
Copy link
Contributor

rabernat commented Mar 7, 2020

So where is lzma module? I only see it is imported in lzma.py.

It is a totally separate package, like numpy. We don't include numpy code in this repo either. The required dependencies for numcodecs are listed here:

cython
numpy
msgpack
pytest

These are packages that will be automatically installed by pip if we try to install numcodecs.

Note that lzma is not one of them. The tests are configured to skip lzma tests if it is not installed:

try:
# noinspection PyProtectedMember
from numcodecs.lzma import LZMA, _lzma
except ImportError: # pragma: no cover
raise unittest.SkipTest("LZMA not available")

@halehawk
Copy link
Contributor Author

halehawk commented Mar 9, 2020 via email

@lindstro
Copy link

We have already reserved zfpy but would appreciate help with packaging.

@alimanfoo
Copy link
Member

Hi folks, can I just check in on status here, are we still waiting on zfpy to be available via PyPI? I see there is a 0.0.0 release on PyPI/zfpy, but conda-forge/zfpy is at 0.5.5.

Could you just release zfpy source tarballs to PyPI and not worry about any binaries (if people need them then recommend the conda-forge route)?

@lindstro
Copy link

lindstro commented Apr 6, 2020

PyPI/zfpy is being worked on (see https://test.pypi.org/project/zfpy/0.5.5rc1/#files for current progress). Hopefully not too much longer.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple comments to firm things up here.

Comment on lines +4 to +6
[submodule "zfp"]
path = zfp
url = https://github.com/LLNL/zfp.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the source code here since we can go the dependency route

Suggested change
[submodule "zfp"]
path = zfp
url = https://github.com/LLNL/zfp.git

Also please remove the submodule as well.

@@ -0,0 +1,17 @@
Zfp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to bikeshed things, but are we wanting this to be zfp or zfpy? Ok either way, just wanted to check that we are happy with the naming before we ship this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fact, the PR #160 is not needed any more, because @lindstro has their own zfp wrapper, so I have PR #229 ready for you to approve to merge. @jakirkham

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, must have gotten turned around. Should we close this PR out then?

@jakirkham
Copy link
Member

It looks like we have PyPI packages now. Are these good to use @lindstro?

@lindstro
Copy link

lindstro commented Jun 2, 2020

@jakirkham I thought these packages were good to go, but after further testing it looks like the osx wheels don't include the libzfp dylib. As I understand, the solution to this is to use delocate. Does anyone have an example of how to do this on Travis?

The Python interface to zfp is called zfpy, so I'd go with zfpy.

@halehawk
Copy link
Contributor Author

halehawk commented Jun 2, 2020

@lindstro @jakirkham I changed my pip install from test PyPI to PyPI zfpy0.5.5 at PR #229, then it passed all checks including osx checks.

@jakirkham
Copy link
Member

@jakirkham I thought these packages were good to go, but after further testing it looks like the osx wheels don't include the libzfp dylib. As I understand, the solution to this is to use delocate. Does anyone have an example of how to do this on Travis?

Unfortunately I do not. @matthew-brett, is an expert in this area. Maybe he can advise?

The Python interface to zfp is called zfpy, so I'd go with zfpy.

+1

@matthew-brett
Copy link

Sorry - I am a confused - can I help here - or is all good?

@halehawk halehawk closed this Jun 2, 2020
@halehawk halehawk mentioned this pull request Jun 2, 2020
7 tasks
@lindstro
Copy link

lindstro commented Jun 2, 2020

Sorry - I am a confused - can I help here - or is all good?

@matthew-brett We're building osx wheels on Travis CI and deploying to PyPI. Our Python package is a couple of Cython wrappers around the zfp C/C++ library. Unfortunately, the osx wheels do not include the libzfp dylib. I ran across a SO post that suggested using delocate to force inclusion of the dylib, though I'm not sure how to set up our Travis scripts to make this happen.

I'm the developer of zfp and have very limited expertise when it comes to Python, PyPI, and Travis CI. I was hoping someone could point to an example of how to properly include dylibs when building wheels on Travis. Any pointers would be appreciated.

@jakirkham
Copy link
Member

Sorry Matthew. The PR is out-of-date, but the packaging issue with zfpy remains, on which the newer PR will depend. If you are able to advise, that would be very helpful 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZFP Compression
7 participants