-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Avoid computing dask variables on __repr__ and __getattr__ #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
37d9135
stop repr(Dataset) from resolving dask variables
gimperiale 885d5cd
stop DataArray.__getattr__('NOTEXIST') from computing dask variables
gimperiale df9b57b
PEP8 fixes
gimperiale 426b12c
PEP8 fixes
gimperiale 5d9062e
Tests for __repr__, __getattr__ and __getstate__
gimperiale 555d2ca
int is coerced to int64 or int32 on different systems
gimperiale 1633a4a
What's New
gimperiale e626a9b
Merge remote-tracking branch 'upstream/master'
gimperiale 2717ae9
Merge branch 'master' into dataset_repr
gimperiale 8c72da0
flake8 tweaks
gimperiale 4e2d081
print summary of dask arrays
gimperiale 05e02be
More compact printing for dask arrays
gimperiale 9fc6751
Merge remote-tracking branch 'upstream/master' into dataset_repr
gimperiale 963e26a
Merge remote-tracking branch 'upstream/master' into dataset_repr
gimperiale 9bc1c74
Cleared changelog and moved to breaking changes
gimperiale 7b1b265
Deduplicate code
gimperiale File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current heuristic uses the
_not_remote()
helper function, so it doesn't display arrays loaded over a network (via opendap), which can often be quite slow. But it does display a summary of values from netCDF files on disk, which I do think is generally helpful and for which I haven't noticed any performance issues.Based on the current definition of
_in_memory
, we wouldn't display any of these arrays:xarray/xarray/core/variable.py
Lines 285 to 289 in 4a15cfa
So instead of using
_in_memory
, I would suggest something like_not_remote(var) and not isinstance(var._data, dask_array_type)
as the condition for showing values.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer loading a NetCDF variable from disk every time you do __repr__ is a terrible idea if that variable has been compressed without chunking. If the variable is a single block of 100MB of zlib-compressed data, you will have to read it and decompress it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer also, your netcdf array might be sitting on a network file system on the opposite side of a narrowband VPN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly possible, but in my experience very few people writes 100MB chunks -- those are very large.
Let's summarize our options:
Dataset.__repr__
Reasons to show data from disk in
__repr__
:preview()
orload()
command to see what's in a Dataset. You can simply print it at a console.Reasons not to show data from disk in
__repr__
:Maybe we should solicit a few more opinions here before we change the default behavior?
Another possibility is to try loading data in a separate thread and timeout if it takes too long (say more than 0.5 seconds), but that might open up it's own set of performance issues (it's not easy to kill a thread, short of terminating a process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my vote would be to only print a preview of data that is in memory. For my uses, I typically have fill values in the first 10-20 data points so the previous
__repr__
didn't give me any information.@pydata/xarray - anyone else have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - do we have results from your google poll on this issue yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like I was wrong -- the consensus is pretty clear that we should go ahead with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this sample size is going to give us statistically significant results but I'm glad to see @delgadom and I are in agreement.
@crusaderky - are you up for implementing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation (in this PR) is actually already correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - data is eagerly loaded from disk only for index coords on __init__ now.