Skip to content

Correct default fill_value for creation.create #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

benjeffery
Copy link
Contributor

Attempting to fix #965

I have added a small test in what I think is the right place. This bug might have been caught if the tests in test_core.py used creation.create instead of self.create_array. Maybe I am missing something, but I couldn't see any obvious reason why those tests have their own creation routine that uses storage.init_array (which has fill_value=None). I'm happy to make that change if it is the right thing to do, but it breaks a lot of the nbytes and hexdigest tests, so I didn't propose it in the PR.
Thanks!

TODO:

  • Add unit tests and/or doctests in docstrings
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@pep8speaks
Copy link

pep8speaks commented Feb 15, 2022

Hello @benjeffery! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 436:17: E711 comparison to None should be 'if cond is None:'

Comment last updated at 2022-02-16 15:16:58 UTC

@benjeffery benjeffery force-pushed the fix_unwritten_byte_array branch from 2a70286 to baf1315 Compare February 15, 2022 00:38
@@ -16,7 +16,7 @@


def create(shape, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', store=None, synchronizer=None,
fill_value=None, order='C', store=None, synchronizer=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other effects from changing this from 0 to None?

Should we be updating fill_values elsewhere to match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other use of fill_value=0 I can see is creation.open_array. I assume this should probably be None too, but I'm not too familiar with all the effects here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_array just calls init_array if needed, which already has fill_value=None. So making it None in open_array seems more consistent with the underlying function.

@jakirkham
Copy link
Member

Thanks Ben! 😄

Had a couple questions above 🙂

@benjeffery
Copy link
Contributor Author

Looks like I've broken a bunch of tests here too - I was only running a subset for speed locally. WIll fix up tomorrow.

@jakirkham
Copy link
Member

The tests that fail seem to expect 0 (as opposed to None). Wondering how much difference there is between these two choices.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 15, 2022

Because of this check, arrays with a fill value of None cannot have empty chunks from the perspective of the write_empty_chunks machinery. So if None is the default fill value, then arrays will default to writing empty chunks, regardless of the state of the write_empty_chunks flag, which is a bit unfortunate. However, a sensible default fill value kind of depends on the datatype of the array (e.g., it's not clear that 0 is a good fill value for int16 data), so I can see this getting rather complicated.

@jakirkham
Copy link
Member

Interesting we might want to revisit that check then. Most of the code uses fill_value=None as the default. As Ben points out, this case where fill_value=0 is an exception.

Typically anywhere we have a fill_value, we pass it into normalize_fill_value. However that normally keeps None.

If no fill_value is set, zarr-python treats that as 0 (for example).

Generally though the Zarr v2 spec leaves this ambiguous:

If the “fill_value” field is null then the contents of the chunk are undefined.

@jni
Copy link
Contributor

jni commented Feb 15, 2022

Generally though the Zarr v2 spec leaves this ambiguous:

If the “fill_value” field is null then the contents of the chunk are undefined.

😲

@benjeffery benjeffery force-pushed the fix_unwritten_byte_array branch from baf1315 to 5a265c6 Compare February 16, 2022 15:16
@benjeffery
Copy link
Contributor Author

I've pushed up quick fixes to the tests here.

@joshmoore
Copy link
Member

Still a number errors of the form:

        # compare properties
        for k, v in expect_props.items():
>           assert v == getattr(copied, k)
E           assert None == 0
E            +  where 0 = getattr(<HDF5 dataset "baz": shape (100,), type "<i8">, 'fillvalue')

not rolling this into 2.11.1.

@joshmoore
Copy link
Member

@benjeffery: let me know if help is needed here.

@jni
Copy link
Contributor

jni commented Mar 8, 2022

Does this mean that numeric types will default to None also? I find that suboptimal, and I think 0 is a good default fill value for numeric types. So I would use normalize_fill_value to have numeric arrays default to 0.

However, I really don't have strong feelings about the issue here. But if numeric arrays will need to specify a fill_value, then some loud documentation needs to exist about that... Again, this is all based on my own default expectations, which were always sparse writing and 0 default, but might not match others'.

@benjeffery
Copy link
Contributor Author

I'm not sure my fixes here are the correct way to go about this, and that some of the ideas in #965 are the right way to do this.

@benjeffery benjeffery closed this Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect default fill value causes byte arrays to become numeric when write_empty_chunks=False
6 participants