Skip to content

dtype not preserved on round trip with xarray #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
itcarroll opened this issue Dec 9, 2021 · 9 comments
Closed

dtype not preserved on round trip with xarray #117

itcarroll opened this issue Dec 9, 2021 · 9 comments

Comments

@itcarroll
Copy link

itcarroll commented Dec 9, 2021

The data type I get when opening a TileDB Array with XArray does not match the data type in the TileDB ArraySchema. In the example below, I put in int16 and get back float32.

import tiledb
import xarray as xr
import numpy as np

index = tiledb.Dim(name='index', domain=(0, 3))
domain = tiledb.Domain(index)
var = tiledb.Attr(name='var', dtype=np.int16)
schema = tiledb.ArraySchema(domain=domain, attrs=[var], sparse=False)
tiledb.Array.create('dense_array0', schema)

with tiledb.open('dense_array0', 'w') as A:
    A[:] = np.array([5, 6, 7, 8], dtype=np.int16)

ds = xr.open_dataset('dense_array0', engine='tiledb')
ds['var'].dtype
dtype('float32')

I have tiledb 0.11.3, libtiledb 2.5.2 and tiledb-cf 0.5.2 on the python:latest docker image.

@jp-dark
Copy link
Collaborator

jp-dark commented Dec 9, 2021

Thank you for the bug report @itcarroll. I was able to reproduce the issue from your code snippet, and I am looking into a fix now.

@jp-dark
Copy link
Collaborator

jp-dark commented Dec 9, 2021

@itcarroll I was able to track down the bug, and it is actually a bug in xarray that happens during the decode_cf_variable step. I opened xarray issue #6055 that you can follow.

The reason you may be seeing this here and not with other xarray backends, is that we always set a fill value for the TileDB attributes. A temporary fix is to set mask_and_scale=False in the open_dataset function.

import tiledb
import xarray as xr
import numpy as np

index = tiledb.Dim(name='index', domain=(0, 3))
domain = tiledb.Domain(index)
var = tiledb.Attr(name='var', dtype=np.int16)
schema = tiledb.ArraySchema(domain=domain, attrs=[var], sparse=False)
tiledb.Array.create('dense_array0', schema)

with tiledb.open('dense_array0', 'w') as A:
    A[:] = np.array([5, 6, 7, 8], dtype=np.int16)

ds = xr.open_dataset('dense_array0', mask_and_scale=False, engine='tiledb')
ds['var'].dtype

@itcarroll
Copy link
Author

Thanks for quick investigation and follow-up with XArray. Makes sense to me if you want to close this issue.

@jp-dark jp-dark closed this as completed Dec 9, 2021
@itcarroll
Copy link
Author

I think XArray's logic of promoting to float when the _FillValue attribute is set is reasonable, with setting mask_and_scale=False a good way to preserve the original data, both its type and the values matching _FillValue.

So I have to ask why TileDB always sets the _FillValue attribute? You seem to be introducing arbitrary metadata (which has consequences, apparently!). Any documentation on this?

@jp-dark
Copy link
Collaborator

jp-dark commented Dec 10, 2021

The TileDB array uses a fill value for writing to an array where you are not filling an entire tile. This is part of the TileDB API versus NetCDF (what xarray was original designed to handle) where _FillValue is always a metadata convention.

I could default to NOT adding _FillValue as metadata unless the user specifically requests it. The _FillValue metadata would only be added with a TileDB specific parameter to the open_dataset function. If you think that would match user expectations better, I can reopen this issue and implement that.

@itcarroll
Copy link
Author

It keeps going deeper! I wouldn't change anything here, yet.

@jp-dark
Copy link
Collaborator

jp-dark commented Jan 21, 2022

This issue got brought up again elsewhere. In the next release, the TileDB-xarray backend will default to not adding the _FillValue metadata unless the parameter encode_fill is set to True when opening a TileDB array with xarray.

@itcarroll
Copy link
Author

I tested it out. Good solution.

BTW: Is it on purpose that if you DIDN'T write to all domain values in a tiledb attribute, then open_dataset will error on loading the data? I would expect loading to quietly include the fill.

@jp-dark
Copy link
Collaborator

jp-dark commented Jan 24, 2022

It should quietly include the fill. That is a separate bug that will also be fixed by PR #124 (also to be included in the next release).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants