Skip to content

Cannot open zarr store stored in Azure blob file system #10209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
lsim-aegeri opened this issue Apr 8, 2025 · 21 comments
Closed
5 tasks done

Cannot open zarr store stored in Azure blob file system #10209

lsim-aegeri opened this issue Apr 8, 2025 · 21 comments
Labels
bug topic-zarr Related to zarr storage library

Comments

@lsim-aegeri
Copy link

What happened?

I'm trying to save data to a zarr store hosted on my Azure blob filesystem. I'm able to write the data just fine, but then am unable to open the dataset afterwards--when I try to open it, I get an empty dataset with no variables. I believe this is related to the storage backend because everything works when I write the data to my local filesystem instead. I also don't think this is a zarr version issue because the same thing happens with zarr v2 and v3, and when I use consolidated metadata with zarr v2.

What did you expect to happen?

I should be able to read data from the zarr stores I write to my Azure blob filesystem.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import pandas as pd
import adlfs

ds = xr.Dataset(
    {"foo": (("x", "y"), np.random.rand(4, 5))},
    coords={
        "x": [10, 20, 30, 40],
        "y": pd.date_range("2000-01-01", periods=5),
        "z": ("x", list("abcd")),
    },
)

store = 'abfs://weatherblob/xr-test/test.zarr-v3'
fs = adlfs.AzureBlobFileSystem(
    account_name="abcdefg",
    sas_token="ABCDEFG",
)
ds.to_zarr(
    store, 
    storage_options=fs.storage_options, 
    mode='w', 
    consolidated=False, 
    zarr_format=3, 
)

ds_zarr = xr.open_zarr(
    store, 
    storage_options=fs.storage_options, 
    consolidated=False, 
    zarr_format=3, 
)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

ds.info returns: 
Size: 248B
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 32B 10 20 30 40
  * y        (y) datetime64[ns] 40B 2000-01-01 2000-01-02 ... 2000-01-05
    z        (x) <U1 16B 'a' 'b' 'c' 'd'
Data variables:
    foo      (x, y) float64 160B 0.768 0.1867 0.1145 ... 0.3455 0.7483 0.8156>

ds_zarr.info returns: 
Size: 0B
Dimensions:  ()
Data variables:
    *empty*>

Also, the blob storage bucket has data at the zarr's path. 
* All the .json and chunk files that I'd expect are there. 
* Active blobs: 9 blobs, 3.04 KiB (3,110 bytes).

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1071-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2025.3.1
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.2
numpy_groupies: 0.11.2
setuptools: None
pip: None
conda: None
pytest: None
mypy: None
IPython: 9.1.0
sphinx: None
adlfs: 2024.12.0

@lsim-aegeri lsim-aegeri added bug needs triage Issue that has not been reviewed by xarray team member labels Apr 8, 2025
Copy link

welcome bot commented Apr 8, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

potentially have look at the files in the bucket — how do they differ from the files that are written to a local filesystem? if we write to one and then copy to the other, does that work?

fwiw it seems unlikely this in an xarray issue (though not impossible)

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 8, 2025

Yeah, I can't say for sure where the issue is coming from but I am getting the results I'd expect after downgrading to an older version of xarray.

INSTALLED VERSIONS

commit: None
python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1071-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.7.0
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.2
numpy_groupies: 0.11.2
setuptools: None
pip: None
conda: None
pytest: None
mypy: None
IPython: 9.1.0
sphinx: None

And write to a zarr v2 store:

ds.to_zarr(
    store, 
    storage_options=fs.storage_options, 
    mode='w', 
    consolidated=False, 
    zarr_version=2, 
)
ds_zarr = xr.open_zarr(
    store, 
    storage_options=fs.storage_options, 
    consolidated=False, 
    zarr_version=2, 
)
ds_zarr.info

I get what I'd expect.

Size: 248B
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 32B 10 20 30 40
  * y        (y) datetime64[ns] 40B 2000-01-01 2000-01-02 ... 2000-01-05
    z        (x) <U1 16B dask.array<chunksize=(4,), meta=np.ndarray>
Data variables:
    foo      (x, y) float64 160B dask.array<chunksize=(4, 5), meta=np.ndarray>>

I'd really prefer to use the latest versions though, because I'm very excited for zarr v3 and that requires more recent versions of xarray.

@max-sixty
Copy link
Collaborator

potentially have look at the files in the bucket — how do they differ from the files that are written to a local filesystem? if we write to one and then copy to the other, does that work?

could you look at this?

@lsim-aegeri
Copy link
Author

Yes, working on it!

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 8, 2025

Very weird.

  • When I write to blob they open incorrectly from blob.
  • When I write locally and then copy the files to blob they open correctly from blob.
  • When I write to blob, copy locally, then copy back to blob they open incorrectly from blob.
  • When I write to blob, copy locally, rename, then copy back to blob with the new name, they open correctly from blob.

I compared all the json files when writing locally and to blob and could not see any differences. I also checked file sizes and they were the same.

Do you know what to make of this?

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 8, 2025

But I think you're right, this probably isn't purely an xarray issue. I can also read the files if I write to blob, then clone within blob, and open from blob. It feels like something related to permissions? I'll investigate further.

I do wonder if a helpful change could be made to xarray to prevent this, given reading and writing works fine using xarray v2024.7.0. Will follow up when I've identified the problem.

@oj-tooth
Copy link

oj-tooth commented Apr 9, 2025

Hi,

Just wanted to follow up on this issue, I've also encountered this bug after writing a Zarr v3 store to the JASMIN s3-compatible object store in the UK. So far, I've tracked the error down to ZarrStore.open_group() in zarr.py (see here), which is called if you pass a string rather than a store to open_zarr(). From what I can tell, store = ZarrStore.open_group() is failing to behave as expected because zarr.open_group() returns a group with no members or arrays.

I can open the unconsolidated Zarr Store as expected by manually passing the store as follows:

store = zarr.storage.FsspecStore(fs=obj_store, path=dest)

xr.open_zarr(store, zarr_format=3, consolidated=False)

where obj_store is an async s3fs.S3FileSystem instance. Planning to look into this more tomorrow, so will post any updates then!

@oj-tooth
Copy link

oj-tooth commented Apr 9, 2025

In case of interest, the following storage_options seem to have solved this issue for me by ensuring successful store creation in the open_zarr() call:

store_path = "s3://bucket/prefix"

ds = xr.open_zarr(fpath, zarr_format=3, consolidated=False, storage_options={"anon":True, "asynchronous":True, "client_kwargs":{"endpoint_url":"https://my_end_point_url"}})

Find it interesting though that reading the same store created with consolidated metadata does not require any storage_options as behind the scenes zarr.open_consolidated() behaves as expected when providing the complete (including endpoint) url.

@TomNicholas
Copy link
Member

TomNicholas commented Apr 11, 2025

Thank you all for diving into this. It's highly likely that this is a bug in the zarr-python library rather than in xarray. Has anyone tried to replicate the bug simply by opening the zarr store using zarr-python directly and listing the contents of the group using the zarr-python API? That would be my next move.

@TomNicholas TomNicholas added topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 11, 2025
@oj-tooth
Copy link

oj-tooth commented Apr 11, 2025

Hi Tom,
Yes, I looked into this further yesterday and traced the issue to the use_consolidated=False case in the zarr.open_group() zarr-python function. I created a zarr v3 store with and without consolidated metadata and the consolidated metadata store arrays can be accessed with use_consolidated=True below (or using open_consolidated() as an alias), whereas use_consolidated=False with the unconsolidated store returns an empty group -> empty dataset when consumed by xarray.

zarr.open_group(url, mode='r', zarr_format=3, use_consolidated=False)

I'll drop an issue in zarr-python as currently the way around this in xarray is to manually specify storeage_options even though, in our case, the data should be read-only accessible for users at the url.

@TomNicholas
Copy link
Member

Thank you @oj-tooth ! Yes this makes some sense to me. Please raise an issue on zarr-python upstream and link to this issue! 🙏

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 14, 2025

Based on the issue @oj-tooth opened, it seems like our problems are different. oj-tooth's issue was that they wanted to make zarr stores with unconsolidated metadata available over http request, which is impossible because listing directory contents doesn't work with http.

I also tried the solution oj-tooth suggested:

import zarr
dsn = xr.open_zarr(
    zarr.storage.FsspecStore(fs=fs, path=store.split('://')[1]),
    consolidated=False,
    use_zarr_fill_value_as_mask=True,
)
dsn.compute()

And am still getting an empty result. Also, my open_zarr's are returning empty datasets regardless of whether I'm using consolidated or unconsolidated metadata.

I'm about to test with zarr-python directly. I hope that sheds a little more light.

@lsim-aegeri
Copy link
Author

I also get an empty result when reading with zarr-python directly.

zstore = zarr.storage.FsspecStore(fs=fs, path=store.split('://')[1])
grp = zarr.open_group(zstore, mode='r')
grp.info_complete()

Result:

Name        : 
Type        : Group
Zarr format : 3
Read-only   : False
Store type  : FsspecStore
No. members : 0
No. arrays  : 0
No. groups  : 0

I will open an issue there as well.

@oj-tooth
Copy link

You're correct. Oversight on my part, as I thought I'd ran into the same issue, but actually just needed to learn more about the limits of reading zarr via http request. Sorry not to have been more helpful!

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 14, 2025

Actually, when I tried to create a reproducible example using pure zarr-python I found I'm not having the same problem.

fs = adlfs.AzureBlobFileSystem(
    account_name="abcdefg",
    sas_token="abcdefg",
)
store = zarr.storage.FsspecStore(
    fs=fs, 
    path='weatherblob/xr-test/test_zarr_direct.zarr-v3'
)

# Create the array
root = zarr.create_group(store=store, zarr_format=3, overwrite=True)
z1 = root.create_array(name='foo', shape=(10000, 10000), chunks=(1000, 1000), dtype='int32')
z1[:] = np.random.randint(0, 100, size=(10000, 10000))

# Read it back
root_read = zarr.open_group(store=store, zarr_format=3, mode='r')
root_read['foo'][:]

Yields, as expected:

array([[76, 33, 87, ..., 12, 15,  2],
       [43, 81,  6, ..., 66, 53, 53],
       [84, 13, 85, ..., 37, 98, 89],
       ...,
       [16, 82, 66, ..., 34, 30, 10],
       [20, 62, 45, ..., 37,  9, 38],
       [49, 27, 90, ..., 16, 98, 69]], dtype=int32

But then when I try to do the roundtrip with xarray I get an empty dataset.

ds = xr.Dataset(
    {"foo": xr.DataArray(root_read['foo'][:], dims=['x', 'y'])},
)
store_xr = zarr.storage.FsspecStore(
    fs=fs, 
    path='weatherblob/xr-test/test_zarr_direct_xr.zarr-v3'
)
ds.to_zarr(
    store_xr, 
    mode='w', 
    consolidated=False, 
    zarr_format=3, 
)
xr.open_zarr(
    store_xr, 
    consolidated=False, 
    zarr_format=3, 
).info

yields:

Dimensions:  ()
Data variables:
    *empty*>

This leads me to believe it's actually an issue somewhere in xarray. Or I guess it could still be an issue with zarr that's popping up in the xarray round-trip and not the zarr-python round-trip because xarray is using zarr-python in a way that's different from my example. I'm not well enough versed in the xarray read/write process to know for sure.

I do not have accounts with other cloud storage providers so I cannot test this with gcp or aws, unfortunately.

@lsim-aegeri
Copy link
Author

You're correct. Oversight on my part, as I thought I'd ran into the same issue, but actually just needed to learn more about the limits of reading zarr via http request. Sorry not to have been more helpful!

All good! Even if it doesn't help solve this particular issue, I still learned something new!

@lsim-aegeri
Copy link
Author

I have a little bit to add to the previous example. When I then try to read the store I wrote with xarray, I get some inconsistent results.

root_read_xr = zarr.open_group(store=store_xr, zarr_format=3, mode='r')
root_read_xr['foo'][:]

array([[93, 22, 81, ..., 96, 34, 60],
       [ 0, 13,  8, ...,  6, 87, 58],
       [24, 89, 93, ..., 16, 98, 83],
       ...,
       [73, 65, 51, ..., 89,  3, 14],
       [44, 23, 83, ..., 81, 14, 48],
       [60, 29, 85, ..., 21, 28, 32]], dtype=int32)

But:

root_read_xr.info_complete()

Name        : 
Type        : Group
Zarr format : 3
Read-only   : False
Store type  : FsspecStore
No. members : 0
No. arrays  : 0
No. groups  : 0

I'm interpreting this as that zarr-python itself cannot see any variables in the group, BUT zarr-python can read individual variables in the group when asked to do so directly.

@lsim-aegeri
Copy link
Author

And then there's still the additional wrinkle that when I simply copy the dataset from one blob path to another I can open it just fine with xarray.

store_xr = zarr.storage.FsspecStore(
    fs=fs, 
    path='weatherblob/xr-test/test_zarr_direct_xr.zarr-v3'
)
store_xr_cp = zarr.storage.FsspecStore(
    fs=fs, 
    path='weatherblob/xr-test/test_zarr_direct_xr_cp.zarr-v3'
)
fs.copy(store_xr.path, store_xr_cp.path, recursive=True)

xr.open_zarr(
    store=f'abfs://{store_xr_cp.path}', 
    storage_options=store_xr_cp.fs.storage_options,
    consolidated=False, 
    zarr_format=3, 
).info

yields:

Dimensions:  (x: 10000, y: 10000)
Dimensions without coordinates: x, y
Data variables:
    foo      (x, y) int32 400MB dask.array<chunksize=(625, 625), meta=np.ndarray>>

I'm feeling quite lost here.

@lsim-aegeri
Copy link
Author

I'm concurrently dealing with a separate issue reading data into memory with ds.compute(). Unfortunately I'm not able to reproduce this issue--it doesn't appear with the dummy data above, only when I try to read ecmwf eps reforecast data I've downloaded. However, the error message I'm getting makes makes me suspicious there might be a relationship.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[24], line 1
----> 1 ds_n.isel(step=0).compute()

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/dataset.py:714, in Dataset.compute(self, **kwargs)
    690 """Manually trigger loading and/or computation of this dataset's data
    691 from disk or a remote source into memory and return a new dataset.
    692 Unlike load, the original dataset is left unaltered.
   (...)
    711 dask.compute
    712 """
    713 new = self.copy(deep=False)
--> 714 return new.load(**kwargs)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/dataset.py:541, in Dataset.load(self, **kwargs)
    538 chunkmanager = get_chunked_array_type(*lazy_data.values())
    540 # evaluate all the chunked arrays simultaneously
--> 541 evaluated_data: tuple[np.ndarray[Any, Any], ...] = chunkmanager.compute(
    542     *lazy_data.values(), **kwargs
    543 )
    545 for k, data in zip(lazy_data, evaluated_data, strict=False):
    546     self.variables[k].data = data

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/namedarray/daskmanager.py:85, in DaskManager.compute(self, *data, **kwargs)
     80 def compute(
     81     self, *data: Any, **kwargs: Any
     82 ) -> tuple[np.ndarray[Any, _DType_co], ...]:
     83     from dask.array import compute
---> 85     return compute(*data, **kwargs)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/dask/base.py:656, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    653     postcomputes.append(x.__dask_postcompute__())
    655 with shorten_traceback():
--> 656     results = schedule(dsk, keys, **kwargs)
    658 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/indexing.py:576, in ImplicitToExplicitIndexingAdapter.__array__(self, dtype, copy)
    574     return np.asarray(self.get_duck_array(), dtype=dtype, copy=copy)
    575 else:
--> 576     return np.asarray(self.get_duck_array(), dtype=dtype)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/indexing.py:579, in ImplicitToExplicitIndexingAdapter.get_duck_array(self)
    578 def get_duck_array(self):
--> 579     return self.array.get_duck_array()

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/indexing.py:790, in CopyOnWriteArray.get_duck_array(self)
    789 def get_duck_array(self):
--> 790     return self.array.get_duck_array()

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/indexing.py:653, in LazilyIndexedArray.get_duck_array(self)
    649     array = apply_indexer(self.array, self.key)
    650 else:
    651     # If the array is not an ExplicitlyIndexedNDArrayMixin,
    652     # it may wrap a BackendArray so use its __getitem__
--> 653     array = self.array[self.key]
    655 # self.array[self.key] is now a numpy array when
    656 # self.array is a BackendArray subclass
    657 # and self.key is BasicIndexer((slice(None, None, None),))
    658 # so we need the explicit check for ExplicitlyIndexed
    659 if isinstance(array, ExplicitlyIndexed):

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/backends/zarr.py:223, in ZarrArrayWrapper.__getitem__(self, key)
    221 elif isinstance(key, indexing.OuterIndexer):
    222     method = self._oindex
--> 223 return indexing.explicit_indexing_adapter(
    224     key, array.shape, indexing.IndexingSupport.VECTORIZED, method
    225 )

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/core/indexing.py:1014, in explicit_indexing_adapter(key, shape, indexing_support, raw_indexing_method)
    992 """Support explicit indexing by delegating to a raw indexing method.
    993 
    994 Outer and/or vectorized indexers are supported by indexing a second time
   (...)
   1011 Indexing result, in the form of a duck numpy-array.
   1012 """
   1013 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support)
-> 1014 result = raw_indexing_method(raw_key.tuple)
   1015 if numpy_indices.tuple:
   1016     # index the loaded duck array
   1017     indexable = as_indexable(result)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/xarray/backends/zarr.py:213, in ZarrArrayWrapper._getitem(self, key)
    212 def _getitem(self, key):
--> 213     return self._array[key]

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/array.py:2425, in Array.__getitem__(self, selection)
   2423     return self.vindex[cast(CoordinateSelection | MaskSelection, selection)]
   2424 elif is_pure_orthogonal_indexing(pure_selection, self.ndim):
-> 2425     return self.get_orthogonal_selection(pure_selection, fields=fields)
   2426 else:
   2427     return self.get_basic_selection(cast(BasicSelection, pure_selection), fields=fields)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/_compat.py:43, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     41 extra_args = len(args) - len(all_args)
     42 if extra_args <= 0:
---> 43     return f(*args, **kwargs)
     45 # extra_args > 0
     46 args_msg = [
     47     f"{name}={arg}"
     48     for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:], strict=False)
     49 ]

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/array.py:2867, in Array.get_orthogonal_selection(self, selection, out, fields, prototype)
   2865     prototype = default_buffer_prototype()
   2866 indexer = OrthogonalIndexer(selection, self.shape, self.metadata.chunk_grid)
-> 2867 return sync(
   2868     self._async_array._get_selection(
   2869         indexer=indexer, out=out, fields=fields, prototype=prototype
   2870     )
   2871 )

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/array.py:1287, in AsyncArray._get_selection(self, indexer, prototype, out, fields)
   1284         _config = replace(_config, order=self.metadata.order)
   1286     # reading chunks and decoding them
-> 1287     await self.codec_pipeline.read(
   1288         [
   1289             (
   1290                 self.store_path / self.metadata.encode_chunk_key(chunk_coords),
   1291                 self.metadata.get_chunk_spec(chunk_coords, _config, prototype=prototype),
   1292                 chunk_selection,
   1293                 out_selection,
   1294                 is_complete_chunk,
   1295             )
   1296             for chunk_coords, chunk_selection, out_selection, is_complete_chunk in indexer
   1297         ],
   1298         out_buffer,
   1299         drop_axes=indexer.drop_axes,
   1300     )
   1301 return out_buffer.as_ndarray_like()

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/codec_pipeline.py:464, in BatchedCodecPipeline.read(self, batch_info, out, drop_axes)
    458 async def read(
    459     self,
    460     batch_info: Iterable[tuple[ByteGetter, ArraySpec, SelectorTuple, SelectorTuple, bool]],
    461     out: NDBuffer,
    462     drop_axes: tuple[int, ...] = (),
    463 ) -> None:
--> 464     await concurrent_map(
    465         [
    466             (single_batch_info, out, drop_axes)
    467             for single_batch_info in batched(batch_info, self.batch_size)
    468         ],
    469         self.read_batch,
    470         config.get("async.concurrency"),
    471     )

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/common.py:68, in concurrent_map(items, func, limit)
     65     async with sem:
     66         return await func(*item)
---> 68 return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/common.py:66, in concurrent_map.<locals>.run(item)
     64 async def run(item: tuple[Any]) -> V:
     65     async with sem:
---> 66         return await func(*item)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/codec_pipeline.py:251, in BatchedCodecPipeline.read_batch(self, batch_info, out, drop_axes)
    244 async def read_batch(
    245     self,
    246     batch_info: Iterable[tuple[ByteGetter, ArraySpec, SelectorTuple, SelectorTuple, bool]],
    247     out: NDBuffer,
    248     drop_axes: tuple[int, ...] = (),
    249 ) -> None:
    250     if self.supports_partial_decode:
--> 251         chunk_array_batch = await self.decode_partial_batch(
    252             [
    253                 (byte_getter, chunk_selection, chunk_spec)
    254                 for byte_getter, chunk_spec, chunk_selection, *_ in batch_info
    255             ]
    256         )
    257         for chunk_array, (_, chunk_spec, _, out_selection, _) in zip(
    258             chunk_array_batch, batch_info, strict=False
    259         ):
    260             if chunk_array is not None:

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/codec_pipeline.py:207, in BatchedCodecPipeline.decode_partial_batch(self, batch_info)
    205 assert self.supports_partial_decode
    206 assert isinstance(self.array_bytes_codec, ArrayBytesCodecPartialDecodeMixin)
--> 207 return await self.array_bytes_codec.decode_partial(batch_info)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/abc/codec.py:198, in ArrayBytesCodecPartialDecodeMixin.decode_partial(self, batch_info)
    178 async def decode_partial(
    179     self,
    180     batch_info: Iterable[tuple[ByteGetter, SelectorTuple, ArraySpec]],
    181 ) -> Iterable[NDBuffer | None]:
    182     """Partially decodes a batch of chunks.
    183     This method determines parts of a chunk from the slice selection,
    184     fetches these parts from the store (via ByteGetter) and decodes them.
   (...)
    196     Iterable[NDBuffer | None]
    197     """
--> 198     return await concurrent_map(
    199         list(batch_info),
    200         self._decode_partial_single,
    201         config.get("async.concurrency"),
    202     )

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/common.py:68, in concurrent_map(items, func, limit)
     65     async with sem:
     66         return await func(*item)
---> 68 return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/core/common.py:66, in concurrent_map.<locals>.run(item)
     64 async def run(item: tuple[Any]) -> V:
     65     async with sem:
---> 66         return await func(*item)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/codecs/sharding.py:506, in ShardingCodec._decode_partial_single(self, byte_getter, selection, shard_spec)
    503     shard_dict = shard_dict_maybe
    504 else:
    505     # read some chunks within the shard
--> 506     shard_index = await self._load_shard_index_maybe(byte_getter, chunks_per_shard)
    507     if shard_index is None:
    508         return None

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/codecs/sharding.py:718, in ShardingCodec._load_shard_index_maybe(self, byte_getter, chunks_per_shard)
    713     index_bytes = await byte_getter.get(
    714         prototype=numpy_buffer_prototype(),
    715         byte_range=RangeByteRequest(0, shard_index_size),
    716     )
    717 else:
--> 718     index_bytes = await byte_getter.get(
    719         prototype=numpy_buffer_prototype(), byte_range=SuffixByteRequest(shard_index_size)
    720     )
    721 if index_bytes is not None:
    722     return await self._decode_shard_index(index_bytes, chunks_per_shard)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/storage/_common.py:124, in StorePath.get(self, prototype, byte_range)
    122 if prototype is None:
    123     prototype = default_buffer_prototype()
--> 124 return await self.store.get(self.path, prototype=prototype, byte_range=byte_range)

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/zarr/storage/_fsspec.py:245, in FsspecStore.get(self, key, prototype, byte_range)
    240     value = prototype.buffer.from_bytes(
    241         await self.fs._cat_file(path, start=byte_range.offset, end=None)
    242     )
    243 elif isinstance(byte_range, SuffixByteRequest):
    244     value = prototype.buffer.from_bytes(
--> 245         await self.fs._cat_file(path, start=-byte_range.suffix, end=None)
    246     )
    247 else:
    248     raise ValueError(f"Unexpected byte_range, got {byte_range}.")

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/adlfs/spec.py:1475, in AzureBlobFileSystem._cat_file(self, path, start, end, max_concurrency, **kwargs)
   1471 async with self.service_client.get_blob_client(
   1472     container=container_name, blob=blob
   1473 ) as bc:
   1474     try:
-> 1475         stream = await bc.download_blob(
   1476             offset=start,
   1477             length=length,
   1478             version_id=version_id,
   1479             max_concurrency=max_concurrency or self.max_concurrency,
   1480             **self._timeout_kwargs,
   1481         )
   1482     except ResourceNotFoundError as e:
   1483         raise FileNotFoundError(
   1484             errno.ENOENT, os.strerror(errno.ENOENT), path
   1485         ) from e

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py:114, in distributed_trace_async.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
    112 span_impl_type = settings.tracing_implementation()
    113 if span_impl_type is None:
--> 114     return await func(*args, **kwargs)
    116 # Merge span is parameter is set, but only if no explicit parent are passed
    117 if merge_span and not passed_in_parent:

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_blob_client_async.py:735, in BlobClient.download_blob(self, offset, length, encoding, **kwargs)
    717 options = _download_blob_options(
    718     blob_name=self.blob_name,
    719     container_name=self.container_name,
   (...)
    732     client=self._client,
    733     **kwargs)
    734 downloader = StorageStreamDownloader(**options)
--> 735 await downloader._setup()  # pylint: disable=protected-access
    736 return downloader

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_download_async.py:317, in StorageStreamDownloader._setup(self)
    308 # pylint: disable-next=attribute-defined-outside-init
    309 self._initial_range, self._initial_offset = process_range_and_offset(
    310     initial_request_start,
    311     initial_request_end,
   (...)
    314     self._encryption_data
    315 )
--> 317 self._response = await self._initial_request()
    318 self.properties = cast("BlobProperties", self._response.properties)  # type: ignore [attr-defined]
    319 self.properties.name = self.name

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_download_async.py:364, in StorageStreamDownloader._initial_request(self)
    362 self._file_size = parse_length_from_content_range(response.properties.content_range)
    363 if self._file_size is None:
--> 364     raise ValueError("Required Content-Range response header is missing or malformed.")
    365 # Remove any extra encryption data size from blob size
    366 self._file_size = adjust_blob_size_for_encryption(self._file_size, self._encryption_data)

ValueError: Required Content-Range response header is missing or malformed.

The last section of the message (copied below) suggests that the adlfs.AzureBlobFileSystem I'm passing in isn't able to read the file size of the blobs I'm saving. I think there's a chance all of this is happening because the filesystem somehow can't read metadata on the blobs until they've been copied. This would explain why I can't open the stores initially because zarr needs to detect files to open the dataset. Then it also explains this other issue.

File ~/code/lsimpfendoerfer/weather-dagster/.venv/lib/python3.11/site-packages/azure/storage/blob/aio/_download_async.py:364, in StorageStreamDownloader._initial_request(self)
    362 self._file_size = parse_length_from_content_range(response.properties.content_range)
    363 if self._file_size is None:
--> 364     raise ValueError("Required Content-Range response header is missing or malformed.")
    365 # Remove any extra encryption data size from blob size
    366 self._file_size = adjust_blob_size_for_encryption(self._file_size, self._encryption_data)

I'm trying to track down if this is an issue with my permission boundaries, with some other aspect of my azure environment, or with something xarray/zarr is doing.

@lsim-aegeri
Copy link
Author

lsim-aegeri commented Apr 15, 2025

Ok I finally got somewhere. Everything is working fine with obstore.

import xarray as xr
import numpy as np
import pandas as pd
from zarr.storage import ObjectStore
from obstore.store import AzureStore

ds = xr.Dataset(
    {"foo": (("x", "y"), np.random.rand(4, 5))},
    coords={
        "x": [10, 20, 30, 40],
        "y": pd.date_range("2000-01-01", periods=5),
        "z": ("x", list("abcd")),
    },
)

azure_store = AzureStore(
    container_name="abcdefg", 
    prefix='xr-test/test10.zarr-v3', 
    account_name="abcdefg",
    sas_key="abcdefg",
)
objstore = ObjectStore(store=azure_store)
ds.to_zarr(
    store=objstore,
    mode='w', 
    consolidated=False, 
    zarr_format=3, 
)
dsn = xr.open_zarr(
    objstore,
    consolidated=False,
    use_zarr_fill_value_as_mask=True,
    zarr_format=3
).compute()

Returns exactly what I was expecting. Given this works perfectly, I have to imagine this was an issue somewhere inside adlfs/fsspec.

Using obstore might also have suggested an explanation for the second issue with Content-Ranges. They gave me a nice error saying that Azure does not support suffix range requests. This must be what was happening--I was trying to use shards and reading shards needs byte ranges (Content-Range response header), which apparently isn't supported for Azure blob storage?

I'm going to close this issue because I don't think there's anything for xarray to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

4 participants