-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Cannot open zarr store stored in Azure blob file system #10209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
potentially have look at the files in the bucket — how do they differ from the files that are written to a local filesystem? if we write to one and then copy to the other, does that work? fwiw it seems unlikely this in an xarray issue (though not impossible) |
Yeah, I can't say for sure where the issue is coming from but I am getting the results I'd expect after downgrading to an older version of xarray. INSTALLED VERSIONScommit: None xarray: 2024.7.0 And write to a zarr v2 store: ds.to_zarr(
store,
storage_options=fs.storage_options,
mode='w',
consolidated=False,
zarr_version=2,
)
ds_zarr = xr.open_zarr(
store,
storage_options=fs.storage_options,
consolidated=False,
zarr_version=2,
)
ds_zarr.info I get what I'd expect.
I'd really prefer to use the latest versions though, because I'm very excited for zarr v3 and that requires more recent versions of xarray. |
could you look at this? |
Yes, working on it! |
Very weird.
I compared all the json files when writing locally and to blob and could not see any differences. I also checked file sizes and they were the same. Do you know what to make of this? |
But I think you're right, this probably isn't purely an xarray issue. I can also read the files if I write to blob, then clone within blob, and open from blob. It feels like something related to permissions? I'll investigate further. I do wonder if a helpful change could be made to xarray to prevent this, given reading and writing works fine using xarray v2024.7.0. Will follow up when I've identified the problem. |
Hi, Just wanted to follow up on this issue, I've also encountered this bug after writing a Zarr v3 store to the JASMIN s3-compatible object store in the UK. So far, I've tracked the error down to I can open the unconsolidated Zarr Store as expected by manually passing the store as follows: store = zarr.storage.FsspecStore(fs=obj_store, path=dest)
xr.open_zarr(store, zarr_format=3, consolidated=False) where |
In case of interest, the following store_path = "s3://bucket/prefix"
ds = xr.open_zarr(fpath, zarr_format=3, consolidated=False, storage_options={"anon":True, "asynchronous":True, "client_kwargs":{"endpoint_url":"https://my_end_point_url"}}) Find it interesting though that reading the same store created with consolidated metadata does not require any |
Thank you all for diving into this. It's highly likely that this is a bug in the zarr-python library rather than in xarray. Has anyone tried to replicate the bug simply by opening the zarr store using zarr-python directly and listing the contents of the group using the zarr-python API? That would be my next move. |
Hi Tom, zarr.open_group(url, mode='r', zarr_format=3, use_consolidated=False) I'll drop an issue in zarr-python as currently the way around this in xarray is to manually specify |
Thank you @oj-tooth ! Yes this makes some sense to me. Please raise an issue on zarr-python upstream and link to this issue! 🙏 |
Based on the issue @oj-tooth opened, it seems like our problems are different. oj-tooth's issue was that they wanted to make zarr stores with unconsolidated metadata available over http request, which is impossible because listing directory contents doesn't work with http. I also tried the solution oj-tooth suggested: import zarr
dsn = xr.open_zarr(
zarr.storage.FsspecStore(fs=fs, path=store.split('://')[1]),
consolidated=False,
use_zarr_fill_value_as_mask=True,
)
dsn.compute() And am still getting an empty result. Also, my open_zarr's are returning empty datasets regardless of whether I'm using consolidated or unconsolidated metadata. I'm about to test with zarr-python directly. I hope that sheds a little more light. |
I also get an empty result when reading with zarr-python directly. zstore = zarr.storage.FsspecStore(fs=fs, path=store.split('://')[1])
grp = zarr.open_group(zstore, mode='r')
grp.info_complete() Result:
I will open an issue there as well. |
You're correct. Oversight on my part, as I thought I'd ran into the same issue, but actually just needed to learn more about the limits of reading zarr via http request. Sorry not to have been more helpful! |
Actually, when I tried to create a reproducible example using pure zarr-python I found I'm not having the same problem. fs = adlfs.AzureBlobFileSystem(
account_name="abcdefg",
sas_token="abcdefg",
)
store = zarr.storage.FsspecStore(
fs=fs,
path='weatherblob/xr-test/test_zarr_direct.zarr-v3'
)
# Create the array
root = zarr.create_group(store=store, zarr_format=3, overwrite=True)
z1 = root.create_array(name='foo', shape=(10000, 10000), chunks=(1000, 1000), dtype='int32')
z1[:] = np.random.randint(0, 100, size=(10000, 10000))
# Read it back
root_read = zarr.open_group(store=store, zarr_format=3, mode='r')
root_read['foo'][:] Yields, as expected:
But then when I try to do the roundtrip with xarray I get an empty dataset. ds = xr.Dataset(
{"foo": xr.DataArray(root_read['foo'][:], dims=['x', 'y'])},
)
store_xr = zarr.storage.FsspecStore(
fs=fs,
path='weatherblob/xr-test/test_zarr_direct_xr.zarr-v3'
)
ds.to_zarr(
store_xr,
mode='w',
consolidated=False,
zarr_format=3,
)
xr.open_zarr(
store_xr,
consolidated=False,
zarr_format=3,
).info yields:
This leads me to believe it's actually an issue somewhere in xarray. Or I guess it could still be an issue with zarr that's popping up in the xarray round-trip and not the zarr-python round-trip because xarray is using zarr-python in a way that's different from my example. I'm not well enough versed in the xarray read/write process to know for sure. I do not have accounts with other cloud storage providers so I cannot test this with gcp or aws, unfortunately. |
All good! Even if it doesn't help solve this particular issue, I still learned something new! |
I have a little bit to add to the previous example. When I then try to read the store I wrote with xarray, I get some inconsistent results.
But:
I'm interpreting this as that zarr-python itself cannot see any variables in the group, BUT zarr-python can read individual variables in the group when asked to do so directly. |
And then there's still the additional wrinkle that when I simply copy the dataset from one blob path to another I can open it just fine with xarray. store_xr = zarr.storage.FsspecStore(
fs=fs,
path='weatherblob/xr-test/test_zarr_direct_xr.zarr-v3'
)
store_xr_cp = zarr.storage.FsspecStore(
fs=fs,
path='weatherblob/xr-test/test_zarr_direct_xr_cp.zarr-v3'
)
fs.copy(store_xr.path, store_xr_cp.path, recursive=True)
xr.open_zarr(
store=f'abfs://{store_xr_cp.path}',
storage_options=store_xr_cp.fs.storage_options,
consolidated=False,
zarr_format=3,
).info yields:
I'm feeling quite lost here. |
I'm concurrently dealing with a separate issue reading data into memory with ds.compute(). Unfortunately I'm not able to reproduce this issue--it doesn't appear with the dummy data above, only when I try to read ecmwf eps reforecast data I've downloaded. However, the error message I'm getting makes makes me suspicious there might be a relationship.
The last section of the message (copied below) suggests that the adlfs.AzureBlobFileSystem I'm passing in isn't able to read the file size of the blobs I'm saving. I think there's a chance all of this is happening because the filesystem somehow can't read metadata on the blobs until they've been copied. This would explain why I can't open the stores initially because zarr needs to detect files to open the dataset. Then it also explains this other issue.
I'm trying to track down if this is an issue with my permission boundaries, with some other aspect of my azure environment, or with something xarray/zarr is doing. |
Ok I finally got somewhere. Everything is working fine with import xarray as xr
import numpy as np
import pandas as pd
from zarr.storage import ObjectStore
from obstore.store import AzureStore
ds = xr.Dataset(
{"foo": (("x", "y"), np.random.rand(4, 5))},
coords={
"x": [10, 20, 30, 40],
"y": pd.date_range("2000-01-01", periods=5),
"z": ("x", list("abcd")),
},
)
azure_store = AzureStore(
container_name="abcdefg",
prefix='xr-test/test10.zarr-v3',
account_name="abcdefg",
sas_key="abcdefg",
)
objstore = ObjectStore(store=azure_store)
ds.to_zarr(
store=objstore,
mode='w',
consolidated=False,
zarr_format=3,
)
dsn = xr.open_zarr(
objstore,
consolidated=False,
use_zarr_fill_value_as_mask=True,
zarr_format=3
).compute() Returns exactly what I was expecting. Given this works perfectly, I have to imagine this was an issue somewhere inside adlfs/fsspec. Using obstore might also have suggested an explanation for the second issue with Content-Ranges. They gave me a nice error saying that Azure does not support suffix range requests. This must be what was happening--I was trying to use shards and reading shards needs byte ranges (Content-Range response header), which apparently isn't supported for Azure blob storage? I'm going to close this issue because I don't think there's anything for xarray to do. |
What happened?
I'm trying to save data to a zarr store hosted on my Azure blob filesystem. I'm able to write the data just fine, but then am unable to open the dataset afterwards--when I try to open it, I get an empty dataset with no variables. I believe this is related to the storage backend because everything works when I write the data to my local filesystem instead. I also don't think this is a zarr version issue because the same thing happens with zarr v2 and v3, and when I use consolidated metadata with zarr v2.
What did you expect to happen?
I should be able to read data from the zarr stores I write to my Azure blob filesystem.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-1071-azure
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2025.3.1
pandas: 2.2.3
numpy: 2.1.3
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2025.3.2
cupy: None
pint: None
sparse: 0.16.0
flox: 0.10.2
numpy_groupies: 0.11.2
setuptools: None
pip: None
conda: None
pytest: None
mypy: None
IPython: 9.1.0
sphinx: None
adlfs: 2024.12.0
The text was updated successfully, but these errors were encountered: