-
-
Notifications
You must be signed in to change notification settings - Fork 329
[v3] Caching from fsspec
doesn't work with FSSpecStore
#2988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@martindurant any ideas? |
I can have a look, but probably not until next week. All filesystems should be asynchronous for zarr, and the cached filesystem should have been wrapped with the async wrapper. I suspect the async_impl or other attribute is being fetched from target of simplecache rather than the instance itself. |
Been having some issues with this as well—here are my findings: Just wanted to mention that wrapping the cache file system with AsyncFileSystemWrapper resolves the issue, at least partially, when I was having problems similar to OP. So something like
Just some notes of what I was able to learn: Zarr 2.0 which used Fsspec's FSMap directly, but the new FsspecStore skips over the use of the cache, always pulling a remote file if you just plug in a caching file system directly. In Zarr 2.0, using FSMaps While in the current zarr fsstore, it calls the async Caching file systems are not async, well at least don't have async functions to call (maybe coming soon hopefully???). I think a complete async function inside WholeFileCacheFileSystem would be a more appropriate solution perhaps? |
Another follow up on this, After some more investigation, I also realized that zarr 3.0 and the use of cat_file allows zarr to fetch specific byte ranges of a given file. I'm not sure this isnt supported by I think a more custom solution is needed for my application. |
I don't think zarr reads partial files for native storage, unless there are special cases such as sharding or block blocked compression. It should also make partial reads for uncompressed data, but does not, a longstanding request of mine. If you have indirection via ReferenceFileSystem, you can cache the virtual files of that, which is essentially what the old cache layer in zarr2 would do. Getting this right in the context of async calls seems .. doable but hard. |
Yea Zarr used to support partial reads for certain codecs (blosc) and that would be a little more challenging to cache. However, for the basic case of whole chunk reads, was anyone able to diagnose the root cause of the issue? We mainly use simplecache for filecache in the past, and its broken with v3 stores. |
Just closing the loop on this incase someone happens to have the same problem as I. In the instances where I needed to manually cache items I ended up implementing a light weight async caching file system that supports the use of @typing.no_type_check
async def _cat_file(
self,
path,
start=None,
end=None,
on_error="raise",
callback=DEFAULT_CALLBACK,
**kwargs,
):
"""Cat file, this is what is used inside Zarr 3.0 at the moment"""
getpath = None
try:
# Check if file is in cache
# This doesnt do anything fancy for chunked data, different chunk ranges
# are stored in different files even if there is over lap
path_chunked = path
if start:
path_chunked += f"_{start}"
if end:
path_chunked += f"_{end}"
detail = self._check_file(path_chunked)
if not detail:
storepath = await self._make_local_details(path_chunked)
getpath = path
else:
detail, storepath = (
detail if isinstance(detail, tuple) else (None, detail)
)
except Exception as e:
if on_error == "raise":
raise
if on_error == "return":
out = e
# If file was not in cache, get it using the base file system
if getpath:
resp = await self.fs._cat_file(getpath, start=start, end=end)
# Save to file
if isfilelike(storepath):
outfile = storepath
else:
outfile = open(storepath, "wb") # noqa: ASYNC101, ASYNC230
try:
outfile.write(resp)
# IDK yet
# callback.relative_update(len(chunk))
finally:
if not isfilelike(storepath):
outfile.close()
self.save_cache()
# Call back is weird here, like the progress should be on the file fetch
# but then how do we deal with the read? I dont know
if start is None:
start = 0
callback.set_size(1)
with open(storepath, "rb") as f:
f.seek(start)
if end is not None:
out = f.read(end - start)
else:
out = f.read()
callback.relative_update(1)
return out Def not the cleanest but got the job done. More complete solution would be to either use caching built into the async FS store already, or have a dedicated async simple caching class. Just having a async caching file system would be better than fiddling with getting sync file systems working imo. async provides major speed ups for my use case. |
Zarr version
3.0.6
Numcodecs version
0.15.1
Python Version
3.13
Operating System
Mac
Installation
using uv to install in virtual environment
Description
We could initialize cached fsspec instances in Zarr v2 by passing protocol and storage options. This doesn't work in v3 anymore even though the cache gets initialized.
Steps to reproduce
fs
property ofroot.store
shows the correct Cached fsspec instanceAs I mentioned before, this works fine on v2.
The text was updated successfully, but these errors were encountered: