Skip to content

[v3] Caching from fsspec doesn't work with FSSpecStore #2988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tasansal opened this issue Apr 16, 2025 · 7 comments
Open

[v3] Caching from fsspec doesn't work with FSSpecStore #2988

tasansal opened this issue Apr 16, 2025 · 7 comments
Labels
bug Potential issues with the zarr-python library

Comments

@tasansal
Copy link
Contributor

tasansal commented Apr 16, 2025

Zarr version

3.0.6

Numcodecs version

0.15.1

Python Version

3.13

Operating System

Mac

Installation

using uv to install in virtual environment

Description

We could initialize cached fsspec instances in Zarr v2 by passing protocol and storage options. This doesn't work in v3 anymore even though the cache gets initialized.

Steps to reproduce

import zarr

remote_store = "gs://<bucket>/<prefix>"  # example remote store

root = zarr.open_consolidated(f"simplecache::{remote_store}", storage_options={"simplecache": {"cache_storage": "./test_cache"}})

%time _ = root["array"][3600, 2048:4096]  # example read
  1. This gives store not being initialized with async warning
  2. The cache directory gets created as expected
  3. fs property of root.store shows the correct Cached fsspec instance
  4. Once reads are done, it always fetches from remote and doesn't use cache.

As I mentioned before, this works fine on v2.

@tasansal tasansal added the bug Potential issues with the zarr-python library label Apr 16, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Apr 16, 2025

@martindurant any ideas?

@martindurant
Copy link
Member

I can have a look, but probably not until next week.

All filesystems should be asynchronous for zarr, and the cached filesystem should have been wrapped with the async wrapper. I suspect the async_impl or other attribute is being fetched from target of simplecache rather than the instance itself.

@NickGeneva
Copy link

Been having some issues with this as well—here are my findings:

Just wanted to mention that wrapping the cache file system with AsyncFileSystemWrapper resolves the issue, at least partially, when I was having problems similar to OP.

So something like


print(f"Zarr version: {zarr.__version__}")
cache_options = {"cache_storage": "cache"}
fs = gcsfs.GCSFileSystem(cache_timeout=-1, token="anon",access="read_only")
fs = WholeFileCacheFileSystem(fs=fs, **cache_options)
fs = AsyncFileSystemWrapper(fs)

zstore = zarr.storage.FsspecStore(fs, path=".....")
zarr_group = zarr.open(store=zstore, mode="r")

Just some notes of what I was able to learn:

Zarr 2.0 which used Fsspec's FSMap directly, but the new FsspecStore skips over the use of the cache, always pulling a remote file if you just plug in a caching file system directly.

In Zarr 2.0, using FSMaps __getitem__ function, it uses the base file systems cat method. Which the cache file systems have.

While in the current zarr fsstore, it calls the async _cat_file function, which the caching systems dont have because they are sync. Thus the function call just jumps to the base file system.

Caching file systems are not async, well at least don't have async functions to call (maybe coming soon hopefully???).
The Async Wrapper works out of the box for WholeFileCacheFileSystem sometimes but had some issues with the SimpleCacheFileSystem.
I think its in the SimpleCacheFileSystem._open function that there may be an issue perhaps?
Additionally, using this wrapper outside of a minimal example seems to cause some odd behavior. Still debugging...

I think a complete async function inside WholeFileCacheFileSystem would be a more appropriate solution perhaps?

@NickGeneva
Copy link

Another follow up on this,

After some more investigation, I also realized that zarr 3.0 and the use of cat_file allows zarr to fetch specific byte ranges of a given file. I'm not sure this isnt supported by WholeFileCacheFileSystem (and thus SimpleCacheFileSystem perhaps?) at least doesnt seem like it from what I can tell?

I think a more custom solution is needed for my application.

@martindurant
Copy link
Member

I don't think zarr reads partial files for native storage, unless there are special cases such as sharding or block blocked compression. It should also make partial reads for uncompressed data, but does not, a longstanding request of mine.
You are right, that wholefilecache and simplecache should read the whole file; there is also CachingFileSystem which reads only the specified blocks, but (if I remember, might be wrong), it's only via the file-like-interface (i.e., open() ), not cat().

If you have indirection via ReferenceFileSystem, you can cache the virtual files of that, which is essentially what the old cache layer in zarr2 would do. Getting this right in the context of async calls seems .. doable but hard.

@tasansal
Copy link
Contributor Author

Yea Zarr used to support partial reads for certain codecs (blosc) and that would be a little more challenging to cache. However, for the basic case of whole chunk reads, was anyone able to diagnose the root cause of the issue? We mainly use simplecache for filecache in the past, and its broken with v3 stores.

@NickGeneva
Copy link

NickGeneva commented May 13, 2025

Just closing the loop on this incase someone happens to have the same problem as I. In the instances where I needed to manually cache items I ended up implementing a light weight async caching file system that supports the use of _cat_file with a lot of repeat code from the simple caching file system. Basically a async file system with something like:

@typing.no_type_check
  async def _cat_file(
      self,
      path,
      start=None,
      end=None,
      on_error="raise",
      callback=DEFAULT_CALLBACK,
      **kwargs,
  ):
      """Cat file, this is what is used inside Zarr 3.0 at the moment"""
      getpath = None
      try:
          # Check if file is in cache
          # This doesnt do anything fancy for chunked data, different chunk ranges
          # are stored in different files even if there is over lap
          path_chunked = path
          if start:
              path_chunked += f"_{start}"
          if end:
              path_chunked += f"_{end}"

          detail = self._check_file(path_chunked)
          if not detail:
              storepath = await self._make_local_details(path_chunked)
              getpath = path
          else:
              detail, storepath = (
                  detail if isinstance(detail, tuple) else (None, detail)
              )
      except Exception as e:
          if on_error == "raise":
              raise
          if on_error == "return":
              out = e

      # If file was not in cache, get it using the base file system
      if getpath:
          resp = await self.fs._cat_file(getpath, start=start, end=end)
          # Save to file
          if isfilelike(storepath):
              outfile = storepath
          else:
              outfile = open(storepath, "wb")  # noqa: ASYNC101, ASYNC230
          try:
              outfile.write(resp)
              # IDK yet
              # callback.relative_update(len(chunk))
          finally:
              if not isfilelike(storepath):
                  outfile.close()
          self.save_cache()

      # Call back is weird here, like the progress should be on the file fetch
      # but then how do we deal with the read? I dont know
      if start is None:
          start = 0

      callback.set_size(1)
      with open(storepath, "rb") as f:
          f.seek(start)
          if end is not None:
              out = f.read(end - start)
          else:
              out = f.read()
      callback.relative_update(1)
      return out

Def not the cleanest but got the job done. More complete solution would be to either use caching built into the async FS store already, or have a dedicated async simple caching class. Just having a async caching file system would be better than fiddling with getting sync file systems working imo. async provides major speed ups for my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

4 participants