Skip to content

instance- and dir-caches #243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 10, 2020
Merged

instance- and dir-caches #243

merged 5 commits into from
Mar 10, 2020

Conversation

martindurant
Copy link
Member

@TomAugspurger what do you think, is this a simple way to solve things? I'm thinking every fsspec instance will have a dircache instance attached, but they are not required to use it (e.g., local never would). What exactly the cache contains is up to the implementation, but the structure in the docstring covers s3fs and gcsfs . When I plug it in, invalidate_cache() should maybe be moved to the new class, with an optional flag on whether directory parents should also be invalidated.

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this replace #216?

import time


class DirCache(dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're changing the __init__ signature, I think it'd be better to compose this class with a ._cache method rather than inheriting from dict. You might inherit from MutableMapping instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably inherit from nothing in the end, since I don't expect calls to keys(), etc.

self.q(item)
return self._cache[item] # maybe raises KeyError

def __setitem__(self, key, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this validate the structure of value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain, I don't know if we want to force implementations here. Might make things simpler if we do.

Parameters to this class control listing expiry or indeed turn
caching off
"""
def __init__(self, usecache=True, expiry_time=False, max_size=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this max_size in terms of elements or bytes?

Evicting based on the number of elements doesn't seem especially useful to me, at least no compared to the size of the values. But perhaps s3fs or gcsfs already does that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is number of elements as it stands, because it's far easier to implement using LRU.
There is no such expiry anywhere yet; gcsfs has time-based expiry.

self._times = {}
if max_size:
self.q = lru_cache(max_size + 1)(lambda key: self._cache.pop(key, None))
self.use = usecache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names should match the names of the arguments to init.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

self._cache = {}
self._times = {}
if max_size:
self.q = lru_cache(max_size + 1)(lambda key: self._cache.pop(key, None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit strange to have the max_size parameter affect the attribute where we actually store things. That leads to stuff like if self.max later on. Can we just always store it in the same place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

._cache stored the listings. The q and times dicts hold only information on what to automatically evict, and are not accessed if eviction is not desired.

@martindurant
Copy link
Member Author

@TomAugspurger , this can be considered working, although the docs should be updated to explain this (and changes propagated to s3fs, gcsfs).

Did you have an opinion if invalidate_cache should live in the DirCache implementation?

@TomAugspurger
Copy link
Collaborator

Did you have an opinion if invalidate_cache should live in the DirCache implementation?

Do you mean a DirCache.invalidate_cache method? Seems reasonable (and the FileSystem instance would just call that).

@martindurant martindurant changed the title WIP: instance- and dir-caches instance- and dir-caches Mar 9, 2020
@martindurant
Copy link
Member Author

if invalidate_cache should live in the DirCache implementation?

I'm going against this in the end, because there can be different requirements in the implementations (e.g., on s3, the creation of a key may affect all parent dirs, but on ssh you must explicitly make those dirs first)

@martindurant martindurant merged commit 4032091 into fsspec:master Mar 10, 2020
@martindurant martindurant deleted the caches branch March 10, 2020 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants