-
Notifications
You must be signed in to change notification settings - Fork 1.2k
remote: efficiently collect directories #2648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@MrOutis again, could you please provide an example when it can benefit us? Am I correct that it will be a custom implementation for S3 to calculate a directory hash, for example when we need to use it as an external dependency and/or output? If that is the case, does it mean, for example that it can be 1 vs N api calls via boto3? do we have some numbers to compare? It's hard to make any decisions w/p any context. |
Yes, @shcheklein , this would be a custom implementation for S3. |
I would leave this here: def _collect_dir(self, path_info):
# See: https://github.com/iterative/dvc/issues/2648
root = path_info.path
return [
{
self.PARAM_CHECKSUM: entry["ETag"].strip('"'),
self.PARAM_RELPATH: os.path.relpath(entry["Key"], start=root),
}
for entry in self._list_objects(path_info)
] It is the implementation I was thinking for S3 |
Sounds good @MrOutis ! It would be interesting to see how much performance we are getting from overriding |
Stale |
Uh oh!
There was an error while loading. Please reload this page.
Version: 0.62.1
Description: The current implementation for
_collect_dir
is an N+1 operation, where itwalk
s the directory to list all the files and then for each one compute/request its checksum (get_file_checksum
).dvc/dvc/remote/base.py
Lines 195 to 231 in 4171aac
The
state
saves us from getting all the checksums again (theN
operation).However, there are remotes like
S3
that have an operation to list the objects with their checksums and other stats (list_objects
).Let's discuss if it make sense to take advantage of this operation, and replace the N+1 (
get_filechecksum(file) for file in walk(dir) if not state.get(file)
) with the one that returns the list of files with some metadata already.Related: #1654
The text was updated successfully, but these errors were encountered: