Skip to content

remote: efficiently collect directories #2648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Oct 22, 2019 · 5 comments
Closed

remote: efficiently collect directories #2648

ghost opened this issue Oct 22, 2019 · 5 comments
Labels
discussion requires active participation to reach a conclusion

Comments

@ghost
Copy link

ghost commented Oct 22, 2019

Version: 0.62.1

Description: The current implementation for _collect_dir is an N+1 operation, where it walks the directory to list all the files and then for each one compute/request its checksum (get_file_checksum).

dvc/dvc/remote/base.py

Lines 195 to 231 in 4171aac

def _collect_dir(self, path_info):
file_infos = set()
for root, _dirs, files in self.walk(path_info):
if DvcIgnore.DVCIGNORE_FILE in files:
raise DvcIgnoreInCollectedDirError(root)
file_infos.update(path_info / root / fname for fname in files)
checksums = {fi: self.state.get(fi) for fi in file_infos}
not_in_state = {
fi for fi, checksum in checksums.items() if checksum is None
}
new_checksums = self._calculate_checksums(not_in_state)
checksums.update(new_checksums)
result = [
{
self.PARAM_CHECKSUM: checksums[fi],
# NOTE: this is lossy transformation:
# "hey\there" -> "hey/there"
# "hey/there" -> "hey/there"
# The latter is fine filename on Windows, which
# will transform to dir/file on back transform.
#
# Yes, this is a BUG, as long as we permit "/" in
# filenames on Windows and "\" on Unix
self.PARAM_RELPATH: fi.relative_to(path_info).as_posix(),
}
for fi in file_infos
]
# Sorting the list by path to ensure reproducibility
return sorted(result, key=itemgetter(self.PARAM_RELPATH))

The state saves us from getting all the checksums again (the N operation).
However, there are remotes like S3 that have an operation to list the objects with their checksums and other stats (list_objects).

Let's discuss if it make sense to take advantage of this operation, and replace the N+1 (get_filechecksum(file) for file in walk(dir) if not state.get(file)) with the one that returns the list of files with some metadata already.

Related: #1654

@ghost ghost added the discussion requires active participation to reach a conclusion label Oct 22, 2019
@shcheklein
Copy link
Member

@MrOutis again, could you please provide an example when it can benefit us? Am I correct that it will be a custom implementation for S3 to calculate a directory hash, for example when we need to use it as an external dependency and/or output?

If that is the case, does it mean, for example that it can be 1 vs N api calls via boto3? do we have some numbers to compare? It's hard to make any decisions w/p any context.

@ghost
Copy link
Author

ghost commented Oct 22, 2019

Yes, @shcheklein , this would be a custom implementation for S3.
I'll provide a more concrete example with some numbers when the support for directories on S3 is fully implemented (#2619)

@ghost ghost mentioned this issue Oct 23, 2019
3 tasks
@ghost
Copy link
Author

ghost commented Oct 28, 2019

I would leave this here:

    def _collect_dir(self, path_info):
        # See: https://github.com/iterative/dvc/issues/2648
        root = path_info.path

        return [
            {
                self.PARAM_CHECKSUM: entry["ETag"].strip('"'),
                self.PARAM_RELPATH: os.path.relpath(entry["Key"], start=root),
            }
            for entry in self._list_objects(path_info)
        ]

It is the implementation I was thinking for S3

@efiop
Copy link
Contributor

efiop commented Nov 4, 2019

Sounds good @MrOutis ! It would be interesting to see how much performance we are getting from overriding _collect_dir like that.

@efiop
Copy link
Contributor

efiop commented May 3, 2021

Stale

@efiop efiop closed this as completed May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion
Projects
None yet
Development

No branches or pull requests

2 participants