Skip to content

remote: efficiently collect directories #2648

Closed
@ghost

Description

Version: 0.62.1

Description: The current implementation for _collect_dir is an N+1 operation, where it walks the directory to list all the files and then for each one compute/request its checksum (get_file_checksum).

dvc/dvc/remote/base.py

Lines 195 to 231 in 4171aac

def _collect_dir(self, path_info):
file_infos = set()
for root, _dirs, files in self.walk(path_info):
if DvcIgnore.DVCIGNORE_FILE in files:
raise DvcIgnoreInCollectedDirError(root)
file_infos.update(path_info / root / fname for fname in files)
checksums = {fi: self.state.get(fi) for fi in file_infos}
not_in_state = {
fi for fi, checksum in checksums.items() if checksum is None
}
new_checksums = self._calculate_checksums(not_in_state)
checksums.update(new_checksums)
result = [
{
self.PARAM_CHECKSUM: checksums[fi],
# NOTE: this is lossy transformation:
# "hey\there" -> "hey/there"
# "hey/there" -> "hey/there"
# The latter is fine filename on Windows, which
# will transform to dir/file on back transform.
#
# Yes, this is a BUG, as long as we permit "/" in
# filenames on Windows and "\" on Unix
self.PARAM_RELPATH: fi.relative_to(path_info).as_posix(),
}
for fi in file_infos
]
# Sorting the list by path to ensure reproducibility
return sorted(result, key=itemgetter(self.PARAM_RELPATH))

The state saves us from getting all the checksums again (the N operation).
However, there are remotes like S3 that have an operation to list the objects with their checksums and other stats (list_objects).

Let's discuss if it make sense to take advantage of this operation, and replace the N+1 (get_filechecksum(file) for file in walk(dir) if not state.get(file)) with the one that returns the list of files with some metadata already.

Related: #1654

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions