-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Labels
discussionrequires active participation to reach a conclusionrequires active participation to reach a conclusion

Description
Version: 0.62.1
Description: The current implementation for _collect_dir
is an N+1 operation, where it walk
s the directory to list all the files and then for each one compute/request its checksum (get_file_checksum
).
Lines 195 to 231 in 4171aac
def _collect_dir(self, path_info): | |
file_infos = set() | |
for root, _dirs, files in self.walk(path_info): | |
if DvcIgnore.DVCIGNORE_FILE in files: | |
raise DvcIgnoreInCollectedDirError(root) | |
file_infos.update(path_info / root / fname for fname in files) | |
checksums = {fi: self.state.get(fi) for fi in file_infos} | |
not_in_state = { | |
fi for fi, checksum in checksums.items() if checksum is None | |
} | |
new_checksums = self._calculate_checksums(not_in_state) | |
checksums.update(new_checksums) | |
result = [ | |
{ | |
self.PARAM_CHECKSUM: checksums[fi], | |
# NOTE: this is lossy transformation: | |
# "hey\there" -> "hey/there" | |
# "hey/there" -> "hey/there" | |
# The latter is fine filename on Windows, which | |
# will transform to dir/file on back transform. | |
# | |
# Yes, this is a BUG, as long as we permit "/" in | |
# filenames on Windows and "\" on Unix | |
self.PARAM_RELPATH: fi.relative_to(path_info).as_posix(), | |
} | |
for fi in file_infos | |
] | |
# Sorting the list by path to ensure reproducibility | |
return sorted(result, key=itemgetter(self.PARAM_RELPATH)) |
The state
saves us from getting all the checksums again (the N
operation).
However, there are remotes like S3
that have an operation to list the objects with their checksums and other stats (list_objects
).
Let's discuss if it make sense to take advantage of this operation, and replace the N+1 (get_filechecksum(file) for file in walk(dir) if not state.get(file)
) with the one that returns the list of files with some metadata already.
Related: #1654
efiop
Metadata
Metadata
Assignees
Labels
discussionrequires active participation to reach a conclusionrequires active participation to reach a conclusion