-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
GH-125413: pathlib: use scandir()
to speed up copy()
#126263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly reduces the number of `PathBase.stat()` calls needed when copying. This also speeds up `Path.copy()`, which inherits the superclass implementation. Under the hood, we use directory entries to distinguish between files, directories and symlinks, and to retrieve a `stat_result` when reading metadata. This logic is extracted into a new `pathlib._abc.CopierBase` class, which helps reduce the number of underscore-prefixed support methods in the path interface.
Copying a directory of 100 empty files, this is about 10% faster when preserving metadata, and 5% faster without. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my nocturnal review just before going to bed!
Misc/NEWS.d/next/Library/2024-11-01-04-21-26.gh-issue-125413.Z-jjZq.rst
Outdated
Show resolved
Hide resolved
Misc/NEWS.d/next/Library/2024-11-01-04-21-26.gh-issue-125413.Z-jjZq.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Bénédikt Tran <[email protected]>
Thanks for the reviews, both! I'll wait to see if Adam has feedback before I merge. |
Gentle nudge @AA-Turner, thanks in advance :) In case you didn't spot it, I'm planning to make |
@@ -931,7 +975,7 @@ def move(self, target): | |||
""" | |||
Recursively move this file or directory tree to the given destination. | |||
""" | |||
self._ensure_different_file(target) | |||
target._copier.ensure_different_files(self, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong - target
might be a string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, or a PathLike
- could we use self._copier
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder Barney.
I've no real concerns here, but I admit I'm not entirely sure why _CopierBase
needs to exist, as it seems to reimplement some functionality of the Path
hierarchy. It may be that the implementation without it would've been much longer, though.
Other than that, comments throughout.
A
try: | ||
source_st = dir_entry.stat() | ||
except AttributeError: | ||
source_st = source.stat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there valid non-None types for dir_entry
that don't have stat
? Otherwise a is None
check is cheaper.
@@ -931,7 +975,7 @@ def move(self, target): | |||
""" | |||
Recursively move this file or directory tree to the given destination. | |||
""" | |||
self._ensure_different_file(target) | |||
target._copier.ensure_different_files(self, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, or a PathLike
- could we use self._copier
?
try: | ||
source = os.fspath(source) | ||
except TypeError: | ||
if not isinstance(source, PathBase): | ||
raise | ||
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we permit types inheriting from pathlib._abc.PathBase
but not also PurePath
(or: not implementing __fspath__
)? This seems a little strange at first look.
raise | ||
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry) | ||
else: | ||
copyfile(source, os.fspath(target)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the above exception handling for fspath
, is target
guaranteed to inherit from PurePath
here?
@@ -634,6 +672,12 @@ def _filter_trailing_slash(self, paths): | |||
path_str = path_str[:-1] | |||
yield path_str | |||
|
|||
def _join_dir_entry(self, dir_entry): | |||
path_str = dir_entry.name if str(self) == '.' else dir_entry.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we shortcut here rather than going via str(self)
? The change to iterdir
means that this is recalculated for each path in the directory being iterated over, rather than only once.
except IsADirectoryError as e: | ||
if not target.exists(): | ||
# Raise a less confusing exception. | ||
raise FileNotFoundError( | ||
f'Directory does not exist: {target}') from e | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity checking -- does this directory error handling need to exist in copy_file
? Can ensuring that source and target are files be handled as a precondition? No worries if 'yes', this just surprised me!
Thanks for the feedback @AA-Turner, and my much-delayed response. I ended up going a different way with making The spiritual successor to this PR here is here: #130238 In that PR, the new I'll go through your feedback here and re-raise it on the other PR if it still applies. Closing this PR. Sorry for the faff! |
Use the new
PathBase.scandir()
method inPathBase.copy()
, which greatly reduces the number ofPathBase.stat()
calls needed when copying. This also speeds upPath.copy()
, which uses the superclass implementation.Under the hood, we use directory entries to distinguish between files, directories and symlinks, and to retrieve a
stat_result
when reading metadata. This logic is extracted into a newpathlib._abc.CopierBase
class, which helps reduce the number of underscore-prefixed support methods in the path interface. But it makes the patch a little large - sorry.os.DirEntry
objects from pathlib #125413