Skip to content

GH-125413: pathlib: use scandir() to speed up copy() #126263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Nov 1, 2024

Use the new PathBase.scandir() method in PathBase.copy(), which greatly reduces the number of PathBase.stat() calls needed when copying. This also speeds up Path.copy(), which uses the superclass implementation.

Under the hood, we use directory entries to distinguish between files, directories and symlinks, and to retrieve a stat_result when reading metadata. This logic is extracted into a new pathlib._abc.CopierBase class, which helps reduce the number of underscore-prefixed support methods in the path interface. But it makes the patch a little large - sorry.

Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
@barneygale barneygale added performance Performance or resource usage topic-pathlib labels Nov 1, 2024
@barneygale barneygale marked this pull request as ready for review November 1, 2024 05:20
@barneygale
Copy link
Contributor Author

Copying a directory of 100 empty files, this is about 10% faster when preserving metadata, and 5% faster without.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my nocturnal review just before going to bed!

@barneygale
Copy link
Contributor Author

Thanks for the reviews, both! I'll wait to see if Adam has feedback before I merge.

@barneygale
Copy link
Contributor Author

Gentle nudge @AA-Turner, thanks in advance :)

In case you didn't spot it, I'm planning to make scandir() private again after this PR lands. Thread here. The TL;DR is that it's not sufficiently motivated at the moment, but it might be if we make the pathlib ABCs public later on.

@@ -931,7 +975,7 @@ def move(self, target):
"""
Recursively move this file or directory tree to the given destination.
"""
self._ensure_different_file(target)
target._copier.ensure_different_files(self, target)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong - target might be a string here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, or a PathLike - could we use self._copier?

Copy link
Member

@AA-Turner AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder Barney.

I've no real concerns here, but I admit I'm not entirely sure why _CopierBase needs to exist, as it seems to reimplement some functionality of the Path hierarchy. It may be that the implementation without it would've been much longer, though.

Other than that, comments throughout.

A

Comment on lines +113 to +116
try:
source_st = dir_entry.stat()
except AttributeError:
source_st = source.stat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there valid non-None types for dir_entry that don't have stat? Otherwise a is None check is cheaper.

@@ -931,7 +975,7 @@ def move(self, target):
"""
Recursively move this file or directory tree to the given destination.
"""
self._ensure_different_file(target)
target._copier.ensure_different_files(self, target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, or a PathLike - could we use self._copier?

Comment on lines +69 to +74
try:
source = os.fspath(source)
except TypeError:
if not isinstance(source, PathBase):
raise
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we permit types inheriting from pathlib._abc.PathBase but not also PurePath (or: not implementing __fspath__)? This seems a little strange at first look.

raise
CopierBase.copy_file(self, source, target, metadata_keys, dir_entry)
else:
copyfile(source, os.fspath(target))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above exception handling for fspath, is target guaranteed to inherit from PurePath here?

@@ -634,6 +672,12 @@ def _filter_trailing_slash(self, paths):
path_str = path_str[:-1]
yield path_str

def _join_dir_entry(self, dir_entry):
path_str = dir_entry.name if str(self) == '.' else dir_entry.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we shortcut here rather than going via str(self)? The change to iterdir means that this is recalculated for each path in the directory being iterated over, rather than only once.

Comment on lines +191 to +196
except IsADirectoryError as e:
if not target.exists():
# Raise a less confusing exception.
raise FileNotFoundError(
f'Directory does not exist: {target}') from e
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking -- does this directory error handling need to exist in copy_file? Can ensuring that source and target are files be handled as a precondition? No worries if 'yes', this just surprised me!

@barneygale
Copy link
Contributor Author

Thanks for the feedback @AA-Turner, and my much-delayed response. I ended up going a different way with making os.DirEntry info available from pathlib: the scandir() method was replaced with a caching info attribute.

The spiritual successor to this PR here is here: #130238

In that PR, the new copy_file() function consolidates a bunch of copying logic from the _CopyWriter class, and uses source_path.info wherever possible to lean on cached scandir results.

I'll go through your feedback here and re-raise it on the other PR if it still applies.

Closing this PR. Sorry for the faff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants