Skip to content

Expose os.DirEntry objects from pathlib #125413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
barneygale opened this issue Oct 13, 2024 · 2 comments
Closed

Expose os.DirEntry objects from pathlib #125413

barneygale opened this issue Oct 13, 2024 · 2 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Oct 13, 2024

Feature or enhancement

I propose we add a new Path.status attribute that stores an os.DirEntry object in paths yielded from Path.iterdir(), or a pathlib-specific type with a similar interface in other paths.

This would:

  • Allow users to access to the cached os.DirEntry after calling Path.iterdir(), which is useful for efficiently determining files' types and often doesn't involve a system call.
  • Allow users to switch on the type of any path without repeatedly making system calls, or having to resort to S_ISREG(st.st_mode) and other holy incantations.
  • In the pathlib ABCs, allows us to entirely banish PathBase.stat() and the stat_result interface, which is too low-level and local filesystem-specific

See discussion: https://discuss.python.org/t/is-there-a-pathlib-equivalent-of-os-scandir/46626

Linked PRs

@barneygale barneygale added type-feature A feature request or enhancement performance Performance or resource usage topic-pathlib labels Oct 13, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
@ncoghlan
Copy link
Contributor

I put this feedback on the PR, but it's probably better placed here: while I like the general idea, I don't think this specific API is the right way to do it.

  • dir_entry potentially being None based on how the instance was created is inconvenient
  • the docs having to excuse dir_entry existing on PurePath objects is awkward

I think we can eliminate both of those bits of awkwardness:

  • define a new alternative construction method on os.DirEntry objects that allows one to be created from arbitrary os.PathLike objects
  • make the slot on PurePath private rather than public (presumably as _dir_entry)
  • define PathBase.dir_entry as a read-only property that returns the cached entry if it is already set, otherwise it uses the new constructor API to create a cached DirEntry instance for itself

If it's impractical to add os.DirEntry.from_path, then a pathlib._DirEntry class that just emulated the os.DirEntry API based on the real underlying Path object would also be fine

barneygale added a commit to barneygale/cpython that referenced this issue Oct 25, 2024
… once

Improve `pathlib._abc.PathBase.copy()` (which provides `Path.copy()`) by
fetching operands' supported metadata keys up-front, rather than once for
each path in the tree.

This prepares the way for using `os.DirEntry` objects in `copy()`.
@barneygale barneygale changed the title Add pathlib.Path.dir_entry Expose os.DirEntry objects from pathlib Oct 28, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 28, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`.

In the private `pathlib._abc.PathBase` class, we can rework the
`iterdir()`, `glob()`, `walk()` and `copy()` methods to call `scandir()`
and make use of cached directory entry information, and thereby improve
performance. Because the `Path.copy()` method is provided by `PathBase`,
this also speeds up traversal when copying local files and directories.
barneygale added a commit that referenced this issue Nov 1, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`. This
will be used to implement several `PathBase` methods more efficiently,
including methods that provide `Path.copy()`.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
@barneygale
Copy link
Contributor Author

To tie up the above loose ends, we went with a Path.scandir() method in the end.

barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
barneygale added a commit that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
barneygale added a commit that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 4, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 29, 2024
Remove documentation for `pathlib.Path.scandir()`, and rename the method to
`_scandir()`. In the private pathlib ABCs, make `iterdir()` abstract and
call it from `_scandir()`.

It's not worthwhile to add this method at the moment - see discussion:
https://discuss.python.org/t/ergonomics-of-new-pathlib-path-scandir/71721
barneygale added a commit to barneygale/cpython that referenced this issue Dec 5, 2024
barneygale added a commit that referenced this issue Dec 5, 2024
Remove documentation for `pathlib.Path.scandir()`, and rename the method to
`_scandir()`. In the private pathlib ABCs, make `iterdir()` abstract and
call it from `_scandir()`.

It's not worthwhile to add this method at the moment - see discussion:
https://discuss.python.org/t/ergonomics-of-new-pathlib-path-scandir/71721

Co-authored-by: Steve Dower <[email protected]>
barneygale added a commit to barneygale/cpython that referenced this issue Jan 19, 2025
…blePath`

In the private pathlib ABCs, support write-only virtual filesystems by
making `WritablePath` inherit directly from `JoinablePath`, rather than
subclassing `ReadablePath`.

There are two complications:

- `ReadablePath.open()` applies to both reading and writing
- `ReadablePath.copy` is secretly an object that supports the *read* side
  of copying, whereas `WritablePath.copy` is a different kind of object
  supporting the *write* side

We untangle these as follow:

- A new `pathlib._abc.magic_open()` function replaces the `open()` method,
  which is dropped from the ABCs but remains in `pathlib.Path`. The
  function works like `io.open()`, but additionally accepts objects with
  `__open_rb__()` or `__open_wb__()` methods as appropriate for the mode.
  These new dunders are made abstract methods of `ReadablePath` and
  `WritablePath` respectively.  If the pathlib ABCs are made public, we
  could consider blessing an "openable" protocol and supporting it in
  `io.open()`, removing the need for `pathlib._abc.magic_open()`.
- `ReadablePath.copy` becomes a true method, whereas `WritablePath.copy` is
  deleted. A new `ReadablePath._copy_reader` property provides a
  `CopyReader` object, and similarly `WritablePath._copy_writer` is a
  `CopyWriter` object. Once pythonGH-125413 is resolved, we'll be able to move
  the `CopyReader` functionality into `ReadablePath.info` and eliminate
  `ReadablePath._copy_reader`.
barneygale added a commit that referenced this issue Jan 21, 2025
…h` (#129014)

In the private pathlib ABCs, support write-only virtual filesystems by
making `WritablePath` inherit directly from `JoinablePath`, rather than
subclassing `ReadablePath`.

There are two complications:

- `ReadablePath.open()` applies to both reading and writing
- `ReadablePath.copy` is secretly an object that supports the *read* side
  of copying, whereas `WritablePath.copy` is a different kind of object
  supporting the *write* side

We untangle these as follow:

- A new `pathlib._abc.magic_open()` function replaces the `open()` method,
  which is dropped from the ABCs but remains in `pathlib.Path`. The
  function works like `io.open()`, but additionally accepts objects with
  `__open_rb__()` or `__open_wb__()` methods as appropriate for the mode.
  These new dunders are made abstract methods of `ReadablePath` and
  `WritablePath` respectively.  If the pathlib ABCs are made public, we
  could consider blessing an "openable" protocol and supporting it in
  `io.open()`, removing the need for `pathlib._abc.magic_open()`.
- `ReadablePath.copy` becomes a true method, whereas `WritablePath.copy` is
  deleted. A new `ReadablePath._copy_reader` property provides a
  `CopyReader` object, and similarly `WritablePath._copy_writer` is a
  `CopyWriter` object. Once GH-125413 is resolved, we'll be able to move
  the `CopyReader` functionality into `ReadablePath.info` and eliminate
  `ReadablePath._copy_reader`.
barneygale added a commit to barneygale/cpython that referenced this issue Jan 21, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Jan 21, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Jan 28, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 4, 2025
barneygale added a commit that referenced this issue Feb 8, 2025
Add `pathlib.Path.info` attribute, which stores an object implementing the `pathlib.types.PathInfo` protocol (also new). The object supports querying the file type and internally caching `os.stat()` results. Path objects generated by `Path.iterdir()` are initialised with status information from `os.DirEntry` objects, which is gleaned from scanning the parent directory.

The `PathInfo` protocol has four methods: `exists()`, `is_dir()`, `is_file()` and `is_symlink()`.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 8, 2025
…`Path.info`

Move pathlib's private `CopyReader`, `LocalCopyReader`, `CopyWriter` and
`LocalCopyWriter` classes into `pathlib._os`, where they can live alongside
the low-level copying functions (`copyfileobj()` etc) and high-level path
querying interface (`PathInfo`).

This sets the stage for merging `LocalCopyReader` into `PathInfo`.
barneygale added a commit that referenced this issue Feb 9, 2025
…info` (#129856)

Move pathlib's private `CopyReader`, `LocalCopyReader`, `CopyWriter` and
`LocalCopyWriter` classes into `pathlib._os`, where they can live alongside
the low-level copying functions (`copyfileobj()` etc) and high-level path
querying interface (`PathInfo`).

This sets the stage for merging `LocalCopyReader` into `PathInfo`.

No change of behaviour; just moving some code around.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 9, 2025
Add the following private methods to `pathlib.Path.info`:

- `_get_mode()`: returns the POSIX file mode (`st_mode`), or zero if
  `os.stat()` fails.
- `_get_times_ns()`: returns the access and modify times in nanoseconds
  (`st_atime_ns` and `st_mtime_ns`), or zeroes if `os.stat()` fails.
- `_get_flags()`: returns the BSD file flags (`st_flags`), or zero if
  `os.stat()` fails.
- `_get_xattrs()`: returns the file extended attributes as a list of
  key, value pairs, or an empty list if `listxattr()` or `getattr()` fail.

These methods replace `LocalCopyReader.read_metadata()`, and so we can
delete the `CopyReader` and `LocalCopyReader` classes. Rather than reading
metadata via `source._copy_reader.read_metadata()`, we instead call
`source.info._get_mode()`, `_get_times_ns()`, etc.

Copying metadata is only supported for local-to-local copies at the moment.
To support copying between arbitrary `ReadablePath` and `WritablePath`
objects, we'd need to make the new methods public and documented.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 13, 2025
barneygale added a commit to barneygale/cpython that referenced this issue Feb 16, 2025
encukou added a commit to barneygale/cpython that referenced this issue Feb 17, 2025
barneygale added a commit that referenced this issue Feb 17, 2025
Add the following private methods to `pathlib.Path.info`:

- `_posix_permissions()`: the POSIX file permissions (`S_IMODE(st_mode)`)
- `_file_id()`: the file ID (`(st_dev, st_ino)`)
- `_access_time_ns()`: the access time in nanoseconds (`st_atime_ns`)
- `_mod_time_ns()`: the modify time in nanoseconds (`st_mtime_ns`)
- `_bsd_flags()`: the BSD file flags (`st_flags`)
- `_xattrs()`: the file extended attributes as a list of key, value pairs,
  or an empty list if `listxattr()` or `getxattr()` fail in an ignorable 
  way.

These methods replace `LocalCopyReader.read_metadata()`, and so we can
delete the `CopyReader` and `LocalCopyReader` classes. Rather than reading
metadata via `source._copy_reader.read_metadata()`, we instead call
`source.info._posix_permissions()`, `_access_time_ns()`, etc.

Preserving metadata is only supported for local-to-local copies at the
moment. To support copying metadata between arbitrary `ReadablePath` and
`WritablePath` objects, we'd need to make the new methods public and
documented.

Co-authored-by: Petr Viktorin <[email protected]>
barneygale added a commit to barneygale/cpython that referenced this issue Feb 21, 2025
…globbing

Call `ReadablePath.info.exists()` rather than `ReadablePath.exists()` when
globbing so that we use (or populate) the `info` cache.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 21, 2025
…ove()`

In `pathlib.Path.copy()` and `move()`, return a fresh `Path` object with an
unpopulated `info` attribute, rather than a `Path` object with information
recorded *prior* to the path's creation.
barneygale added a commit to barneygale/cpython that referenced this issue Feb 21, 2025
…ove()`

In `pathlib.Path.copy()` and `move()`, return a fresh `Path` object with an
unpopulated `info` attribute, rather than a `Path` object with information
recorded *prior* to the path's creation.
barneygale added a commit that referenced this issue Feb 24, 2025
…ng (#130422)

Call `ReadablePath.info.exists()` rather than `ReadablePath.exists()` when
globbing so that we use (or populate) the `info` cache.
barneygale added a commit that referenced this issue Feb 24, 2025
…#130424)

In `pathlib.Path.copy()` and `move()`, return a fresh `Path` object with an
unpopulated `info` attribute, rather than a `Path` object with information
recorded *prior* to the path's creation.
barneygale added a commit that referenced this issue Feb 26, 2025
Replace `WritablePath._copy_writer` with a new `_write_info()` method. This
method allows the target of a `copy()` to preserve metadata.

Replace `pathlib._os.CopyWriter` and `LocalCopyWriter` classes with new
`copy_file()` and `copy_info()` functions. The `copy_file()` function uses
`source_path.info` wherever possible to save on `stat()`s.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…globbing (python#130422)

Call `ReadablePath.info.exists()` rather than `ReadablePath.exists()` when
globbing so that we use (or populate) the `info` cache.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…ove()` (python#130424)

In `pathlib.Path.copy()` and `move()`, return a fresh `Path` object with an
unpopulated `info` attribute, rather than a `Path` object with information
recorded *prior* to the path's creation.
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
…ython#130238)

Replace `WritablePath._copy_writer` with a new `_write_info()` method. This
method allows the target of a `copy()` to preserve metadata.

Replace `pathlib._os.CopyWriter` and `LocalCopyWriter` classes with new
`copy_file()` and `copy_info()` functions. The `copy_file()` function uses
`source_path.info` wherever possible to save on `stat()`s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants