Skip to content

GH-72904: Add glob.translate() function #106703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Nov 13, 2023
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
bbfd404
GH-72904: Add optional *seps* argument to `fnmatch.translate()`
barneygale Jun 18, 2023
da9948d
Simplify `_make_child_relpath()` further
barneygale Jul 13, 2023
a07118b
Fix default value in docs
barneygale Jul 13, 2023
2728dcd
Match style of surrounding `fnmatch` code a little better.
barneygale Jul 13, 2023
fbcf4e3
Merge branch 'main' into gh-72904-fnmatch-seps
barneygale Jul 19, 2023
a0ce9c4
Docs + naming improvements
barneygale Jul 19, 2023
5b620fb
Replace *seps* with *sep*
barneygale Jul 26, 2023
51f2698
Update Doc/library/fnmatch.rst
barneygale Aug 4, 2023
9c8c3f3
Move to `glob.translate()`
barneygale Aug 11, 2023
8518ea2
Whoops
barneygale Aug 12, 2023
75129c8
Deduplicate code to handle character sets
barneygale Aug 13, 2023
2505590
Add support for `include_hidden=False`
barneygale Sep 23, 2023
1754d42
Fix doctest
barneygale Sep 23, 2023
dd2d401
Merge branch 'main' into gh-72904-fnmatch-seps
barneygale Sep 23, 2023
7b1ad63
Improve implementation; minimise fnmatch and pathlib diffs.
barneygale Sep 25, 2023
1485ff3
Fix tests
barneygale Sep 26, 2023
4c6d6f0
Tiny performance tweak
barneygale Sep 26, 2023
5aae7a2
Merge branch 'main' into gh-72904-fnmatch-seps
barneygale Sep 26, 2023
afb2d43
Fix `_make_child_relpath()`
barneygale Sep 26, 2023
d73df1b
Minor code improvements
barneygale Sep 26, 2023
c70afe3
Add another test for `include_hidden=False`
barneygale Sep 26, 2023
9cb2952
Merge branch 'main' into gh-72904-fnmatch-seps
barneygale Sep 30, 2023
f178b14
Add whatsnew entry
barneygale Sep 30, 2023
4a726aa
Collapse adjacent `**` segments.
barneygale Sep 30, 2023
78292eb
Apply suggestions from code review
barneygale Sep 30, 2023
5d4062c
Add comment explaining regex that consumes "empty" paths.
barneygale Sep 30, 2023
1ad624d
Merge branch 'main' into gh-72904-fnmatch-seps
barneygale Oct 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion Doc/library/fnmatch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ cache the compiled regex patterns in the following functions: :func:`fnmatch`,
``[n for n in names if fnmatch(n, pattern)]``, but implemented more efficiently.


.. function:: translate(pattern)
.. function:: translate(pattern, sep=None)

Return the shell-style *pattern* converted to a regular expression for
using with :func:`re.match`.
Expand All @@ -98,6 +98,22 @@ cache the compiled regex patterns in the following functions: :func:`fnmatch`,
>>> reobj.match('foobar.txt')
<re.Match object; span=(0, 10), match='foobar.txt'>

A path separator character may be supplied to the *sep* argument. If given,
the separator is sed to split the pattern into segments, where:

- A ``*`` pattern segment matches precisely one path segment.
- A ``**`` pattern segment matches any number of path segments.
- If ``**`` appears in any other position within the pattern,
:exc:`ValueError` is raised.
- ``*`` and ``?`` wildcards in other positions don't match path separators.

These rules approximate shell recursive globbing. The :mod:`pathlib` module
calls this function and supplies *sep* to implement
:meth:`~pathlib.PurePath.match` and :meth:`~pathlib.Path.glob`.

.. versionchanged:: 3.13
The *sep* parameter was added.


.. seealso::

Expand Down
38 changes: 34 additions & 4 deletions Lib/fnmatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,24 @@ def fnmatchcase(name, pat):
return match(name) is not None


def translate(pat):
def translate(pat, sep=None):
"""Translate a shell PATTERN to a regular expression.

A path separator character may be supplied to the *sep* argument. If
given, '*' and '?' wildcards will not match separators; '*' wildcards in
standalone pattern segments match precisely one path segment; and '**'
wildcards in standalone segments match any number of path segments.

There is no way to quote meta-characters.
"""

STAR = object()
if sep:
SEP = re.escape(sep)
DOT = f'[^{SEP}]'
else:
SEP = None
DOT = '.'
res = []
add = res.append
i, n = 0, len(pat)
Expand All @@ -86,10 +97,29 @@ def translate(pat):
i = i+1
if c == '*':
# compress consecutive `*` into one
if (not res) or res[-1] is not STAR:
h = i-1
while i < n and pat[i] == '*':
i = i+1
if sep:
star_count = i-h
is_segment = (h == 0 or pat[h-1] == sep) and (i == n or pat[i] == sep)
if star_count == 1:
if is_segment:
add(f'{DOT}+')
else:
add(f'{DOT}*')
elif star_count == 2 and is_segment:
if i == n:
add('.*')
else:
add(f'(.*{SEP})?')
i = i+1
else:
raise ValueError("Invalid pattern: '**' can only be an entire path component")
else:
add(STAR)
elif c == '?':
add('.')
add(DOT)
elif c == '[':
j = i
if j < n and pat[j] == '!':
Expand Down Expand Up @@ -136,7 +166,7 @@ def translate(pat):
add('(?!)')
elif stuff == '!':
# Negated empty range: match any character.
add('.')
add(DOT)
else:
if stuff[0] == '!':
stuff = '^' + stuff[1:]
Expand Down
160 changes: 37 additions & 123 deletions Lib/pathlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,78 +64,12 @@ def _is_case_sensitive(pathmod):
#


# fnmatch.translate() returns a regular expression that includes a prefix and
# a suffix, which enable matching newlines and ensure the end of the string is
# matched, respectively. These features are undesirable for our implementation
# of PurePatch.match(), which represents path separators as newlines and joins
# pattern segments together. As a workaround, we define a slice object that
# can remove the prefix and suffix from any translate() result. See the
# _compile_pattern_lines() function for more details.
_FNMATCH_PREFIX, _FNMATCH_SUFFIX = fnmatch.translate('_').split('_')
_FNMATCH_SLICE = slice(len(_FNMATCH_PREFIX), -len(_FNMATCH_SUFFIX))
_SWAP_SEP_AND_NEWLINE = {
'/': str.maketrans({'/': '\n', '\n': '/'}),
'\\': str.maketrans({'\\': '\n', '\n': '\\'}),
}


@functools.lru_cache(maxsize=256)
def _compile_pattern(pat, case_sensitive):
def _compile_pattern(pat, sep, case_sensitive):
"""Compile given glob pattern to a re.Pattern object (observing case
sensitivity), or None if the pattern should match everything."""
if pat == '*':
return None
sensitivity)."""
flags = re.NOFLAG if case_sensitive else re.IGNORECASE
return re.compile(fnmatch.translate(pat), flags).match


@functools.lru_cache()
def _compile_pattern_lines(pattern_lines, case_sensitive):
"""Compile the given pattern lines to an `re.Pattern` object.

The *pattern_lines* argument is a glob-style pattern (e.g. '**/*.py') with
its path separators and newlines swapped (e.g. '**\n*.py`). By using
newlines to separate path components, and not setting `re.DOTALL`, we
ensure that the `*` wildcard cannot match path separators.

The returned `re.Pattern` object may have its `match()` method called to
match a complete pattern, or `search()` to match from the right. The
argument supplied to these methods must also have its path separators and
newlines swapped.
"""

# Match the start of the path, or just after a path separator
parts = ['^']
for part in pattern_lines.splitlines(keepends=True):
if part == '*\n':
part = r'.+\n'
elif part == '*':
part = r'.+'
elif part == '**\n':
# '**/' component: we use '[\s\S]' rather than '.' so that path
# separators (i.e. newlines) are matched. The trailing '^' ensures
# we terminate after a path separator (i.e. on a new line).
part = r'[\s\S]*^'
elif part == '**':
# '**' component.
part = r'[\s\S]*'
elif '**' in part:
raise ValueError("Invalid pattern: '**' can only be an entire path component")
else:
# Any other component: pass to fnmatch.translate(). We slice off
# the common prefix and suffix added by translate() to ensure that
# re.DOTALL is not set, and the end of the string not matched,
# respectively. With DOTALL not set, '*' wildcards will not match
# path separators, because the '.' characters in the pattern will
# not match newlines.
part = fnmatch.translate(part)[_FNMATCH_SLICE]
parts.append(part)
# Match the end of the path, always.
parts.append(r'\Z')
flags = re.MULTILINE
if not case_sensitive:
flags |= re.IGNORECASE
return re.compile(''.join(parts), flags=flags)
return re.compile(fnmatch.translate(pat, sep), flags).match


def _select_children(parent_paths, dir_only, follow_symlinks, match):
Expand All @@ -159,7 +93,7 @@ def _select_children(parent_paths, dir_only, follow_symlinks, match):
except OSError:
continue
name = entry.name
if match is None or match(name):
if match(name):
yield parent_path._make_child_relpath(name)


Expand Down Expand Up @@ -196,7 +130,7 @@ def _select_unique(paths):
yielded = set()
try:
for path in paths:
path_str = str(path)
path_str = path._str
if path_str not in yielded:
yield path
yielded.add(path_str)
Expand Down Expand Up @@ -268,10 +202,10 @@ class PurePath:
# tail are normalized.
'_drv', '_root', '_tail_cached',

# The `_str` slot stores the string representation of the path,
# The `_str_cached` slot stores the string representation of the path,
# computed from the drive, root and tail when `__str__()` is called
# for the first time. It's used to implement `_str_normcase`
'_str',
'_str_cached',

# The `_str_normcase_cached` slot stores the string path with
# normalized case. It is set when the `_str_normcase` property is
Expand All @@ -285,10 +219,6 @@ class PurePath:
# to implement comparison methods like `__lt__()`.
'_parts_normcase_cached',

# The `_lines_cached` slot stores the string path with path separators
# and newlines swapped. This is used to implement `match()`.
'_lines_cached',

# The `_hash` slot stores the hash of the case-normalized string
# path. It's set when `__hash__()` is called for the first time.
'_hash',
Expand Down Expand Up @@ -375,7 +305,7 @@ def _load_parts(self):
def _from_parsed_parts(self, drv, root, tail):
path_str = self._format_parsed_parts(drv, root, tail)
path = self.with_segments(path_str)
path._str = path_str or '.'
path._str_cached = path_str
path._drv = drv
path._root = root
path._tail_cached = tail
Expand All @@ -392,12 +322,7 @@ def _format_parsed_parts(cls, drv, root, tail):
def __str__(self):
"""Return the string representation of the path, suitable for
passing to system calls."""
try:
return self._str
except AttributeError:
self._str = self._format_parsed_parts(self.drive, self.root,
self._tail) or '.'
return self._str
return self._str or '.'

def __fspath__(self):
return str(self)
Expand Down Expand Up @@ -435,16 +360,25 @@ def as_uri(self):
path = str(self)
return prefix + urlquote_from_bytes(os.fsencode(path))

@property
def _str(self):
try:
return self._str_cached
except AttributeError:
self._str_cached = self._format_parsed_parts(
self.drive, self.root, self._tail)
return self._str_cached

@property
def _str_normcase(self):
# String with normalized case, for hashing and equality checks
try:
return self._str_normcase_cached
except AttributeError:
if _is_case_sensitive(self.pathmod):
self._str_normcase_cached = str(self)
self._str_normcase_cached = self._str
else:
self._str_normcase_cached = str(self).lower()
self._str_normcase_cached = self._str.lower()
return self._str_normcase_cached

@property
Expand All @@ -456,20 +390,6 @@ def _parts_normcase(self):
self._parts_normcase_cached = self._str_normcase.split(self.pathmod.sep)
return self._parts_normcase_cached

@property
def _lines(self):
# Path with separators and newlines swapped, for pattern matching.
try:
return self._lines_cached
except AttributeError:
path_str = str(self)
if path_str == '.':
self._lines_cached = ''
else:
trans = _SWAP_SEP_AND_NEWLINE[self.pathmod.sep]
self._lines_cached = path_str.translate(trans)
return self._lines_cached

def __eq__(self, other):
if not isinstance(other, PurePath):
return NotImplemented
Expand Down Expand Up @@ -737,13 +657,16 @@ def match(self, path_pattern, *, case_sensitive=None):
path_pattern = self.with_segments(path_pattern)
if case_sensitive is None:
case_sensitive = _is_case_sensitive(self.pathmod)
pattern = _compile_pattern_lines(path_pattern._lines, case_sensitive)
sep = path_pattern.pathmod.sep
pattern_str = path_pattern._str
if path_pattern.drive or path_pattern.root:
return pattern.match(self._lines) is not None
pass
elif path_pattern._tail:
return pattern.search(self._lines) is not None
pattern_str = f'**{sep}{pattern_str}'
else:
raise ValueError("empty pattern")
match = _compile_pattern(pattern_str, sep, case_sensitive)
return match(self._str) is not None


# Subclassing os.PathLike makes isinstance() checks slower,
Expand Down Expand Up @@ -1017,26 +940,16 @@ def _scandir(self):
return os.scandir(self)

def _make_child_relpath(self, name):
sep = self.pathmod.sep
lines_name = name.replace('\n', sep)
lines_str = self._lines
path_str = str(self)
tail = self._tail
if tail:
path_str = f'{path_str}{sep}{name}'
lines_str = f'{lines_str}\n{lines_name}'
elif path_str != '.':
path_str = f'{path_str}{name}'
lines_str = f'{lines_str}{lines_name}'
path_str = f'{self._str}{self.pathmod.sep}{name}'
else:
path_str = name
lines_str = lines_name
path_str = f'{self._str}{name}'
path = self.with_segments(path_str)
path._str = path_str
path._str_cached = path_str
path._drv = self.drive
path._root = self.root
path._tail_cached = tail + [name]
path._lines_cached = lines_str
return path

def glob(self, pattern, *, case_sensitive=None, follow_symlinks=None):
Expand Down Expand Up @@ -1082,6 +995,7 @@ def _glob(self, pattern, case_sensitive, follow_symlinks):
# do not perform any filesystem access, which can be much faster!
filter_paths = follow_symlinks is not None and '..' not in pattern_parts
deduplicate_paths = False
sep = self.pathmod.sep
paths = iter([self] if self.is_dir() else [])
part_idx = 0
while part_idx < len(pattern_parts):
Expand All @@ -1102,9 +1016,9 @@ def _glob(self, pattern, case_sensitive, follow_symlinks):
paths = _select_recursive(paths, dir_only, follow_symlinks)

# Filter out paths that don't match pattern.
prefix_len = len(self._make_child_relpath('_')._lines) - 1
match = _compile_pattern_lines(path_pattern._lines, case_sensitive).match
paths = (path for path in paths if match(path._lines[prefix_len:]))
prefix_len = len(self._make_child_relpath('_')._str) - 1
match = _compile_pattern(path_pattern._str, sep, case_sensitive)
paths = (path for path in paths if match(path._str[prefix_len:]))
return paths

dir_only = part_idx < len(pattern_parts)
Expand All @@ -1117,7 +1031,7 @@ def _glob(self, pattern, case_sensitive, follow_symlinks):
raise ValueError("Invalid pattern: '**' can only be an entire path component")
else:
dir_only = part_idx < len(pattern_parts)
match = _compile_pattern(part, case_sensitive)
match = _compile_pattern(part, sep, case_sensitive)
paths = _select_children(paths, dir_only, follow_symlinks, match)
return paths

Expand Down Expand Up @@ -1210,11 +1124,11 @@ def absolute(self):
# Fast path for "empty" paths, e.g. Path("."), Path("") or Path().
# We pass only one argument to with_segments() to avoid the cost
# of joining, and we exploit the fact that getcwd() returns a
# fully-normalized string by storing it in _str. This is used to
# implement Path.cwd().
# fully-normalized string by storing it in _str_cached. This is
# used to implement Path.cwd().
if not self.root and not self._tail:
result = self.with_segments(cwd)
result._str = cwd
result._str_cached = cwd
return result
return self.with_segments(cwd, self)

Expand Down
Loading