Skip to content

Question about adjacent empty matches in regular expressions #122055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
krave1986 opened this issue Jul 20, 2024 · 8 comments
Open

Question about adjacent empty matches in regular expressions #122055

krave1986 opened this issue Jul 20, 2024 · 8 comments
Labels
docs Documentation in the Doc dir topic-regex

Comments

@krave1986
Copy link

krave1986 commented Jul 20, 2024

Documentation

The Python documentation for re.sub() states:

Empty matches for the pattern are replaced only when not adjacent to a previous empty match.

However, after some testing, I have been unable to construct a regular expression pattern that produces adjacent empty matches. This leads to the following questions:

Is it actually possible to create a regular expression pattern that results in adjacent empty matches in Python's re module?
If not, should we consider updating the documentation to avoid potential confusion among developers?

My Investigation

I've tried various patterns that might theoretically produce adjacent empty matches, such as:

import re

def find_all_matches(pattern, string):
    return [m.start() for m in re.finditer(pattern, string)]

test_string = "abcd"
patterns = [
    r'\b|\b',  # Word boundaries
    r'^|$',    # Start or end of string
    r'(?=.)|(?<=.)',  # Positive lookahead or lookbehind
    r'.*?',    # Non-greedy any character
]

for pattern in patterns:
    matches = find_all_matches(pattern, test_string)
    print(f"Pattern {pattern}: {matches}")

None of these patterns produce adjacent empty matches. The regex engine seems to always move forward after finding a match, even an empty one.

Request for Clarification

This issue sincerely requests clarification on this matter:

  1. If adjacent empty matches are indeed possible as the documentation suggests, could you provide some examples that demonstrate this behavior?

  2. Are there specific scenarios or edge cases where adjacent empty matches can occur?

  3. If possible, could you share a minimal working example that shows how re.sub() handles adjacent empty matches differently from non-adjacent ones?

These examples would greatly help in understanding the documentation and the behavior of the re module in such cases.

Thank you for your time and attention to this matter. Any insights or examples you can provide would be greatly appreciated.

Linked PRs

@krave1986 krave1986 added the docs Documentation in the Doc dir label Jul 20, 2024
@terryjreedy
Copy link
Member

The quoted line is currently 1082-3 in the main branch. #76489 (PR ##4846) changed 'previous match' to 'previous empty match' because the result of sub('x*', '-', 'abxd') was changed from '-a-b-d-' to '-a-b--d-'.

@serhiy-storchaka This was your PR. The implication of the current doc is that there can be adjacent empty matches, but some of MRABarnett's comment " the general consensus has been to match once" implies otherwise.

@vadmium
Copy link
Member

vadmium commented Jul 20, 2024

Assuming the rules for empty matches are now consistent across the various functions (sub, finditer, etc), maybe it would be clearer to say “Empty matches do not occur adjacent to a previous empty match”? Same for the re.split documentation.

@ronaldoussoren
Copy link
Contributor

Aren't these two adjacent empty matches?

>>> pat = re.compile('^(.?)(.?)')
>>> m = pat.match('')
>>> m.groups()
('', '')
>>> pat = re.compile('^(.?)(.?)(a+)')
>>> pat.match('a').groups()
('', '', 'a')

@serhiy-storchaka
Copy link
Member

Thank you @terryjreedy for the context. Indeed, it was a change in the existing text which can now look confusing.

In fact, adjacent empty matches are not possible because it is considered the same match. Like str.count() does not count empty substring multiple times at the same position, re.finditer(), re.findall(), re.split() and re.sub() do not find multiple empty matches at the same position.

@vadmium, yes, it would perhaps be better.

@ronaldoussoren, no, in this context, we are talking about matches of the whole regular expression in functions that search matches repeatedly.

@krave1986
Copy link
Author

Thank you @terryjreedy for providing the historical context and the before-and-after comparison from the PR, which significantly contributes to understanding the evolution of this behavior.

While I appreciate @vadmium's suggestion for clarity, I find that it still doesn't fully elucidate the cause-and-effect relationship in this context. To address this, I have attempted to make the explanation more intuitive and directly address the core behavior; therefore, I propose the following modification:

Adjacent empty matches are not possible, but an empty match can occur immediately after a non-empty match. As a result, sub('x*', '-', 'abxd') returns -a-b--d- instead of -a-b-d-.

This:

  1. Clearly states that adjacent empty matches are not possible.
  2. Introduces the fact that empty matches can follow non-empty matches.
  3. Directly links this behavior to the example, showing both the actual and the (incorrect) expected output.
  4. Provides a clear cause-and-effect relationship, making it easier for developers to understand and predict the behavior.

The inclusion of both the current and previous behavior in the example directly addresses the change introduced by the PR, making the documentation more informative for both new and experienced Python developers.

Could you please review this suggestion and provide your thoughts?

@shtrom
Copy link

shtrom commented Apr 29, 2025

I think I ran into this today on Python 3.13.2:

>>> re.sub('.*', 'bob', 'o')
'bobbob'
>>> re.sub('.*', 'bob', 'eu')
'bobbob'

Here, I'd expect o to be matched by .*, and be replaced by a single bob. But two replacements are made.

This doesn't happen for an empty string, where a single replacements is made.

>>> re.sub('.*', 'bob', '')
'bob'

By using a Callable for repl, I can confirm that there are two matches if the string is not empty:

>>> re.sub('.*', lambda x: print(x), '')
<re.Match object; span=(0, 0), match=''>
''
>>> re.sub('.*', lambda x: print(x), 'o')
<re.Match object; span=(0, 1), match='o'>
<re.Match object; span=(1, 1), match=''>
''
>>> re.sub('.*', lambda x: print(x), 'eu')
<re.Match object; span=(0, 2), match='eu'>
<re.Match object; span=(2, 2), match=''>
''

So, I think this is an instance, and perhaps a good example, of adjacent empty matches.

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Apr 29, 2025

I like your suggestion, @krave1986. There is also a similar phrase for re.split(), and the text of the versionchanged directive could be updated too. I created #133169 based on your suggestion.

@serhiy-storchaka
Copy link
Member

@shtrom, adjacent matches are possible, but they cannot both be empty. There should be a progress, otherwise we would have infinite number of matches at the same position.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-regex
Projects
None yet
Development

No branches or pull requests

7 participants