Skip to content

Surprising Any in result type of re.Match.group() where None expected #12090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dkfellows opened this issue Jun 3, 2024 · 14 comments
Closed

Comments

@dkfellows
Copy link

I've been trying to typecheck some code that uses re.match but getting some unexpected Anys in it that came from these typeshed definitions (especially the second and third ones):

    # group() returns "AnyStr" or "AnyStr | None", depending on the pattern.
    @overload
    def group(self, group: Literal[0] = 0, /) -> AnyStr: ...
    @overload
    def group(self, group: str | int, /) -> AnyStr | Any: ...
    @overload
    def group(self, group1: str | int, group2: str | int, /, *groups: str | int) -> tuple[AnyStr | Any, ...]: ...

Under what circumstance is producing an Any at this point a good plan? The AnyStr is fine, but surely the Any should be a None? After all, right now it's saying "Oh, we could spontaneously return a lock object from the concurrency module because we're feeling bored; no promises!" which is rather at variance with the documentation which says effectively "returns a substring if there is a match group, otherwise None".

@AlexWaygood
Copy link
Member

This is a duplicate of #10680, #9482, #11203, and many others (see also #9465 (comment), python/mypy#16441 (comment), #10526, etc.)

@srittau srittau closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2024
@dkfellows
Copy link
Author

So... a lot of existing code is doing something dumb, simply not handling the types right, and the decision was taken to make everything type unsafe rather than getting the types correct? That's an absolutely terrible decision.

@AlexWaygood
Copy link
Member

AlexWaygood commented Jun 5, 2024

No, a lot of code is doing something that is clearly type-safe, and could be statically verified to be type-safe. Unfortunately, the rules that govern when this kind of pattern is actually type-safe or are not currently (and possibly will never be) expressible solely using type annotations, and type annotations are the only thing we have control over here in typeshed. In order to avoid a truly vast number of false positives when type checkers analyse user code, we have chosen in this instance to go for more lenient annotations.

@dkfellows
Copy link
Author

At the very least, it ought to be overloaded so that asking for a group by integer index says that it will always produce an AnyStr in the type logic; if there's no such group, you get an error in that case:

>>> m = re.search("([abc])d", "abcde") 
>>> m
<re.Match object; span=(2, 4), match='cd'>
>>> m.groups()
('c',)
>>> m.group(1)
'c'
>>> m.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group
>>> m.group("foo")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

A group might not match anything at all and become None if the pattern is structured that way, which is fine:

>>> m = re.search("(([cde])f)?b", "ab")
>>> m
<re.Match object; span=(1, 2), match='b'>
>>> m.groups()
(None, None)

but at no point are you going to get, say, an integer or anything like that out of this.

I understand that you'd like to be able to say "this particular case can't be a None" but types can't give you that. What they can do is say that you for sure aren't getting an int or a float or a thread or something else completely crazy.

@AlexWaygood
Copy link
Member

At the very least, it ought to be overloaded so that asking for a group by integer index says that it will always produce an AnyStr in the type logic; if there's no such group, you get an error in that case:

it doesn't sound like you've considered optional groups:

>>> import re
>>> m = re.search(r"([ab])?d", "abcde")
>>> m.group(1) is None
True

@dkfellows
Copy link
Author

it doesn't sound like you've considered optional groups:

Next paragraph. 😉

@AlexWaygood
Copy link
Member

Okay, then I'm not sure exactly what you're proposing. If you'd like to make a PR to gauge the possible impact of a change, feel free, but we're pretty unlikely to merge it if it has anything like the number of false positives that e.g. #11203 had :-)

@Akuli
Copy link
Collaborator

Akuli commented Jun 5, 2024

What they can do is say that you for sure aren't getting an int or a float or a thread or something else completely crazy.

The problem is that if you say this, it forces everyone to check for None.

Ideally there would be a way to tell a type checker that:

  • AnyStr is something you will need to handle
  • no errors if you check whether the thing is None, also no errors if you don't check
  • ints and floats and such are completely crazy and will never be produced.

This combination of 3 different requirements is not possible to achieve. If you pick any two of them, it is possible to achieve. We have decided to pick the first two.

@dkfellows
Copy link
Author

Sorry if I'm a bit irascible. I'm in the middle of trying to typecheck the YAML deserialization mechanism from Heck.

@AlexWaygood
Copy link
Member

It's okay. Types can be infuriating sometimes :-)

@Akuli
Copy link
Collaborator

Akuli commented Jun 5, 2024

I think it might help to show more concretely what we want the annotation to do. Here are our requirements:

def foo() -> ???:
    ...

# Requirement A: this should be error
if foo().starswthi("bar"):
    ...

# Requirement B: this should succeed
if foo().startswith("bar"):
    ...

# Requirement C: this should succeed
if foo() is None:
    ...

If you annotate with foo() -> str, requirements A and B are fine but C is not. You will get error for the None check.

If you annotate with foo() -> str | None, requirements A and C are fine but B is not. It forces a None check.

If you annotate with foo() -> Any, requirements B and C are fine but A is not. You are silencing all errors.

If you annotate with foo() -> str | Any, all requirements are satisfied.

@dkfellows
Copy link
Author

Conceptually however, the property of whether group can be None is a property of the Match; there are definitely instances where this is not so. That would point to (spitballing) making the type of the result of the group method be an extra type parameter, Match[AnyStr, T], where T is either the AnyStr or AnyStr | None (and where it would be possible to cast the thing to the actual type it really has in the cases where the regular expression has no optional groups). I guess that this property could be pushed back into the Pattern somehow — it's a static property of a particular pattern — but I've not tried thinking about that.

Some user intervention will clearly be required; it's not reasonable for the type checker to analyse regular expressions from arbitrary sources.

@Akuli
Copy link
Collaborator

Akuli commented Jun 5, 2024

Match[AnyStr, T] won't help if some groups can be None while others can't. For example, consider (mon|tue|wed|thu|fri|sat|sun)( \d\d:\d\d)?.

@dkfellows
Copy link
Author

True, but then you're outside the reasonable reach of what a type checker should do. Only crazy people like the TypeScript developers try to go further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants