Skip to content

Correcting type mismatches in the email module #10444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cushionbadak opened this issue Jul 12, 2023 · 7 comments · Fixed by #10450
Closed

Correcting type mismatches in the email module #10444

cushionbadak opened this issue Jul 12, 2023 · 7 comments · Fixed by #10450

Comments

@cushionbadak
Copy link
Contributor

I hope these three instances of type mismatch can help in enhancing cpython type hints.

1. Type mismatch in email.charset.Charset.header_encode_lines

Overview

Target for Modification: Charset.header_encode_lines in email/charset.pyi
Existing Type Hint: (self, string: str, maxlengths: Iterator[int]) -> list[str]
Suggested Type Hint: (self, string: str, maxlengths: Iterator[int]) -> list[Optional[str]]

Explanation

The method Charset.header_encode_lines may potentially return a list containing None.

class Charset:
    def header_encode_lines(self, string, maxlengths):
        ...
        lines = []
        current_line = []
        ...
                # Does nothing fit on the first line?
                if not lines and not current_line:
                    lines.append(None)
                else:
                    joined_line = EMPTYSTRING.join(current_line)
                    header_bytes = _encode(joined_line, codec)
                    lines.append(encoder(header_bytes))
        ...
        return lines

Although the path through which control flow might reach lines.append(None) is not immediately clear, it's important to note that this condition (where None is included in the returned list) was encountered in CPython's email unit test, specifically in TestEmailAsianCodecs.test_japanese_codecs in test/test_email/test_asian_codecs.py. The minimal example provided below illustrates this type mismatch:

class TestEmailAsianCodecs(TestEmailBase):
    def test_japanese_codecs(self):
        eq = self.ndiffAssertEqual
        jcode = "euc-jp"
        gcode = "iso-8859-1"
        j = Charset(jcode)
        g = Charset(gcode)
        h = Header("Hello World!")
        jhello = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc'
                     b'\xa5\xeb\xa5\xc9\xa1\xaa', jcode)
        ghello = str(b'Gr\xfc\xdf Gott!', gcode)
        h.append(jhello, j)
        h.append(ghello, g)
        eq(h.encode(), """\
Hello World! =?iso-2022-jp?b?GyRCJU8lbSE8JW8hPCVrJUkhKhsoQg==?=
 =?iso-8859-1?q?Gr=FC=DF_Gott!?=""")

In this example, the method header_encode_lines gives the output [None, '=?iso-8859-1?q?Gr=FC=DF_Gott!?=']. This occurs through the follwoing control flow:

email.header.Header.encode
-> email.header._ValueFormatter.feed
-> email.charset.Charset.header_encode_lines

2. Type mismatch in email.charset.Charset.body_encode

Overview

Target for Modification: Charset.body_encode in email/charset.pyi
Existing Type Hint: (overload) (self, string: None) -> None, (self, string: str) -> str
Suggested Type Hint: (overload) (self, string: None) -> None, (self, string: str | bytes) -> str

Explanation

The method Charset.body_encode may be called with bytes as the string parameter, a case not currently covered by the existing type hint:

@overload
def body_encode(self, string: None) -> None: ...
@overload
def body_encode(self, string: str) -> str: ...

As mentioned in #10429 , Message.set_charset method can pass self._payload value to Charset.body_encode as the string parameter. However, there are scenarios where Message.set_charset modifies the str type self._payload to bytes type, before passing it to Charset.body_encode method. The following snippet from the library code demonstrates this:

class Message:
    def set_charset(self, charset):
        ...
                payload = self._payload
                if payload:
                    try:
                        payload = payload.encode('ascii', 'surrogateescape') # "str" payload converted to "bytes" here
                    except UnicodeError:
                        ...
                self._payload = charset.body_encode(payload)
        ...

This mismatch (passing bytes in place of str) was noticed in the CPython email unit test, TestEmailAsianCodecs.test_payload_encoding_utf8 found in test/test_email/test_asian_codecs.py. The compact example provided below highlights this type mismatch:

class TestEmailAsianCodecs(TestEmailBase):
    def test_payload_encoding_utf8(self):
       jhello = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc'
                    b'\xa5\xeb\xa5\xc9\xa1\xaa', 'euc-jp')
       msg = Message()
       msg.set_payload(jhello, 'utf-8')

In this example, the body_encode method receives bytes, invoked via the following control flow:

email.message.Message.set_payload
-> email.message.Message.set_charset
-> email.charset.Charset.body_encode

3. Type Mismatch in email.policy.EmailPolicy.header_store_parse

Overview

Target for Modification: EmailPolicy.header_store_parse in email/policy.pyi
Existing Type Hint: (self, name: str, value: str) -> tuple[str, str]
Suggested Type Hint: (self, name: str, value: Any) -> tuple[str, str]

Explanation

The EmailPolicy.header_store_parse method can be invoked with an object of any type as the value parameter, but the current type hint only allows for a str value.

def header_store_parse(self, name: str, value: str) -> tuple[str, str]: ...

Considering that email.message.Message's special method __setitem__ calls EmailPolicy.header_store_parse and given the signature of __setitem__, the EmailPolicy.header_store_parse parameter type should be (self, name: str, value: Any).

class Message:
    def __setitem__(self, name, val):
        ...
        self._headers.append(self.policy.header_store_parse(name, val))
_HeaderType: TypeAlias = Any
class Message:
    def __setitem__(self, name: str, val: _HeaderType) -> None: ...

This inconsistency was detected in the CPython email unit test, TestBytesGenerator.test_smtp_policy found in test/test_email/test_generator.py. The trimmed example provided below illustrates this type mismatch:

from email.message import EmailMessage
from email.headerregistry import Address
msg = EmailMessage()
msg["From"] = Address(addr_spec="[email protected]", display_name="Páolo")

In this scenario, the msg["From"] = Address(...) statement invokes message.EmailMessage.__setitem__ ( = message.Message.__setitem__) and passes an Address object to the value parameter.


P.S. All links to CPython reference commit hash ae315991... from the current CPython 3.12 branch.

P.S. In my attempts to apply existing static type checking tools such as mypy, pytype, and pyright to the type hints in the cpython standard library, I've encountered various errors and imprecise results. I'm wondering if there are any established best practices for this process. Any guidance or tips would be greatly appreciated.

@srittau
Copy link
Collaborator

srittau commented Jul 12, 2023

Thanks for all this work!

Looking at the source for header_store_path(), I think we also need to change the return type to tuple[str, Any], as the docstring says:

Otherwise the name and value are passed to header_factory method, and the resulting custom header object is returned as the value.

And the annotation for EmailPolicy.header_factory should probably be Callable[[str, Any], Any].

P.S. In my attempts to apply existing static type checking tools such as mypy, pytype, and pyright to the type hints in the cpython standard library, I've encountered various errors and imprecise results. I'm wondering if there are any established best practices for this process. Any guidance or tips would be greatly appreciated.

If I remember correctly, there have been previous checks of the standard library (by @ethanhs?), and we even had discussions about running checks on a regular basis, but I can't find them anymore.

@emmatyping
Copy link
Member

It's been a looong time since I ran my check against the standard library, I recall running it on the plane ride back from PyCon 2018 :)

Here's the issue: #2149

At that point I was mostly just checking "does typeshed have stubs for these modules", which should be easy enough to check, the harder evaluation is coverage of the API surface...

@AlexWaygood
Copy link
Member

At that point I was mostly just checking "does typeshed have stubs for these modules", which should be easy enough to check, the harder evaluation is coverage of the API surface...

Stubtest actually does a very good job these days of checking whether things in modules at runtime exist in the stub. But, although it checks for missing modules in our third-party stubs, it currently doesn't do that for our stdlib stubs. I know we're missing several of the newer importlib submodules in the stdlib, for example

@AlexWaygood
Copy link
Member

Here's the issue: #2149

#5191 was a followup

@cushionbadak
Copy link
Contributor Author

cushionbadak commented Jul 12, 2023

Looking at the source for header_store_path(), I think we also need to change the return type to tuple[str, Any], as the docstring says:

What if we consider the following overload?

@overload
def header_store_parse(self, name: str, value: str) -> tuple[str, str]: ...
@overload
def header_store_parse(self, name: str, value: Any) -> tuple[str, Any]: ...

The implementation appears to divide into two parts, one for (name: str, value: Any) in Line-141~142, and the other for (name: str, value: str) in Line-148. This is where header_factory is defined as a Callable[[str, str], str] according to the type hint.

class EmailPolicy(Policy):
    header_factory: Callable[[str, str], str]

    def header_store_parse(self, name, value):
        # Line 141~142
        if hasattr(value, 'name') and value.name.lower() == name.lower():
            return (name, value)
        ...
        # Line 148
        return (name, self.header_factory(name, value))

EDIT:

And the annotation for EmailPolicy.header_factory should probably be Callable[[str, Any], Any].

Oh I missed this, sorry. I'll think about it some more.

@Akuli
Copy link
Collaborator

Akuli commented Jul 12, 2023

checks of the standard library

One idea I have been thinking of, but haven't tried yet:

  1. Copy the standard library (.py files) to a new folder
  2. Apply the stubs from typeshed to it, so that you end up with .py files that contain typeshed's annotations
  3. Run type checkers

I think pytype comes with scripts that do step 2, but I might remember wrong.

One challenge will be that private methods remain untyped, because they are missing from typeshed. To fix that, you could try running Python's test suite under monkeytype.

If you can get this to work, you might discover lots of things to be improved :)

@cushionbadak
Copy link
Contributor Author

Thank you, I appreciate all the advice you've given me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants