gh-127787: refactor helpers for `PyUnicodeErrorObject` internal interface #127789

picnixz · 2024-12-10T12:08:10Z

Unify get_unicode and get_string in a single function.
Allow to retrieve the underlying object attribute, its size and its start and end indices in one round.
Use a common implementation for the following functions:
- PyUnicode{Decode,Encode}Error_GetEncoding
- PyUnicode{Decode,Encode,Translate}Error_GetObject
- PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason
- PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}

Note that there are some cosmetic changes here and there (in the naming of parameters) but these are essentially in prevision of #127694 in order to reduce the conflicts I'll need to solve (there will be conflicts probably but ideally, I want them to be minimal).

I've moved all helpers before the public API. I could move them inbetween but I felt that it's cleaner that way (it also allowed me to put double blank lines between functions a bit more easily).

Issue: Refactor PyUnicodeError internal C helpers #127787

- Unify `get_unicode` and `get_string` in a single function. - Allow to retrieve the underlying `object` attribute and its size in one round. - Use a common implementation for the following functions: - `PyUnicode{Decode,Encode}Error_GetEncoding` - `PyUnicode{Decode,Encode,Translate}Error_GetObject` - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason` - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}`

picnixz · 2024-12-10T13:14:36Z

@encukou I've designed a _PyUnicodeError_GetParams which allows to retrieve object, size, start, end and check whether start and end are consistent or not as well. This could help in the codecs handlers (but I just need to check whether I need < or <=).

NVM: just removing the parameter. It's easier to make the check start < end outside.

picnixz · 2024-12-13T16:46:33Z

@encukou A little implementation question. Do you think it's preferrable to have

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    int rc = check_unicode_error_type(self, "UnicodeEncodeError");
    return rc < 0 ? NULL : unicode_error_get_encoding_impl(self);
}

with unicode_error_get_encoding_impl working on generic UnicodeError objects (just assertion casts) or do you prefer unicode_error_get_encoding_impl to actually be the one performing the following check with an additional expect_type parameter:

int rc = check_unicode_error_type(self, expect_type);

Unless I use generating maocrs, I'll end up either duplicating the expect_type strings, or by duplicating int rc = .... Personally, today I feel that it reads better as it is now, but tomorrow maybe I may prefer a "short" implementation of the public API itself.

encukou · 2024-12-19T15:11:54Z

I think it's fine to silently accept “wrong” subclasses of UnicodeError, especially if a strict check would be difficult to implement.

picnixz · 2024-12-21T11:11:58Z

Maybe I wasn't clear but I wasn't talking about subclasses or not. Since I'm using PyObject_TypeCheck, I consider the check to be broad enough. What I wanted to ask is whether you want me to have something like:

static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self)
{
    PyUnicodeErrorObject *exc = PyUnicodeError_CAST(self);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    int rc = check_unicode_error_type(self, "UnicodeEncodeError");
    return rc < 0 ? NULL : unicode_error_get_encoding_impl(self);
}

or

static inline PyUnicodeErrorObject *
as_unicode_error(PyObject *self, const char *expect_type)
{
    int rc = check_unicode_error_type(self, expect_type);
    return rc < 0 ? NULL : _PyUnicodeError_CAST(self);
}

static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self, const char *expect_type)
{
    PyUnicodeErrorObject *exc = as_unicode_error(self, expect_type);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}


PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    return unicode_error_get_encoding_impl(self, "UnicodeEncodeError");
}

The first solution delegates type-checking and attribute retrieval to two different functions ( check_unicode_error_type and unicode_error_get_encoding_impl), while the second version assumes that unicode_error_get_encoding_impl is also responsible for the cast. However, I designed the functions so that internal ones do not need to do runtime (yet assert-only) checks so that they could possibly be used somewhere else.

Objects/exceptions.c

encukou · 2025-01-02T14:07:31Z

Objects/exceptions.c

+/*
+ * Return the underlying (str) 'encoding' attribute of a Unicode Error object.
+ *
+ * The caller is responsible to ensure that 'self' is a PyUnicodeErrorObject.


I'd prefer an assert over a “The caller is responsible to ensure...” comment. To a human reader, they should be equivalent.

The assert is actually inside the _CAST macro. It's just to document that this function would crash otherwise. The alternative is to remove the assert inside the CAST macro and make it an explicit one though that would add lines.

encukou · 2025-01-02T14:08:20Z

That I wanted to ask is whether you want me to have something like: [...]

Hm, that looks like a style choice I can leave to you; they look similarly complex.

There's one more style you can consider for internal functions, “error pass-through”:

/* if self is NULL, return NULL; an exception must already be set */
static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self) {
    if (!self) {
        return NULL,
    }
    PyUnicodeErrorObject *exc = PyUnicodeError_CAST(self);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    PyObject *err = check_unicode_error_type(self, "UnicodeEncodeError");
    return unicode_error_get_encoding_impl(err);
}

Are you happy with the current iteration of the PR?
(Apologies, I'm now a bit lost about the state of your PRs; if there's one I should look at first, please let me know!)

Co-authored-by: Petr Viktorin <[email protected]>

picnixz · 2025-01-02T14:21:05Z

(Apologies, I'm now a bit lost about the state of your PRs; if there's one I should look at first, please let me know!)

No worries! I have a lot of PRs that are identical (namely UBSan ones) which you can just skip for now. Other PRs related to unicode error objects are those with codecs. I'm not on my dev session now and since it's a holidays period, I don't want to overwhelm you with review requests.

Are you happy with the current iteration of the PR?

I'll have a look again tomorrow to decide the final state of the PR. I'm pretty happy with the current implementation (namely no pass-through, and assertion in the CAST) but I can consider the pass-through approach. It looks nice and could reduce the number of overall lines. I can also make sure that an exception is set before returning NULL so it would at least suit what I wanted to do (it would also decouple the logic of checking and performing the actual operation in PyUnicodeEncodeError_{Get,Set}* functions so it's also fine).

I don't know how the exception class will evolve in the future, especially how we will decide to handle relative start/end indices (I think we're unfortunately stuck and won't really be able to change the behaviour since it's part of the stable ABI).

picnixz · 2025-01-02T14:22:10Z

The merge plan I had in mind was:

This PR (so that I introduce the GetParams function)
The PRs for Incorrect handling of start and end values in codecs error handlers #126004 which could then use the GetParams function.

So until this one is merged, there is no need to review the others as the code will change a bit. Though, you can review them if you want to look at the logic only.

This is typically useful for future refactorization and to be able to write lines below 80 characters. This also helps avoiding having to remember where to place the NULL arguments.

picnixz · 2025-01-03T10:08:46Z

I eventually decided to avoid a pass-through. While it would work, I feel that it's not right to expect the callee to eventually rely on the fact that an exception has been set.

However, do you want me to add NULL checks? (without those, the assertions would also crash)
EDIT: I've added them so that it's more robust.

encukou

LGTM! Just a few comment nitpicks left.

Objects/exceptions.c

encukou

Thank you!

… interface (pythonGH-127789) - Unify `get_unicode` and `get_string` in a single function. - Allow to retrieve the underlying `object` attribute, its size, and the adjusted 'start' and 'end', all at once. Add a new `_PyUnicodeError_GetParams` internal function for this. (In `exceptions.c`, it's somewhat common to not need all the attributes, but the compiler has opportunity to inline the function and optimize unneeded work away. Outside that file, we'll usually need all or most of them at once.) - Use a common implementation for the following functions: - `PyUnicode{Decode,Encode}Error_GetEncoding` - `PyUnicode{Decode,Encode,Translate}Error_GetObject` - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason` - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}`

picnixz added 3 commits December 10, 2024 12:57

put comment section headers

32a199d

add comments

2583095

picnixz requested a review from encukou December 10, 2024 12:08

picnixz requested a review from iritkatriel as a code owner December 10, 2024 12:08

bedevere-app bot added the awaiting review label Dec 10, 2024

bedevere-app bot mentioned this pull request Dec 10, 2024

Refactor PyUnicodeError internal C helpers #127787

Closed

picnixz added the skip news label Dec 10, 2024

picnixz marked this pull request as draft December 10, 2024 12:14

bedevere-app bot removed the awaiting review label Dec 10, 2024

picnixz added 2 commits December 10, 2024 13:25

simpler checks

f0893b7

fix tests

a4b01f0

picnixz marked this pull request as ready for review December 10, 2024 12:27

bedevere-app bot added the awaiting review label Dec 10, 2024

unify even more the interface using a generic getter

01b5f22

picnixz added 3 commits December 10, 2024 14:16

remove useless consistent parameter

be982a0

Simplify call

d4dc9a6

Merge branch 'main' into feat/exc/unicode-error-refactor-127787

3078f23

picnixz marked this pull request as draft December 13, 2024 16:41

bedevere-app bot removed the awaiting review label Dec 13, 2024

picnixz added 2 commits December 13, 2024 17:47

remove unused function

94800fd

remove un-necessary macros

e5709fa

picnixz marked this pull request as ready for review December 13, 2024 16:48

bedevere-app bot added the awaiting review label Dec 13, 2024

encukou reviewed Jan 2, 2025

View reviewed changes

Objects/exceptions.c Outdated Show resolved Hide resolved

encukou reviewed Jan 2, 2025

View reviewed changes

Update Objects/exceptions.c

83eb24d

Co-authored-by: Petr Viktorin <[email protected]>

picnixz added 5 commits January 3, 2025 10:24

style update

7c9fd99

use macro for repeated names to avoid typos

8219be9

specialize _PyUnicodeError_GetParams for start and end attributes

4a5e4e3

This is typically useful for future refactorization and to be able to write lines below 80 characters. This also helps avoiding having to remember where to place the NULL arguments.

update comments

f7a2efa

put 2 blank lines before sections

17367ff

picnixz requested a review from encukou January 3, 2025 10:06

add NULL assertions to avoid obscure segmentation faults

c05f2ad

encukou reviewed Jan 3, 2025

View reviewed changes

Objects/exceptions.c Outdated Show resolved Hide resolved

Objects/exceptions.c Outdated Show resolved Hide resolved

Objects/exceptions.c Outdated Show resolved Hide resolved

picnixz added 3 commits January 3, 2025 12:24

update comments

6c07afc

fixup comment

ebf86c7

fixup comment

3fb81c2

picnixz requested a review from encukou January 3, 2025 11:36

encukou approved these changes Jan 3, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Jan 3, 2025

encukou merged commit fa985be into python:main Jan 3, 2025
43 checks passed

bedevere-app bot removed the awaiting merge label Jan 3, 2025

picnixz deleted the feat/exc/unicode-error-refactor-127787 branch January 3, 2025 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-127787: refactor helpers for `PyUnicodeErrorObject` internal interface #127789

gh-127787: refactor helpers for `PyUnicodeErrorObject` internal interface #127789

picnixz commented Dec 10, 2024 •

edited

Loading

picnixz commented Dec 10, 2024 •

edited

Loading

picnixz commented Dec 13, 2024

encukou commented Dec 19, 2024

picnixz commented Dec 21, 2024 •

edited

Loading

encukou Jan 2, 2025

picnixz Jan 2, 2025

encukou commented Jan 2, 2025

picnixz commented Jan 2, 2025

picnixz commented Jan 2, 2025 •

edited

Loading

picnixz commented Jan 3, 2025 •

edited

Loading

encukou left a comment

encukou left a comment

gh-127787: refactor helpers for PyUnicodeErrorObject internal interface #127789

gh-127787: refactor helpers for PyUnicodeErrorObject internal interface #127789

Conversation

picnixz commented Dec 10, 2024 • edited Loading

picnixz commented Dec 10, 2024 • edited Loading

picnixz commented Dec 13, 2024

encukou commented Dec 19, 2024

picnixz commented Dec 21, 2024 • edited Loading

encukou Jan 2, 2025

Choose a reason for hiding this comment

picnixz Jan 2, 2025

Choose a reason for hiding this comment

encukou commented Jan 2, 2025

picnixz commented Jan 2, 2025

picnixz commented Jan 2, 2025 • edited Loading

picnixz commented Jan 3, 2025 • edited Loading

encukou left a comment

Choose a reason for hiding this comment

encukou left a comment

Choose a reason for hiding this comment

gh-127787: refactor helpers for `PyUnicodeErrorObject` internal interface #127789

gh-127787: refactor helpers for `PyUnicodeErrorObject` internal interface #127789

picnixz commented Dec 10, 2024 •

edited

Loading

picnixz commented Dec 10, 2024 •

edited

Loading

picnixz commented Dec 21, 2024 •

edited

Loading

picnixz commented Jan 2, 2025 •

edited

Loading

picnixz commented Jan 3, 2025 •

edited

Loading