Skip to content

gh-127787: refactor helpers for PyUnicodeErrorObject internal interface #127789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jan 3, 2025

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented Dec 10, 2024

  • Unify get_unicode and get_string in a single function.

  • Allow to retrieve the underlying object attribute, its size and its start and end indices in one round.

  • Use a common implementation for the following functions:

    • PyUnicode{Decode,Encode}Error_GetEncoding
    • PyUnicode{Decode,Encode,Translate}Error_GetObject
    • PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason
    • PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}

Note that there are some cosmetic changes here and there (in the naming of parameters) but these are essentially in prevision of #127694 in order to reduce the conflicts I'll need to solve (there will be conflicts probably but ideally, I want them to be minimal).

I've moved all helpers before the public API. I could move them inbetween but I felt that it's cleaner that way (it also allowed me to put double blank lines between functions a bit more easily).

- Unify `get_unicode` and `get_string` in a single function.
- Allow to retrieve the underlying `object` attribute and its
  size in one round.
- Use a common implementation for the following functions:

  - `PyUnicode{Decode,Encode}Error_GetEncoding`
  - `PyUnicode{Decode,Encode,Translate}Error_GetObject`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}`
@picnixz picnixz marked this pull request as ready for review December 10, 2024 12:27
@picnixz
Copy link
Member Author

picnixz commented Dec 10, 2024

@encukou I've designed a _PyUnicodeError_GetParams which allows to retrieve object, size, start, end and check whether start and end are consistent or not as well. This could help in the codecs handlers (but I just need to check whether I need < or <=).

NVM: just removing the parameter. It's easier to make the check start < end outside.

@picnixz picnixz marked this pull request as draft December 13, 2024 16:41
@picnixz
Copy link
Member Author

picnixz commented Dec 13, 2024

@encukou A little implementation question. Do you think it's preferrable to have

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    int rc = check_unicode_error_type(self, "UnicodeEncodeError");
    return rc < 0 ? NULL : unicode_error_get_encoding_impl(self);
}

with unicode_error_get_encoding_impl working on generic UnicodeError objects (just assertion casts) or do you prefer unicode_error_get_encoding_impl to actually be the one performing the following check with an additional expect_type parameter:

int rc = check_unicode_error_type(self, expect_type);

Unless I use generating maocrs, I'll end up either duplicating the expect_type strings, or by duplicating int rc = .... Personally, today I feel that it reads better as it is now, but tomorrow maybe I may prefer a "short" implementation of the public API itself.

@picnixz picnixz marked this pull request as ready for review December 13, 2024 16:48
@encukou
Copy link
Member

encukou commented Dec 19, 2024

I think it's fine to silently accept “wrong” subclasses of UnicodeError, especially if a strict check would be difficult to implement.

@picnixz
Copy link
Member Author

picnixz commented Dec 21, 2024

Maybe I wasn't clear but I wasn't talking about subclasses or not. Since I'm using PyObject_TypeCheck, I consider the check to be broad enough. What I wanted to ask is whether you want me to have something like:

static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self)
{
    PyUnicodeErrorObject *exc = PyUnicodeError_CAST(self);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    int rc = check_unicode_error_type(self, "UnicodeEncodeError");
    return rc < 0 ? NULL : unicode_error_get_encoding_impl(self);
}

or

static inline PyUnicodeErrorObject *
as_unicode_error(PyObject *self, const char *expect_type)
{
    int rc = check_unicode_error_type(self, expect_type);
    return rc < 0 ? NULL : _PyUnicodeError_CAST(self);
}

static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self, const char *expect_type)
{
    PyUnicodeErrorObject *exc = as_unicode_error(self, expect_type);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}


PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    return unicode_error_get_encoding_impl(self, "UnicodeEncodeError");
}

The first solution delegates type-checking and attribute retrieval to two different functions ( check_unicode_error_type and unicode_error_get_encoding_impl), while the second version assumes that unicode_error_get_encoding_impl is also responsible for the cast. However, I designed the functions so that internal ones do not need to do runtime (yet assert-only) checks so that they could possibly be used somewhere else.

/*
* Return the underlying (str) 'encoding' attribute of a Unicode Error object.
*
* The caller is responsible to ensure that 'self' is a PyUnicodeErrorObject.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer an assert over a “The caller is responsible to ensure...” comment. To a human reader, they should be equivalent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert is actually inside the _CAST macro. It's just to document that this function would crash otherwise. The alternative is to remove the assert inside the CAST macro and make it an explicit one though that would add lines.

@encukou
Copy link
Member

encukou commented Jan 2, 2025

That I wanted to ask is whether you want me to have something like: [...]

Hm, that looks like a style choice I can leave to you; they look similarly complex.

There's one more style you can consider for internal functions, “error pass-through”:

/* if self is NULL, return NULL; an exception must already be set */
static inline PyObject *
unicode_error_get_encoding_impl(PyObject *self) {
    if (!self) {
        return NULL,
    }
    PyUnicodeErrorObject *exc = PyUnicodeError_CAST(self);
    return as_unicode_error_attribute(exc->encoding, "encoding", false);
}

PyObject *
PyUnicodeEncodeError_GetEncoding(PyObject *self)
{
    PyObject *err = check_unicode_error_type(self, "UnicodeEncodeError");
    return unicode_error_get_encoding_impl(err);
}

Are you happy with the current iteration of the PR?
(Apologies, I'm now a bit lost about the state of your PRs; if there's one I should look at first, please let me know!)

Co-authored-by: Petr Viktorin <[email protected]>
@picnixz
Copy link
Member Author

picnixz commented Jan 2, 2025

(Apologies, I'm now a bit lost about the state of your PRs; if there's one I should look at first, please let me know!)

No worries! I have a lot of PRs that are identical (namely UBSan ones) which you can just skip for now. Other PRs related to unicode error objects are those with codecs. I'm not on my dev session now and since it's a holidays period, I don't want to overwhelm you with review requests.

Are you happy with the current iteration of the PR?

I'll have a look again tomorrow to decide the final state of the PR. I'm pretty happy with the current implementation (namely no pass-through, and assertion in the CAST) but I can consider the pass-through approach. It looks nice and could reduce the number of overall lines. I can also make sure that an exception is set before returning NULL so it would at least suit what I wanted to do (it would also decouple the logic of checking and performing the actual operation in PyUnicodeEncodeError_{Get,Set}* functions so it's also fine).

I don't know how the exception class will evolve in the future, especially how we will decide to handle relative start/end indices (I think we're unfortunately stuck and won't really be able to change the behaviour since it's part of the stable ABI).

@picnixz
Copy link
Member Author

picnixz commented Jan 2, 2025

The merge plan I had in mind was:

So until this one is merged, there is no need to review the others as the code will change a bit. Though, you can review them if you want to look at the logic only.

This is typically useful for future refactorization and to
be able to write lines below 80 characters. This also helps
avoiding having to remember where to place the NULL arguments.
@picnixz picnixz requested a review from encukou January 3, 2025 10:06
@picnixz
Copy link
Member Author

picnixz commented Jan 3, 2025

I eventually decided to avoid a pass-through. While it would work, I feel that it's not right to expect the callee to eventually rely on the fact that an exception has been set.

However, do you want me to add NULL checks? (without those, the assertions would also crash)
EDIT: I've added them so that it's more robust.

Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few comment nitpicks left.

@picnixz picnixz requested a review from encukou January 3, 2025 11:36
Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@encukou encukou merged commit fa985be into python:main Jan 3, 2025
43 checks passed
@picnixz picnixz deleted the feat/exc/unicode-error-refactor-127787 branch January 3, 2025 12:48
WolframAlph pushed a commit to WolframAlph/cpython that referenced this pull request Jan 4, 2025
… interface (pythonGH-127789)

- Unify `get_unicode` and `get_string` in a single function.

- Allow to retrieve the underlying `object` attribute, its
  size, and the adjusted 'start' and 'end', all at once.
  Add a new `_PyUnicodeError_GetParams` internal function for this.
  (In `exceptions.c`, it's somewhat common to not need all the attributes,
  but the compiler has opportunity to inline the function and optimize
  unneeded work away. Outside that file, we'll usually need all or
  most of them at once.)

- Use a common implementation for the following functions:

  - `PyUnicode{Decode,Encode}Error_GetEncoding`
  - `PyUnicode{Decode,Encode,Translate}Error_GetObject`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}`
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025
… interface (pythonGH-127789)

- Unify `get_unicode` and `get_string` in a single function.

- Allow to retrieve the underlying `object` attribute, its
  size, and the adjusted 'start' and 'end', all at once.
  Add a new `_PyUnicodeError_GetParams` internal function for this.
  (In `exceptions.c`, it's somewhat common to not need all the attributes,
  but the compiler has opportunity to inline the function and optimize
  unneeded work away. Outside that file, we'll usually need all or
  most of them at once.)

- Use a common implementation for the following functions:

  - `PyUnicode{Decode,Encode}Error_GetEncoding`
  - `PyUnicode{Decode,Encode,Translate}Error_GetObject`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}Reason`
  - `PyUnicode{Decode,Encode,Translate}Error_{Get,Set}{Start,End}`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants