Skip to content

GH-96458: Statically initialize utf8 representation of static strings #96481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 3, 2022

Conversation

kumaraditya303
Copy link
Contributor

@kumaraditya303 kumaraditya303 commented Sep 1, 2022

@kumaraditya303 kumaraditya303 marked this pull request as ready for review September 1, 2022 17:16
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_PyUnicode_LATIN1_INIT is wrong name. You can statically initialize utf8 representation only in ASCII strings.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like @ericsnowcurrently to review this too.

@markshannon
Copy link
Member

Are you generating utf8 strings separately from the ascii string, or do they share the same data?

@gvanrossum
Copy link
Member

Are you generating utf8 strings separately from the ascii string, or do they share the same data?

The utf8 cache strings are only ever created and used for non-ASCII strings (i.e. at least one character is 128 or larger). The unicode string object is smart enough to know that for pure ASCII strings the utf8 is identical so it doesn't bother creating or using the utf8 cache value.

This PR only generates utf8 data for exactly 6 strings in deepfreeze.c. I'm sure those are the ones with at least non-ASCII character.

In order to see a leak one would have to request the utf8 encoding of at least one of those 6 strings.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ericsnowcurrently
Copy link
Member

_PyUnicode_LATIN1_INIT is wrong name. You can statically initialize utf8 representation only in ASCII strings.

What do you mean? Keep in mind that the macro is used exclusively to statically initialize latin-1 strings.

@gvanrossum
Copy link
Member

_PyUnicode_LATIN1_INIT is wrong name. You can statically initialize utf8 representation only in ASCII strings.

What do you mean? Keep in mind that the macro is used exclusively to statically initialize latin-1 strings.

Indeed -- in fact the macro is only used in pycore_runtime_init_generated.h (a generated file :-) to initialize the _latin1 field of unicode objects. So the name seems eminently reasonable.

@serhiy-storchaka
Copy link
Member

Sorry, now I see that _PyUnicode_LATIN1_INIT takes a new argument.

@gvanrossum gvanrossum merged commit 6dab8c9 into python:main Sep 3, 2022
@kumaraditya303 kumaraditya303 deleted the deepfreeze branch September 3, 2022 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants