Skip to content

Avoid locale dependent <ctype.h> functions like isascii(), isdigit(), tolower() #108767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vstinner opened this issue Sep 1, 2023 · 3 comments
Closed
Labels
type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Sep 1, 2023

Feature or enhancement

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Proposal:

The following C files use <ctype.h> functions which depend on the current LC_CTYPE locale:

$ git grep -l -E '\b(isalnum|isalpha|iscntrl|isdigit|islower|isgraph|isprint|ispunct|isspace|isupper|isxdigit|tolower|toupper)\b'|grep -E '\.c$'
Modules/_decimal/libmpdec/io.c
Modules/_sre/sre.c
Modules/_zoneinfo.c
Modules/getaddrinfo.c
Objects/bytearrayobject.c
Objects/bytes_methods.c
Objects/bytesobject.c
Objects/unicodeobject.c
PC/launcher.c
PC/launcher2.c
Parser/tokenizer.c
Python/formatter_unicode.c
Python/pystrcmp.c

I propose to replace them with Python C API functions which don't depend on the locale, like Py_ISDIGIT() and Py_TOLOWER().

Linked PRs

@vstinner vstinner added the type-feature A feature request or enhancement label Sep 1, 2023
vstinner added a commit to vstinner/cpython that referenced this issue Sep 1, 2023
Convert the following macros to static inline functions:

* Py_ISLOWER()
* Py_ISUPPER()
* Py_ISALPHA()
* Py_ISDIGIT()
* Py_ISXDIGIT()
* Py_ISALNUM()
* Py_ISSPACE()
* Py_TOLOWER()
* Py_TOUPPER()
* Py_CHARMASK()
@serhiy-storchaka
Copy link
Member

The use in Modules/_sre/sre.c is intentional.

vstinner added a commit to vstinner/cpython that referenced this issue Sep 1, 2023
Replace <ctype.h> locale dependent isdigit() with Python locale
independent Py_ISDIGIT() function in _PyBytes_FormatEx().
@vstinner
Copy link
Member Author

vstinner commented Sep 1, 2023

Modules/_decimal/libmpdec/io.c is a copy of libmpdec, I prefer to leave it unchanged.

I also prefer to leave the Windows launcher program unchanged:

PC/launcher.c
PC/launcher2.c

vstinner added a commit to vstinner/cpython that referenced this issue Sep 1, 2023
Replace <ctype.h> locale dependent functions with Python "pyctype.h"
locale independent functions:

* Replace isalpha() with Py_ISALPHA().
* Replace isdigit() with Py_ISDIGIT().
* Replace isxdigit() with Py_ISXDIGIT().
* Replace tolower() with Py_TOLOWER().

Leave Modules/_sre/sre.c unchanged, it uses locale dependent
functions on purpose.
@vstinner
Copy link
Member Author

vstinner commented Sep 1, 2023

By the way, pyport.h has an interesting code:

/* On 4.4BSD-descendants, ctype functions serves the whole range of
 * wchar_t character set rather than single byte code points only.
 * This characteristic can break some operations of string object
 * including str.upper() and str.split() on UTF-8 locales.  This
 * workaround was provided by Tim Robbins of FreeBSD project.
 */

#if defined(__APPLE__)
#  define _PY_PORT_CTYPE_UTF8_ISSUE
#endif

#ifdef _PY_PORT_CTYPE_UTF8_ISSUE
#ifndef __cplusplus
   /* The workaround below is unsafe in C++ because
    * the <locale> defines these symbols as real functions,
    * with a slightly different signature.
    * See issue #10910
    */
#include <ctype.h>
#include <wctype.h>
#undef isalnum
#define isalnum(c) iswalnum(btowc(c))
#undef isalpha
#define isalpha(c) iswalpha(btowc(c))
#undef islower
#define islower(c) iswlower(btowc(c))
#undef isspace
#define isspace(c) iswspace(btowc(c))
#undef isupper
#define isupper(c) iswupper(btowc(c))
#undef tolower
#define tolower(c) towlower(btowc(c))
#undef toupper
#define toupper(c) towupper(btowc(c))
#endif
#endif

vstinner added a commit that referenced this issue Sep 1, 2023
Replace <ctype.h> locale dependent functions with Python "pyctype.h"
locale independent functions:

* Replace isalpha() with Py_ISALPHA().
* Replace isdigit() with Py_ISDIGIT().
* Replace isxdigit() with Py_ISXDIGIT().
* Replace tolower() with Py_TOLOWER().

Leave Modules/_sre/sre.c unchanged, it uses locale dependent
functions on purpose.

Include explicitly <ctype.h> in _decimal.c to get isascii().
vstinner added a commit to vstinner/cpython that referenced this issue Sep 2, 2023
Convert the following macros to static inline functions:

* Py_ISLOWER()
* Py_ISUPPER()
* Py_ISALPHA()
* Py_ISDIGIT()
* Py_ISXDIGIT()
* Py_ISALNUM()
* Py_ISSPACE()
* Py_TOLOWER()
* Py_TOUPPER()
* Py_CHARMASK()
vstinner added a commit to vstinner/cpython that referenced this issue Sep 3, 2023
Convert the following macros to static inline functions:

* Py_ISLOWER()
* Py_ISUPPER()
* Py_ISALPHA()
* Py_ISDIGIT()
* Py_ISXDIGIT()
* Py_ISALNUM()
* Py_ISSPACE()
* Py_TOLOWER()
* Py_TOUPPER()
* Py_CHARMASK()

Changes:

* sre_lower_ascii() now casts Py_TOLOWER() argument to "unsigned
  char" and cast the result to "unsigned int".
* bytesobject.c and bytearrayobject.c now pass an "int" argument to
  Py_CHARMASK(), instead of a "Py_ssize_t" argument.
@vstinner vstinner closed this as completed Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants