Skip to content

Simpler utf8_decode #414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 27, 2024
Merged

Simpler utf8_decode #414

merged 1 commit into from
May 27, 2024

Conversation

chqrlie
Copy link
Collaborator

@chqrlie chqrlie commented May 26, 2024

  • no longer pass the array length to utf8_decode
  • add utf8_decode_len for border cases

@saghul
Copy link
Contributor

saghul commented May 26, 2024

Windows seems unhappy with the syntax :-/

@chqrlie
Copy link
Collaborator Author

chqrlie commented May 26, 2024

Windows seems unhappy with the syntax :-/

It is not really a syntax issue: the compiler (gcc.exe 12.2.0 for mingw64) inlines the code of utf8_decode_len into utf8_scan and complains that buf[i] = p[i]; writes beyond the end of the array at offset 4.

The offending code is:

#define UTF8_CHAR_LEN_MAX  4

uint32_t utf8_decode_len(const uint8_t *p, size_t max_len, const uint8_t **pp) {
    if (max_len >= UTF8_CHAR_LEN_MAX) {
        return utf8_decode(p, pp);
    } else
    if (max_len == 0) {
        *pp = p;
        return 0xFFFD;
    } else {
        uint8_t buf[UTF8_CHAR_LEN_MAX];
        const uint8_t *p_next;
        size_t i;
        uint32_t c;
        for (i = 0; i < max_len; i++) {
            buf[i] = p[i];
        }
        buf[i] = '\0';
        c = utf8_decode(buf, &p_next);
        *pp = p + (p_next - buf);
        return c;
    }
}

the code in question is only executed for max_len in range 1,2,3. How can i ever get the value 4?
Looking at how gcc 12.2 compiles this with the same flags for linux ( https://godbolt.org/z/zdEGcrqYb ) one can see that the loop is actually converted to a call to memcpy. The same for mingw64: https://godbolt.org/z/9MW5j8aME ...

Of course inlining this code into utf8_scan may produce different code, and the compiler complains the same as in the CI: https://godbolt.org/z/9MW5j8aME . So we cannot see the assembly code anyway.

If I initialize the array as uint8_t buf[UTF8_CHAR_LEN_MAX] = ""; which generates a single instructions, the problem goes away too.

Something is seriously wrong in this version of gcc: if I replace the code with a call to memcpy(buf, p, max_len); i = max_len; the compiler no longer complains: https://godbolt.org/z/PMf5b53eT

Another way to work around this problem is using -O2 instead of -O3.

gcc 13.1 does not complain either.

I shall post a alternative approach tonight.

- no longer pass the array length to `utf8_decode`
- add `utf8_decode_len` for border cases
- use switch based dispatch to work around gcc 12.2 optimizer bug
@chqrlie chqrlie merged commit 921c1ee into quickjs-ng:master May 27, 2024
47 checks passed
@chqrlie chqrlie deleted the utf8-patch branch May 27, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants