Simpler utf8_decode #414

chqrlie · 2024-05-26T08:18:36Z

no longer pass the array length to utf8_decode
add utf8_decode_len for border cases

saghul · 2024-05-26T08:40:05Z

Windows seems unhappy with the syntax :-/

chqrlie · 2024-05-26T19:24:05Z

Windows seems unhappy with the syntax :-/

It is not really a syntax issue: the compiler (gcc.exe 12.2.0 for mingw64) inlines the code of utf8_decode_len into utf8_scan and complains that buf[i] = p[i]; writes beyond the end of the array at offset 4.

The offending code is:

#define UTF8_CHAR_LEN_MAX  4

uint32_t utf8_decode_len(const uint8_t *p, size_t max_len, const uint8_t **pp) {
    if (max_len >= UTF8_CHAR_LEN_MAX) {
        return utf8_decode(p, pp);
    } else
    if (max_len == 0) {
        *pp = p;
        return 0xFFFD;
    } else {
        uint8_t buf[UTF8_CHAR_LEN_MAX];
        const uint8_t *p_next;
        size_t i;
        uint32_t c;
        for (i = 0; i < max_len; i++) {
            buf[i] = p[i];
        }
        buf[i] = '\0';
        c = utf8_decode(buf, &p_next);
        *pp = p + (p_next - buf);
        return c;
    }
}

the code in question is only executed for max_len in range 1,2,3. How can i ever get the value 4?
Looking at how gcc 12.2 compiles this with the same flags for linux ( https://godbolt.org/z/zdEGcrqYb ) one can see that the loop is actually converted to a call to memcpy. The same for mingw64: https://godbolt.org/z/9MW5j8aME ...

Of course inlining this code into utf8_scan may produce different code, and the compiler complains the same as in the CI: https://godbolt.org/z/9MW5j8aME . So we cannot see the assembly code anyway.

If I initialize the array as uint8_t buf[UTF8_CHAR_LEN_MAX] = ""; which generates a single instructions, the problem goes away too.

Something is seriously wrong in this version of gcc: if I replace the code with a call to memcpy(buf, p, max_len); i = max_len; the compiler no longer complains: https://godbolt.org/z/PMf5b53eT

Another way to work around this problem is using -O2 instead of -O3.

gcc 13.1 does not complain either.

I shall post a alternative approach tonight.

- no longer pass the array length to `utf8_decode` - add `utf8_decode_len` for border cases - use switch based dispatch to work around gcc 12.2 optimizer bug

Simpler utf8_decode

830fa6b

- no longer pass the array length to `utf8_decode` - add `utf8_decode_len` for border cases - use switch based dispatch to work around gcc 12.2 optimizer bug

chqrlie force-pushed the utf8-patch branch from 1458995 to 830fa6b Compare May 26, 2024 21:58

saghul approved these changes May 27, 2024

View reviewed changes

chqrlie merged commit 921c1ee into quickjs-ng:master May 27, 2024
47 checks passed

chqrlie deleted the utf8-patch branch May 27, 2024 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simpler utf8_decode #414

Simpler utf8_decode #414

Uh oh!

chqrlie commented May 26, 2024

Uh oh!

saghul commented May 26, 2024

Uh oh!

chqrlie commented May 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Simpler utf8_decode #414

Simpler utf8_decode #414

Uh oh!

Conversation

chqrlie commented May 26, 2024

Uh oh!

saghul commented May 26, 2024

Uh oh!

chqrlie commented May 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chqrlie commented May 26, 2024 •

edited

Loading