Skip to content

Commit ffea747

Browse files
khwilliamsonnwc10
authored andcommitted
Avoid some conditionals in is...UTF8_CHAR()
These three functions to determine if the next bit of a string is UTF-8 (constrained in three different ways) have basically the same short loop. One of the initial conditions in the while() is always true the first time around. By moving that condition to the middle of the loop, we avoid it for the common case where the loop is executed just once. This is when the input is a UTF-8 invariant character (ASCII on ASCII platforms). If the functions were constrained to require the first byte pointed to by the input to exist, the while() could be a do {} while(), and there would be no extra conditional in calling this vs checking if the next character is invariant, and if not calling this. And there would be fewer conditionals for the case of 2 or more bytes in the character.
1 parent 5678ce7 commit ffea747

File tree

1 file changed

+26
-17
lines changed

1 file changed

+26
-17
lines changed

inline.h

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1127,16 +1127,19 @@ Perl_isUTF8_CHAR(const U8 * const s0, const U8 * const e)
11271127
* on 32-bit ASCII platforms where it trivially is an error). Call a
11281128
* helper function for the other platforms. */
11291129

1130-
while (s < e && LIKELY(state != 1)) {
1131-
state = PL_extended_utf8_dfa_tab[256
1130+
while (s < e) {
1131+
state = PL_extended_utf8_dfa_tab[ 256
11321132
+ state
11331133
+ PL_extended_utf8_dfa_tab[*s]];
1134-
if (state != 0) {
1135-
s++;
1136-
continue;
1134+
s++;
1135+
1136+
if (state == 0) {
1137+
return s - s0;
11371138
}
11381139

1139-
return s - s0 + 1;
1140+
if (UNLIKELY(state == 1)) {
1141+
break;
1142+
}
11401143
}
11411144

11421145
#if defined(UV_IS_QUAD) || defined(EBCDIC)
@@ -1195,15 +1198,19 @@ Perl_isSTRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
11951198

11961199
PERL_ARGS_ASSERT_ISSTRICT_UTF8_CHAR;
11971200

1198-
while (s < e && LIKELY(state != 1)) {
1199-
state = PL_strict_utf8_dfa_tab[256 + state + PL_strict_utf8_dfa_tab[*s]];
1201+
while (s < e) {
1202+
state = PL_strict_utf8_dfa_tab[ 256
1203+
+ state
1204+
+ PL_strict_utf8_dfa_tab[*s]];
1205+
s++;
12001206

1201-
if (state != 0) {
1202-
s++;
1203-
continue;
1207+
if (state == 0) {
1208+
return s - s0;
12041209
}
12051210

1206-
return s - s0 + 1;
1211+
if (UNLIKELY(state == 1)) {
1212+
break;
1213+
}
12071214
}
12081215

12091216
#ifndef EBCDIC
@@ -1261,15 +1268,17 @@ Perl_isC9_STRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
12611268

12621269
PERL_ARGS_ASSERT_ISC9_STRICT_UTF8_CHAR;
12631270

1264-
while (s < e && LIKELY(state != 1)) {
1271+
while (s < e) {
12651272
state = PL_c9_utf8_dfa_tab[256 + state + PL_c9_utf8_dfa_tab[*s]];
1273+
s++;
12661274

1267-
if (state != 0) {
1268-
s++;
1269-
continue;
1275+
if (state == 0) {
1276+
return s - s0;
12701277
}
12711278

1272-
return s - s0 + 1;
1279+
if (UNLIKELY(state == 1)) {
1280+
break;
1281+
}
12731282
}
12741283

12751284
return 0;

0 commit comments

Comments
 (0)