Skip to content

Commit 6696cfa

Browse files
committed
Change mem_collxfrm() algorithm for embedded NULs
One of the problems in implementing Perl is that the C library routines forbid embedded NUL characters, which Perl accepts. This is true for the case of strxfrm() which handles collation under locale. The best solution as far as functionality goes, would be for Perl to write its own strxfrm replacement which would handle the specific needs of Perl. But that is not going to happen because of the huge complexity in handling it across many platforms. We would have to know the location and format of the locale definition files for every such platform. Some might follow POSIX guidelines, some might not. strxfrm creates a transformation of its input into a new string consisting of weight bytes. In the typical but general case, a 3 character NUL-terminated input string 'A B C 00' (spaces added for readability) gets transformed into something like: A¹ B¹ C¹ 01 A² B² C² 01 A³ B³ C³ 00 where the superscripted characters are weights for the corresponding input characters. Superscript 1 represents (essentially) the primary sorting key; 2, the secondary, etc, for as many levels as the locale definition gives. The 01 byte is likely to be the separator between levels, but not necessarily, and there could be some other mechanisms used on various platforms. To handle embedded NULs, the simplest thing would be to just remove them before passing in to strxfrm(). Then they would be entirely ignored, which might not be what you want. You might want them to have some weight at the tertiary level, for example. It also causes problems because strxfrm is very context sensitive. The locale definition can define weights for specific sequences of any length (and the weights can be multi-byte), and by removing a NUL, two characters now become adjacent that weren't in the input, and they could now form one of those special sequences and thus throw things off. Another way to handle NULs, that seemingly ignores them, but actually doesn't, is the mechanism in use prior to this commit. The input string is split at the NULs, and the substrings are independently passed to strxfrm, and the results concatenated together. This doesn't work either. In our example 'A B C 00', suppose B is a NUL, and should have some weight at the tertiary level. What we want is: A¹ C¹ 01 A² C² 01 A³ B³ C³ 00 But that's not at all what you get. Instead it is: A¹ 01 A² 01 A³ C¹ 01 C² 01 C³ 00 The primary weight of C comes immediately after the teriary weight of A, but more importantly, a NUL, instead of being ignored at the primary levels, is significant at all levels, so that "a\0c" would sort before "ab". Still another possibility is to replace the NUL with some other character before passing it to strxfrm. That was my original plan, to replace each NUL with the character that this code determines has the lowest collation order for the current locale. On strings that don't contain that character, the results would be as good as it gets for that locale. That character is likely to be ignored at higher weight levels, but have some small non-ignored weight at the lowest ones. And hopefully the character would rarely be encountered in practice. When it does happen, it and NUL would sort identically; hardly the end of the world. If the entire strings sorted identically, the NUL-containing one would come out before the other one, since the original Perl strings are used as a tie breaker. However, testing showed a problem with this. If that other character is part of a sequence that has special weighting, the results won't be correct. With gcc, U+00B4 ACUTE ACCENT is the lowest collating character in many UTF-8 locales. It combines in Romanian and Vietnamese with some other characters to change weights, and hence changing NULs into U+B4 screws things up. What I finally have come to is to do is a modification of this final approach, where the possible NUL replacements are limited to just characters that are controls in the locale. NULs are replaced by the lowest collating control. It would really be a defective locale if this control combined with some other character to form a special sequence. Often the character will be a 01, START OF HEADING. In the very unlikely case that there are absolutely no controls in the locale, 01 is used, because we have to replace it with something. The code added by this commit is mostly utf8-ready. A few commits from now will make Perl properly work with UTF-8 (if the platform supports it). But until that time, this isn't a full implementation; it only looks for the lowest-sorting control that is invariant, where the the UTF8ness doesn't matter. The added tests are marked as TODO until then.
1 parent 59c018b commit 6696cfa

File tree

9 files changed

+180
-24
lines changed

9 files changed

+180
-24
lines changed

embed.fnc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -909,7 +909,7 @@ Ap |I32 * |markstack_grow
909909
#if defined(USE_LOCALE_COLLATE)
910910
p |int |magic_setcollxfrm|NN SV* sv|NN MAGIC* mg
911911
: Defined in locale.c, used only in sv.c
912-
p |char* |mem_collxfrm |NN const char* s|STRLEN len|NN STRLEN* xlen
912+
p |char* |mem_collxfrm |NN const char* input_string|STRLEN len|NN STRLEN* xlen
913913
#endif
914914
Afpd |SV* |mess |NN const char* pat|...
915915
Apd |SV* |mess_sv |NN SV* basemsg|bool consume

embedvar.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,7 @@
309309
#define PL_stderrgv (vTHX->Istderrgv)
310310
#define PL_stdingv (vTHX->Istdingv)
311311
#define PL_strtab (vTHX->Istrtab)
312+
#define PL_strxfrm_min_char (vTHX->Istrxfrm_min_char)
312313
#define PL_sub_generation (vTHX->Isub_generation)
313314
#define PL_subline (vTHX->Isubline)
314315
#define PL_subname (vTHX->Isubname)

intrpvar.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -564,6 +564,7 @@ PERLVAR(I, collation_name, char *) /* Name of current collation */
564564
PERLVAR(I, collxfrm_base, Size_t) /* Basic overhead in *xfrm() */
565565
PERLVARI(I, collxfrm_mult,Size_t, 2) /* Expansion factor in *xfrm() */
566566
PERLVARI(I, collation_ix, U32, 0) /* Collation generation index */
567+
PERLVARA(I, strxfrm_min_char, 3, char)
567568
PERLVARI(I, collation_standard, bool, TRUE)
568569
/* Assume simple collation */
569570
#endif /* USE_LOCALE_COLLATE */

lib/locale.t

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1740,7 +1740,7 @@ foreach my $Locale (@Locale) {
17401740

17411741
++$locales_test_number;
17421742
$test_names{$locales_test_number}
1743-
= 'TODO Skip in locales where \001 has primary sorting weight; '
1743+
= 'Skip in locales where \001 has primary sorting weight; '
17441744
. 'otherwise verify that \0 doesn\'t have primary sorting weight';
17451745
if ("a\001c" lt "ab") {
17461746
report_result($Locale, $locales_test_number, 1);
@@ -1749,6 +1749,19 @@ foreach my $Locale (@Locale) {
17491749
my $ok = "ab" lt "a\0c";
17501750
report_result($Locale, $locales_test_number, $ok);
17511751
}
1752+
1753+
++$locales_test_number;
1754+
$test_names{$locales_test_number}
1755+
= 'TODO Verify that strings with embedded NUL collate';
1756+
my $ok = "a\0a\0a" lt "a\001a\001a";
1757+
report_result($Locale, $locales_test_number, $ok);
1758+
1759+
++$locales_test_number;
1760+
$test_names{$locales_test_number}
1761+
= 'TODO Verify that strings with embedded NUL and '
1762+
. 'extra trailing NUL collate';
1763+
$ok = "a\0a\0" lt "a\001a\001";
1764+
report_result($Locale, $locales_test_number, $ok);
17521765
}
17531766

17541767
my $ok1;

locale.c

Lines changed: 138 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -486,6 +486,7 @@ Perl_new_collate(pTHX_ const char *newcoll)
486486
PL_collxfrm_base = 0;
487487
PL_collxfrm_mult = 2;
488488
PL_in_utf8_COLLATE_locale = FALSE;
489+
*PL_strxfrm_min_char = '\0';
489490
return;
490491
}
491492

@@ -500,6 +501,7 @@ Perl_new_collate(pTHX_ const char *newcoll)
500501
}
501502

502503
PL_in_utf8_COLLATE_locale = _is_cur_LC_category_utf8(LC_COLLATE);
504+
*PL_strxfrm_min_char = '\0';
503505

504506
/* A locale collation definition includes primary, secondary, tertiary,
505507
* etc. weights for each character. To sort, the primary weights are
@@ -1295,13 +1297,136 @@ Perl_init_i18nl10n(pTHX_ int printwarn)
12951297
*/
12961298

12971299
char *
1298-
Perl_mem_collxfrm(pTHX_ const char *s, STRLEN len, STRLEN *xlen)
1300+
Perl_mem_collxfrm(pTHX_ const char *input_string,
1301+
STRLEN len,
1302+
STRLEN *xlen
1303+
)
12991304
{
1305+
char * s = (char *) input_string;
1306+
STRLEN s_strlen = strlen(input_string);
13001307
char *xbuf;
1301-
STRLEN xAlloc, xin, xout; /* xalloc is a reserved word in VC */
1308+
STRLEN xAlloc, xout; /* xalloc is a reserved word in VC */
13021309

13031310
PERL_ARGS_ASSERT_MEM_COLLXFRM;
13041311

1312+
/* Replace any embedded NULs with the control that sorts before any others.
1313+
* This will give as good as possible results on strings that don't
1314+
* otherwise contain that character, but otherwise there may be
1315+
* less-than-perfect results with that character and NUL. This is
1316+
* unavoidable unless we replace strxfrm with our own implementation.
1317+
*
1318+
* XXX This code may be overkill. khw wrote it before realizing that if
1319+
* you change a NUL into some other character, that that may change the
1320+
* strxfrm results if that character is part of a sequence with other
1321+
* characters for weight calculations. To minimize the chances of this,
1322+
* now the replacement is restricted to another control (likely to be
1323+
* \001). But the full generality has been retained.
1324+
*
1325+
* This is one of the few places in the perl core, where we can use
1326+
* standard functions like strlen() and strcat(). It's because we're
1327+
* looking for NULs. */
1328+
if (s_strlen < len) {
1329+
char * e = s + len;
1330+
char * sans_nuls;
1331+
STRLEN cur_min_char_len;
1332+
1333+
/* If we don't know what control character sorts lowest for this
1334+
* locale, find it */
1335+
if (*PL_strxfrm_min_char == '\0') {
1336+
int j;
1337+
char * cur_min_x = NULL; /* Cur cp's xfrm, (except it also
1338+
includes the collation index
1339+
prefixed. */
1340+
1341+
/* Look through all legal code points (NUL isn't) */
1342+
for (j = 1; j < 256; j++) {
1343+
char * x; /* j's xfrm plus collation index */
1344+
STRLEN x_len; /* length of 'x' */
1345+
STRLEN trial_len = 1;
1346+
1347+
/* Create a 1 byte string of the current code point, but with
1348+
* room to be 2 bytes */
1349+
char cur_source[] = { (char) j, '\0' , '\0' };
1350+
1351+
if (PL_in_utf8_COLLATE_locale) {
1352+
if (! isCNTRL_L1(j)) {
1353+
continue;
1354+
}
1355+
1356+
/* If needs to be 2 bytes, find them */
1357+
if (! UVCHR_IS_INVARIANT(j)) {
1358+
continue; /* Can't handle variants yet */
1359+
}
1360+
}
1361+
else if (! isCNTRL_LC(j)) {
1362+
continue;
1363+
}
1364+
1365+
/* Then transform it */
1366+
x = mem_collxfrm(cur_source, trial_len, &x_len);
1367+
1368+
/* If something went wrong (which it shouldn't), just
1369+
* ignore this code point */
1370+
if ( x_len == 0
1371+
|| strlen(x + sizeof(PL_collation_ix)) < x_len)
1372+
{
1373+
continue;
1374+
}
1375+
1376+
/* If this character's transformation is lower than
1377+
* the current lowest, this one becomes the lowest */
1378+
if ( cur_min_x == NULL
1379+
|| strLT(x + sizeof(PL_collation_ix),
1380+
cur_min_x + sizeof(PL_collation_ix)))
1381+
{
1382+
strcpy(PL_strxfrm_min_char, cur_source);
1383+
cur_min_x = x;
1384+
}
1385+
else {
1386+
Safefree(x);
1387+
}
1388+
} /* end of loop through all bytes */
1389+
1390+
/* Unlikely, but possible, if there aren't any controls in the
1391+
* locale, arbitrarily use \001 */
1392+
if (cur_min_x == NULL) {
1393+
STRLEN x_len; /* temporary */
1394+
cur_min_x = mem_collxfrm("\001", 1, &x_len);
1395+
/* cur_min_cp was already initialized to 1 */
1396+
}
1397+
1398+
Safefree(cur_min_x);
1399+
}
1400+
1401+
/* The worst case length for the replaced string would be if every
1402+
* character in it is NUL. Multiply that by the length of each
1403+
* replacement, and allow for a trailing NUL */
1404+
cur_min_char_len = strlen(PL_strxfrm_min_char);
1405+
Newx(sans_nuls, (len * cur_min_char_len) + 1, char);
1406+
*sans_nuls = '\0';
1407+
1408+
1409+
/* Replace each NUL with the lowest collating control. Loop until have
1410+
* exhausted all the NULs */
1411+
while (s + s_strlen < e) {
1412+
strcat(sans_nuls, s);
1413+
1414+
/* Do the actual replacement */
1415+
strcat(sans_nuls, PL_strxfrm_min_char);
1416+
1417+
/* Move past the input NUL */
1418+
s += s_strlen + 1;
1419+
s_strlen = strlen(s);
1420+
}
1421+
1422+
/* And add anything that trails the final NUL */
1423+
strcat(sans_nuls, s);
1424+
1425+
/* Switch so below we transform this modified string */
1426+
s = sans_nuls;
1427+
len = strlen(s);
1428+
}
1429+
13051430
/* The first element in the output is the collation id, used by
13061431
* sv_collxfrm(); then comes the space for the transformed string. The
13071432
* equation should give us a good estimate as to how much is needed */
@@ -1316,17 +1441,16 @@ Perl_mem_collxfrm(pTHX_ const char *s, STRLEN len, STRLEN *xlen)
13161441

13171442
/* Then the transformation of the input. We loop until successful, or we
13181443
* give up */
1319-
for (xin = 0; xin < len; ) {
1320-
Size_t xused;
1321-
13221444
for (;;) {
1323-
xused = strxfrm(xbuf + xout, s + xin, xAlloc - xout);
1445+
STRLEN xused = strxfrm(xbuf + xout, s, xAlloc - xout);
13241446

13251447
/* If the transformed string occupies less space than we told
13261448
* strxfrm() was available, it means it successfully transformed
13271449
* the whole string. */
1328-
if ((STRLEN)xused < xAlloc - xout)
1450+
if (xused < xAlloc - xout) {
1451+
xout += xused;
13291452
break;
1453+
}
13301454

13311455
if (UNLIKELY(xused >= PERL_INT_MAX))
13321456
goto bad;
@@ -1340,19 +1464,20 @@ Perl_mem_collxfrm(pTHX_ const char *s, STRLEN len, STRLEN *xlen)
13401464
goto bad;
13411465
}
13421466

1343-
xin += strlen(s + xin) + 1;
1344-
xout += xused;
1467+
*xlen = xout - sizeof(PL_collation_ix);
1468+
13451469

1346-
/* Embedded NULs are understood but silently skipped
1347-
* because they make no sense in locale collation. */
1470+
if (s != input_string) {
1471+
Safefree(s);
13481472
}
13491473

1350-
xbuf[xout] = '\0';
1351-
*xlen = xout - sizeof(PL_collation_ix);
13521474
return xbuf;
13531475

13541476
bad:
13551477
Safefree(xbuf);
1478+
if (s != input_string) {
1479+
Safefree(s);
1480+
}
13561481
*xlen = 0;
13571482
return NULL;
13581483
}

pod/perldelta.pod

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,14 @@ here, but most should go in the L</Performance Enhancements> section.
2727

2828
[ List each enhancement as a =head2 entry ]
2929

30+
=head2 Better locale collation of strings containing embedded C<NUL>
31+
characters
32+
33+
In locales that have multi-level character weights, these are now
34+
ignored at the higher priority ones. There are still some gotchas in
35+
some strings, though. See
36+
L<perllocale/Collation of strings containing embedded C<NUL> characters>.
37+
3038
=head1 Security
3139

3240
XXX Any security-related notices go here. In particular, any security

pod/perllocale.pod

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1567,13 +1567,14 @@ called, and whatever it does is what you get.
15671567

15681568
=head2 Collation of strings containing embedded C<NUL> characters
15691569

1570-
Perl handles C<NUL> characters in the middle of strings. In many
1571-
locales, control characters are ignored unless the strings otherwise
1572-
compare equal. Unlike other control characters, C<NUL> characters are
1573-
never ignored. For example, if given that C<"b"> sorts after
1574-
C<"\001">, and C<"c"> sorts after C<"b">, C<"a\0c"> always sorts before
1575-
C<"ab">. This is true even in locales in which C<"ab"> sorts before
1576-
C<"a\001c">.
1570+
C<NUL> characters will sort the same as the lowest collating control
1571+
character does, or to C<"\001"> in the unlikely event that there are no
1572+
control characters at all in the locale. In cases where the strings
1573+
don't contain this non-C<NUL> control, the results will be correct, and
1574+
in many locales, this control, whatever it might be, will rarely be
1575+
encountered. But there are cases where a C<NUL> should sort before this
1576+
control, but doesn't. If two strings do collate identically, the one
1577+
containing the C<NUL> will sort to earlier.
15771578

15781579
=head2 Broken systems
15791580

proto.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5785,9 +5785,9 @@ STATIC char* S_stdize_locale(pTHX_ char* locs);
57855785
PERL_CALLCONV int Perl_magic_setcollxfrm(pTHX_ SV* sv, MAGIC* mg);
57865786
#define PERL_ARGS_ASSERT_MAGIC_SETCOLLXFRM \
57875787
assert(sv); assert(mg)
5788-
PERL_CALLCONV char* Perl_mem_collxfrm(pTHX_ const char* s, STRLEN len, STRLEN* xlen);
5788+
PERL_CALLCONV char* Perl_mem_collxfrm(pTHX_ const char* input_string, STRLEN len, STRLEN* xlen);
57895789
#define PERL_ARGS_ASSERT_MEM_COLLXFRM \
5790-
assert(s); assert(xlen)
5790+
assert(input_string); assert(xlen)
57915791
/* PERL_CALLCONV char* sv_collxfrm(pTHX_ SV *const sv, STRLEN *const nxp); */
57925792
PERL_CALLCONV char* Perl_sv_collxfrm_flags(pTHX_ SV *const sv, STRLEN *const nxp, I32 const flags);
57935793
#define PERL_ARGS_ASSERT_SV_COLLXFRM_FLAGS \

t/porting/libperl.t

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,13 @@ for my $symbol (sort keys %unexpected) {
527527
SKIP: {
528528
skip("uses sprintf for Gconvert in sv.o");
529529
}
530+
}
531+
elsif ( $symbol eq 'strcat'
532+
&& @o == 1 && $o[0] eq 'locale.o')
533+
{
534+
SKIP: {
535+
skip("locale.o legitimately uses strcat");
536+
}
530537
} else {
531538
is(@o, 0, "uses no $symbol (@o)");
532539
}

0 commit comments

Comments
 (0)