Make :utf8 strict #19121

tonycoz · 2021-09-13T01:08:24Z

No description provided.

lib/PerlIO.pm

jkeenan · 2021-09-13T01:28:09Z

Strictly speaking, cpan/CPAN-Meta-YAML/t/11_read_string.t should be changed upstream first and synched into blead once a new CPAN version has been released.

How should we adapt to that?

Grinnz · 2021-09-13T01:30:13Z

lib/PerlIO.pm

+
+=item *
+
+C<strict> - disallows all encoding errors, non-characters, surrogates


non-characters are now allowed in unicode; maybe strict should allow them? http://www.unicode.org/versions/corrigendum9.html

AFAIK they were always allowed, but the current defaults are one of the few things retained from the code I started from.

tonycoz · 2021-09-13T01:33:39Z

I expect the defaults on strictness and possibly the way errors are handled to change before this is merged. Opinions welcome.

Grinnz · 2021-09-13T01:36:10Z

I am somewhat concerned about fallout from (as the doc and test updates indicate) :utf8 no longer being a flag on all :encoding layers. Not sure what the practical impact would be but this is unfortunately an interface that code dealing with layers had to work with.

lib/PerlIO.pm

Grinnz · 2021-09-13T02:03:49Z

lib/PerlIO.pm

@@ -366,8 +477,7 @@ You are supposed to use open() and binmode() to manipulate the stack.
 B<Implementation details follow, please close your eyes.>

 The arguments to layers are by default returned in parentheses after
-the name of the layer, and certain layers (like C<:utf8>) are not real


This also is still the case, even if not for :utf8.

I've re-worked this, only the F_UTF8 flag ended up as a layer, and no longer does.

Grinnz · 2021-09-13T06:03:56Z

lib/PerlIO.pm

@@ -187,6 +284,21 @@ as such a layer assumes to be working with Perl's internal upgraded
 encoding, so you will likely get a mangled result.  Instead use C<:raw> or
 C<:pop> to remove encoding layers.

+   # accept only valid Unicode, replacing everything else with the


These examples seem to be in the wrong section now

khwilliamson · 2021-09-13T14:57:18Z

On 9/12/21 7:30 PM, Dan Book wrote: non-characters are now allowed in unicode; maybe strict should allow them? http://www.unicode.org/versions/corrigendum9.html <http://www.unicode.org/versions/corrigendum9.html>

No! That could lead to security problems. Code that relies on using non-characters as sentinels that could never be in outside input would now be exposed to them, creating an attack vector. This Corrigendum was written because things like text editors and source code control were refusing to work on files that had such sentinels in them, which was never Unicode's intent. You were always supposed to be able to use these for your own purposes, while being assured you would not get inputs containing them. If you couldn't use them internally, there would be no point to them at all! But the writers of the meta-handlers, like source code control, had misread the Standard, leading to this Corrigendum. Such programs have to assume that they could be exposed to inputs containing them, and so can't use these code points as sentinels. But the default for Perl cannot be to allow them. The Corrigendum does not allow free reign for their use in all circumstances, as if they were any other code point. That would also mean there is no point to them. These code points are to be used in limited circumstances; as such the default must not be to allow them, but it should be possible to specify that they are allowed. At the time of the Corrigendum was issued, I checked with Unicode, and they concurred that Perl should not change.

nwc10 · 2021-09-14T07:10:02Z

There's (at least one) leak:

$ LC_ALL=en_US.UTF-8 PERL_UNICODE="" PERL_DESTRUCT_LEVEL=2 valgrind --leak-check=full ./perl  -e0
==826044== Memcheck, a memory error detector
==826044== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==826044== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==826044== Command: ./perl -e0
==826044==
==826044==
==826044== HEAP SUMMARY:
==826044==     in use at exit: 111 bytes in 3 blocks
==826044==   total heap usage: 1,515 allocs, 1,512 frees, 215,148 bytes allocated
==826044==
==826044== 111 (37 direct, 74 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==826044==    at 0x483877F: malloc (vg_replace_malloc.c:307)
==826044==    by 0x2F8842: Perl_safesysmalloc (util.c:161)
==826044==    by 0x51092D: PerlIOUnicode_pushed (perlio.c:5214)
==826044==    by 0x507D1E: PerlIO_push (perlio.c:1156)
==826044==    by 0x508184: PerlIO_apply_layera (perlio.c:1264)
==826044==    by 0x5082B9: PerlIO_apply_layers (perlio.c:1284)
==826044==    by 0x50840C: PerlIO_binmode (perlio.c:1316)
==826044==    by 0x1A91C3: S_parse_body (perl.c:2504)
==826044==    by 0x1A77A8: perl_parse (perl.c:1853)
==826044==    by 0x154256: main (perlmain.c:109)
==826044==
==826044== LEAK SUMMARY:
==826044==    definitely lost: 37 bytes in 1 blocks
==826044==    indirectly lost: 74 bytes in 2 blocks
==826044==      possibly lost: 0 bytes in 0 blocks
==826044==    still reachable: 0 bytes in 0 blocks
==826044==         suppressed: 0 bytes in 0 blocks
==826044==
==826044== For lists of detected and suppressed errors, rerun with: -s
==826044== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

(Sorry, I didn't dig further to try to address this)

I assume that this leaks on a normal build, but this happens to be a build with -DPURIFY and ithreads.

(Found with a build that does -DPURIFY, ithreads, ASAN and MALLOC_PERTURB_=N MALLOC_CHECK_=2 PERL_DESTRUCT_LEVEL=2 TEST_JOBS=33 LC_ALL=en_US.UTF-8 PERL_UNICODE="" nice -19 make -j32 test_harness. I'd use en_GB but it's not installed on that machine :-()

nwc10 · 2021-09-14T07:11:35Z

We can (and should, I feel) add ASAN and -DPURIFY to the CI workflow once #19115 is resolved.

Partly based on work done by Leon Timmermans, but has had extensive changes.

nwc10 · 2021-09-15T06:46:24Z

Leak is fixed. \o/

(Only the leaks seen on blead remain - see #19115)

s/UTF8_DISALLOW_ABOVE_31_BIT/UTF8_DISALLOW_PERL_EXTENDED/ The former name is misleading for both EBCDIC, where it is 30 bits, not 31, and for overlongs where something that is 63 bits can boil down to something that is less than 31

khwilliamson · 2021-09-20T00:43:30Z

The UTF-8 components of this look good to me; I looked at the PerlIO portions, and they look reasonable, but I'm not qualified to really review those

Leont · 2021-09-20T16:10:57Z

This fails t/io/utf8.t, dist/IO/t/io_sock.t and especially t/op/readline.t when PERLIO=stdio. I know why, and have been working on a branch that refactors readline to no longer have this issue (I've just pushed that as leont/perlio-readline). That was one of the main reasons why my branch on this had stalled.

tonycoz · 2021-09-23T01:16:59Z

This fails t/io/utf8.t, dist/IO/t/io_sock.t and especially t/op/readline.t when PERLIO=stdio. I know why, and have been working on a branch that refactors readline to no longer have this issue (I've just pushed that as leont/perlio-readline). That was one of the main reasons why my branch on this had stalled.

Thanks for pointing this out.

I don't think PerlIO::encoding is completely correct, and I'm not sure it can be with this interface.

The simplest case is if we're decoding EDCIDIC :encoding(posix-bc), then the byte to look for in the source character set will be \x15 rather than the \x0A that would be passed in on an ASCII system, this could be made to work with this interface.

The problem with the interface is with multibyte encodings - readdelim() only gets the last byte of the UTF-8 encoding of the delimiter, if the underlying encoding is Shift-JIS or some other non-UTF-8 encoding there's no way for PerlIO::encoding to find the final byte of the source encoding, since it only has partial information about the requested character.[1]

I'll see if I can fix those issues. Thanks again.

[1] one nightmare that I think we/Unicode have avoided is multiple "code points" in some source encoding mapping to one code point in Unicode, but IIRC Unicode always has distinct code points to match historical encodings.

Leont · 2021-09-23T19:17:30Z

I don't think PerlIO::encoding is completely correct, and I'm not sure it can be with this interface.

AFAIK it will read from the decoded buffer, but I must admit I haven't actually double-checked if this worked correctly (and other than my rebase I haven't touched this branch in a while)

tonycoz · 2021-09-28T01:02:02Z

It's possible I don't understand what you were trying to solve.

In the case I debugged, perlio was blocking in fread() when trying to read a line of text, rather than (indirectly) calling read() only once as a ":perlio" fill does.

The problem with fread() is it always tries to read the full amount needed, while we only need to read up to the delimiter.

The problem I see if there is any other substantial layer, like encoding, the call to PerlIO_fill() (via fill_count()) will cause the same problem, for example:

tony@venus:.../git/perl5$ cat ../19121-line-delay.pl
#!/usr/bin/perl
binmode STDOUT, ":encoding(UTF-8)";
++$|;
print "Hello\nxxxx";
sleep 20;
tony@venus:.../git/perl5$ ./perl -Ilib -MDevel::Peek -le 'my $s = time;open my $fh, "-|:encoding(UTF-8)", "../19121-line-delay.pl" or die; my $x = <$fh>; print time()-$s; print $x'
0
Hello

tony@venus:.../git/perl5$ PERLIO=stdio ./perl -Ilib -MDevel::Peek -le 'my $s = time;open my $fh, "-|:encoding(UTF-8)", "../19121-line-delay.pl" or die; my $x = <$fh>; print time()-$s; print $x'
20
Hello

tony@venus:.../git/perl5$ git status
On branch perlio-readline
Your branch is up to date with 'origin/leont/perlio-readline'.

As with the simpler cases, the stdio case here is blocking in fread().

For this to be a general solution I think the terminator needs to be supplied to the next layer, so in this case PerlIOStdio_readdelim() would be a wrapper around fgets().

Leont · 2021-10-02T13:10:34Z

It's possible I don't understand what you were trying to solve.

In the case I debugged, perlio was blocking in fread() when trying to read a line of text, rather than (indirectly) calling read() only once as a ":perlio" fill does.

The problem with fread() is it always tries to read the full amount needed, while we only need to read up to the delimiter.

Correct. The read method in PerlIO isn't well-defined, and these differences of semantics cause a lot of headache.

The problem I see if there is any other substantial layer, like encoding, the call to PerlIO_fill() (via fill_count()) will cause the same problem, for example:

Yeah, I don't really know how to fix this for :stdio:encoding, both of them work in ways that are really unfortunate. Though given no one but me seems to have ever noticed that problem despite it existing since 2002 it may not be that much of an issue.

Leont · 2021-10-02T19:33:03Z

Actually, I think setvbuf may allow us to fix up :stdio

ETA: it doesn't, we don't know the position inside of the buffer.

tonycoz · 2021-10-04T04:07:06Z

Yeah, I don't really know how to fix this for :stdio:encoding, both of them work in ways that are really unfortunate. Though given no one but me seems to have ever noticed that problem despite it existing since 2002 it may not be that much of an issue.

One way I could see it working would be for :encoding to encode the delimiter (assuming readdelim() takes a code point rather than a byte for utf8 flagged streams), and pass the final byte of the encoding to readdelim() of the next layer, but it has complex implications.

readdelim() for :stdio would need to loop on f?getc(), and hopefully the underlying stdio/libc equivalent to *_fill() would only read available bytes just as perlio does. There's not much we can do if it doesn't beyond disabling buffering entirely.

The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in)

For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it.

For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we have a similar problem.

For :encoding there's also the possibility that multiple byte sequences in the encoding could map to the same code point, though I couldn't find any examples.

I could see us filling a byte at a time for readdelim() on the complex cases above.

Leont · 2021-11-05T20:42:05Z

The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in)

Yeah, that would be a real pain. I think ISO-2022 and MIME encodings suffer from this issue for any character, and possibly others as well.

For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it.

For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we have a similar problem.

That seems like a fairly unlikely thing for anyone to do, but they could do it.

khwilliamson · 2021-11-05T22:58:36Z

On 11/5/21 2:42 PM, Leon Timmermans wrote: The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in) Yeah, that would be a real pain. I think ISO-2022 and MIME encodings suffer from this issue for any character, and possibly others as well. For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it. For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we have a similar problem. That seems like a fairly unlikely thing for anyone to do, but they could do it.

This is what the non-characters are for, in part. We don't have to support FFFD

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19121 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA2DHYUNTRZHAZCC33QWM3UKQ6SRANCNFSM5D4WLP6A>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jkeenan · 2022-07-03T18:05:44Z

Commenters: there's been no further discussion in this Draft p.r. in 8 months. Should we keep the ticket open?

tonycoz · 2022-07-03T23:54:58Z

I'll make a new PR once I've re-worked it.

tonycoz force-pushed the utf8-strict-again branch from bd6ba11 to e6ee70e Compare September 13, 2021 01:20