Skip to content

Make :utf8 strict #19121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Make :utf8 strict #19121

wants to merge 8 commits into from

Conversation

tonycoz
Copy link
Contributor

@tonycoz tonycoz commented Sep 13, 2021

No description provided.

@jkeenan
Copy link
Contributor

jkeenan commented Sep 13, 2021

Strictly speaking, cpan/CPAN-Meta-YAML/t/11_read_string.t should be changed upstream first and synched into blead once a new CPAN version has been released.

How should we adapt to that?


=item *

C<strict> - disallows all encoding errors, non-characters, surrogates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-characters are now allowed in unicode; maybe strict should allow them? http://www.unicode.org/versions/corrigendum9.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK they were always allowed, but the current defaults are one of the few things retained from the code I started from.

@tonycoz
Copy link
Contributor Author

tonycoz commented Sep 13, 2021

I expect the defaults on strictness and possibly the way errors are handled to change before this is merged. Opinions welcome.

@Grinnz
Copy link
Contributor

Grinnz commented Sep 13, 2021

I am somewhat concerned about fallout from (as the doc and test updates indicate) :utf8 no longer being a flag on all :encoding layers. Not sure what the practical impact would be but this is unfortunately an interface that code dealing with layers had to work with.

@@ -366,8 +477,7 @@ You are supposed to use open() and binmode() to manipulate the stack.
B<Implementation details follow, please close your eyes.>

The arguments to layers are by default returned in parentheses after
the name of the layer, and certain layers (like C<:utf8>) are not real
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also is still the case, even if not for :utf8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-worked this, only the F_UTF8 flag ended up as a layer, and no longer does.

@@ -187,6 +284,21 @@ as such a layer assumes to be working with Perl's internal upgraded
encoding, so you will likely get a mangled result. Instead use C<:raw> or
C<:pop> to remove encoding layers.

# accept only valid Unicode, replacing everything else with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples seem to be in the wrong section now

@khwilliamson
Copy link
Contributor

khwilliamson commented Sep 13, 2021 via email

@nwc10
Copy link
Contributor

nwc10 commented Sep 14, 2021

There's (at least one) leak:

$ LC_ALL=en_US.UTF-8 PERL_UNICODE="" PERL_DESTRUCT_LEVEL=2 valgrind --leak-check=full ./perl  -e0
==826044== Memcheck, a memory error detector
==826044== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==826044== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==826044== Command: ./perl -e0
==826044==
==826044==
==826044== HEAP SUMMARY:
==826044==     in use at exit: 111 bytes in 3 blocks
==826044==   total heap usage: 1,515 allocs, 1,512 frees, 215,148 bytes allocated
==826044==
==826044== 111 (37 direct, 74 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3
==826044==    at 0x483877F: malloc (vg_replace_malloc.c:307)
==826044==    by 0x2F8842: Perl_safesysmalloc (util.c:161)
==826044==    by 0x51092D: PerlIOUnicode_pushed (perlio.c:5214)
==826044==    by 0x507D1E: PerlIO_push (perlio.c:1156)
==826044==    by 0x508184: PerlIO_apply_layera (perlio.c:1264)
==826044==    by 0x5082B9: PerlIO_apply_layers (perlio.c:1284)
==826044==    by 0x50840C: PerlIO_binmode (perlio.c:1316)
==826044==    by 0x1A91C3: S_parse_body (perl.c:2504)
==826044==    by 0x1A77A8: perl_parse (perl.c:1853)
==826044==    by 0x154256: main (perlmain.c:109)
==826044==
==826044== LEAK SUMMARY:
==826044==    definitely lost: 37 bytes in 1 blocks
==826044==    indirectly lost: 74 bytes in 2 blocks
==826044==      possibly lost: 0 bytes in 0 blocks
==826044==    still reachable: 0 bytes in 0 blocks
==826044==         suppressed: 0 bytes in 0 blocks
==826044==
==826044== For lists of detected and suppressed errors, rerun with: -s
==826044== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

(Sorry, I didn't dig further to try to address this)

I assume that this leaks on a normal build, but this happens to be a build with -DPURIFY and ithreads.

(Found with a build that does -DPURIFY, ithreads, ASAN and MALLOC_PERTURB_=N MALLOC_CHECK_=2 PERL_DESTRUCT_LEVEL=2 TEST_JOBS=33 LC_ALL=en_US.UTF-8 PERL_UNICODE="" nice -19 make -j32 test_harness. I'd use en_GB but it's not installed on that machine :-()

@nwc10
Copy link
Contributor

nwc10 commented Sep 14, 2021

We can (and should, I feel) add ASAN and -DPURIFY to the CI workflow once #19115 is resolved.

@nwc10
Copy link
Contributor

nwc10 commented Sep 15, 2021

Leak is fixed. \o/

(Only the leaks seen on blead remain - see #19115)

s/UTF8_DISALLOW_ABOVE_31_BIT/UTF8_DISALLOW_PERL_EXTENDED/

The former name is misleading for both EBCDIC, where it is 30 bits, not
31, and for overlongs where something that is 63 bits can boil down to
something that is less than 31
@khwilliamson
Copy link
Contributor

The UTF-8 components of this look good to me; I looked at the PerlIO portions, and they look reasonable, but I'm not qualified to really review those

@Leont
Copy link
Contributor

Leont commented Sep 20, 2021

This fails t/io/utf8.t, dist/IO/t/io_sock.t and especially t/op/readline.t when PERLIO=stdio. I know why, and have been working on a branch that refactors readline to no longer have this issue (I've just pushed that as leont/perlio-readline). That was one of the main reasons why my branch on this had stalled.

@tonycoz
Copy link
Contributor Author

tonycoz commented Sep 23, 2021

This fails t/io/utf8.t, dist/IO/t/io_sock.t and especially t/op/readline.t when PERLIO=stdio. I know why, and have been working on a branch that refactors readline to no longer have this issue (I've just pushed that as leont/perlio-readline). That was one of the main reasons why my branch on this had stalled.

Thanks for pointing this out.

I don't think PerlIO::encoding is completely correct, and I'm not sure it can be with this interface.

The simplest case is if we're decoding EDCIDIC :encoding(posix-bc), then the byte to look for in the source character set will be \x15 rather than the \x0A that would be passed in on an ASCII system, this could be made to work with this interface.

The problem with the interface is with multibyte encodings - readdelim() only gets the last byte of the UTF-8 encoding of the delimiter, if the underlying encoding is Shift-JIS or some other non-UTF-8 encoding there's no way for PerlIO::encoding to find the final byte of the source encoding, since it only has partial information about the requested character.[1]

I'll see if I can fix those issues. Thanks again.

[1] one nightmare that I think we/Unicode have avoided is multiple "code points" in some source encoding mapping to one code point in Unicode, but IIRC Unicode always has distinct code points to match historical encodings.

@Leont
Copy link
Contributor

Leont commented Sep 23, 2021

I don't think PerlIO::encoding is completely correct, and I'm not sure it can be with this interface.

AFAIK it will read from the decoded buffer, but I must admit I haven't actually double-checked if this worked correctly (and other than my rebase I haven't touched this branch in a while)

@tonycoz tonycoz marked this pull request as draft September 27, 2021 05:55
@tonycoz
Copy link
Contributor Author

tonycoz commented Sep 28, 2021

It's possible I don't understand what you were trying to solve.

In the case I debugged, perlio was blocking in fread() when trying to read a line of text, rather than (indirectly) calling read() only once as a ":perlio" fill does.

The problem with fread() is it always tries to read the full amount needed, while we only need to read up to the delimiter.

The problem I see if there is any other substantial layer, like encoding, the call to PerlIO_fill() (via fill_count()) will cause the same problem, for example:

tony@venus:.../git/perl5$ cat ../19121-line-delay.pl
#!/usr/bin/perl
binmode STDOUT, ":encoding(UTF-8)";
++$|;
print "Hello\nxxxx";
sleep 20;
tony@venus:.../git/perl5$ ./perl -Ilib -MDevel::Peek -le 'my $s = time;open my $fh, "-|:encoding(UTF-8)", "../19121-line-delay.pl" or die; my $x = <$fh>; print time()-$s; print $x'
0
Hello

tony@venus:.../git/perl5$ PERLIO=stdio ./perl -Ilib -MDevel::Peek -le 'my $s = time;open my $fh, "-|:encoding(UTF-8)", "../19121-line-delay.pl" or die; my $x = <$fh>; print time()-$s; print $x'
20
Hello

tony@venus:.../git/perl5$ git status
On branch perlio-readline
Your branch is up to date with 'origin/leont/perlio-readline'.

As with the simpler cases, the stdio case here is blocking in fread().

For this to be a general solution I think the terminator needs to be supplied to the next layer, so in this case PerlIOStdio_readdelim() would be a wrapper around fgets().

@Leont
Copy link
Contributor

Leont commented Oct 2, 2021

It's possible I don't understand what you were trying to solve.

In the case I debugged, perlio was blocking in fread() when trying to read a line of text, rather than (indirectly) calling read() only once as a ":perlio" fill does.

The problem with fread() is it always tries to read the full amount needed, while we only need to read up to the delimiter.

Correct. The read method in PerlIO isn't well-defined, and these differences of semantics cause a lot of headache.

The problem I see if there is any other substantial layer, like encoding, the call to PerlIO_fill() (via fill_count()) will cause the same problem, for example:

Yeah, I don't really know how to fix this for :stdio:encoding, both of them work in ways that are really unfortunate. Though given no one but me seems to have ever noticed that problem despite it existing since 2002 it may not be that much of an issue.

@Leont
Copy link
Contributor

Leont commented Oct 2, 2021

Actually, I think setvbuf may allow us to fix up :stdio

ETA: it doesn't, we don't know the position inside of the buffer.

@tonycoz
Copy link
Contributor Author

tonycoz commented Oct 4, 2021

Yeah, I don't really know how to fix this for :stdio:encoding, both of them work in ways that are really unfortunate. Though given no one but me seems to have ever noticed that problem despite it existing since 2002 it may not be that much of an issue.

One way I could see it working would be for :encoding to encode the delimiter (assuming readdelim() takes a code point rather than a byte for utf8 flagged streams), and pass the final byte of the encoding to readdelim() of the next layer, but it has complex implications.

readdelim() for :stdio would need to loop on f?getc(), and hopefully the underlying stdio/libc equivalent to *_fill() would only read available bytes just as perlio does. There's not much we can do if it doesn't beyond disabling buffering entirely.

The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in)

For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it.

For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we have a similar problem.

For :encoding there's also the possibility that multiple byte sequences in the encoding could map to the same code point, though I couldn't find any examples.

I could see us filling a byte at a time for readdelim() on the complex cases above.

@Leont
Copy link
Contributor

Leont commented Nov 5, 2021

The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in)

Yeah, that would be a real pain. I think ISO-2022 and MIME encodings suffer from this issue for any character, and possibly others as well.

For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it.

For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we have a similar problem.

That seems like a fairly unlikely thing for anyone to do, but they could do it.

@khwilliamson
Copy link
Contributor

khwilliamson commented Nov 5, 2021 via email

@jkeenan
Copy link
Contributor

jkeenan commented Jul 3, 2022

Commenters: there's been no further discussion in this Draft p.r. in 8 months. Should we keep the ticket open?

@tonycoz
Copy link
Contributor Author

tonycoz commented Jul 3, 2022

I'll make a new PR once I've re-worked it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants