-
Notifications
You must be signed in to change notification settings - Fork 577
Make :utf8 strict #19121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make :utf8 strict #19121
Conversation
bd6ba11
to
e6ee70e
Compare
Strictly speaking, How should we adapt to that? |
|
||
=item * | ||
|
||
C<strict> - disallows all encoding errors, non-characters, surrogates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-characters are now allowed in unicode; maybe strict should allow them? http://www.unicode.org/versions/corrigendum9.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK they were always allowed, but the current defaults are one of the few things retained from the code I started from.
I expect the defaults on strictness and possibly the way errors are handled to change before this is merged. Opinions welcome. |
I am somewhat concerned about fallout from (as the doc and test updates indicate) :utf8 no longer being a flag on all :encoding layers. Not sure what the practical impact would be but this is unfortunately an interface that code dealing with layers had to work with. |
e6ee70e
to
a54cf9a
Compare
@@ -366,8 +477,7 @@ You are supposed to use open() and binmode() to manipulate the stack. | |||
B<Implementation details follow, please close your eyes.> | |||
|
|||
The arguments to layers are by default returned in parentheses after | |||
the name of the layer, and certain layers (like C<:utf8>) are not real |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also is still the case, even if not for :utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-worked this, only the F_UTF8 flag ended up as a layer, and no longer does.
a54cf9a
to
cccae3c
Compare
@@ -187,6 +284,21 @@ as such a layer assumes to be working with Perl's internal upgraded | |||
encoding, so you will likely get a mangled result. Instead use C<:raw> or | |||
C<:pop> to remove encoding layers. | |||
|
|||
# accept only valid Unicode, replacing everything else with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples seem to be in the wrong section now
On 9/12/21 7:30 PM, Dan Book wrote:
non-characters are now allowed in unicode; maybe strict should allow
them? http://www.unicode.org/versions/corrigendum9.html
<http://www.unicode.org/versions/corrigendum9.html>
No!
That could lead to security problems.
Code that relies on using non-characters as sentinels that could never
be in outside input would now be exposed to them, creating an attack vector.
This Corrigendum was written because things like text editors and source
code control were refusing to work on files that had such sentinels in
them, which was never Unicode's intent. You were always supposed to be
able to use these for your own purposes, while being assured you would
not get inputs containing them. If you couldn't use them internally,
there would be no point to them at all!
But the writers of the meta-handlers, like source code control, had
misread the Standard, leading to this Corrigendum. Such programs have
to assume that they could be exposed to inputs containing them, and so
can't use these code points as sentinels.
But the default for Perl cannot be to allow them. The Corrigendum does
not allow free reign for their use in all circumstances, as if they were
any other code point. That would also mean there is no point to them.
These code points are to be used in limited circumstances; as such the
default must not be to allow them, but it should be possible to specify
that they are allowed.
At the time of the Corrigendum was issued, I checked with Unicode, and
they concurred that Perl should not change.
|
There's (at least one) leak:
(Sorry, I didn't dig further to try to address this) I assume that this leaks on a normal build, but this happens to be a build with (Found with a build that does |
We can (and should, I feel) add ASAN and |
Partly based on work done by Leon Timmermans, but has had extensive changes.
cccae3c
to
cd4a22e
Compare
Leak is fixed. \o/ (Only the leaks seen on blead remain - see #19115) |
s/UTF8_DISALLOW_ABOVE_31_BIT/UTF8_DISALLOW_PERL_EXTENDED/ The former name is misleading for both EBCDIC, where it is 30 bits, not 31, and for overlongs where something that is 63 bits can boil down to something that is less than 31
The UTF-8 components of this look good to me; I looked at the PerlIO portions, and they look reasonable, but I'm not qualified to really review those |
This fails |
Thanks for pointing this out. I don't think PerlIO::encoding is completely correct, and I'm not sure it can be with this interface. The simplest case is if we're decoding EDCIDIC The problem with the interface is with multibyte encodings - readdelim() only gets the last byte of the UTF-8 encoding of the delimiter, if the underlying encoding is Shift-JIS or some other non-UTF-8 encoding there's no way for PerlIO::encoding to find the final byte of the source encoding, since it only has partial information about the requested character.[1] I'll see if I can fix those issues. Thanks again. [1] one nightmare that I think we/Unicode have avoided is multiple "code points" in some source encoding mapping to one code point in Unicode, but IIRC Unicode always has distinct code points to match historical encodings. |
AFAIK it will read from the decoded buffer, but I must admit I haven't actually double-checked if this worked correctly (and other than my rebase I haven't touched this branch in a while) |
It's possible I don't understand what you were trying to solve. In the case I debugged, perlio was blocking in fread() when trying to read a line of text, rather than (indirectly) calling read() only once as a ":perlio" fill does. The problem with fread() is it always tries to read the full amount needed, while we only need to read up to the delimiter. The problem I see if there is any other substantial layer, like encoding, the call to PerlIO_fill() (via fill_count()) will cause the same problem, for example:
As with the simpler cases, the stdio case here is blocking in fread(). For this to be a general solution I think the terminator needs to be supplied to the next layer, so in this case PerlIOStdio_readdelim() would be a wrapper around fgets(). |
Correct. The read method in PerlIO isn't well-defined, and these differences of semantics cause a lot of headache.
Yeah, I don't really know how to fix this for |
Actually, I think ETA: it doesn't, we don't know the position inside of the buffer. |
One way I could see it working would be for :encoding to encode the delimiter (assuming readdelim() takes a code point rather than a byte for utf8 flagged streams), and pass the final byte of the encoding to readdelim() of the next layer, but it has complex implications. readdelim() for :stdio would need to loop on f?getc(), and hopefully the underlying stdio/libc equivalent to *_fill() would only read available bytes just as perlio does. There's not much we can do if it doesn't beyond disabling buffering entirely. The main difficulty I see for :encoding or :utf8 with replacement is multiple source byte sequences mapping to the same unicode code point (this is where the complex implications come in) For :utf8 if the delimiter was U+FFFD and if we're doing replacement on error then any invalid sequence could map to it. For :encoding if the delimiter was For :encoding there's also the possibility that multiple byte sequences in the encoding could map to the same code point, though I couldn't find any examples. I could see us filling a byte at a time for readdelim() on the complex cases above. |
Yeah, that would be a real pain. I think ISO-2022 and MIME encodings suffer from this issue for any character, and possibly others as well.
That seems like a fairly unlikely thing for anyone to do, but they could do it. |
On 11/5/21 2:42 PM, Leon Timmermans wrote:
The main difficulty I see for :encoding or :utf8 with replacement is
multiple source byte sequences mapping to the same unicode code
point (this is where the complex implications come in)
Yeah, that would be a real pain. I think ISO-2022 and MIME encodings
suffer from this issue for any character, and possibly others as well.
For :utf8 if the delimiter was U+FFFD and if we're doing replacement
on error then any invalid sequence could map to it.
For :encoding if the delimiter was \ and CHECK is FB_PERLQQ then we
have a similar problem.
That seems like a fairly unlikely thing for anyone to do, but they could
do it.
This is what the non-characters are for, in part. We don't have to
support FFFD
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#19121 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2DHYUNTRZHAZCC33QWM3UKQ6SRANCNFSM5D4WLP6A>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Commenters: there's been no further discussion in this Draft p.r. in 8 months. Should we keep the ticket open? |
I'll make a new PR once I've re-worked it. |
No description provided.