Skip to content

Referring to floating $x under locale changes its string representation #11872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p5pRT opened this issue Jan 16, 2012 · 27 comments
Closed

Referring to floating $x under locale changes its string representation #11872

p5pRT opened this issue Jan 16, 2012 · 27 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 16, 2012

Migrated from rt.perl.org#108378 (status was 'resolved')

Searchable as RT108378$

@p5pRT
Copy link
Author

p5pRT commented Jan 16, 2012

From @khwilliamson

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.15.6.


In lib/locale.t $x is set to 1.23 and then used on the rhs of various
expressions without changing it. If this happens in a locale in which
the radix character is say a comma, "$x" gets changed to 1,23, even
though $x was never explicitly modified. Then, when "$x" is referred to
in a scope without locale, the comma is retained, potentially screwing
things up. Thus for example, '$e = "$x"' gives $e a comma and $e is
considered to be non-numeric. Saying instead '$e = $x' does make $e numeric.



Flags​:
  category=core
  severity=low


Site configuration information for perl 5.15.6​:

Configured by khw at Mon Jan 16 10​:15​:54 MST 2012.

Summary of my perl5 (revision 5 version 15 subversion 6) configuration​:
  Commit id​: 86e2f32
  Platform​:
  osname=linux, osvers=2.6.35-31-generic-pae,
archname=i686-linux-thread-multi-64int-ld
  uname='linux karl 2.6.35-31-generic-pae #63-ubuntu smp mon nov 28
20​:48​:50 utc 2011 i686 gnulinux '
  config_args='-des -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0' -Dman1dir=none
-Dman3dir=none -DDEBUGGING -Dusemorebits -Dusethreads'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=define
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O0 -ggdb3',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.4.5', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='long double', nvsize=12,
Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
/usr/lib/i686-linux-gnu
  libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/libc-2.12.1.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.12.1'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -ggdb3 -ggdb3 -O0
-L/usr/local/lib -fstack-protector'

Locally applied patches​:


@​INC for perl 5.15.6​:

/home/khw/blead/lib/perl5/site_perl/5.15.6/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/site_perl/5.15.6
  /home/khw/blead/lib/perl5/5.15.6/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/5.15.6
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.15.6​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL5OPT=-w
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Feb 3, 2012

From @cpansprout

On Mon Jan 16 11​:04​:04 2012, public@​khwilliamson.com wrote​:

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.15.6.

-----------------------------------------------------------------
In lib/locale.t $x is set to 1.23 and then used on the rhs of various
expressions without changing it. If this happens in a locale in which
the radix character is say a comma, "$x" gets changed to 1,23, even
though $x was never explicitly modified. Then, when "$x" is referred
to
in a scope without locale, the comma is retained, potentially screwing
things up. Thus for example, '$e = "$x"' gives $e a comma and $e is
considered to be non-numeric. Saying instead '$e = $x' does make $e
numeric.

I think that is why $# was deprecated and removed. It breaks number <->
string equivalence.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Feb 3, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 13, 2013

From @khwilliamson

On 05/12/2013 01​:44 PM, Nicholas Clark wrote​:

On Thu, Apr 25, 2013 at 06​:07​:11PM -0700, Jan Dubois wrote​:

The code that turns an NV into a PV uses the Gconvert() macro. It will
always use the current locale, regardless of any locale pragma being
in effect or not, AFAICT.

I guess this is the bug then. The problem of course is that saving
the current locale, setting the locale to "C", converting the number,
and then restoring the original locale is quite a bit of overhead for
the regular use case that isn't using locale.

So the simple call to Gconvert() in Perl_sv_2pv_flags() would then become​:

#ifdef USE_LOCALE_NUMERIC
char *loc = savepv(setlocale(LC_NUMERIC, NULL));
setlocale(LC_NUMERIC, "C");
#endif
Gconvert(SvNVX(sv), NV_DIG, 0, s);
#ifdef USE_LOCALE_NUMERIC
setlocale(LC_NUMERIC, loc);
Safefree(loc);
#endif

That looks rather expensive to me.

It turns out that it's not.

On Sat, Apr 27, 2013 at 10​:05​:03AM +0200, Steffen Mueller wrote​:

On 04/26/2013 08​:03 PM, Jan Dubois wrote​:

I think this is undesirable, and implicit conversion between strings
and numbers should *always* use the "C" locale.

Yes, please! Locales are just plain old crazy.

Here's the test code for the first 3 examples​:

#include <locale.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv) {
int loop = 0;
char buffer[15 + 20];
if (argv[1] && !setlocale(LC_NUMERIC, argv[1])) {
perror("setlocale");
return 0;
}
do {
double d = loop + 0.5;
char *was = setlocale(LC_NUMERIC, NULL);

if \(\!was\) \{
    perror\("setlocale"\);
    return 0;
\}

if \(was\[0\] == 'C' && was\[1\] == '\\0'\) \{
    gcvt\(d\, 15\, buffer\);
\} else \{
    if \(\!setlocale\(LC\_NUMERIC\, "C"\)\) \{
    perror\("setlocale"\);
    return 0;
    \}

    gcvt\(d\, 15\, buffer\);

    if \(\!setlocale\(LC\_NUMERIC\, was\)\) \{
    perror\("setlocale"\);
    return 0;
    \}
\}
 \} while \(\+\+loop \< 10000000\);
 puts\(buffer\);
 return 0;

}

default (no macros defined) - do what we do now, don't setlocale at all.
This demonstrates that we get different results​:
[nicholas@​dromedary-001 test]$ ./locale-default hr_HR.iso88592
9999999,5
[nicholas@​dromedary-001 test]$ ./locale-default
9999999.5

always (-DALWAYS) - always setlocale to "C", and then back to what it was​:
[nicholas@​dromedary-001 test]$ ./locale-always
9999999.5
[nicholas@​dromedary-001 test]$ ./locale-always hr_HR.iso88592
9999999.5

smarter (-DSMARTER) - always setlocale to "C", but only set it back if it
wasn't C before. Output as before.

smartest (attached code) - setlocale NULL to read the locale, and then
set/restore if it's not C. This one turns out to be most interesting. Output
as before.

dumbbench says

default Rounded run time per iteration​: 6.767e+00 +/- 2.3e-02 (0.3%)
default (hr_HR) Rounded run time per iteration​: 6.755e+00 +/- 1.0e-02 (0.1%)
always Rounded run time per iteration​: 7.030e+00 +/- 1.7e-02 (0.2%)
always (hr_HR) Rounded run time per iteration​: 7.039e+00 +/- 1.7e-02 (0.2%)
smarter Rounded run time per iteration​: 7.006e+00 +/- 1.8e-02 (0.3%)
smarter (hr_HR) Rounded run time per iteration​: 7.020e+00 +/- 1.3e-02 (0.2%)
smartest Rounded run time per iteration​: 6.874e+00 +/- 1.8e-02 (0.3%)
smartest (hr_HR) Rounded run time per iteration​: 1.9302e+01 +/- 1.9e-02 (0.1%)

So about 4% slower. Apart from the one that looks like it should be more
efficient, which can be 186% slower. I have no clue why.

Note that the numeric conversion is cached. This is only going to be a
performance hit for code which needs to format a lot of numbers. If it's
a problem, we can also get a speedup by using integer formatting if the
value is safely an integer.

If it's *still* a problem we might like to investigate the Grisu3 algorithm
from http​://florian.loitsch.com/publications/dtoa-pldi2010.pdf

I believe that that has been coded for V8 and then extracted as
http​://code.google.com/p/double-conversion/
It's something like 4 times as fast for the 99% of values which it can cope
with. And we can force it to ignore locales (if the code even deals with
them. I've not looked that hard)

It's in C++, and I believe a 3 clause BSD licence. We could convert it to C.

Also, Python has considered using this, and then rejected it as not worth
the effort​: http​://bugs.python.org/issue12450

There doesn't seem to be much difference between the two approaches on x86_64
Linux. I'm wondering what would be a useful place to benchmark to see if
there is a difference - Win32? I would have thought that avoiding the
second setlocale() would be a win somewhere. But it is fractionally more code.

Anyway, I think it's worth doing. Just which of the two to use.

Nicholas Clark

It turns out that we already keep track of if we are in the C locale or
not for the numeric radix character. PL_numeric_standard contains that
information. This means that the most common case will only have a
single extra test, of this bit.

The attached draft patch demonstrates this. It also appears to fix
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=108378 and
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=115800

It does this by not setting SvPOK when in the scope of 'use locale'.
Without this, when upgrading an NV to a PVNV, the stringified version
retains whatever radix character it had at the time it was stringified,
regardless of whatever the current locale calls for. Even implicit
stringification, as in Jan's example​:

  $a = 1.2;
  $b = eval "$a + 1.5";

sets $a to "1,2" if the current locale calls for that radix, where it
remains, potentially causing problems down the road. (#108378 is about
this; the attached patch causes this example to work correctly.)

By not setting SvPOK, the NV remains an NV, and if a stringified version
is required, it must be recalculated each time; thus if the radix
changes by a call to setlocale, we will catch that.

It also has the side effect that the radix in every SVNV will be a dot,
as Steffen suggested, since we won't create an SVNV from an NV unless
outside the scope of locale, and we change to the C locale during each
such creation event, and the C locale uses a dot.

I believe that additionally, within the scope of 'use locale', any PVNV
should be treated as a plain NV, so that the radix won't be output as
what it was set to before the scope was entered.

I'm not familiar enough with this part of Perl to know if there is
something really wrong with this scheme; perhaps you do.

The problem with this patch is that it breaks code that has come to rely
on not needing 'use locale' to get locale behavior. The current
documentation is not totally clear about this. It says that you need
'use locale', but one example given doesn't include it. And the current
behavior is tested for in

lib/version/t/07locale.t

It doesn't do a 'use locale', and even if one is added, one test fails
because it relies on the stringification being retained outside the
scope of the 'use locale'.

Thus, this patch fixes some problems, and I believe makes locales work
more sanely, but breaks code that relies on behavior contrary to what
the documentation says (but in accord with an example in the documentation).

@p5pRT
Copy link
Author

p5pRT commented May 13, 2013

From @khwilliamson

0008-XXX-draft-patch.patch
From 02dc6aa79842af16c4129a89d249de100cbb5ff0 Mon Sep 17 00:00:00 2001
From: Karl Williamson <[email protected]>
Date: Mon, 13 May 2013 07:35:35 -0600
Subject: [PATCH 8/8] XXX draft patch

---
 sv.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/sv.c b/sv.c
index 3736ba8..4a39607 100644
--- a/sv.c
+++ b/sv.c
@@ -2894,6 +2894,7 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
 	Move(ptr, s, len, char);
 	s += len;
 	*s = '\0';
+        SvPOK_on(sv);
     }
     else if (SvNOK(sv)) {
 	if (SvTYPE(sv) < SVt_PVNV)
@@ -2907,7 +2908,35 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
 	    /* The +20 is pure guesswork.  Configure test needed. --jhi */
 	    s = SvGROW_mutable(sv, NV_DIG + 20);
 	    /* some Xenix systems wipe out errno here */
-	    Gconvert(SvNVX(sv), NV_DIG, 0, s);
+
+#ifndef USE_LOCALE_NUMERIC
+            Gconvert(SvNVX(sv), NV_DIG, 0, s);
+            SvPOK_on(sv);
+#else
+            /* Gconvert uses the current locale.  That's what we want if we're
+             * supposed to be using locales.  But if not, we want the result
+             * based on the C locale, so we need to change to the C locale
+             * during the Gconvert and then change back.  But if we're already
+             * in the C locale (PL_numeric_standard), no need to do any
+             * changing */
+            if (PL_numeric_standard || IN_LOCALE_RUNTIME) {
+                Gconvert(SvNVX(sv), NV_DIG, 0, s);
+            }
+            else {
+                char *loc = savepv(setlocale(LC_NUMERIC, NULL));
+                setlocale(LC_NUMERIC, "C");
+                Gconvert(SvNVX(sv), NV_DIG, 0, s);
+                setlocale(LC_NUMERIC, loc);
+                Safefree(loc);
+            }
+
+            /* The stringification should not "stick" if we are in locales, as
+             * a run-time change of the locale, or exiting the 'use locale'
+             * scope may require a different stringification */
+            if (! IN_LOCALE_RUNTIME) {
+                SvPOK_on(sv);
+            }
+#endif
 	    RESTORE_ERRNO;
 	    while (*s) s++;
 	}
@@ -2952,7 +2981,6 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
 	    *lp = len;
 	SvCUR_set(sv, len);
     }
-    SvPOK_on(sv);
     DEBUG_c(PerlIO_printf(Perl_debug_log, "0x%"UVxf" 2pv(%s)\n",
 			  PTR2UV(sv),SvPVX_const(sv)));
     if (flags & SV_CONST_RETURN)
-- 
1.8.1.3

@p5pRT
Copy link
Author

p5pRT commented May 13, 2013

From @nwc10

On Mon, May 13, 2013 at 08​:59​:19AM -0600, Karl Williamson wrote​:

It turns out that we already keep track of if we are in the C locale or
not for the numeric radix character. PL_numeric_standard contains that
information. This means that the most common case will only have a
single extra test, of this bit.

Nice to know.

The attached draft patch demonstrates this. It also appears to fix
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=108378 and
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=115800

It does this by not setting SvPOK when in the scope of 'use locale'.
Without this, when upgrading an NV to a PVNV, the stringified version
retains whatever radix character it had at the time it was stringified,
regardless of whatever the current locale calls for. Even implicit
stringification, as in Jan's example​:

$a = 1.2;
$b = eval "$a + 1.5";

sets $a to "1,2" if the current locale calls for that radix, where it
remains, potentially causing problems down the road. (#108378 is about
this; the attached patch causes this example to work correctly.)

Yes, this seems like a sane approach. I can't spot any real flaws yet.
(Meaning that I don't trust myself)

I believe that additionally, within the scope of 'use locale', any PVNV
should be treated as a plain NV, so that the radix won't be output as
what it was set to before the scope was entered.

I'm not familiar enough with this part of Perl to know if there is
something really wrong with this scheme; perhaps you do.

I am not familiar with it either. And, as you say, what we have now leaks
the radix from a PVNV. Right now, C<no locale> leaks in​:

$ cat locale.pl
#!/usr/bin/perl -w
use strict;

use POSIX '​:locale_h';

$a = 3.14;

$b = $b = "$a" if shift;

setlocale(LC_ALL, "hr_HR");

use locale;

print $a, "\n";

__END__
$ ./locale.pl 0
3,14
$ ./locale.pl 1
3.14

in that if C<use locale> is supposed to be affecting numeric conversions
within its scope, even the implicit conversions, then it should be "3,14"
in both cases, and not happen to be "3.14" just because of the side effects
of an earlier use in a string context.

But that one feels harder to fix. Seems that one has to rid all the SvPV*
macros to ignore the cached string in the scope of C<use locale>

Which feels like it's going to be ugly. To the point that I'm not sure if
it's workable.

Thus, this patch fixes some problems, and I believe makes locales work
more sanely, but breaks code that relies on behavior contrary to what
the documentation says (but in accord with an example in the documentation).

We're allowed to be wrong, and correct our mistakes.

Although generally it helps when doing this that it's clear that the new
model of things is more self-consistent than the old model :-)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 14, 2013

From @Leont

On Mon, May 13, 2013 at 4​:59 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

It turns out that we already keep track of if we are in the C locale or not
for the numeric radix character. PL_numeric_standard contains that
information. This means that the most common case will only have a single
extra test, of this bit.

The attached draft patch demonstrates this. It also appears to fix
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=108378 and
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=115800

It does this by not setting SvPOK when in the scope of 'use locale'. Without
this, when upgrading an NV to a PVNV, the stringified version retains
whatever radix character it had at the time it was stringified, regardless
of whatever the current locale calls for. Even implicit stringification, as
in Jan's example​:

$a = 1.2;
$b = eval "$a + 1.5";

sets $a to "1,2" if the current locale calls for that radix, where it
remains, potentially causing problems down the road. (#108378 is about
this; the attached patch causes this example to work correctly.)

By not setting SvPOK, the NV remains an NV, and if a stringified version is
required, it must be recalculated each time; thus if the radix changes by a
call to setlocale, we will catch that.

It also has the side effect that the radix in every SVNV will be a dot, as
Steffen suggested, since we won't create an SVNV from an NV unless outside
the scope of locale, and we change to the C locale during each such creation
event, and the C locale uses a dot.

This all sounds sensible, however

I believe that additionally, within the scope of 'use locale', any PVNV
should be treated as a plain NV, so that the radix won't be output as what
it was set to before the scope was entered.

I'm not familiar enough with this part of Perl to know if there is something
really wrong with this scheme; perhaps you do.

I think you mean any SvNOK PVNV. PVNV can also contain cached.
non-authoritative NV values. I think that should include SvNOK PVMG's
and PVLV's too. Not sure how workable this is.

Also, you're suggesting that in​:

perl -MDevel​::Peek -E 'my $num = "1.2"; log $num; say Dump $num;'

the string "1.2" should be printed as a "1,2" because the variable
happens to have been used in numeric context. I'm not sure that's
sane. This is one of those places where not having proper types is
painful.

The problem with this patch is that it breaks code that has come to rely on
not needing 'use locale' to get locale behavior. The current documentation
is not totally clear about this. It says that you need 'use locale', but
one example given doesn't include it. And the current behavior is tested
for in

lib/version/t/07locale.t

It doesn't do a 'use locale', and even if one is added, one test fails
because it relies on the stringification being retained outside the scope of
the 'use locale'.

That's just wrong IMO.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 14, 2013

From @ap

* Nicholas Clark <nick@​ccl4.org> [2013-05-13 21​:55]​:

And, as you say, what we have now leaks the radix from a PVNV. Right
now, C<no locale> leaks in​:

$ cat locale.pl
#!/usr/bin/perl -w
use strict;

use POSIX '​:locale_h';

$a = 3.14;

$b = $b = "$a" if shift;

setlocale(LC_ALL, "hr_HR");

use locale;

print $a, "\n";

__END__
$ ./locale.pl 0
3,14
$ ./locale.pl 1
3.14

in that if C<use locale> is supposed to be affecting numeric conversions
within its scope, even the implicit conversions, then it should be "3,14"
in both cases, and not happen to be "3.14" just because of the side effects
of an earlier use in a string context.

But that one feels harder to fix. Seems that one has to rid all the SvPV*
macros to ignore the cached string in the scope of C<use locale>

Which feels like it's going to be ugly. To the point that I'm not sure if
it's workable.

This is a case where if the initial SV type were marked as a canonical,
a fix would be trivial, no?

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented May 16, 2013

From @khwilliamson

On 05/13/2013 06​:25 PM, Aristotle Pagaltzis wrote​:

This is a case where if the initial SV type were marked as a canonical,
a fix would be trivial, no?

I believe so

@p5pRT
Copy link
Author

p5pRT commented May 16, 2013

From @nwc10

On Thu, May 16, 2013 at 08​:58​:27AM -0600, Karl Williamson wrote​:

On 05/13/2013 06​:25 PM, Aristotle Pagaltzis wrote​:

This is a case where if the initial SV type were marked as a canonical,
a fix would be trivial, no?

I believe so

No.

Certainly not directly. Knowing that something is canonically a number
doesn't solve the problem - that with locales a *number* can have more than
one "correct" stringification, and we have no way to store two
cached stringifications, let alone tag them such that the correct one is
returned in context.

For example

  $a = 3.14;
  $a =~ /,/;

Should that match?

If locales are affecting stringification, and that is running within the
scope of C<use locale>, then it will depend on the locale's radix character.

But that then means that the call to SvPV() [or equivalent] within the
regular expression engine C code needs to behave differently depending on
circumstances. And that information is not passed down from the OP.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 16, 2013

From @khwilliamson

On 05/13/2013 06​:22 PM, Leon Timmermans wrote​:

On Mon, May 13, 2013 at 4​:59 PM, Karl Williamson
<public@​khwilliamson.com> wrote​:

It turns out that we already keep track of if we are in the C locale or not
for the numeric radix character. PL_numeric_standard contains that
information. This means that the most common case will only have a single
extra test, of this bit.

The attached draft patch demonstrates this. It also appears to fix
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=108378 and
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=115800

It does this by not setting SvPOK when in the scope of 'use locale'. Without
this, when upgrading an NV to a PVNV, the stringified version retains
whatever radix character it had at the time it was stringified, regardless
of whatever the current locale calls for. Even implicit stringification, as
in Jan's example​:

$a = 1.2;
$b = eval "$a + 1.5";

sets $a to "1,2" if the current locale calls for that radix, where it
remains, potentially causing problems down the road. (#108378 is about
this; the attached patch causes this example to work correctly.)

By not setting SvPOK, the NV remains an NV, and if a stringified version is
required, it must be recalculated each time; thus if the radix changes by a
call to setlocale, we will catch that.

It also has the side effect that the radix in every SVNV will be a dot, as
Steffen suggested, since we won't create an SVNV from an NV unless outside
the scope of locale, and we change to the C locale during each such creation
event, and the C locale uses a dot.

This all sounds sensible, however

I believe that additionally, within the scope of 'use locale', any PVNV
should be treated as a plain NV, so that the radix won't be output as what
it was set to before the scope was entered.

I'm not familiar enough with this part of Perl to know if there is something
really wrong with this scheme; perhaps you do.

I think you mean any SvNOK PVNV. PVNV can also contain cached.
non-authoritative NV values. I think that should include SvNOK PVMG's
and PVLV's too. Not sure how workable this is.

Also, you're suggesting that in​:

perl -MDevel​::Peek -E 'my $num = "1.2"; log $num; say Dump $num;'

the string "1.2" should be printed as a "1,2" because the variable
happens to have been used in numeric context. I'm not sure that's
sane.

Actually, I'm not suggesting that. I ran this code with my patch
(adding in locale usage), and it worked. The reason is that forcing it
to be a numeric doesn't affect the stringification that's already in place.

perl -MPOSIX -MDevel​::Peek -E 'POSIX​::setlocale(POSIX​::LC_ALL,
"de_DE.UTF8-8"); my $num = "1.2"; use locale; log $num; say $num; say
Dump $num;'

This is one of those places where not having proper types is

painful.

The problem with this patch is that it breaks code that has come to rely on
not needing 'use locale' to get locale behavior. The current documentation
is not totally clear about this. It says that you need 'use locale', but
one example given doesn't include it. And the current behavior is tested
for in

lib/version/t/07locale.t

It doesn't do a 'use locale', and even if one is added, one test fails
because it relies on the stringification being retained outside the scope of
the 'use locale'.

That's just wrong IMO.

I presume you mean the test is just wrong.

Leon

@p5pRT
Copy link
Author

p5pRT commented May 16, 2013

From @khwilliamson

On 05/13/2013 01​:50 PM, Nicholas Clark wrote​:

On Mon, May 13, 2013 at 08​:59​:19AM -0600, Karl Williamson wrote​:

It turns out that we already keep track of if we are in the C locale or
not for the numeric radix character. PL_numeric_standard contains that
information. This means that the most common case will only have a
single extra test, of this bit.

Nice to know.

The attached draft patch demonstrates this. It also appears to fix
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=108378 and
https://rt-archive.perl.org/perl5/Public/Bug/Display.html?id=115800

It does this by not setting SvPOK when in the scope of 'use locale'.
Without this, when upgrading an NV to a PVNV, the stringified version
retains whatever radix character it had at the time it was stringified,
regardless of whatever the current locale calls for. Even implicit
stringification, as in Jan's example​:

$a = 1.2;
$b = eval "$a + 1.5";

sets $a to "1,2" if the current locale calls for that radix, where it
remains, potentially causing problems down the road. (#108378 is about
this; the attached patch causes this example to work correctly.)

Yes, this seems like a sane approach. I can't spot any real flaws yet.
(Meaning that I don't trust myself)

I believe that additionally, within the scope of 'use locale', any PVNV
should be treated as a plain NV, so that the radix won't be output as
what it was set to before the scope was entered.

I'm not familiar enough with this part of Perl to know if there is
something really wrong with this scheme; perhaps you do.

I am not familiar with it either. And, as you say, what we have now leaks
the radix from a PVNV. Right now, C<no locale> leaks in​:

$ cat locale.pl
#!/usr/bin/perl -w
use strict;

use POSIX '​:locale_h';

$a = 3.14;

$b = $b = "$a" if shift;

setlocale(LC_ALL, "hr_HR");

use locale;

print $a, "\n";

__END__
$ ./locale.pl 0
3,14
$ ./locale.pl 1
3.14

My patch fixes this.

in that if C<use locale> is supposed to be affecting numeric conversions
within its scope, even the implicit conversions, then it should be "3,14"
in both cases, and not happen to be "3.14" just because of the side effects
of an earlier use in a string context.

But that one feels harder to fix. Seems that one has to rid all the SvPV*
macros to ignore the cached string in the scope of C<use locale>

Which feels like it's going to be ugly. To the point that I'm not sure if
it's workable.

I believe you are talking about the case
use strict;

$a = 3.14;
$b = $b = "$a" if shift;
POSIX​::setlocale(POSIX​::LC_ALL,"de_DE.UTF-8");
use locale;
print $a, "\n";

In that case my patch doesn't work, because $a is set POK because it is
stringified outside of the 'use locale' scope.

What I've since done to fix that is to change the eight macros that are
variants of this one​:

/* Within locales, an NOK scalar isn't considered POK, thus forcing
  * re-stringification. This is because the locale (and hence the current
  * proper radix character stringification) can change at any time */
#define SvPOK(sv) ((SvFLAGS(sv) & SVf_POK) \
  && (! SvNOK(sv) || LIKELY(! IN_LOCALE_RUNTIME)))

That forces the SVPV macros to call sv_2pv_flags() each time the
stringification is needed of an NOK scalar, within the scope of 'use
locale'. With that, the code above works.

Thus, this patch fixes some problems, and I believe makes locales work
more sanely, but breaks code that relies on behavior contrary to what
the documentation says (but in accord with an example in the documentation).

We're allowed to be wrong, and correct our mistakes.

Although generally it helps when doing this that it's clear that the new
model of things is more self-consistent than the old model :-)

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 18, 2013

From @nwc10

On Thu, May 16, 2013 at 10​:26​:42AM -0600, Karl Williamson wrote​:

/* Within locales, an NOK scalar isn't considered POK, thus forcing
* re-stringification. This is because the locale (and hence the current
* proper radix character stringification) can change at any time */
#define SvPOK(sv) ((SvFLAGS(sv) & SVf_POK) \
&& (! SvNOK(sv) || LIKELY(! IN_LOCALE_RUNTIME)))

That approach starts to make me feel very uncomfortable.
Whilst it doesn't contradict the documentation for SvPOK()

=for apidoc Am|U32|SvPOK|SV* sv
Returns a U32 value indicating whether the SV contains a character
string.

it does change behaviour since 5.000, and contradicts this comment​:

#define SVf_POK 0x00000400 /* has valid public pointer value */

That forces the SVPV macros to call sv_2pv_flags() each time the
stringification is needed of an NOK scalar, within the scope of 'use
locale'. With that, the code above works.

Yes, one has to end up with that approach. The alternative, which at the
moment I think I'm liking better, is to *not* set SvPOK() on stringification
of NVs. (Or at least, NVs which aren't integers). Which would retain the
current meaning of SVf_POK, leave SvPOK() unchanged, and cause the logic
about what to return to be in one place. This also reduces the code size.
However, the cost is more function calls.

I can't see a way of doing this without more complexity.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented May 21, 2013

From @khwilliamson

On 05/18/2013 02​:47 AM, Nicholas Clark wrote​:

On Thu, May 16, 2013 at 10​:26​:42AM -0600, Karl Williamson wrote​:

/* Within locales, an NOK scalar isn't considered POK, thus forcing
* re-stringification. This is because the locale (and hence the current
* proper radix character stringification) can change at any time */
#define SvPOK(sv) ((SvFLAGS(sv) & SVf_POK) \
&& (! SvNOK(sv) || LIKELY(! IN_LOCALE_RUNTIME)))

That approach starts to make me feel very uncomfortable.
Whilst it doesn't contradict the documentation for SvPOK()

=for apidoc Am|U32|SvPOK|SV* sv
Returns a U32 value indicating whether the SV contains a character
string.

it does change behaviour since 5.000, and contradicts this comment​:

#define SVf_POK 0x00000400 /* has valid public pointer value */

That forces the SVPV macros to call sv_2pv_flags() each time the
stringification is needed of an NOK scalar, within the scope of 'use
locale'. With that, the code above works.

Yes, one has to end up with that approach. The alternative, which at the
moment I think I'm liking better, is to *not* set SvPOK() on stringification
of NVs. (Or at least, NVs which aren't integers). Which would retain the
current meaning of SVf_POK, leave SvPOK() unchanged, and cause the logic
about what to return to be in one place. This also reduces the code size.
However, the cost is more function calls.

I can't see a way of doing this without more complexity.

Me neither. I'm fine with doing it as you suggest. My scheme was an
attempt to minimize real-time impact on programs that don't use locale.

I do believe, however, that we should issue a statement deprecating any
direct accesses to Perl core data structures. All accesses should go
through some set/get accessor macros or functions; and if any need to be
written, I'll volunteer to do so. We may not be able to enforce this
with warnings, but it's crazy to be constrained to not making changes
because some XS writer is or could be twiddling some internal bits.

@p5pRT
Copy link
Author

p5pRT commented May 21, 2013

From @jandubois

I'm slightly confused by this discussion. AFAICT, there are currently
2 different opinions about the "desired" behavior​:

Steffen and me believe that implicit stringification should always use
the "C" locale and ignore the current locale setting. This seems to be
backed by the documentation, and also seems the only way in which
string eval "" can work with interpolated floating point numbers.

Nicholas and Karl seem to be discussing implementation details on how
implicit stringification could observe the lexical effect of the
locale pragma.

I find the implementation discussion premature while there is no
consensus on the desired semantics (or what part of the current
implementation is considered to be buggy). What are the arguments in
favor of using current locale for implicit stringification?

Cheers,
-Jan

@p5pRT
Copy link
Author

p5pRT commented Jun 9, 2013

From @rjbs

* Jan Dubois <jand@​activestate.com> [2013-05-21T13​:41​:29]

Steffen and me believe that implicit stringification should always use
the "C" locale and ignore the current locale setting. This seems to be
backed by the documentation, and also seems the only way in which
string eval "" can work with interpolated floating point numbers.

Implicit stringification in what sense? With LC_NUMERIC in effect, we promise
that print($x) will stringify numbers according to locale rules. It seems
somewhat perverse to then say that print("$x") will do otherwise.

Did I miss a detail somewhere?

(Apologies for my late arrival here.)

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Jun 9, 2013

From @rjbs

* Karl Williamson <public@​khwilliamson.com> [2013-05-21T11​:43​:50]

Me neither. I'm fine with doing it as you suggest. My scheme was
an attempt to minimize real-time impact on programs that don't use
locale.

Nicholas was suggesting not caching locale-stringified strings for numbers.
You (Karl) and I discussed this a while ago on #p5p and I said it was also what
I thought we should've been doing. Can we make this happen?

I do believe, however, that we should issue a statement deprecating
any direct accesses to Perl core data structures.

Is this something we can plausibly do? I think it's the sort of thing that
every guts hacker I've known has said we needed to do, but couldn't do without
breaking tons of stuff. Perhaps we break them by degrees?

At any rate, I am in favor of this to the extent that we don't break everything
all at once. ;)

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Jun 10, 2013

From @jandubois

On Sun, Jun 9, 2013 at 7​:21 AM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

* Jan Dubois <jand@​activestate.com> [2013-05-21T13​:41​:29]

Steffen and me believe that implicit stringification should always use
the "C" locale and ignore the current locale setting. This seems to be
backed by the documentation, and also seems the only way in which
string eval "" can work with interpolated floating point numbers.

Implicit stringification in what sense? With LC_NUMERIC in effect, we promise
that print($x) will stringify numbers according to locale rules. It seems
somewhat perverse to then say that print("$x") will do otherwise.

Did I miss a detail somewhere?

Sorry, it was my incomplete reading of perllocale.pod that led me to
believe that only printf() would follow the current locale. After
re-reading it, I find the documentation self-contradictory​:

| Output produced by print() is also affected by the current locale​: it
| corresponds to what you'd get from printf() in the "C" locale.

If the output from print() corresponds to what you get from printf()
under the "C" locale, then print() would *not* be affected by the
current locale.

But anyways, the next sentence, and the examples further down make it
clear that the intention has been that the current locale does affect
internal conversations between numbers and strings. As I showed in the
other thread about this topic, this has some unpleasant side-effects
for string eval that includes interpolated floating point numbers
(might silently break with incorrect results).

So I continue to consider this a bad language design decision, but
have no further objections to make the implementation match the
documented behavior (and maybe clarify the documentation a bit
regarding print()).

Cheers,
-Jan

PS​: string eval with interpolated floating point numbers is already
broken on some platforms for Inf and NaN values, and only works
accidentally on the rest. But then most code doesn't deal with Inf and
NaN anyways, so this is a lesser problem.

@p5pRT
Copy link
Author

p5pRT commented Jun 10, 2013

From @rjbs

* Jan Dubois <jand@​activestate.com> [2013-06-10T04​:07​:32]

On Sun, Jun 9, 2013 at 7​:21 AM, Ricardo Signes
<perl.p5p@​rjbs.manxome.org> wrote​:

Implicit stringification in what sense? With LC_NUMERIC in effect, we
promise that print($x) will stringify numbers according to locale rules.
It seems somewhat perverse to then say that print("$x") will do otherwise.

Did I miss a detail somewhere?

Sorry, it was my incomplete reading of perllocale.pod that led me to
believe that only printf() would follow the current locale. After
re-reading it, I find the documentation self-contradictory​:
[...]

...and sorry about my reply, to which you're replying, as it was late to the
game and you'd already covered this ground. I had not fully caught up in the
thread when I wrote it.

So I continue to consider this a bad language design decision, but
have no further objections to make the implementation match the
documented behavior (and maybe clarify the documentation a bit
regarding print()).

I agree that it is a bad design. I really have very little idea how widely
deployed code relying on it is, as I enjoy the privilege of more or less living
in the C locale. My impression is "so much that breaking it is not a good idea
right now."

Perhaps I can wrangle some information on that. I'll also look at
locale-specific libraries in some of our cousin languages — I've also never had
to deal with locales there, before, either. Maybe there are good lessons to
learn.

PS​: string eval with interpolated floating point numbers is already
broken on some platforms for Inf and NaN values, and only works
accidentally on the rest. But then most code doesn't deal with Inf and
NaN anyways, so this is a lesser problem.

I have totally hit this. :-(

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Jun 16, 2013

From @khwilliamson

On 06/09/2013 08​:27 AM, Ricardo Signes wrote​:

* Karl Williamson <public@​khwilliamson.com> [2013-05-21T11​:43​:50]

Me neither. I'm fine with doing it as you suggest. My scheme was
an attempt to minimize real-time impact on programs that don't use
locale.

Nicholas was suggesting not caching locale-stringified strings for numbers.
You (Karl) and I discussed this a while ago on #p5p and I said it was also what
I thought we should've been doing. Can we make this happen?

I will work on a patch to do this.

I agree with Jan and Dr. Ruud that this is a poorly designed portion of
Perl. I see no other option but to continue to support it as best we
can. There needs to be cautions about the interactions in eval in both
perlfunc and perllocale. And there should be something in perlfunc
about the NaN and Inf anyway.

I do believe, however, that we should issue a statement deprecating
any direct accesses to Perl core data structures.

Is this something we can plausibly do? I think it's the sort of thing that
every guts hacker I've known has said we needed to do, but couldn't do without
breaking tons of stuff. Perhaps we break them by degrees?

At any rate, I am in favor of this to the extent that we don't break everything
all at once. ;)

What I meant was adding something to perldelta and perhack, and perhaps
other pods (suggestions as to which are welcome) something like​:

"Because of limitations of the C language, various undocumented
Perl-core-internal data items are visible to XS modules, for example,
global or per-interpreter variables such as C<PL_curstackinfo>. It is
deprecated to use these directly without going through a documented (in
L<perlapi>) function or macro interface. Starting in Perl v5.24.0, any
such data item can be removed or changed without prior notice. If you
feel you need access to any of these data items, send email to
L<mailto​:perl5-porters@​perl.org>."

I don't see how we can generally make it so that warnings get generated
for illicit uses in the meantime.

@p5pRT
Copy link
Author

p5pRT commented Jun 16, 2013

From @cpansprout

On Sun Jun 16 12​:16​:11 2013, public@​khwilliamson.com wrote​:

What I meant was adding something to perldelta and perhack, and
perhaps
other pods (suggestions as to which are welcome) something like​:

"Because of limitations of the C language, various undocumented
Perl-core-internal data items are visible to XS modules, for example,
global or per-interpreter variables such as C<PL_curstackinfo>. It is
deprecated to use these directly without going through a documented
(in
L<perlapi>) function or macro interface. Starting in Perl v5.24.0,
any
such data item can be removed or changed without prior notice. If you
feel you need access to any of these data items, send email to
L<mailto​:perl5-porters@​perl.org>."

perlbug@​perl.org, please. :-)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jun 17, 2013

From @jkeenan

On Mon Jun 10 15​:00​:45 2013, perl.p5p@​rjbs.manxome.org wrote​:

* Jan Dubois <jand@​activestate.com> [2013-06-10T04​:07​:32]

Sorry, it was my incomplete reading of perllocale.pod that led me to
believe that only printf() would follow the current locale. After
re-reading it, I find the documentation self-contradictory​:
[...]

...and sorry about my reply, to which you're replying, as it was late
to the
game and you'd already covered this ground. I had not fully caught up
in the
thread when I wrote it.

Is this the same problem as
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=82418 ?

@p5pRT
Copy link
Author

p5pRT commented Jun 17, 2013

From @cpansprout

On Sun Jun 16 17​:40​:05 2013, jkeenan wrote​:

On Mon Jun 10 15​:00​:45 2013, perl.p5p@​rjbs.manxome.org wrote​:

* Jan Dubois <jand@​activestate.com> [2013-06-10T04​:07​:32]

Sorry, it was my incomplete reading of perllocale.pod that led me to
believe that only printf() would follow the current locale. After
re-reading it, I find the documentation self-contradictory​:
[...]

...and sorry about my reply, to which you're replying, as it was late
to the
game and you'd already covered this ground. I had not fully caught up
in the
thread when I wrote it.

Is this the same problem as
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=82418 ?

No, I don’t think so. That one has to do with constant folding, and
whether the number will be stringified at compile time or run time.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Jun 17, 2013

From @khwilliamson

On 06/16/2013 08​:18 PM, Father Chrysostomos via RT wrote​:

On Sun Jun 16 17​:40​:05 2013, jkeenan wrote​:

On Mon Jun 10 15​:00​:45 2013, perl.p5p@​rjbs.manxome.org wrote​:

* Jan Dubois <jand@​activestate.com> [2013-06-10T04​:07​:32]

Sorry, it was my incomplete reading of perllocale.pod that led me to
believe that only printf() would follow the current locale. After
re-reading it, I find the documentation self-contradictory​:
[...]

...and sorry about my reply, to which you're replying, as it was late
to the
game and you'd already covered this ground. I had not fully caught up
in the
thread when I wrote it.

Is this the same problem as
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=82418 ?

No, I don’t think so. That one has to do with constant folding, and
whether the number will be stringified at compile time or run time.

Also note that the author of #82418 expects that you don't need a 'use
locale' to have locale effects. It is (still) a bug that any of the
examples in this ticket output a comma radix character.

@p5pRT
Copy link
Author

p5pRT commented Jun 18, 2013

From @khwilliamson

Fixed by b127e37
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jun 18, 2013

From [Unknown Contact. See original ticket]

Fixed by b127e37
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jun 18, 2013

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant