Clarify type coercion rules in statistics module #64680

ncoghlan · 2014-02-02T01:31:52Z

BPO	20481
Nosy	@gpshead, @ncoghlan, @larryhastings, @stevendaprano, @wm75
Files	_sum_coerce.py: provides _sum and _coerce_types to replace the same functions in statistics.py, requires import numbers to work statistics.patch stats.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ncoghlan'
closed_at = <Date 2014-02-08.10:14:09.419>
created_at = <Date 2014-02-02.01:31:51.702>
labels = ['type-bug', 'library', 'release-blocker']
title = 'Clarify type coercion rules in statistics module'
updated_at = <Date 2014-02-08.22:21:40.863>
user = 'https://github.com/ncoghlan'

bugs.python.org fields:

activity = <Date 2014-02-08.22:21:40.863>
actor = 'ncoghlan'
assignee = 'ncoghlan'
closed = True
closed_date = <Date 2014-02-08.10:14:09.419>
closer = 'python-dev'
components = ['Library (Lib)']
creation = <Date 2014-02-02.01:31:51.702>
creator = 'ncoghlan'
dependencies = []
files = ['33862', '33910', '33978']
hgrepos = []
issue_num = 20481
keywords = ['patch']
message_count = 21.0
messages = ['209934', '209956', '209976', '210030', '210113', '210126', '210127', '210137', '210143', '210202', '210236', '210239', '210243', '210247', '210251', '210557', '210604', '210607', '210611', '210663', '210694']
nosy_count = 7.0
nosy_names = ['gregory.p.smith', 'ncoghlan', 'larry', 'steven.daprano', 'python-dev', 'oscarbenjamin', 'wolma']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue20481'
versions = ['Python 3.4']

ncoghlan · 2014-02-02T01:31:51Z

I haven't completely following the type coercion discussion on python-ideas. but the statistics module at least needs a docs clarification (to explain that the current behaviour when mixing input types is not fully defined, especially when Decimal is involved), and potentially a behavioural change to disallow certain type combinations where the behaviour may change in the future (see
https://mail.python.org/pipermail/python-ideas/2014-February/025214.html for example)

Either option seems reasonable to me (with a slight preference for the latter), but it's at least clear that we need to avoid locking ourselves into the exact coercion behaviour of the current implementation indefinitely.

wm75 · 2014-02-02T07:45:08Z

Thanks Nick for filing this!
I've been working on modifications to statistics._sum and statistics._coerce_types that together make the module's behaviour independent of the order of input types (by making the decision based on the set of input types) and, specifically, disallow certain combinations involving Decimal reliably.
Specifically, my modified version, like the original, returns the input type if there is only one; for mixed input types, differently than the original, it tries to return the most basic representative of the most narrow number type from the numbers tower (i.e., int if all input types are Integral, Fraction if all are Rational, float with all Real); Decimal is treated in the following way: if Decimal is the only input type or if the input types consist of only Decimal and Integral, Decimal is returned; my two versions of _coerce_types differ in that the first raises TypeError with any other combination involving Decimal, the second allows combinations of Decimal with float (returning float) and raises TypeError with all others.
I sent the first version to Steven D'Aprano and Oscar Benjamin for review.

oscarbenjamin · 2014-02-02T12:33:38Z

Wolfgang have you tested this with any third party numeric types from
sympy, gmpy2, mpmath etc.?

Last I checked no third party types implement the numbers ABCs e.g.:

>>> import sympy, numbers
>>> r = sympy.Rational(1, 2)
>>> r
1/2
>>> isinstance(r, numbers.Rational)
False

AFAICT testing against the numbers ABCs is just a slow way of testing
against the stdlib types:

$ python -m timeit -s 'from numbers import Integral' 'isinstance(1, Integral)'
100000 loops, best of 3: 2.59 usec per loop
$ python -m timeit -s 'from numbers import Integral' 'isinstance(1, int)'
1000000 loops, best of 3: 0.31 usec per loop

You can at least make it faster using a tuple:

$ python -m timeit -s 'from numbers import Integral' 'isinstance(1,
(int, Integral))'
1000000 loops, best of 3: 0.423 usec per loop

I'm not saying that this is necessarily a worthwhile optimisation but
rather that the numbers ABCs are in practice not really very useful
(AFAICT).

I don't know how well the statistics module currently handles third
party numeric types but if the type coercion is to be changed then
this something to look at. The current implementation tries to
duck-type to some extent and yours uses ABCs but does either approach
actually have any practical gain for interoperability with non-stdlib
numeric types? If not then it would be simpler just to explicitly
hard-code exactly how it works for the powerset of stdlib types.

OTOH if it could be made to do sensible things with non-stdlib types
then that would be great. Falling back on float isn't a bad choice but
if it can be made to do exact things for exact types (e.g.
sympy.Rational) then that would be great. Similarly mpmath.mpf
provides multi-precision floats. It would be great to be able to take
advantage of that higher precision rather than downgrade everything to
float.

This is in general a hard problem though so I don't think it's
unreasonable to make restrictions about what types it can work with -
achieving optimal accuracy for all types without restrictions is
basically impossible.

wm75 · 2014-02-02T22:04:17Z

Hi Oscar,
well, I haven't used sympy much, and I have no experience with the others, but in light of your comment I quickly checked sympy and gmpy2.
You are right about them still not using the numbers ABCs, however, on your advise I also checked how the current statistics module implementation handles their numeric types and the answer is: it doesn't, and this is totally independent of the _coerce_types issue.
For sympy: the problem lies with statistics._exact_ratio, which cannot convert sympy numeric types to a numerator/denominator tuple (a prerequisite for _sum)
For gmpy2: the problem occurs just one step further down the road. gmpy2.Rationals have numerator and denominator properties, so _exact_ratio knows how to handle them, but the elements of the returned tuple are of type gmpy2.mpz (gmpy2's integer equivalent) and when _sum tries to convert the tuple into a Fraction you get:

TypeError: both arguments should be Rational instances

which is precisely because the mpz type is not integrated into the numbers tower.

This last example is very illustrative I think because it shows that already now the standard library (the fractions module in this case) requires numeric types to comply with the numeric tower, so statistics would not be without precedent, and I think this is totally justified:
after all this is the standard library (can't believe I'm saying this since I really got into this sort of by accident) and third party libraries should seek compatibility, but the standard library just needs to be self-consistent.

I guess using ABCs over a duck-typing approach when coercing types, in fact, offers a huge advantage for third party libraries since they only need to register their types with the numbers ABC to achieve compatibility, while they need to consider complicated subclassing schemes with the current approach (of course, I am only talking about compatibility with _coerce_types here, which is the focus of this issue. Other parts of statistics may impose further restrictions as we've just seen for _sum).

Finally, regarding speed. The fundamental difference between the current implementation and my proposed change is that the current version calls _coerce_types for every number in the input sequence, so performance is critical here, but in my version _coerce_types gets called only once and then operates on a really small set of input types, so it is absolutely not the time-critical step in the overall performance of _sum.
For this very reason I made no effort at all to optimize the code, but just tried to keep it as simple and clear as possible.

This, in fact, is IMHO the second major benefit of my proposal for _coerce_types (besides making its result order-independent). Read the current code for _coerce_types, then the proposed one. Try to consider all their ramifications and side-effects and decide which one's easier to understand and maintain.

Best,
Wolfgang

oscarbenjamin · 2014-02-03T11:11:46Z

It's not as simple as registering with an ABC. You also need to provide the
interface that the ABC represents:

>>> import sympy
>>> r = sympy.Rational(1, 2)
>>> r
1/2
>>> r.numerator
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Half' object has no attribute 'numerator'

AFAIK there are no plans by any third part libraries to increase their
inter-operability with the numeric tower.

My point is that in choosing what types to accept and how to coerce them you
should focus on actual practical benefits rather than theoretical ones. If it
can be done in a way that means it works for more numeric types then that's
great. But when I say "works" I mean that it should ideally achieve the best
possible accuracy for each type.

If that's not possible then it might be simplest to just document how it works
for combinations of the std lib types (and perhaps subclasses thereof) and
then say that it will fall back on coercing to float for anything else. This
approach is simpler to document and for end-users to understand. It also has
the benefit that it will work for all non std lib types (that I'm aware of)
without pretending to achieve more accuracy than it can.

>>> import sympy, fractions, gmpy
>>> fractions.Fraction(sympy.Rational(1, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/fractions.py", line 148, in __new__
    raise TypeError("argument should be a string "
TypeError: argument should be a string or a Rational instance
>>> float(sympy.Rational(1, 2))
0.5
>>> fractions.Fraction(gmpy.mpq(1, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/fractions.py", line 148, in __new__
    raise TypeError("argument should be a string "
TypeError: argument should be a string or a Rational instance
>>> float(gmpy.mpq(1, 2))
0.5

Coercion to float via __float__ is well supported in the Python ecosystem.
Consistent support for getting exact integer ratios is (unfortunately) not.

wm75 · 2014-02-03T14:14:17Z

Just to make sure that this discussion is not getting on the wrong track,
there are currently two strict requirements for any numeric type to be usable with statistics._sum:

(1) the type has to provide either
- numerator/denominator properties or
- an as_integer_ratio method or
- an as_tuple method that mimicks the Decimal method of that name

this requirement comes from statistics._exact_ratio.

This is where, for example, the sympy numeric types fail because they do not provide any of these interfaces.

For completeness, this is the code of this function:

def _exact_ratio(x):
    """Convert Real number x exactly to (numerator, denominator) pair.

    >>> _exact_ratio(0.25)
    (1, 4)

x is expected to be an int, Fraction, Decimal or float.
"""
try:
    try:
        # int, Fraction
        return (x.numerator, x.denominator)
    except AttributeError:
        # float
        try:
            return x.as_integer_ratio()
        except AttributeError:
            # Decimal
            try:
                return _decimal_to_ratio(x)
            except AttributeError:
                msg = "can't convert type '{}' to numerator/denominator"
                raise TypeError(msg.format(type(x).__name__)) from None
except (OverflowError, ValueError):
    # INF or NAN
    if __debug__:
        # Decimal signalling NANs cannot be converted to float :-(
        if isinstance(x, Decimal):
            assert not x.is_finite()
        else:
            assert not math.isfinite(x)
    return (x, None)

(2) Essentially, the numerator and the denominator returned by _exact_ratio have to be valid arguments for the Fraction constructor.
This is a consequence of this block of code in _sum:

    for d, n in sorted(partials.items()):
        total += Fraction(n, d)

Of note, Fraction(n, d) requires both arguments to be members of numbers.Rational and this is where, for example, the gmpy.mpq type fails.

(3) The type's constructor has to work with a Fraction argument.
This is because _sum tries to

return T(total) # where T is the coerced type and total the calculated sum as a Fraction

The gmpy.mpq type, for example, fails at this step again.

ACTUALLY: THIS SHOULD BE RAISED AS ITS OWN ISSUE HERE as soon as the coercion part is settled because it means that _sum may succeed with certain mixed input types even though some of the same types may fail, when they are the only type.

IMPORTANTLY, neither requirement has anything to do with the module's type coercion, which is the topic of this discussion.

IN ADDITION, the proposed patch involving _coerce_types adds the following (soft) requirement:
in order to be able to coerce a sequence of numbers of mixed numeric types without loss of precision all involved types need to be integrated into the hierarchy of numeric abstract base classes as defined in the numbers module (otherwise all types are coerced to float).

From this it should be clear that the compatibility bottleneck is not in this patch, but in other parts of the module.

What is more, if a custom numeric type implements the numerator/denominator properties, then it is simple enough to register it as a virtual subclass of numbers.Rational or .Integral to enable lossless coercion in the presence of mixed types.

Example with gmpy2 numeric type:

>> from gmpy2 import mpq # mpq is gmpy's Fraction equivalent
>> import numbers

>>> numbers.Rational.register(type(mpq()))
>>> r = mpq(7,5)
>>> type(r)
<class 'mpq'>

>>> issubclass(type(r), numbers.Rational)
True

=> the mpq type could now be coerced correctly even in mixed input types situations, but remains incompatible with _sum for reasons (2) and (3).

SUMMARY: Making _sum work with custom types is very complicated, BUT:
this is not due to type coercion as it is discussed here.

wm75 · 2014-02-03T14:18:33Z

there are currently two strict requirements for any numeric type to be usable with statistics._sum:

I meant *three* of course (remembered one only during writing).

oscarbenjamin · 2014-02-03T15:13:49Z

I agree that supporting non-stdlib types is in some ways a separate issue from
how to manage coercion with mixed stdlib types. Can you provide a complete
patch (e.g. hg diff > coerce_types.patch).
http://docs.python.org/devguide/

There should probably also be tests added for situations where the current
implementation behaves undesirably. Something like these ones:
http://hg.python.org/cpython/file/a97ce3ecc96a/Lib/test/test_statistics.py#l1445

Note that when I said non-stdlib types can be handled by coercing to float I
didn't mean that the output should be coerced to float but rather the input
should be coerced to float because __float__ is the most consistent interface
available on third party numeric types.

Once the input numbers are converted to float statistics._sum can handle them
perfectly well. In this case I think the output should also be a float so that
it's clear that precision may have been lost. If the precision of float is not
what the user wants then the documentation can point them toward
Fraction/Decimal.

wm75 · 2014-02-03T15:48:00Z

Once the input numbers are converted to float statistics._sum can handle
them perfectly well. In this case I think the output should also be a float so
that it's clear that precision may have been lost. If the precision of float is not
what the user wants then the documentation can point them toward
Fraction/Decimal.

Ah, I'm getting it now. That is actually a very interesting thought. Still I don't think this should be part of this issue discussion, but I'll think about it and file a new enhancement issue if I have an idea.

As for providing the complete patch, I can do that I guess, but formulating the tests may take a while. I'll try to do it though, if Steven thinks it's worth it. After all he's the one who'd have to approve it and the patch does alter his design quite a bit.

ncoghlan · 2014-02-04T11:15:36Z

I think it's also acceptable at this point for the module docs to just say that handling of mixed type input is undefined and implementation dependent, and recommend doing "map(int, input_data)", "map(float, input_data)", "map(Decimal, input_data)" or "map(Fraction, input_data)" to ensure getting a consistent answer for mixed type input.

I believe it would also be acceptable for the module to just fail immediately as soon as it detects an input type that differs from the type of the first value rather than attempting to guess the most appropriate behaviour.

This close to 3.4rc1 (Sunday 9th February), I don't think we want to be committing to *any* particular implementation of type coercion.

oscarbenjamin · 2014-02-04T14:46:13Z

I was working on the basis that we were talking about Python 3.5.

But now I see that it's a 3.4 release blocker. Is it really that urgent?

I think the current behaviour is very good at handling a wide range of types.
It would be nice to consistently report errors for incompatible types but it
can also just be documented as a thing that users shouldn't do.

If there were a situation where it silently returned a highly inaccurate
value I would consider that urgent but I don't think there is.

ncoghlan · 2014-02-04T15:11:07Z

Changing the behaviour is not urgent - documenting that it may change
in the future is essential.

wm75 · 2014-02-04T15:43:14Z

Hi Nick and Oscar,
my patch including tests is ready. What else should I say than that I think it is ok, but of course with only days remaining and still no feedback from Steven, I'm not sure there is any chance for it going into 3.4 still.
Anyway, before uploading I wanted to ask you what the best procedure is:
upload the module and its tests as one diff or separately so that it is easier to see where the current module version fails?
Best,
Wolfgang

stevendaprano · 2014-02-04T16:32:06Z

Wolfgang,

Thanks for the patch, I have some concerns about it, but the basic idea
does look reasonable. However, I've been convinced that supporting mixed
types at all needs more careful thought. Under the circumstances, I'm
more concerned about making sure we don't lock in any behaviour we'll
regret, so for 3.4 I'm going to follow Nick's advice and reject
mixed-type calculations.

wm75 · 2014-02-04T16:45:06Z

Hi Steven,
sounds reasonable, still here's the patch in diff version.
Its rules for type coercion are detailed in _coerce_types's docstring.
Questions and comments are welcome.
Best,
Wolfgang

stevendaprano · 2014-02-07T21:04:39Z

Attached is a patch which:

documents that mixed types are not currently supported;
changes the behaviour of _sum to raise TypeError on mixed input
types (mixing int and <other> is allowed, but nothing else);
updates the tests;
adds some documentation changes for bpo-20389.

ncoghlan · 2014-02-08T09:49:15Z

Claiming to commit before 3.4rc1 (as I believe Steven's SSH key still needs to be added)

python-dev · 2014-02-08T10:14:09Z

New changeset 5db74cd953ab by Nick Coghlan in branch 'default':
Close bpo-20481: Disallow mixed type input in statistics
http://hg.python.org/cpython/rev/5db74cd953ab

ncoghlan · 2014-02-08T10:19:08Z

OK, I committed a slight variant of Steven's patch:

I moved the bpo-20389 doc changes to a separate patch that I uploaded over there
Steven had marked one of the tests as potentially obsolete, I just removed it entirely. If we decide we eventually want it back, it's still there in the VCS history :)

oscarbenjamin · 2014-02-08T15:56:05Z

Close bpo-20481: Disallow mixed type input in statistics

If I understand correctly the reason for hastily pushing this patch
through is that it's the safe option: disallow mixing types as a quick
fix for soon to be released 3.4. If we want to allow mixing types then
that can be a new feature for 3.5.

Is that correct?

If so should the discussion about what to do in 3.5 take place in this
issue (i.e. reopen for 3.5) or should it be a new issue? bpo-20499
would benefit from clarity about the statistics module policy for
mixing types.

ncoghlan · 2014-02-08T22:21:41Z

Yes, a new RFE to sensibly handle mixed type input in 3.5 would make sense (I did something similar for the issue where we removed the special casing of Counter for 3.4).

ncoghlan added release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 2, 2014

stevendaprano self-assigned this Feb 2, 2014

ncoghlan assigned ncoghlan and unassigned stevendaprano Feb 8, 2014

python-dev mannequin closed this as completed Feb 8, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Uh oh!

Clarify type coercion rules in statistics module #64680

Clarify type coercion rules in statistics module #64680

Comments

ncoghlan commented Feb 2, 2014

ncoghlan commented Feb 2, 2014

Uh oh!

wm75 mannequin commented Feb 2, 2014

Uh oh!

oscarbenjamin mannequin commented Feb 2, 2014

Uh oh!

wm75 mannequin commented Feb 2, 2014

Uh oh!

oscarbenjamin mannequin commented Feb 3, 2014

Uh oh!

wm75 mannequin commented Feb 3, 2014

Uh oh!

wm75 mannequin commented Feb 3, 2014

Uh oh!

oscarbenjamin mannequin commented Feb 3, 2014

Uh oh!

wm75 mannequin commented Feb 3, 2014

Uh oh!

ncoghlan commented Feb 4, 2014

Uh oh!

oscarbenjamin mannequin commented Feb 4, 2014

Uh oh!

ncoghlan commented Feb 4, 2014

Uh oh!

wm75 mannequin commented Feb 4, 2014

Uh oh!

stevendaprano commented Feb 4, 2014

Uh oh!

wm75 mannequin commented Feb 4, 2014

Uh oh!

stevendaprano commented Feb 7, 2014

Uh oh!

ncoghlan commented Feb 8, 2014

Uh oh!

python-dev mannequin commented Feb 8, 2014

Uh oh!

ncoghlan commented Feb 8, 2014

Uh oh!

oscarbenjamin mannequin commented Feb 8, 2014

Uh oh!

ncoghlan commented Feb 8, 2014

Uh oh!