-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
Clarify type coercion rules in statistics module #64680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I haven't completely following the type coercion discussion on python-ideas. but the statistics module at least needs a docs clarification (to explain that the current behaviour when mixing input types is not fully defined, especially when Decimal is involved), and potentially a behavioural change to disallow certain type combinations where the behaviour may change in the future (see Either option seems reasonable to me (with a slight preference for the latter), but it's at least clear that we need to avoid locking ourselves into the exact coercion behaviour of the current implementation indefinitely. |
Thanks Nick for filing this! |
Wolfgang have you tested this with any third party numeric types from Last I checked no third party types implement the numbers ABCs e.g.: >>> import sympy, numbers
>>> r = sympy.Rational(1, 2)
>>> r
1/2
>>> isinstance(r, numbers.Rational)
False AFAICT testing against the numbers ABCs is just a slow way of testing $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, Integral)'
100000 loops, best of 3: 2.59 usec per loop
$ python -m timeit -s 'from numbers import Integral' 'isinstance(1, int)'
1000000 loops, best of 3: 0.31 usec per loop You can at least make it faster using a tuple: $ python -m timeit -s 'from numbers import Integral' 'isinstance(1,
(int, Integral))'
1000000 loops, best of 3: 0.423 usec per loop I'm not saying that this is necessarily a worthwhile optimisation but I don't know how well the statistics module currently handles third OTOH if it could be made to do sensible things with non-stdlib types This is in general a hard problem though so I don't think it's |
Hi Oscar, TypeError: both arguments should be Rational instances which is precisely because the mpz type is not integrated into the numbers tower. This last example is very illustrative I think because it shows that already now the standard library (the fractions module in this case) requires numeric types to comply with the numeric tower, so statistics would not be without precedent, and I think this is totally justified: I guess using ABCs over a duck-typing approach when coercing types, in fact, offers a huge advantage for third party libraries since they only need to register their types with the numbers ABC to achieve compatibility, while they need to consider complicated subclassing schemes with the current approach (of course, I am only talking about compatibility with _coerce_types here, which is the focus of this issue. Other parts of statistics may impose further restrictions as we've just seen for _sum). Finally, regarding speed. The fundamental difference between the current implementation and my proposed change is that the current version calls _coerce_types for every number in the input sequence, so performance is critical here, but in my version _coerce_types gets called only once and then operates on a really small set of input types, so it is absolutely not the time-critical step in the overall performance of _sum. This, in fact, is IMHO the second major benefit of my proposal for _coerce_types (besides making its result order-independent). Read the current code for _coerce_types, then the proposed one. Try to consider all their ramifications and side-effects and decide which one's easier to understand and maintain. Best, |
It's not as simple as registering with an ABC. You also need to provide the >>> import sympy
>>> r = sympy.Rational(1, 2)
>>> r
1/2
>>> r.numerator
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Half' object has no attribute 'numerator' AFAIK there are no plans by any third part libraries to increase their My point is that in choosing what types to accept and how to coerce them you If that's not possible then it might be simplest to just document how it works >>> import sympy, fractions, gmpy
>>> fractions.Fraction(sympy.Rational(1, 2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/fractions.py", line 148, in __new__
raise TypeError("argument should be a string "
TypeError: argument should be a string or a Rational instance
>>> float(sympy.Rational(1, 2))
0.5
>>> fractions.Fraction(gmpy.mpq(1, 2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/fractions.py", line 148, in __new__
raise TypeError("argument should be a string "
TypeError: argument should be a string or a Rational instance
>>> float(gmpy.mpq(1, 2))
0.5 Coercion to float via __float__ is well supported in the Python ecosystem. |
Just to make sure that this discussion is not getting on the wrong track, (1) the type has to provide either
This is where, for example, the sympy numeric types fail because they do not provide any of these interfaces.
def _exact_ratio(x):
"""Convert Real number x exactly to (numerator, denominator) pair. >>> _exact_ratio(0.25)
(1, 4)
(2) Essentially, the numerator and the denominator returned by _exact_ratio have to be valid arguments for the Fraction constructor. for d, n in sorted(partials.items()):
total += Fraction(n, d) Of note, Fraction(n, d) requires both arguments to be members of numbers.Rational and this is where, for example, the gmpy.mpq type fails. (3) The type's constructor has to work with a Fraction argument.
The gmpy.mpq type, for example, fails at this step again. ACTUALLY: THIS SHOULD BE RAISED AS ITS OWN ISSUE HERE as soon as the coercion part is settled because it means that _sum may succeed with certain mixed input types even though some of the same types may fail, when they are the only type. IMPORTANTLY, neither requirement has anything to do with the module's type coercion, which is the topic of this discussion. IN ADDITION, the proposed patch involving _coerce_types adds the following (soft) requirement: From this it should be clear that the compatibility bottleneck is not in this patch, but in other parts of the module. What is more, if a custom numeric type implements the numerator/denominator properties, then it is simple enough to register it as a virtual subclass of numbers.Rational or .Integral to enable lossless coercion in the presence of mixed types. Example with gmpy2 numeric type:
>>> numbers.Rational.register(type(mpq()))
>>> r = mpq(7,5)
>>> type(r)
<class 'mpq'>
>>> issubclass(type(r), numbers.Rational)
True => the mpq type could now be coerced correctly even in mixed input types situations, but remains incompatible with _sum for reasons (2) and (3). SUMMARY: Making _sum work with custom types is very complicated, BUT: |
I meant *three* of course (remembered one only during writing). |
I agree that supporting non-stdlib types is in some ways a separate issue from There should probably also be tests added for situations where the current Note that when I said non-stdlib types can be handled by coercing to float I Once the input numbers are converted to float statistics._sum can handle them |
Ah, I'm getting it now. That is actually a very interesting thought. Still I don't think this should be part of this issue discussion, but I'll think about it and file a new enhancement issue if I have an idea. As for providing the complete patch, I can do that I guess, but formulating the tests may take a while. I'll try to do it though, if Steven thinks it's worth it. After all he's the one who'd have to approve it and the patch does alter his design quite a bit. |
I think it's also acceptable at this point for the module docs to just say that handling of mixed type input is undefined and implementation dependent, and recommend doing "map(int, input_data)", "map(float, input_data)", "map(Decimal, input_data)" or "map(Fraction, input_data)" to ensure getting a consistent answer for mixed type input. I believe it would also be acceptable for the module to just fail immediately as soon as it detects an input type that differs from the type of the first value rather than attempting to guess the most appropriate behaviour. This close to 3.4rc1 (Sunday 9th February), I don't think we want to be committing to *any* particular implementation of type coercion. |
I was working on the basis that we were talking about Python 3.5. But now I see that it's a 3.4 release blocker. Is it really that urgent? I think the current behaviour is very good at handling a wide range of types. If there were a situation where it silently returned a highly inaccurate |
Changing the behaviour is not urgent - documenting that it may change |
Hi Nick and Oscar, |
Wolfgang, Thanks for the patch, I have some concerns about it, but the basic idea |
Hi Steven, |
Attached is a patch which:
|
Claiming to commit before 3.4rc1 (as I believe Steven's SSH key still needs to be added) |
New changeset 5db74cd953ab by Nick Coghlan in branch 'default': |
OK, I committed a slight variant of Steven's patch:
|
If I understand correctly the reason for hastily pushing this patch Is that correct? If so should the discussion about what to do in 3.5 take place in this |
Yes, a new RFE to sensibly handle mixed type input in 3.5 would make sense (I did something similar for the issue where we removed the special casing of Counter for 3.4). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: