Skip to content

Clarify type coercion rules in statistics module #64680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ncoghlan opened this issue Feb 2, 2014 · 21 comments
Closed

Clarify type coercion rules in statistics module #64680

ncoghlan opened this issue Feb 2, 2014 · 21 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ncoghlan
Copy link
Contributor

ncoghlan commented Feb 2, 2014

BPO 20481
Nosy @gpshead, @ncoghlan, @larryhastings, @stevendaprano, @wm75
Files
  • _sum_coerce.py: provides _sum and _coerce_types to replace the same functions in statistics.py, requires import numbers to work
  • statistics.patch
  • stats.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2014-02-08.10:14:09.419>
    created_at = <Date 2014-02-02.01:31:51.702>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'Clarify type coercion rules in statistics module'
    updated_at = <Date 2014-02-08.22:21:40.863>
    user = 'https://github.com/ncoghlan'

    bugs.python.org fields:

    activity = <Date 2014-02-08.22:21:40.863>
    actor = 'ncoghlan'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2014-02-08.10:14:09.419>
    closer = 'python-dev'
    components = ['Library (Lib)']
    creation = <Date 2014-02-02.01:31:51.702>
    creator = 'ncoghlan'
    dependencies = []
    files = ['33862', '33910', '33978']
    hgrepos = []
    issue_num = 20481
    keywords = ['patch']
    message_count = 21.0
    messages = ['209934', '209956', '209976', '210030', '210113', '210126', '210127', '210137', '210143', '210202', '210236', '210239', '210243', '210247', '210251', '210557', '210604', '210607', '210611', '210663', '210694']
    nosy_count = 7.0
    nosy_names = ['gregory.p.smith', 'ncoghlan', 'larry', 'steven.daprano', 'python-dev', 'oscarbenjamin', 'wolma']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue20481'
    versions = ['Python 3.4']

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 2, 2014

    I haven't completely following the type coercion discussion on python-ideas. but the statistics module at least needs a docs clarification (to explain that the current behaviour when mixing input types is not fully defined, especially when Decimal is involved), and potentially a behavioural change to disallow certain type combinations where the behaviour may change in the future (see
    https://mail.python.org/pipermail/python-ideas/2014-February/025214.html for example)

    Either option seems reasonable to me (with a slight preference for the latter), but it's at least clear that we need to avoid locking ourselves into the exact coercion behaviour of the current implementation indefinitely.

    @ncoghlan ncoghlan added release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 2, 2014
    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 2, 2014

    Thanks Nick for filing this!
    I've been working on modifications to statistics._sum and statistics._coerce_types that together make the module's behaviour independent of the order of input types (by making the decision based on the set of input types) and, specifically, disallow certain combinations involving Decimal reliably.
    Specifically, my modified version, like the original, returns the input type if there is only one; for mixed input types, differently than the original, it tries to return the most basic representative of the most narrow number type from the numbers tower (i.e., int if all input types are Integral, Fraction if all are Rational, float with all Real); Decimal is treated in the following way: if Decimal is the only input type or if the input types consist of only Decimal and Integral, Decimal is returned; my two versions of _coerce_types differ in that the first raises TypeError with any other combination involving Decimal, the second allows combinations of Decimal with float (returning float) and raises TypeError with all others.
    I sent the first version to Steven D'Aprano and Oscar Benjamin for review.

    @stevendaprano stevendaprano self-assigned this Feb 2, 2014
    @oscarbenjamin
    Copy link
    Mannequin

    oscarbenjamin mannequin commented Feb 2, 2014

    Wolfgang have you tested this with any third party numeric types from
    sympy, gmpy2, mpmath etc.?

    Last I checked no third party types implement the numbers ABCs e.g.:

    >>> import sympy, numbers
    >>> r = sympy.Rational(1, 2)
    >>> r
    1/2
    >>> isinstance(r, numbers.Rational)
    False

    AFAICT testing against the numbers ABCs is just a slow way of testing
    against the stdlib types:

    $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, Integral)'
    100000 loops, best of 3: 2.59 usec per loop
    $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, int)'
    1000000 loops, best of 3: 0.31 usec per loop

    You can at least make it faster using a tuple:

    $ python -m timeit -s 'from numbers import Integral' 'isinstance(1,
    (int, Integral))'
    1000000 loops, best of 3: 0.423 usec per loop

    I'm not saying that this is necessarily a worthwhile optimisation but
    rather that the numbers ABCs are in practice not really very useful
    (AFAICT).

    I don't know how well the statistics module currently handles third
    party numeric types but if the type coercion is to be changed then
    this something to look at. The current implementation tries to
    duck-type to some extent and yours uses ABCs but does either approach
    actually have any practical gain for interoperability with non-stdlib
    numeric types? If not then it would be simpler just to explicitly
    hard-code exactly how it works for the powerset of stdlib types.

    OTOH if it could be made to do sensible things with non-stdlib types
    then that would be great. Falling back on float isn't a bad choice but
    if it can be made to do exact things for exact types (e.g.
    sympy.Rational) then that would be great. Similarly mpmath.mpf
    provides multi-precision floats. It would be great to be able to take
    advantage of that higher precision rather than downgrade everything to
    float.

    This is in general a hard problem though so I don't think it's
    unreasonable to make restrictions about what types it can work with -
    achieving optimal accuracy for all types without restrictions is
    basically impossible.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 2, 2014

    Hi Oscar,
    well, I haven't used sympy much, and I have no experience with the others, but in light of your comment I quickly checked sympy and gmpy2.
    You are right about them still not using the numbers ABCs, however, on your advise I also checked how the current statistics module implementation handles their numeric types and the answer is: it doesn't, and this is totally independent of the _coerce_types issue.
    For sympy: the problem lies with statistics._exact_ratio, which cannot convert sympy numeric types to a numerator/denominator tuple (a prerequisite for _sum)
    For gmpy2: the problem occurs just one step further down the road. gmpy2.Rationals have numerator and denominator properties, so _exact_ratio knows how to handle them, but the elements of the returned tuple are of type gmpy2.mpz (gmpy2's integer equivalent) and when _sum tries to convert the tuple into a Fraction you get:

    TypeError: both arguments should be Rational instances

    which is precisely because the mpz type is not integrated into the numbers tower.

    This last example is very illustrative I think because it shows that already now the standard library (the fractions module in this case) requires numeric types to comply with the numeric tower, so statistics would not be without precedent, and I think this is totally justified:
    after all this is the standard library (can't believe I'm saying this since I really got into this sort of by accident) and third party libraries should seek compatibility, but the standard library just needs to be self-consistent.

    I guess using ABCs over a duck-typing approach when coercing types, in fact, offers a huge advantage for third party libraries since they only need to register their types with the numbers ABC to achieve compatibility, while they need to consider complicated subclassing schemes with the current approach (of course, I am only talking about compatibility with _coerce_types here, which is the focus of this issue. Other parts of statistics may impose further restrictions as we've just seen for _sum).

    Finally, regarding speed. The fundamental difference between the current implementation and my proposed change is that the current version calls _coerce_types for every number in the input sequence, so performance is critical here, but in my version _coerce_types gets called only once and then operates on a really small set of input types, so it is absolutely not the time-critical step in the overall performance of _sum.
    For this very reason I made no effort at all to optimize the code, but just tried to keep it as simple and clear as possible.

    This, in fact, is IMHO the second major benefit of my proposal for _coerce_types (besides making its result order-independent). Read the current code for _coerce_types, then the proposed one. Try to consider all their ramifications and side-effects and decide which one's easier to understand and maintain.

    Best,
    Wolfgang

    @oscarbenjamin
    Copy link
    Mannequin

    oscarbenjamin mannequin commented Feb 3, 2014

    It's not as simple as registering with an ABC. You also need to provide the
    interface that the ABC represents:

    >>> import sympy
    >>> r = sympy.Rational(1, 2)
    >>> r
    1/2
    >>> r.numerator
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'Half' object has no attribute 'numerator'

    AFAIK there are no plans by any third part libraries to increase their
    inter-operability with the numeric tower.

    My point is that in choosing what types to accept and how to coerce them you
    should focus on actual practical benefits rather than theoretical ones. If it
    can be done in a way that means it works for more numeric types then that's
    great. But when I say "works" I mean that it should ideally achieve the best
    possible accuracy for each type.

    If that's not possible then it might be simplest to just document how it works
    for combinations of the std lib types (and perhaps subclasses thereof) and
    then say that it will fall back on coercing to float for anything else. This
    approach is simpler to document and for end-users to understand. It also has
    the benefit that it will work for all non std lib types (that I'm aware of)
    without pretending to achieve more accuracy than it can.

    >>> import sympy, fractions, gmpy
    >>> fractions.Fraction(sympy.Rational(1, 2))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/fractions.py", line 148, in __new__
        raise TypeError("argument should be a string "
    TypeError: argument should be a string or a Rational instance
    >>> float(sympy.Rational(1, 2))
    0.5
    >>> fractions.Fraction(gmpy.mpq(1, 2))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/fractions.py", line 148, in __new__
        raise TypeError("argument should be a string "
    TypeError: argument should be a string or a Rational instance
    >>> float(gmpy.mpq(1, 2))
    0.5

    Coercion to float via __float__ is well supported in the Python ecosystem.
    Consistent support for getting exact integer ratios is (unfortunately) not.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 3, 2014

    Just to make sure that this discussion is not getting on the wrong track,
    there are currently two strict requirements for any numeric type to be usable with statistics._sum:

    (1) the type has to provide either
    - numerator/denominator properties or
    - an as_integer_ratio method or
    - an as_tuple method that mimicks the Decimal method of that name

    this requirement comes from statistics._exact_ratio.
    

    This is where, for example, the sympy numeric types fail because they do not provide any of these interfaces.


    For completeness, this is the code of this function:
    
    def _exact_ratio(x):
        """Convert Real number x exactly to (numerator, denominator) pair.
        >>> _exact_ratio(0.25)
        (1, 4)
    x is expected to be an int, Fraction, Decimal or float.
    """
    try:
        try:
            # int, Fraction
            return (x.numerator, x.denominator)
        except AttributeError:
            # float
            try:
                return x.as_integer_ratio()
            except AttributeError:
                # Decimal
                try:
                    return _decimal_to_ratio(x)
                except AttributeError:
                    msg = "can't convert type '{}' to numerator/denominator"
                    raise TypeError(msg.format(type(x).__name__)) from None
    except (OverflowError, ValueError):
        # INF or NAN
        if __debug__:
            # Decimal signalling NANs cannot be converted to float :-(
            if isinstance(x, Decimal):
                assert not x.is_finite()
            else:
                assert not math.isfinite(x)
        return (x, None)
    

    (2) Essentially, the numerator and the denominator returned by _exact_ratio have to be valid arguments for the Fraction constructor.
    This is a consequence of this block of code in _sum:

        for d, n in sorted(partials.items()):
            total += Fraction(n, d)

    Of note, Fraction(n, d) requires both arguments to be members of numbers.Rational and this is where, for example, the gmpy.mpq type fails.

    (3) The type's constructor has to work with a Fraction argument.
    This is because _sum tries to

    return T(total) # where T is the coerced type and total the calculated sum as a Fraction
    

    The gmpy.mpq type, for example, fails at this step again.

    ACTUALLY: THIS SHOULD BE RAISED AS ITS OWN ISSUE HERE as soon as the coercion part is settled because it means that _sum may succeed with certain mixed input types even though some of the same types may fail, when they are the only type.

    IMPORTANTLY, neither requirement has anything to do with the module's type coercion, which is the topic of this discussion.

    IN ADDITION, the proposed patch involving _coerce_types adds the following (soft) requirement:
    in order to be able to coerce a sequence of numbers of mixed numeric types without loss of precision all involved types need to be integrated into the hierarchy of numeric abstract base classes as defined in the numbers module (otherwise all types are coerced to float).

    From this it should be clear that the compatibility bottleneck is not in this patch, but in other parts of the module.

    What is more, if a custom numeric type implements the numerator/denominator properties, then it is simple enough to register it as a virtual subclass of numbers.Rational or .Integral to enable lossless coercion in the presence of mixed types.

    Example with gmpy2 numeric type:

    >> from gmpy2 import mpq # mpq is gmpy's Fraction equivalent
    >> import numbers

    >>> numbers.Rational.register(type(mpq()))
    >>> r = mpq(7,5)
    >>> type(r)
    <class 'mpq'>
    
    >>> issubclass(type(r), numbers.Rational)
    True

    => the mpq type could now be coerced correctly even in mixed input types situations, but remains incompatible with _sum for reasons (2) and (3).

    SUMMARY: Making _sum work with custom types is very complicated, BUT:
    this is not due to type coercion as it is discussed here.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 3, 2014

    there are currently two strict requirements for any numeric type to be usable with statistics._sum:

    I meant *three* of course (remembered one only during writing).

    @oscarbenjamin
    Copy link
    Mannequin

    oscarbenjamin mannequin commented Feb 3, 2014

    I agree that supporting non-stdlib types is in some ways a separate issue from
    how to manage coercion with mixed stdlib types. Can you provide a complete
    patch (e.g. hg diff > coerce_types.patch).
    http://docs.python.org/devguide/

    There should probably also be tests added for situations where the current
    implementation behaves undesirably. Something like these ones:
    http://hg.python.org/cpython/file/a97ce3ecc96a/Lib/test/test_statistics.py#l1445

    Note that when I said non-stdlib types can be handled by coercing to float I
    didn't mean that the output should be coerced to float but rather the input
    should be coerced to float because __float__ is the most consistent interface
    available on third party numeric types.

    Once the input numbers are converted to float statistics._sum can handle them
    perfectly well. In this case I think the output should also be a float so that
    it's clear that precision may have been lost. If the precision of float is not
    what the user wants then the documentation can point them toward
    Fraction/Decimal.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 3, 2014

    Once the input numbers are converted to float statistics._sum can handle
    them perfectly well. In this case I think the output should also be a float so
    that it's clear that precision may have been lost. If the precision of float is not
    what the user wants then the documentation can point them toward
    Fraction/Decimal.

    Ah, I'm getting it now. That is actually a very interesting thought. Still I don't think this should be part of this issue discussion, but I'll think about it and file a new enhancement issue if I have an idea.

    As for providing the complete patch, I can do that I guess, but formulating the tests may take a while. I'll try to do it though, if Steven thinks it's worth it. After all he's the one who'd have to approve it and the patch does alter his design quite a bit.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 4, 2014

    I think it's also acceptable at this point for the module docs to just say that handling of mixed type input is undefined and implementation dependent, and recommend doing "map(int, input_data)", "map(float, input_data)", "map(Decimal, input_data)" or "map(Fraction, input_data)" to ensure getting a consistent answer for mixed type input.

    I believe it would also be acceptable for the module to just fail immediately as soon as it detects an input type that differs from the type of the first value rather than attempting to guess the most appropriate behaviour.

    This close to 3.4rc1 (Sunday 9th February), I don't think we want to be committing to *any* particular implementation of type coercion.

    @oscarbenjamin
    Copy link
    Mannequin

    oscarbenjamin mannequin commented Feb 4, 2014

    I was working on the basis that we were talking about Python 3.5.

    But now I see that it's a 3.4 release blocker. Is it really that urgent?

    I think the current behaviour is very good at handling a wide range of types.
    It would be nice to consistently report errors for incompatible types but it
    can also just be documented as a thing that users shouldn't do.

    If there were a situation where it silently returned a highly inaccurate
    value I would consider that urgent but I don't think there is.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 4, 2014

    Changing the behaviour is not urgent - documenting that it may change
    in the future is essential.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 4, 2014

    Hi Nick and Oscar,
    my patch including tests is ready. What else should I say than that I think it is ok, but of course with only days remaining and still no feedback from Steven, I'm not sure there is any chance for it going into 3.4 still.
    Anyway, before uploading I wanted to ask you what the best procedure is:
    upload the module and its tests as one diff or separately so that it is easier to see where the current module version fails?
    Best,
    Wolfgang

    @stevendaprano
    Copy link
    Member

    Wolfgang,

    Thanks for the patch, I have some concerns about it, but the basic idea
    does look reasonable. However, I've been convinced that supporting mixed
    types at all needs more careful thought. Under the circumstances, I'm
    more concerned about making sure we don't lock in any behaviour we'll
    regret, so for 3.4 I'm going to follow Nick's advice and reject
    mixed-type calculations.

    @wm75
    Copy link
    Mannequin

    wm75 mannequin commented Feb 4, 2014

    Hi Steven,
    sounds reasonable, still here's the patch in diff version.
    Its rules for type coercion are detailed in _coerce_types's docstring.
    Questions and comments are welcome.
    Best,
    Wolfgang

    @stevendaprano
    Copy link
    Member

    Attached is a patch which:

    • documents that mixed types are not currently supported;

    • changes the behaviour of _sum to raise TypeError on mixed input
      types (mixing int and <other> is allowed, but nothing else);

    • updates the tests;

    • adds some documentation changes for bpo-20389.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 8, 2014

    Claiming to commit before 3.4rc1 (as I believe Steven's SSH key still needs to be added)

    @ncoghlan ncoghlan assigned ncoghlan and unassigned stevendaprano Feb 8, 2014
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 8, 2014

    New changeset 5db74cd953ab by Nick Coghlan in branch 'default':
    Close bpo-20481: Disallow mixed type input in statistics
    http://hg.python.org/cpython/rev/5db74cd953ab

    @python-dev python-dev mannequin closed this as completed Feb 8, 2014
    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 8, 2014

    OK, I committed a slight variant of Steven's patch:

    • I moved the bpo-20389 doc changes to a separate patch that I uploaded over there

    • Steven had marked one of the tests as potentially obsolete, I just removed it entirely. If we decide we eventually want it back, it's still there in the VCS history :)

    @oscarbenjamin
    Copy link
    Mannequin

    oscarbenjamin mannequin commented Feb 8, 2014

    Close bpo-20481: Disallow mixed type input in statistics

    If I understand correctly the reason for hastily pushing this patch
    through is that it's the safe option: disallow mixing types as a quick
    fix for soon to be released 3.4. If we want to allow mixing types then
    that can be a new feature for 3.5.

    Is that correct?

    If so should the discussion about what to do in 3.5 take place in this
    issue (i.e. reopen for 3.5) or should it be a new issue? bpo-20499
    would benefit from clarity about the statistics module policy for
    mixing types.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Feb 8, 2014

    Yes, a new RFE to sensibly handle mixed type input in 3.5 would make sense (I did something similar for the issue where we removed the special casing of Counter for 3.4).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants