Skip to content

random.choice fails on numpy arrays #100805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicoNeureiter opened this issue Jan 6, 2023 · 9 comments
Closed

random.choice fails on numpy arrays #100805

NicoNeureiter opened this issue Jan 6, 2023 · 9 comments
Assignees
Labels
type-bug An unexpected behavior, bug, or error

Comments

@NicoNeureiter
Copy link

NicoNeureiter commented Jan 6, 2023

Bug report

In a project of mine I use random.choice on a numpy array (at multiple point in the code) which worked fine until recently. After upgrading to Python 3.11, my code crashes with:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The value error is caused by a if not seq check in random.choice (line 369) that was introduced in commit 3fee777. The check is clearly intended to test for emptiness of the sequence and would raise an error if the sequence is empty. With numpy arrays (even non-empty ones) it now causes the ValueError when trying to cast the array to a boolean.

The issue is easily reproducible with the following code:

import random
import numpy as np
a = np.array([1, 2, 3])
random.choice(a)

The code worked on earlier versions, but fails in Python 3.11.

The problem could be avoided by checking for emptiness using if len(seq) == 0. I understand that if not seq checks are relatively common and many people prefer the conciseness, but numpy compatibility and backwards compatibility seem important to me in this context.

Tested on:

  • CPython versions tested on: 3.11.0
  • Operating system and architecture: Ubuntu 22.04 64-bit

Linked PRs

@NicoNeureiter NicoNeureiter added the type-bug An unexpected behavior, bug, or error label Jan 6, 2023
@AlexWaygood
Copy link
Member

This issue was also reported by @llimeht here: #30008 (comment)

@rhettinger
Copy link
Contributor

rhettinger commented Jan 6, 2023

Thanks for the report. I'll look at it again shortly. I suspect we would have to make a speed trade-off which would be a bummer in the case of a fine-grained function like this one. Is there a reason for not using numpy.random.choice which is specifically designed to support numpy arrays and do so more efficiently than CPython can?

@rhettinger
Copy link
Contributor

Mark, do you have thoughts on whether the random sequence functions should support NumPy arrays? And what about other functions outside the random module?

The random.shuffle() and random.choices() functions happen to work, but we don't test this or promise this. The random.sample() function rejects NumPy arrays because isinstance(population, collections.abc.Sequence) returns False. This has been the case for over a decade. The random.choice() function fails because it tests if not seq and NumPy arrays intentionally block the __bool__ method. This has only been the case since 3.11 was released in October 2022, so we could easily switch to if not len(seq) if we want to support NumPy arrays. On my machine, the performance cost is 4%.

@mdickinson
Copy link
Member

whether the random sequence functions should support NumPy arrays

No strong opinions either way on the "should". It's convenient when they do, but NumPy has its own rich selection of RNG functionality, and NumPy arrays differ enough from the Sequence interface that it's not reasonable to expect them to be usable everywhere that a Python sequence is.

That said, this particular case seems like a clear regression. If we want to enforce that the argument to random.choice is a Sequence, that seems like something that should go through a deprecation period.

@mdickinson mdickinson removed their assignment Jan 7, 2023
@AlexWaygood
Copy link
Member

My 2 cents: in general, I don't think there's any obligation for functions in the random module to support numpy arrays. if not seq is a Pythonic idiom that numpy has chosen to disallow, which is pretty unfortunate. However, considering that this has been reported independently by two different people, it seems clear that this is a somewhat painful, breaking change for users of numpy. The performance cost is regrettable, but I'd argue in favour of changing it to if not len(seq).

@mdickinson
Copy link
Member

If we want to enforce that the argument to random.choice is a Sequence [...]

Hmm. That wouldn't actually help, anyway, since as far as I can tell being an instance of Sequence doesn't guarantee the existence of a __bool__ method that tests for emptiness.

@NicoNeureiter
Copy link
Author

Thank you all for reacting to this so quickly. I actually did not consider that this check is performance relevant. But this is my first issue on CPython and it makes sense that here every percent of performance counts.

Is there a reason for not using numpy.random.choice which is specifically designed to support numpy arrays and do so more efficiently than CPython can?

In my case it was mainly consistency, since I already used random in the same module and decided not to mix it with numpy.random when there is no reason to. And for single samples CPython implementation seems to be faster on my laptop. Even random.choice(list(a)) is still an order of magnitude faster than np.random.choice(a). But neither of those reasons is really relevant in my case and if random.choice() hadn't supported numpy arrays in previous versions, I would have used np.random.choice() without any further considerations. The reason for this report was backwards compatibility, which I assume might affect other people too.

@rhettinger
Copy link
Contributor

rhettinger commented Jan 7, 2023

I'll go ahead and make a PR and backport it. It is not really a bug fix; it is more of an accommodation for reliance on an implementation detail.

The function contract requires that choice() raise an IndexError for an empty input. PEP 8 recommends in against if len(seq) == 0 and in favor of if not seq. We don't require that the input be an instance of collections.abc.Sequence. Instead the docs specify "sequence" in lowercase. That has always meant that __getitem__ and __len__ are defined and done so in a way that has well behaved sequence-like behavior. The expectation is that you can call: s[0], s[1], s[2], ..., s[len(s)-1], that s[len(s)] raises an IndexError, and that len(s) is supported. It is further expected that the downstream implications of that API have not been interfered with. A call to iter(s) should work and not be disabled. Likewise bool(s) should work and not be disabled.

Numpy has intentionally (and legitimately) decided not to follow this path. Knowingly it chose (because the benefits outweighed the costs) to not work with a huge swath of pure Python functions that expect either a "sequence" lowercase or "Sequence" uppercase.

Interestingly, MyPy has no ability to detect this issue. The choice() stub requires lowercase "sequence" semantics. It has no ability to detect that __bool__ or __iter__ have been defined have different (and incompatible) downstream semantics.

Perhaps a shout-out to Hyrum's law is warranted:

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

@AlexWaygood
Copy link
Member

Thanks @NicoNeureiter and @llimeht for both reporting this issue! Support for numpy arrays has been restored on main, and backported to 3.11. It will be included in the next 3.11 patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants