Skip to content

The definition and discussion around "default array index data type" is missing #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
leofang opened this issue Nov 8, 2021 · 2 comments · Fixed by #322
Closed

The definition and discussion around "default array index data type" is missing #319

leofang opened this issue Nov 8, 2021 · 2 comments · Fixed by #322
Labels
topic: Indexing Array indexing. topic: Type Promotion Type promotion.
Milestone

Comments

@leofang
Copy link
Contributor

leofang commented Nov 8, 2021

The context is when returning indices and/or counts of array elements the unique_*() APIs may have to promote the data type of the returned array (=default integer type) depending on the situation, so they should have been using "default array index data type" instead.

IIUC the term "default array index data type" is still referred to by some functions like argsort(), but

Copying @kgryte from #317 (comment):

Originally, when writing the unique specification, the output dtype was the "default index data type". I wonder if we need to revive that distinction. Namely, that an array library should have three default data types:

  1. floating-point data type
  2. integer data type
  3. index data type

Here, having a default index data type makes sense, as counts should align accordingly (i.e., a count should never exceed the maximum array index).

Furthermore, while it may often be the case that indices will have the same dtype as the default integer dtype, this need not be the case. For example, indices may be int64, while the default integer dtype could be int32 due to better target hardware support (e.g., GPUs).

@rgommers
Copy link
Member

rgommers commented Nov 8, 2021

Relevant part of NumPy docs here: "The native NumPy indexing type is intp and may differ from the default integer array type. intp is the smallest data type sufficient to safely index any array; for advanced indexing it may be faster than other types."

Overview of integer types I always refer to: https://github.com/scikit-learn/scikit-learn/wiki/C-integer-types:-the-missing-manual

This has always been quite confusing. NumPy doesn't really explain it in its docs, and without making it a first-class concept and just saying "returns integer array" it will be unclear for example in what cases a function returns 32-bit integers rather than 64-bit.

I could be wrong but the discussion around its behavior (specifically, when to promote) is still missing

I would recommend that it does not become a separate type, rather than it's either int32 or int64. With the only difference being that the "indexing default" may be something like "int32 on 32-bit platforms, 64-bit otherwise".

If it's done like that, no separate casting rules are needed.

@kgryte
Copy link
Contributor

kgryte commented Nov 8, 2021

@rgommers Agreed. I'll submit a PR with the proposed specification guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: Indexing Array indexing. topic: Type Promotion Type promotion.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants