-
Notifications
You must be signed in to change notification settings - Fork 53
Type signatures for standardized functions #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Great suggestions and questions, thanks @shoyer. I agree with adding type annotations to signatures as much as we can.
I'd say yes, must be of the same type. Even if all libraries would define only exactly what's in the spec, returning another type of array would be quite odd. And in practice no two libraries are going to be exactly 100% compatible, which would make it worse.
That's a tricky question. I can see the rationale for not having automatic rank promotion, if every array implementation would require prepending size-one dimensions that would catch a class of bugs (and probably some other advantages too). On the other hand, NumPy et al. already do rank promotion, and it can't be easily turned off. It's not like with keywords we don't spec where one can easily write a wrapper that leaves them out. So semantics are then going to vary between libraries.
Good question, I was about to write up a proposal on required dtypes to support and casting rules. I'd lean towards no NumPy-like dtype promotion. JAX and PyTorch can't implement those casting rules, and I think library code would actually benefit from not having NumPy's weird rules (there's also the Perhaps there's a sensible middle ground?
|
Agreed -- anything in the intersection of what NumPy, JAX and PyTorch support, for example, is probably safe to include. The means type promotion within but not between families (e.g., I trust we don't want to adopt NumPy's value dependent dtype promotion rules or scalars. |
Just so I understand correctly, the issue of not necessarily wanting automatic rank promotion is to guard against scenarios in which I provide arrays having a different number of axes and I actually didn't mean to do so (user error), but my code continues to work because of automatically prepended dimensions, and thus, I'm not able to easily track down undesired mismatched arrays. Am I understanding correctly? |
This is exactly right. A canonical example would be adding a vector to a square matrix. |
It is not clear to me that the desire to prevent automatic rank promotion is different from wanting to prevent broadcasting altogether. I can make a similar argument that, were I to provide two three-dimensional arrays, with one of those arrays having a singleton/degenerate dimension, I could have made a mistake which would be difficult to track down due to the lack of a runtime error when broadcasting is automatically supported. Accordingly, the question seems to me whether we specify broadcasting semantics at all. IMO, we should include broadcasting semantics, and we should follow NumPy's semantics (and as implemented outside of Python in array programming environments such as MATLAB [1, 2] (a.k.a., implicit expansion) and Julia [1]), including automatic rank promotion. Should runtimes want to allow users to opt-out of automatic rank promotion or broadcasting altogether, they can follow JAX's lead and support either global configuration or, e.g., a non-conforming abstraction layer which does not permit rank promotion/broadcasting. In both cases, the default behavior should be an array implementation conforming to the standard with explicit opt-in (e.g., to aid debugging, etc) for non-conforming behavior. It does not seem to me that because automatic rank promotion (and, by extension, broadcasting) can obscure unintended bugs we should omit its specification, for reasons of its widespread prevalence and convenience, not just in Python via NumPy, but elsewhere. In the absence of global configuration/non-conforming abstractions, if unintended rank promotion/broadcasting bugs are a concern, nothing stops the array library consumer from including their own checks prior to method invocation which ensure either the same rank or the exact same shape. Meaning, sanity checking is punted to userland. By not including automatic rank promotion in the specification, despite its widespread practice, we contribute to the problem we are trying to solve, which is normalizing user expectations across array libraries. |
I think it's totally reasonable to say that we'll copy NumPy for broadcasting. There are other ways to catch broadcasting errors (e.g., type checking, at least in principle). |
Thinking about this some more: there may be cases where it's not feasible to return the same type, so rather than saying "same type" (too strict) or "anything that adheres to the spec" (too loose), how about specifying the expected behavior like so:
|
Could you extrapolate on what the harm is with the "too loose" approach? The approach you outline seems a bit complicated, and I wonder if a simpler approach of just saying it must be an object that follows the array protocol is viable. |
To be a bit more explicit, I am proposing these types of signatures: from typing_extensions import Protocol
def arange(x: Array) -> Array:
...
class Array(Protocol):
def __add__(self, Union[Array, int]) -> Array:
... Or, if we want to allow one more level of indirection, like is done with from typing_extensions import Protocol
class Arrayable(Protocol):
def __array__(self) -> Array:
... # Empty method body (explicit '...')
def arange(x: Arrayable) -> Arrayable:
...
class Array(Protocol):
def __add__(self, Union[Arrayble, int]) -> Arrayble:
... |
And to play devil's advocate, a form of "you must always return an array of the same type as your self", could look like this? from typing_extensions import Protocol
ARRAY = typing.TypeVar("ARRAY", bound="Array")
def arange(x: ARRAY) -> ARRAY:
...
class Array(Protocol):
def __add__(self: ARRAY, Union[Array, int]) -> ARRAY:
... Ralf, your proposal seems to be a mix between these two? I am curious if it is possible to define a typing for it. |
I don't love forcing libraries to return an array of the exact same type. Many libraries use multiple array types internally. Ideally there would be a base class (equivalent of |
First: this is not about static typing. The annotation in either case may, and likely will, look the same (your first version @saulshanabrook). I like the The question I was getting at was not only what array type is returned, but also also:
The answer is unlikely to be “everything that adheres to the array standard” for any array library. A very basic example that doesn't work today (arguable if it's desirable to make that work):
A more artificial example where the spec you favor says that should be fine:
Maybe it's just about phrasing: to me saying "anything that adheres to the protocol" is correct but not very useful, it seems to allow/encourage things that will definitely break user expectations. So I want the same static typing (whatever is returned is an array type that adheres to the spec), but also say something that's more useful, not by ruling out valid things people would like to do (e.g. multiple array types implemented within the same library, like My expectation is that, for efficiency and correctness reasons, even my first example is fine as is: I suspect PyTorch would not want to start accepting arrays that aren't |
actually no, I'm not sure that's 100% correct. I think the static typing in #13 (comment) is useful for array libraries that are happy to do coercion to their own internal array type (numpy would do that), and the right thing to do for consumer libraries that want to support multiple array types. But array libraries should also be free to do:
rather than typing everything as an |
@rgommers Thank you for the example, I think I see what you mean now. You are saying that for PyTorch, say, to be conformant to the API, it should only guarantee that it could take in PyTorch arrays to its functions, not any array that follows the protocol? Is that right? If so, my question becomes "Is it possible to type this statically?" Obviously, a "no" there shouldn't preclude us from pursuing this approach, but it would I think then be interesting to see how that could be added at some point, to the type annotation standards. |
Yes exactly. I see array/tensor libraries as fundamental building blocks. They can conform to the standard without knowing about each other. So for me, this is a goal (
And this is a non-goal (where the connecting line means "array object into function", not data interchange):
I think I can make that into a clearer visual that communicates goal/non-goal, but I hope it's clear what I mean. |
Yes I see, that's very helpful. I think this conversation really moves into the broader space of "What is the contract provided to downstream libraries and how is that exposed?" so I might open another issue to have that discussion and try to outline some options there? |
@saulshanabrook thanks, makes sense to summarize and continue in a more appropriately titled issue, because this ballooned a bit from "type signatures for standardized functions". |
I don't see the new issue yet, so for now I'll comment here so I don't forget it :)
I think type checking the sort of structure @rgommers suggests might be possible by defining either "contravariant" or "covariant" (I don't find this names very intuitive so it takes me some effort to remember which is which and how exactly they work, but I do remember these concepts exist for clarifying when subclasses are substitutable and when they aren't.) |
The Function and Method signatures section currently says:
Each parameter does have a type annotation. The array inputs/outputs are simply typed as I think we've done all we can/must here for now. Given that there are zero (AFAIK) libraries today with reasonably complete type annotations I'm not sure it makes sense to say anything more in the document. EDIT: except summarize this status in the Static Typing section. |
Okay there was something to say after all, opened a PR. I think not providing a |
I think standardized functions should include signatures that are at least in principle compatible with static type checking tools such as mypy, ideally treating dtype and shape as an implicit part of array types. This would eventually allow for static checking of code using these array APIs, which would using these APIs much safer.
This would entail standardizing a number of implementation details beyond those in the current draft specs, e.g.,
axis
must use a Pythonint
ortuple[int, ...]
.add(float64, float64) -> float64
? One argument against codifying dtype promotion is that the details of dtype promotion are rather tricky, and NumPy's rules are unsuitable for accelerators (JAX and PyTorch implement incompatible semantics). On the other hand, requiring explicit dtype casting (like in TensorFlow) is rather annoying for users (maybe less of a problem for library code).The text was updated successfully, but these errors were encountered: