Skip to content

Type signatures for standardized functions #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Jul 30, 2020 · 20 comments · Fixed by #62
Closed

Type signatures for standardized functions #13

shoyer opened this issue Jul 30, 2020 · 20 comments · Fixed by #62

Comments

@shoyer
Copy link
Contributor

shoyer commented Jul 30, 2020

I think standardized functions should include signatures that are at least in principle compatible with static type checking tools such as mypy, ideally treating dtype and shape as an implicit part of array types. This would eventually allow for static checking of code using these array APIs, which would using these APIs much safer.

This would entail standardizing a number of implementation details beyond those in the current draft specs, e.g.,

  • Python types: this is very basic, but we should specify exactly which Python types are valid for various arguments, e.g., axis must use a Python int or tuple[int, ...].
  • array types: are operations on an array of some type required to return an array of the same type? Or merely another array (of any type) that also satisfies the standard array API (whatever we eventually define that to be)?
  • array shapes: do we require implementing the full version of NumPy broadcasting? Or do we define behavior in simpler cases, e.g., where all shapes match? It might also make sense to only require a more explicit subset of broadcasting behavior, without automatic rank promotion.
  • array dtypes: do we include dtype promotion like what NumPy uses? Or do we only define operations on a predefined set of "safe" dtypes, e.g., add(float64, float64) -> float64? One argument against codifying dtype promotion is that the details of dtype promotion are rather tricky, and NumPy's rules are unsuitable for accelerators (JAX and PyTorch implement incompatible semantics). On the other hand, requiring explicit dtype casting (like in TensorFlow) is rather annoying for users (maybe less of a problem for library code).
@rgommers
Copy link
Member

Great suggestions and questions, thanks @shoyer. I agree with adding type annotations to signatures as much as we can.

array types: are operations on an array of some type required to return an array of the same type?

I'd say yes, must be of the same type. Even if all libraries would define only exactly what's in the spec, returning another type of array would be quite odd. And in practice no two libraries are going to be exactly 100% compatible, which would make it worse.

array shapes: do we require implementing the full version of NumPy broadcasting? Or do we define behavior in simpler cases, e.g., where all shapes match? It might also make sense to only require a more explicit subset of broadcasting behavior, without automatic rank promotion.

That's a tricky question. I can see the rationale for not having automatic rank promotion, if every array implementation would require prepending size-one dimensions that would catch a class of bugs (and probably some other advantages too). On the other hand, NumPy et al. already do rank promotion, and it can't be easily turned off. It's not like with keywords we don't spec where one can easily write a wrapper that leaves them out. So semantics are then going to vary between libraries.

array dtypes: do we include dtype promotion like what NumPy uses? Or do we only define operations on a predefined set of "safe" dtypes, e.g., add(float64, float64) -> float64? One argument against codifying dtype promotion is that the details of dtype promotion are rather tricky, and NumPy's rules are unsuitable for accelerators (JAX and PyTorch implement incompatible semantics). On the other hand, requiring explicit dtype casting (like in TensorFlow) is rather annoying for users (maybe less of a problem for library code).

Good question, I was about to write up a proposal on required dtypes to support and casting rules. I'd lean towards no NumPy-like dtype promotion. JAX and PyTorch can't implement those casting rules, and I think library code would actually benefit from not having NumPy's weird rules (there's also the casting= ‘no’, ‘equiv’, ‘safe’, ‘same_kind’, or ‘unsafe’, and the unintuitive behavior when you mix floats and ints with different precisions).

Perhaps there's a sensible middle ground? add(float32, float64) -> float64 seems safe and useful to specify, but this is a can of worms:

In [1]: x1 = np.arange(3, dtype=np.uint32)                                                                            

In [2]: x2 = np.ones(3, dtype=np.float16)                                                                             

In [3]: (x1 + x2).dtype                                                                                               
Out[3]: dtype('float64')

@shoyer
Copy link
Contributor Author

shoyer commented Jul 30, 2020

Perhaps there's a sensible middle ground? add(float32, float64) -> float64 seems safe and useful to specify

Agreed -- anything in the intersection of what NumPy, JAX and PyTorch support, for example, is probably safe to include. The means type promotion within but not between families (e.g., float32 + float64 but not float32 + int32).

I trust we don't want to adopt NumPy's value dependent dtype promotion rules or scalars.

@kgryte
Copy link
Contributor

kgryte commented Aug 3, 2020

Just so I understand correctly, the issue of not necessarily wanting automatic rank promotion is to guard against scenarios in which I provide arrays having a different number of axes and I actually didn't mean to do so (user error), but my code continues to work because of automatically prepended dimensions, and thus, I'm not able to easily track down undesired mismatched arrays. Am I understanding correctly?

@shoyer
Copy link
Contributor Author

shoyer commented Aug 3, 2020

Just so I understand correctly, the issue of not necessarily wanting automatic rank promotion is to guard against scenarios in which I provide arrays having a different number of axes and I actually didn't mean to do so (user error), but my code continues to work because of automatically prepended dimensions, and thus, I'm not able to easily track down undesired mismatched arrays. Am I understanding correctly?

This is exactly right. A canonical example would be adding a vector to a square matrix.

@kgryte
Copy link
Contributor

kgryte commented Aug 3, 2020

It is not clear to me that the desire to prevent automatic rank promotion is different from wanting to prevent broadcasting altogether. I can make a similar argument that, were I to provide two three-dimensional arrays, with one of those arrays having a singleton/degenerate dimension, I could have made a mistake which would be difficult to track down due to the lack of a runtime error when broadcasting is automatically supported.

Accordingly, the question seems to me whether we specify broadcasting semantics at all.

IMO, we should include broadcasting semantics, and we should follow NumPy's semantics (and as implemented outside of Python in array programming environments such as MATLAB [1, 2] (a.k.a., implicit expansion) and Julia [1]), including automatic rank promotion.

Should runtimes want to allow users to opt-out of automatic rank promotion or broadcasting altogether, they can follow JAX's lead and support either global configuration or, e.g., a non-conforming abstraction layer which does not permit rank promotion/broadcasting. In both cases, the default behavior should be an array implementation conforming to the standard with explicit opt-in (e.g., to aid debugging, etc) for non-conforming behavior.

It does not seem to me that because automatic rank promotion (and, by extension, broadcasting) can obscure unintended bugs we should omit its specification, for reasons of its widespread prevalence and convenience, not just in Python via NumPy, but elsewhere.

In the absence of global configuration/non-conforming abstractions, if unintended rank promotion/broadcasting bugs are a concern, nothing stops the array library consumer from including their own checks prior to method invocation which ensure either the same rank or the exact same shape. Meaning, sanity checking is punted to userland.

By not including automatic rank promotion in the specification, despite its widespread practice, we contribute to the problem we are trying to solve, which is normalizing user expectations across array libraries.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 4, 2020

I think it's totally reasonable to say that we'll copy NumPy for broadcasting. There are other ways to catch broadcasting errors (e.g., type checking, at least in principle).

@rgommers
Copy link
Member

rgommers commented Aug 4, 2020

array types: are operations on an array of some type required to return an array of the same type?

I'd say yes, must be of the same type. Even if all libraries would define only exactly what's in the spec, returning another type of array would be quite odd. And in practice no two libraries are going to be exactly 100% compatible, which would make it worse.

Thinking about this some more: there may be cases where it's not feasible to return the same type, so rather than saying "same type" (too strict) or "anything that adheres to the spec" (too loose), how about specifying the expected behavior like so:

  • If the array type is implemented by the library that's being used (e.g. numpy.ndarray passed to a numpy function): must return an array of the same type.
  • If the array type is defined elsewhere, it can be a subclass of an array type defined in the library, or something else.
    • If it's a subclass: return a type of that subclass if possible, otherwise a superclass
    • If it's not a subclass: return an array of the same type as the input array if possible, otherwise a/the array type implemented by the library

@saulshanabrook
Copy link
Contributor

Could you extrapolate on what the harm is with the "too loose" approach?

The approach you outline seems a bit complicated, and I wonder if a simpler approach of just saying it must be an object that follows the array protocol is viable.

@saulshanabrook
Copy link
Contributor

saulshanabrook commented Aug 5, 2020

To be a bit more explicit, I am proposing these types of signatures:

from typing_extensions import Protocol

def arange(x: Array) ->  Array:
    ...


class Array(Protocol):
    def __add__(self, Union[Array, int]) -> Array:
        ...

Or, if we want to allow one more level of indirection, like is done with __iter__, we could have:

from typing_extensions import Protocol

class Arrayable(Protocol):
    def __array__(self) -> Array:
       ...  # Empty method body (explicit '...')

def arange(x: Arrayable) ->  Arrayable:
    ...


class Array(Protocol):
    def __add__(self, Union[Arrayble, int]) -> Arrayble:
        ...

@saulshanabrook
Copy link
Contributor

saulshanabrook commented Aug 5, 2020

And to play devil's advocate, a form of "you must always return an array of the same type as your self", could look like this?

from typing_extensions import Protocol


ARRAY = typing.TypeVar("ARRAY", bound="Array")

def arange(x: ARRAY) ->  ARRAY:
    ...

class Array(Protocol):
    def __add__(self: ARRAY, Union[Array, int]) -> ARRAY:
        ...

Ralf, your proposal seems to be a mix between these two? I am curious if it is possible to define a typing for it.

@shoyer
Copy link
Contributor Author

shoyer commented Aug 5, 2020

I don't love forcing libraries to return an array of the exact same type. Many libraries use multiple array types internally. Ideally there would be a base class (equivalent of np.ndarray), but whether they expose a single class to users is really an implementation detail, e.g., pydata/sparse has COO and DOK, which are both subclasses of SparseArray. They could be both hidden behind the same wrapper class, but I don't think that indirection would serve any useful purpose.

@rgommers
Copy link
Member

rgommers commented Aug 5, 2020

First: this is not about static typing. The annotation in either case may, and likely will, look the same (your first version @saulshanabrook).

I like the COO/DOK example, that’s a concrete one that should be made to work.

The question I was getting at was not only what array type is returned, but also also:

  • which array types are accepted as input?
  • is it possible to rely on array methods and attributes beyond what’s in the standard?

The answer is unlikely to be “everything that adheres to the array standard” for any array library.

A very basic example that doesn't work today (arguable if it's desirable to make that work):

In [1]: import torch                                                          

In [2]: import numpy as np                                                    

In [3]: x = np.array([1.5, 2.5])                                              

In [4]: y = torch.tensor([3.5, 4.5])                                          

In [5]: np.add(x, y)                                                          
Out[5]: tensor([5., 7.], dtype=torch.float64)

In [6]: torch.add(x, y)                                                       
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-456cd8716a08> in <module>
----> 1 torch.add(x, y)

TypeError: add(): argument 'input' (position 1) must be Tensor, not numpy.ndarray

A more artificial example where the spec you favor says that should be fine:

torch.add(some_jax_array, some_tensorflow_array)   # returns a numpy array

Maybe it's just about phrasing: to me saying "anything that adheres to the protocol" is correct but not very useful, it seems to allow/encourage things that will definitely break user expectations.

So I want the same static typing (whatever is returned is an array type that adheres to the spec), but also say something that's more useful, not by ruling out valid things people would like to do (e.g. multiple array types implemented within the same library, like DOK/COO), but by(a) providing guidance on how things should behave, and (b) ruling out completely unrealistic things.

My expectation is that, for efficiency and correctness reasons, even my first example is fine as is: I suspect PyTorch would not want to start accepting arrays that aren't torch.Tensor. And I think this is true for other libraries as well.

@rgommers
Copy link
Member

rgommers commented Aug 5, 2020

So I want the same static typing (whatever is returned is an array type that adheres to the spec)

actually no, I'm not sure that's 100% correct. I think the static typing in #13 (comment) is useful for array libraries that are happy to do coercion to their own internal array type (numpy would do that), and the right thing to do for consumer libraries that want to support multiple array types. But array libraries should also be free to do:

def arange(x: myTensor) ->  myTensor:
    ...

rather than typing everything as an Array Protocol.

@saulshanabrook
Copy link
Contributor

@rgommers Thank you for the example, I think I see what you mean now.

You are saying that for PyTorch, say, to be conformant to the API, it should only guarantee that it could take in PyTorch arrays to its functions, not any array that follows the protocol?

Is that right?


If so, my question becomes "Is it possible to type this statically?" Obviously, a "no" there shouldn't preclude us from pursuing this approach, but it would I think then be interesting to see how that could be added at some point, to the type annotation standards.

@rgommers
Copy link
Member

rgommers commented Aug 5, 2020

You are saying that for PyTorch, say, to be conformant to the API, it should only guarantee that it could take in PyTorch arrays to its functions, not any array that follows the protocol?

Yes exactly. I see array/tensor libraries as fundamental building blocks. They can conform to the standard without knowing about each other. So for me, this is a goal (<mental picture: DAG>):

       scikit-learn
     /             \
   /                \
numpy              pytorch

And this is a non-goal (where the connecting line means "array object into function", not data interchange):

       scikit-learn
     /             \
   /                \
numpy  -------   pytorch

I think I can make that into a clearer visual that communicates goal/non-goal, but I hope it's clear what I mean.

@saulshanabrook
Copy link
Contributor

Yes I see, that's very helpful. I think this conversation really moves into the broader space of "What is the contract provided to downstream libraries and how is that exposed?" so I might open another issue to have that discussion and try to outline some options there?

@rgommers
Copy link
Member

rgommers commented Aug 5, 2020

@saulshanabrook thanks, makes sense to summarize and continue in a more appropriately titled issue, because this ballooned a bit from "type signatures for standardized functions".

@shoyer
Copy link
Contributor Author

shoyer commented Aug 5, 2020

I don't see the new issue yet, so for now I'll comment here so I don't forget it :)

If so, my question becomes "Is it possible to type this statically?" Obviously, a "no" there shouldn't preclude us from pursuing this approach, but it would I think then be interesting to see how that could be added at some point, to the type annotation standards.

I think type checking the sort of structure @rgommers suggests might be possible by defining either "contravariant" or "covariant" TypeVar instances:
https://www.python.org/dev/peps/pep-0484/#covariance-and-contravariance

(I don't find this names very intuitive so it takes me some effort to remember which is which and how exactly they work, but I do remember these concepts exist for clarifying when subclasses are substitutable and when they aren't.)

@rgommers
Copy link
Member

rgommers commented Oct 28, 2020

The Function and Method signatures section currently says:

Type annotations are left out of the signatures themselves for readability; they are added to the
descriptions of individual parameters however. In code which aims to adhere to the standard,
adding type annotations is strongly recommended.

Each parameter does have a type annotation. The array inputs/outputs are simply typed as array, without specifying what that is or offering some implementation that can be used / derived from. And the bit of ASCII art above (#13 (comment)) is now a proper diagram in the Assumptions section.

I think we've done all we can/must here for now. Given that there are zero (AFAIK) libraries today with reasonably complete type annotations I'm not sure it makes sense to say anything more in the document. EDIT: except summarize this status in the Static Typing section.

rgommers added a commit that referenced this issue Oct 28, 2020
@rgommers
Copy link
Member

Okay there was something to say after all, opened a PR. I think not providing a Protocol now is justified, given that there's no place to put it. Such a thing would only be needed once multiple array libraries adopt the standard and then a downstream library starts supporting multiple array libraries. Inventing something at that point may be better.

rgommers added a commit that referenced this issue Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants