-
-
Notifications
You must be signed in to change notification settings - Fork 385
cmp
ing fails when attrib is a numpy array
#435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since this is the second time numpy equality brought up (ref #409), I think it’s best to keep it open. :) As a workaround, you can totally write your own We’ll have to come up with a nicer way! |
I've found this wildly frustrating myself, but I think this is a problem with # object
def __eq__(self, other: object) -> bool: ...
# numpy
def __eq__(self, other: np.ndarray) -> np.ndarray: ... This is so you can get the equality product of two arrays easily. And every data scientist I talk to seems to think this is the most natural thing in the world. I deeply disagree with this design decision, but I think the reasoning behind this is clear: in numpy, everything operates on arrays, normal python language semantics be damned! So in order to solve this issue in attrs, you'd need to have a flexible way of comparing equality on an item-per-item basis...except we already have a good way to do this, and it's the But there are realistically a few ways of handling this:
@attr.s(cmp=False, slots=True)
class Arr:
arr = attr.ib(type=np.ndarray)
def __eq__(self, other: object) -> bool:
if type(self) is not type(other):
return NotImplemented
return (self.arr == self.other).all()
@attr.s
class Bitmap:
raw = attr.ib(type=Arr)
colorspace = attr.ib(type=str)
Bitmap(Arr(np.array([1])), 'srgb') == Bitmap(Arr(np.array([1])), 'srgb')
import operator
import attr
import numpy as np
@attr.s(cmp=False)
class CustomEq:
label = attr.ib(type=str)
tags = attr.ib(type=set)
data = attr.ib(type=np.ndarray, metadata={'eq': np.array_equal})
def __eq__(self, other):
if type(self) is not type(other):
return NotImplemented
eq_spec = ((f.name, f.metadata.get('eq', operator.eq))
for f in attr.fields(type(self)))
return all(eq(getattr(self, name), getattr(other, name))
for name, eq in eq_spec) Example 2 is more like what I think That said, calling numpy an edge case is a bit silly at this point, since the science wing of python is an enormous and important part of the community. But that's why I think option 1 makes the most sense. There have to be meaningful ways to drag their types into the normal python system, and light wrappers make a lot more sense than anything else I've come up with. In fact, you get the added benefit of a place to provide semantic information about things like numpy arrays, which are normally paraded around as naked data structures that reveal nothing of their intent. I ran into many of these issues at my job and in writing+using zerial (sorry no docs yet), where I'm trying an approach to serialization somewhere between related and cattrs. And I've found that numpy really often doesn't play well with the rest of python. But for an entire class of users, it is the very reason they use python and not something else. So we'll have to figure out a way to do deal with it, and I think wrapping numpy arrays might be the way. |
I’m painfully aware of that; I meant it’s an edge case in the sense of not being Pythonic. :| You’ve kinda nailed how I’d hope people to approach this problem: composition and metadata. The resulting code is far too inefficient for core and metadata is mostly for endusers though, so how should one approach that? Some thoughts:
|
Question: what is so much more efficient about the
I could come up with a PR doing that this weekend; it shouldn't be too hard. This use case affects my work a lot, so I could get some real-world feedback with numpy + attrs users pretty quickly.
Item 4 complicates 3. If we never had to worry about doing this for ordering, then I would say make
I know in the numpy instance, although I want normal equality semantics, I don't really want orderability. The fact that numpy arrays can be stored row- or column-major means that we can't even use index-based priority for comparisons like we can with regular python lists, and even if we could, it doesn't make sense semantically. So in this case, we'd want to opt in to eq/ne but opt out of lt/gt/le/ge. I'll look and get back to you with a PR 😉 |
Hmmm, after looking at this, I think the best place to do it would be to overload the type of
To keep roughly the same efficiency, much of this could be generated in Overloading arr_compare = attr.CmpSpec(eq=np.array_equal, ord=False)
@attr.s
class DataStuff:
name: str = attr.ib()
arr: np.ndarray = attr.ib(cmp=arr_compare) I'm working on it. |
I think you're on to something with CmpSpec! I had this bug on my mind to switch to enums but thinking about this particular problem, I basically. came to the same conclusion: it would be nice to pass even more info. This would allow us to finally allow eq-only comparisons which people have been asking for a while. I wonder how to make this fit into the bigger picture (hence my delay in response, sorry). We've got ongoing work about customization of Reprs but we're aiming at Union[bool,Callable]. It would be good to take a step back and ask ourselves:
That's our current feature switches so we should strive for consistency. |
Yeah, I'm not sure exactly how it would work, but it would make it more . And the I don't know if
On names, I'd talk it over with other people But I do think it makes sense to use specification objects rather than small unions because it gets harder and harder to manage the values as you have more variants and use them in more places. For example, with the repr, it would make sense to accept a def default_repr_func(_inst, _attrib, value):
return repr(value)
@attr.s
class ReprSpec:
include_this = attr.ib(type=bool)
repr_func = attr.ib(type=Callable[[object, object], str], default=default_repr_func)
def convert_to_repr_spec(val):
if isinstance(val, ReprSpec):
return val
elif isinstance(val, bool):
return ReprSpec(include_this=val)
elif callable(val):
return ReprSpec(include_this=True, repr_func=val) But that would only be for making things uniform and allowing for easier future expansion. It almost feels like overkill in the case of reprs. The reason we need this for cmp is because "cmp" combines 2 or 6 concepts, depending on whether we only separate eq/ord or go into the entire rich comparison API. In any case, much of this wouldn't need to be exposed to the average user, and could be wrapped around for common use cases in some kind of advanced API. |
At any rate, I would like to help out with this comparison implementation, but it may be a week or two before I'm able to follow up on that with an actual PR. Do you guys have a place where you're discussing these other expansions to repr-ability and so forth? |
I would like a solution to this. I created an evil and naive monkeypatch to try it out: https://gist.github.com/jamescasbon/b0e1f2113a28e523ff3326d7b93eda19 It does clarify one issue in my mind: array equality is probably going to need policies to be chosen. Sometimes np.array_equal would do but other times we might want all_close with a tolerance. @jriddy did you get anywhere with a PR? |
No I've been busy. And in most cases this ends up being a less important use case for us. I've been getting by mostly by attribute or class level cmp=False for these things. The place where it really matters to me is in testing, because we build serialization models around attr.s classes, so I have a whole suite of tests that assert that the serialized and de-serialized object matches the orginal object. This is important to me, but I can build test helpers and fixtures to deal with this use case. Otherwise...it just hasn't come up the way I thought it would. My research guys...well it never occurs to them that it's even weird that numpy arrays behave this way. Even the ones that are deep into python development (I've got a former matplotlib core dev on my team) think this is totally normal. To me, the type of all the cmp functions is like If you can help me clarify the use cases, I think I can work out a solution that could fit into the current implementation. So what are the policies we want for an attribute's |
As a small update, I think the final idea is brewing in my head, after also flying it by @glyph. As much as I dislike subclassing, I think it would make sense to refactor the method generation into protocols. Something along the lines of: class CmpMaker(metaclass=ABCMeta):
@abstractmethod
def make_eq(self, cls, attributes):
pass
@abstractmethod
def make_ne(self, cls, attributes):
pass
# ... and each returns a method for cls based on attributes. This would allow you to re-use your implementations, wrap existing ones etc. I suspect/hope that it's a more general solution to the problem. You can also easily disable comparison and just go for eq/ne. Am I missing something? I'm traveling rn so I haven't gotten around to implement a PoC yet. |
@hynek Use |
@glyph It seems more that the pattern of re-use that this would encourage would be inheritance: class EqOnlyMaker:
@staticmethod
def _notimpl(this, other):
return NotImplemeted
def make_lt(self, cls, attributes):
return self._notimpl
make_le, make_gt, make_ge = make_lt
class MyEqMakerImpl(EqOnlyMaker):
def make_eq(self, cls, attributes):
... It seems whenever you start introducing method "hooks" like this, inheritance becomes the dominant pattern. How have you guys avoided this with Twisted in recent years? The last I used it, you had to use inheritance everywhere |
@jriddy I think that this might be a degenerate case of inheritance, where you have a protocol that has obvious default behavior for all of its methods and then you have a superclass that provides null implementations of all that behavior for convenience just to avoid repetition. The problem that Twisted has, which I agree is hard to avoid, is that people mistake the degenerate case for a more general one, and then start using inheritance multiple layers deep to implement different interfaces. I'm not sure how to defend against this, since Twisted still has a decade-sized hole to dig itself out of and may never fully emerge. |
So the end result of this is a class that is created at definition time, and whose methods get called to return the 6 rich comparison functions, which get added to the class? I'm starting to see some sense to this. I think the only thing lacking from this is a way to selectively re-use the default implementation (which is clever and fast) for selected attributes, while still remaining picklable, |
I quite like @jriddy's suggestion of attr.CmpSpec, and it makes sense to me that the onus of providing suitable eq (and lt, gt, etc) is on the users, because they know what data the attribute holds. Nothing against providing an out-of-the-box CmpSpec that can be used with numpy or pandas, of course, but without trying to guess for the user. We could implement a protocol in the following way: CmpSpec is a simple object potentially containing the usual 6 suspect attributes, and they must be Callables with input (self, other) and output (bool); if CmpSpec has only one function, we can add the negative of that for free; if it has 2 or more, we can fill the blanks and implement what is missing. The user shouldn't even have to provide a CmpSpec class - anything that looks like it (i.e. has any of the 6 attributes) should be good enough. This seems easy enough to incorporate into _make_eq: instead of comparing a big tuple that contains all attributes, we can compare one tuple per attribute, or use CmpSpec-provided eq Callable. I don't know if that would be slower because we'd build many single tuples instead of one big tuple, but I can't imagine users being super sensitive to eq performance. Disclaimer: I'd like very much to have a go at this, but I haven't worked with any project like this before, and I might underestimate the complexities of modifying a library with so many users. |
@botant I'd be willing to collaborate with you. I'm finally going to start to have time due to some changes in a my work situation, so we can find the way to make this work. If you can @-me in a WIP PR, I can help with integrating it well into the existing code more seamlessly. |
That is awesome @jriddy, thanks! I'll get started on it. |
@botant and I are working on this, but we really need some help with the name for this concept. See botant/attrs#2 if you have any ideas or suggestions please |
* Updated implementation of comparison behaviour customization. * Fixed version of next release, updates newsfragment and documentation. * Fixed documentation. * Fixed documentation. * Fixed comments and changelog. * Fix doctest error * Update src/attr/_make.py Co-authored-by: Hynek Schlawack <[email protected]> * Pass eq_key and order_key explicitly in _CountingAttr * Merged with master and resolved conflics after introduction of _make_method Co-authored-by: Antonio Botelho <[email protected]> Co-authored-by: Hynek Schlawack <[email protected]>
Fixed by #627 – tell your Numpy friends. :) |
I'm trying to follow what the solution was. Docs for flexible attribute comparison are in this PR: #768 But there is still a It's not simply this? @attrs(auto_attribs=True)
class Foo:
position: numpy.ndarray = attrib(eq=numpy.array_equal, factory=partial(numpy.zeros, 3))
flag: bool = False |
The documentation can be found here: https://www.attrs.org/en/stable/comparison.html#customization |
The generated
cmp
methods fail if the attrib is a numpy array (of size >1) withthis is because:
I realize that this can be switched off with
cmp=False
, but often comparing these attribs is useful!reproducible example:
the simplest fix might be to trust user annotations about what is/isn't a numpy array and check those separately. or, maybe better, would be awesome to supply a custom
cmp
function, e.g.where
f
is the standard cmp function (eq
,lt
, etc) andx,y
are the items being compared.The text was updated successfully, but these errors were encountered: