Deal with np.array(sparsearr) densification #72

nils-werner · 2018-01-10T17:16:10Z

Picking up from #68:

Implementing len() means that NumPy can create np.ndarrays from COO arrays using

numpy.array(x)

But it's very very slow.

If I understand it correctly, numpy.array(x) looks for buffer and __array_interface__ when creating an array from an object. If it doesn't find one it simply iterates over it (hence the slowness)

What if we implement __array_interface__ that calls self.todense()?

@property
def __array_interface__(self):
    return {
        'shape': self.shape,
        'data': self.todense(),
        'typestr': self.dtype.str,
    }

Before

x = sparse.random((20, 30, 20))

%timeit numpy.array(x)
# 1.48 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit x.todense()
# 11.6 µs ± 40.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy.allclose(x, x.todense)
# True

After

%timeit numpy.array(x)
# 23 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy.allclose(x, x.todense)
# True

I can't explain where the 2x slowdown (11 vs 23 us) comes from though. The factor is also there for x = sparse.random((20, 30, 40000)): 78ms vs 184ms.

But please be aware that I am phrasing this suggestion as a question! I don't exactly know what NumPy does with the value returned from __array_interface__. Does it copy it? Does it share it? Are we leaking memory? Are we double-freeing it?

nils-werner · 2018-01-10T17:40:17Z

Lets see what SO has to say

hameerabbasi · 2018-01-10T17:42:15Z

From the docs for __array_interface__ it seems data sharing is done automatically, so we should be good as long as Numpy is free of bugs.

You might also want to do 'data': (self.todense(), True), since we're not writing to the original object. I suspect (but am not sure) that the slowness is due to the double allocation and copy. If we used False, it would be shared.

Since we're passing a complete Numpy array, the data sharing will be done for us. However, when using __array_interface__, the double allocation/copy can be a cause for slowness. If we used False, the data from the original array produced from todense() would be shared and (I suspect) be faster.

But since we're marking it as "writable" that might cause other side effects I'm not aware of. Common sense says this should not be the case, though, since a new copy is produced every time __array_interface__ is called.

nils-werner · 2018-01-11T08:16:29Z

One answer (now deleted) on SO suggested to use __array__.

hameerabbasi · 2018-01-11T08:24:38Z

Interesting. He was probably right to delete the answer, as it didn't answer the original question. That said, it does solve our problem.

I'm happy to merge either way (as this is clearly an improvement) but it'd be nice to see benchmarks either way.

nils-werner · 2018-01-11T08:37:09Z

Where would you like to see them? Here as a comment, or as part of the codebase and docs?

Also, one thing to keep in mind is that

x = sparse.random((200, 200, 200, 200, 200, 200), density=0.00000001)
np.allclose(x, 0)

now "just works" by silently converting the sparse array to a dense one. This could be a point of frustration for users, if they accidentally consume all their RAM in one innocent looking single line.

I am also working on a slightly smarter implementation of a sparse.allclose() that tries to do as much of the comparsion as possible in the sparse domain.

hameerabbasi · 2018-01-11T08:59:57Z

Here, as a comment. Benchmarks should only be included in docs if there are changes across releases.

Also, sparse.allclose would be simple.

Check shape.
Call sum_duplicates
Call np.array_equal(coords1, coords2)
Call np.allclose(data1, data2)

Although this would end up comparing sparsity structure exactly. In any case, I think it's close to what we want. If you don't want to compare sparsity structure:

Match coords1 and coords2
Call np.allclose(data_matched1, data_matched2)
np.isclose(data_unmatched1, 0).all()
np.isclose(data_unmatched2, 0).all()

You might want to look at the logic we use in _elemwise_binary if you take this approach.

hameerabbasi · 2018-01-11T09:03:34Z

On second thought, this PR in general might be a bad idea, until we implement a suitable framework for auto-densification (#10), as it would densify at most times, and densification would be implicit instead of explicit.

Edit: On third thought, since np.array already does this worse than we do, it is a good idea, but we should give priority to #10.

nils-werner · 2018-01-11T09:16:17Z

On second thought, this PR in general might be a bad idea

Not implementing this is even worse, as NumPy tries to densify the array anyways...

nils-werner · 2018-01-11T09:37:44Z

Would be using maybe_densify a solution?

hameerabbasi · 2018-01-11T09:39:53Z

It's okay for now, however we will need to find a resolution to #10 for the future. I'm looking into it now. Since Dask (and other parallel libraries) may use this, we might need thread local storage for the configuration options.

hameerabbasi · 2018-01-13T13:19:41Z

There's another problem with this... If we do x + y where x is a scipy.sparse.spmatrix, and y is COO, the addition "indirectly" calls np.asanyarray and that tries to densify the COO... In a purely sparse situation!

I fear that with this solution we're moving more and more towards implicit densification. At this point, I think it's just best if we just raise a NotImplementedError in __array__, or return NotImplemented, so the operation above fails over to COO.__radd__ and can be handled properly.

Edit: I opened a new issue for this, #81.

hameerabbasi · 2018-01-15T10:29:00Z

If possible, could you add tests for the cases scipy.sparse.spmatrix op COO?

nils-werner · 2018-01-15T11:40:00Z

What do you mean?

hameerabbasi · 2018-01-15T12:25:27Z

Tests of the form x op y where x is scipy.sparse.spmatrix and y is COO. We would like to check that the result is COO as well.

hameerabbasi · 2018-01-15T18:09:23Z

docs/generated/sparse.COO.rst

@@ -68,6 +68,7 @@ COO
      COO.to_scipy_sparse
      COO.tocsc
      COO.tocsr
+      COO.__array__


Do we actually want to document double-underscore functions in the API docs? I'm not sure what the convention is for the Python community. I know documenting them in code is good for potential contributors but I'm not sure if we should put them in the API docs.

Numpy does document it

And I think in case of __array__ we should, because it is likely to show unexpected behaviour that users should be able to look up.

hameerabbasi · 2018-01-15T19:15:23Z

I just tested... Prior to merging #68, np.array(COO) returns a COO object array. After that we run into weird bugs such as #81 and #78, all resulting from different np.array(COO) behavior. I wonder if there's a way to go back to the old behavior while keeping __len__...

I tried returning self, but that raises an error: ValueError: object __array__ method not producing an array

nils-werner · 2018-01-15T19:37:27Z

One (slightly hackish) way may be

def __array__(self, dtype=object):
    if dtype != object:
        raise NotImplementedError(
            "Casting sparse COO array to dense array is not supported. "
            "Use .todense() to force densification."
        )
    arr = np.array(object)
    arr[()] = self  # assign `self` to array to prevent recursive call of __array__()
    return arr

np.array(a) produces array(<COO ...>, dtype=object)
np.array(a, dtype=float) etc. raise a NotImplementedError

nils-werner · 2018-01-16T08:43:35Z

Another alternative would be

def __array__(self, dtype=None):
    if dtype is None:
        dtype = self.dtype
    arr = np.array(1, dtype=dtype)
    # assign `self` to array to prevent recursive call of __array__()
    arr[()] = self
    return arr

which would be in line with what happens when you np.array() a non-rectangular list of values:

a = sparse.random((20, 30, 40))
np.array(a)               # ValueError: setting an array element with a sequence.
np.array(a, dtype=object) # array(<COO: ...>, dtype=object)
np.array(a, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", a)    # ValueError: setting an array element with a sequence.

b = [1, [1, 1]]
np.array(b)               # ValueError: setting an array element with a sequence.
np.array(b, dtype=object) # array([1, list([1, 1])], dtype=object)
np.array(b, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", b)    # ValueError: setting an array element with a sequence.

Using this, the tests in #78 still fail

nils-werner · 2018-01-16T08:45:29Z

Or

def __array__(self, dtype=object):
    arr = np.array(1, dtype=dtype)
    # assign `self` to array to prevent recursive call of __array__()
    arr[()] = self
    return arr

Which would be a little bit more in line with what happens when you np.array(x) some random object and allows straight up np.array(a) casting (the exception isn't right yet)

a = sparse.random((20, 30, 40))
np.array(a)               # array(<COO: ...>, dtype=object)
np.array(a, dtype=object) # array(<COO: ...>, dtype=object)
np.array(a, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", a)

b = dict()
np.array(b)               # array({}, dtype=object)
np.array(b, dtype=object) # array({}, dtype=object)
np.array(b, dtype=int)    # TypeError: int() argument must be a string, a bytes-like object or a number, not 'dict'
np.save("test.npy", b)

Using this, the tests in #78 succeed (but I don't know if what is happening internally is really sane)

hameerabbasi · 2018-01-16T09:48:42Z

The way I see it, we have two ways we can really go:

Raise a TypeError (I think a TypeError is more appropriate here) and drop Numpy 1.12 support.
Implement the hack in https://github.com/mrocklin/sparse/pull/72#issuecomment-357890921 or https://github.com/mrocklin/sparse/pull/72#issuecomment-357771582 (we would need to see what NumPy 1.12 does when the dtype isn't object and mimic that) and support Numpy 1.12.

I'm open to both but slightly in favour of the first. #81 is fixed with both 1 and 2, and even if we return NotImplemented, or raise NotImplementedError or ValueError.

Personally, I think if we implement 2, and run into problems again, we drop Numpy 1.12 support completely.

hameerabbasi · 2018-01-16T09:54:22Z

I wish np.ndarray implemented some sort of abstract metaclass that we could inherit from to deal with all these problems. Inheriting from np.ndarray isn't really an option at this point.

hameerabbasi · 2018-01-16T10:32:29Z

cc @mrocklin Your input here would be valuable.

hameerabbasi · 2018-01-17T08:59:34Z

@nils-werner What do you think? I'm not sure how widespread Numpy 1.13 adoption is and whether we should put in the effort to support 1.12.

nils-werner · 2018-01-18T12:04:22Z

I don't know. I don't understand all this new ufunc magic well enough to give a useful answer...

mrocklin · 2018-01-18T14:46:37Z

cc @mrocklin Your input here would be valuable.

I'm not sure I know enough about NumPy internals to quickly know the right thing to do here. cc'ing @njsmith in case he has time to comment here (Also, Nathaniel, meet @hameerabbasi and @nils-werner , both of whom have been doing a lot of great work on this project over the last month.)

hameerabbasi · 2018-01-18T22:19:22Z

@njsmith, just a quick run-down. We want np.exp(COO), etc. to not call np.array(COO) but to default to our own COO.exp(). The same goes for scipy.sparse.spmatrix + COO, which calls np.array inside scipy.sparse.spmatrix.__add__ and then densifies our matrix instead of calling COO.__radd__.

We're basically looking for a way to tell Numpy 1.12- to NOT use np.array results.

hameerabbasi · 2018-01-19T01:57:33Z

I've opened #84 to deal with this for now.

njsmith · 2018-01-19T02:15:03Z

Definitely don't inherit from np.ndarray, that way lies nothing but suffering.

For np.exp(COO), I think __array_ufunc__ is pretty much the best/only solution you'll find. If that means people need to upgrade numpy then oh well. If they wanted to use old software they'd be using scipy.sparse, right?

In general, the plan is for numpy to continue adding stuff like __array_ufunc__, e.g. there have been rumblings about __array_concatenate__, and I think another priority is some kind of asduckarray. The ideal is that eventually you'll be able to implement all of these __array_whatever__ functions and get something that basically acts as an array without being automatically densified. I would probably make __array__ raise an error for now though – eventually it might make sense, but if you make it an error now then you can always implement it later; if you make it work now, then you're stuck with it. And you definitely don't want people to get in the habit of calling it and getting back an ndarray([COO], dtype=object), that's just a mess.

For interaction with scipy.sparse, I dunno, it depends on the details of how that code works, which I'm not very familiar with. You'd have to talk to them. I would seriously consider simply not supporting it at all and telling people to convert all their matrices to sparse objects.

hameerabbasi · 2018-01-19T09:13:18Z

Thanks a lot @njsmith!

hameerabbasi · 2018-01-19T09:43:23Z

xref numpy/numpy#4164

hameerabbasi · 2018-01-23T08:58:07Z

@nils-werner, as discussed in #84, would you mind adding the fix in https://github.com/mrocklin/sparse/pull/72#issuecomment-357771582 (with the slight exception that we should mimic numpy behaviour for dtype != object)?

It might be worthwhile converting the dtype with np.result_type before erroring.

mrocklin · 2018-01-25T00:31:57Z

OK, I've taken a bit of time and read through things here. In general I apologize for being absent recently. Some thoughts.

I'm in favor of np.asarray(my_coo) densifying. This is dumb in some cases but I'd rather not get in people's way, even as they're being dumb.
It would be nice to maintain compatibility with scipy.sparse.

It's also worth noting that Numpy saves us a bit here with MemoryErrors. So if we don't do anything here then we're just relying on their policies, which I like.

In [1]: from sparse import COO, random

In [2]: x = random((100000, 100000), density=0.000001)

In [3]: x
Out[3]: <COO: shape=(100000, 100000), dtype=float64, nnz=10000, sorted=False, duplicates=True>

In [4]: import numpy as np

In [5]: np.array(x)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-5-152940d1f14f> in <module>()
----> 1 np.array(x)

/home/mrocklin/workspace/sparse/sparse/coo.py in __array__(self, dtype, **kwargs)
   1230 
   1231     def __array__(self, dtype=None, **kwargs):
-> 1232         x = self.todense()
   1233         if dtype and x.dtype != dtype:
   1234             x = x.astype(dtype)

/home/mrocklin/workspace/sparse/sparse/coo.py in todense(self)
    353         """
    354         self.sum_duplicates()
--> 355         x = np.zeros(shape=self.shape, dtype=self.dtype)
    356 
    357         coords = tuple([self.coords[i, :] for i in range(self.ndim)])

MemoryError:

mrocklin · 2018-01-25T00:36:14Z

Here is a strawman proposal: https://github.com/mrocklin/sparse/pull/87

nils-werner changed the title ~~Implemented __array_interface__~~ Speed up np.array(sparsearr) conversion Jan 11, 2018

nils-werner changed the title ~~Speed up np.array(sparsearr) conversion~~ Deal with np.array(sparsearr) densification Jan 15, 2018

nils-werner added 7 commits January 15, 2018 16:26

Implemented __array_interface__

5f064d1

__array__ instead of __array_interface__ seems like the way to go

25885be

Simplified __array__ if a bit

8a1a4c2

Test comparing np.array(x) and x.todense()

916e016

Use maybe_densify in __array__

120bf82

__array__ returns NotImplemented

f615b79

Fix docstring

7a9b597

nils-werner force-pushed the __array_interface__ branch from 55fbd85 to 7a9b597 Compare January 15, 2018 15:30

Docs

13f7841

hameerabbasi reviewed Jan 15, 2018

View reviewed changes

hameerabbasi mentioned this pull request Jan 15, 2018

Test NumPy 1.12 and latest in both Travis and tox #79

Closed

hameerabbasi added this to the 0.2 milestone Jan 15, 2018

hameerabbasi added bug Indicates an unexpected problem or unintended behavior discussion labels Jan 15, 2018

hameerabbasi mentioned this pull request Jan 18, 2018

Operations of the form scipy.sparse.spmatrix operator COO return a dense result #81

Closed

hameerabbasi mentioned this pull request Jan 19, 2018

Drop support for Numpy 1.12. #84

Merged

mrocklin mentioned this pull request Jan 25, 2018

Implement __array__ protocol #87

Merged

hameerabbasi closed this Jan 25, 2018

Uh oh!

Deal with np.array(sparsearr) densification #72

Deal with np.array(sparsearr) densification #72

Uh oh!

Conversation

nils-werner commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-werner commented Jan 10, 2018

Uh oh!

hameerabbasi commented Jan 10, 2018

Uh oh!

nils-werner commented Jan 11, 2018

Uh oh!

hameerabbasi commented Jan 11, 2018

Uh oh!

nils-werner commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Jan 11, 2018

Uh oh!

hameerabbasi commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-werner commented Jan 11, 2018

Uh oh!

nils-werner commented Jan 11, 2018

Uh oh!

hameerabbasi commented Jan 11, 2018

Uh oh!

hameerabbasi commented Jan 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Jan 15, 2018

Uh oh!

nils-werner commented Jan 15, 2018

Uh oh!

hameerabbasi commented Jan 15, 2018

Uh oh!

hameerabbasi Jan 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nils-werner Jan 16, 2018

Choose a reason for hiding this comment

Uh oh!

hameerabbasi commented Jan 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-werner commented Jan 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-werner commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nils-werner commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Jan 16, 2018

Uh oh!

hameerabbasi commented Jan 16, 2018

Uh oh!

hameerabbasi commented Jan 17, 2018

Uh oh!

nils-werner commented Jan 18, 2018

Uh oh!

mrocklin commented Jan 18, 2018

Uh oh!

hameerabbasi commented Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameerabbasi commented Jan 19, 2018

Uh oh!

njsmith commented Jan 19, 2018

Uh oh!

hameerabbasi commented Jan 19, 2018

Uh oh!

nils-werner commented Jan 10, 2018 •

edited

Loading

nils-werner commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 13, 2018 •

edited

Loading

hameerabbasi Jan 15, 2018 •

edited

Loading

hameerabbasi commented Jan 15, 2018 •

edited

Loading

nils-werner commented Jan 15, 2018 •

edited

Loading

nils-werner commented Jan 16, 2018 •

edited

Loading

nils-werner commented Jan 16, 2018 •

edited

Loading

hameerabbasi commented Jan 16, 2018 •

edited

Loading

hameerabbasi commented Jan 18, 2018 •

edited

Loading