Skip to content

Deal with np.array(sparsearr) densification #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

nils-werner
Copy link
Contributor

@nils-werner nils-werner commented Jan 10, 2018

Picking up from #68:

Implementing len() means that NumPy can create np.ndarrays from COO arrays using

numpy.array(x)

But it's very very slow.

If I understand it correctly, numpy.array(x) looks for buffer and __array_interface__ when creating an array from an object. If it doesn't find one it simply iterates over it (hence the slowness)

What if we implement __array_interface__ that calls self.todense()?

@property
def __array_interface__(self):
    return {
        'shape': self.shape,
        'data': self.todense(),
        'typestr': self.dtype.str,
    }

Before

x = sparse.random((20, 30, 20))

%timeit numpy.array(x)
# 1.48 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit x.todense()
# 11.6 µs ± 40.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy.allclose(x, x.todense)
# True

After

%timeit numpy.array(x)
# 23 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numpy.allclose(x, x.todense)
# True

I can't explain where the 2x slowdown (11 vs 23 us) comes from though. The factor is also there for x = sparse.random((20, 30, 40000)): 78ms vs 184ms.

But please be aware that I am phrasing this suggestion as a question! I don't exactly know what NumPy does with the value returned from __array_interface__. Does it copy it? Does it share it? Are we leaking memory? Are we double-freeing it?

@nils-werner
Copy link
Contributor Author

@hameerabbasi
Copy link
Collaborator

From the docs for __array_interface__ it seems data sharing is done automatically, so we should be good as long as Numpy is free of bugs.

You might also want to do 'data': (self.todense(), True), since we're not writing to the original object. I suspect (but am not sure) that the slowness is due to the double allocation and copy. If we used False, it would be shared.

Since we're passing a complete Numpy array, the data sharing will be done for us. However, when using __array_interface__, the double allocation/copy can be a cause for slowness. If we used False, the data from the original array produced from todense() would be shared and (I suspect) be faster.

But since we're marking it as "writable" that might cause other side effects I'm not aware of. Common sense says this should not be the case, though, since a new copy is produced every time __array_interface__ is called.

@nils-werner nils-werner changed the title Implemented __array_interface__ Speed up np.array(sparsearr) conversion Jan 11, 2018
@nils-werner
Copy link
Contributor Author

One answer (now deleted) on SO suggested to use __array__.

@hameerabbasi
Copy link
Collaborator

Interesting. He was probably right to delete the answer, as it didn't answer the original question. That said, it does solve our problem.

I'm happy to merge either way (as this is clearly an improvement) but it'd be nice to see benchmarks either way.

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 11, 2018

Where would you like to see them? Here as a comment, or as part of the codebase and docs?

Also, one thing to keep in mind is that

x = sparse.random((200, 200, 200, 200, 200, 200), density=0.00000001)
np.allclose(x, 0)

now "just works" by silently converting the sparse array to a dense one. This could be a point of frustration for users, if they accidentally consume all their RAM in one innocent looking single line.

I am also working on a slightly smarter implementation of a sparse.allclose() that tries to do as much of the comparsion as possible in the sparse domain.

@hameerabbasi
Copy link
Collaborator

Here, as a comment. Benchmarks should only be included in docs if there are changes across releases.

Also, sparse.allclose would be simple.

  • Check shape.
  • Call sum_duplicates
  • Call np.array_equal(coords1, coords2)
  • Call np.allclose(data1, data2)

Although this would end up comparing sparsity structure exactly. In any case, I think it's close to what we want. If you don't want to compare sparsity structure:

  • Match coords1 and coords2
  • Call np.allclose(data_matched1, data_matched2)
  • np.isclose(data_unmatched1, 0).all()
  • np.isclose(data_unmatched2, 0).all()

You might want to look at the logic we use in _elemwise_binary if you take this approach.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 11, 2018

On second thought, this PR in general might be a bad idea, until we implement a suitable framework for auto-densification (#10), as it would densify at most times, and densification would be implicit instead of explicit.

Edit: On third thought, since np.array already does this worse than we do, it is a good idea, but we should give priority to #10.

@nils-werner
Copy link
Contributor Author

On second thought, this PR in general might be a bad idea

Not implementing this is even worse, as NumPy tries to densify the array anyways...

@nils-werner
Copy link
Contributor Author

Would be using maybe_densify a solution?

@hameerabbasi
Copy link
Collaborator

It's okay for now, however we will need to find a resolution to #10 for the future. I'm looking into it now. Since Dask (and other parallel libraries) may use this, we might need thread local storage for the configuration options.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 13, 2018

There's another problem with this... If we do x + y where x is a scipy.sparse.spmatrix, and y is COO, the addition "indirectly" calls np.asanyarray and that tries to densify the COO... In a purely sparse situation!

I fear that with this solution we're moving more and more towards implicit densification. At this point, I think it's just best if we just raise a NotImplementedError in __array__, or return NotImplemented, so the operation above fails over to COO.__radd__ and can be handled properly.

Edit: I opened a new issue for this, #81.

@hameerabbasi
Copy link
Collaborator

If possible, could you add tests for the cases scipy.sparse.spmatrix op COO?

@nils-werner
Copy link
Contributor Author

What do you mean?

@hameerabbasi
Copy link
Collaborator

Tests of the form x op y where x is scipy.sparse.spmatrix and y is COO. We would like to check that the result is COO as well.

@nils-werner nils-werner changed the title Speed up np.array(sparsearr) conversion Deal with np.array(sparsearr) densification Jan 15, 2018
@@ -68,6 +68,7 @@ COO
COO.to_scipy_sparse
COO.tocsc
COO.tocsr
COO.__array__
Copy link
Collaborator

@hameerabbasi hameerabbasi Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to document double-underscore functions in the API docs? I'm not sure what the convention is for the Python community. I know documenting them in code is good for potential contributors but I'm not sure if we should put them in the API docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy does document it

And I think in case of __array__ we should, because it is likely to show unexpected behaviour that users should be able to look up.

@hameerabbasi hameerabbasi added this to the 0.2 milestone Jan 15, 2018
@hameerabbasi hameerabbasi added bug Indicates an unexpected problem or unintended behavior discussion labels Jan 15, 2018
@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 15, 2018

I just tested... Prior to merging #68, np.array(COO) returns a COO object array. After that we run into weird bugs such as #81 and #78, all resulting from different np.array(COO) behavior. I wonder if there's a way to go back to the old behavior while keeping __len__...

I tried returning self, but that raises an error: ValueError: object __array__ method not producing an array

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 15, 2018

One (slightly hackish) way may be

def __array__(self, dtype=object):
    if dtype != object:
        raise NotImplementedError(
            "Casting sparse COO array to dense array is not supported. "
            "Use .todense() to force densification."
        )
    arr = np.array(object)
    arr[()] = self  # assign `self` to array to prevent recursive call of __array__()
    return arr
  • np.array(a) produces array(<COO ...>, dtype=object)
  • np.array(a, dtype=float) etc. raise a NotImplementedError

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 16, 2018

Another alternative would be

def __array__(self, dtype=None):
    if dtype is None:
        dtype = self.dtype
    arr = np.array(1, dtype=dtype)
    # assign `self` to array to prevent recursive call of __array__()
    arr[()] = self
    return arr

which would be in line with what happens when you np.array() a non-rectangular list of values:

a = sparse.random((20, 30, 40))
np.array(a)               # ValueError: setting an array element with a sequence.
np.array(a, dtype=object) # array(<COO: ...>, dtype=object)
np.array(a, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", a)    # ValueError: setting an array element with a sequence.

b = [1, [1, 1]]
np.array(b)               # ValueError: setting an array element with a sequence.
np.array(b, dtype=object) # array([1, list([1, 1])], dtype=object)
np.array(b, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", b)    # ValueError: setting an array element with a sequence.

Using this, the tests in #78 still fail

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 16, 2018

Or

def __array__(self, dtype=object):
    arr = np.array(1, dtype=dtype)
    # assign `self` to array to prevent recursive call of __array__()
    arr[()] = self
    return arr

Which would be a little bit more in line with what happens when you np.array(x) some random object and allows straight up np.array(a) casting (the exception isn't right yet)

a = sparse.random((20, 30, 40))
np.array(a)               # array(<COO: ...>, dtype=object)
np.array(a, dtype=object) # array(<COO: ...>, dtype=object)
np.array(a, dtype=int)    # ValueError: setting an array element with a sequence.
np.save("test.npy", a)

b = dict()
np.array(b)               # array({}, dtype=object)
np.array(b, dtype=object) # array({}, dtype=object)
np.array(b, dtype=int)    # TypeError: int() argument must be a string, a bytes-like object or a number, not 'dict'
np.save("test.npy", b)

Using this, the tests in #78 succeed (but I don't know if what is happening internally is really sane)

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 16, 2018

The way I see it, we have two ways we can really go:

  1. Raise a TypeError (I think a TypeError is more appropriate here) and drop Numpy 1.12 support.
  2. Implement the hack in https://github.com/mrocklin/sparse/pull/72#issuecomment-357890921 or https://github.com/mrocklin/sparse/pull/72#issuecomment-357771582 (we would need to see what NumPy 1.12 does when the dtype isn't object and mimic that) and support Numpy 1.12.

I'm open to both but slightly in favour of the first. #81 is fixed with both 1 and 2, and even if we return NotImplemented, or raise NotImplementedError or ValueError.

Personally, I think if we implement 2, and run into problems again, we drop Numpy 1.12 support completely.

@hameerabbasi
Copy link
Collaborator

I wish np.ndarray implemented some sort of abstract metaclass that we could inherit from to deal with all these problems. Inheriting from np.ndarray isn't really an option at this point.

@hameerabbasi
Copy link
Collaborator

cc @mrocklin Your input here would be valuable.

@hameerabbasi
Copy link
Collaborator

@nils-werner What do you think? I'm not sure how widespread Numpy 1.13 adoption is and whether we should put in the effort to support 1.12.

@nils-werner
Copy link
Contributor Author

I don't know. I don't understand all this new ufunc magic well enough to give a useful answer...

@mrocklin
Copy link
Contributor

cc @mrocklin Your input here would be valuable.

I'm not sure I know enough about NumPy internals to quickly know the right thing to do here. cc'ing @njsmith in case he has time to comment here (Also, Nathaniel, meet @hameerabbasi and @nils-werner , both of whom have been doing a lot of great work on this project over the last month.)

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 18, 2018

@njsmith, just a quick run-down. We want np.exp(COO), etc. to not call np.array(COO) but to default to our own COO.exp(). The same goes for scipy.sparse.spmatrix + COO, which calls np.array inside scipy.sparse.spmatrix.__add__ and then densifies our matrix instead of calling COO.__radd__.

We're basically looking for a way to tell Numpy 1.12- to NOT use np.array results.

@hameerabbasi
Copy link
Collaborator

I've opened #84 to deal with this for now.

@njsmith
Copy link
Member

njsmith commented Jan 19, 2018

Definitely don't inherit from np.ndarray, that way lies nothing but suffering.

For np.exp(COO), I think __array_ufunc__ is pretty much the best/only solution you'll find. If that means people need to upgrade numpy then oh well. If they wanted to use old software they'd be using scipy.sparse, right?

In general, the plan is for numpy to continue adding stuff like __array_ufunc__, e.g. there have been rumblings about __array_concatenate__, and I think another priority is some kind of asduckarray. The ideal is that eventually you'll be able to implement all of these __array_whatever__ functions and get something that basically acts as an array without being automatically densified. I would probably make __array__ raise an error for now though – eventually it might make sense, but if you make it an error now then you can always implement it later; if you make it work now, then you're stuck with it. And you definitely don't want people to get in the habit of calling it and getting back an ndarray([COO], dtype=object), that's just a mess.

For interaction with scipy.sparse, I dunno, it depends on the details of how that code works, which I'm not very familiar with. You'd have to talk to them. I would seriously consider simply not supporting it at all and telling people to convert all their matrices to sparse objects.

@hameerabbasi
Copy link
Collaborator

Thanks a lot @njsmith!

@hameerabbasi
Copy link
Collaborator

xref numpy/numpy#4164

@hameerabbasi
Copy link
Collaborator

@nils-werner, as discussed in #84, would you mind adding the fix in https://github.com/mrocklin/sparse/pull/72#issuecomment-357771582 (with the slight exception that we should mimic numpy behaviour for dtype != object)?

It might be worthwhile converting the dtype with np.result_type before erroring.

@mrocklin
Copy link
Contributor

OK, I've taken a bit of time and read through things here. In general I apologize for being absent recently. Some thoughts.

  1. I'm in favor of np.asarray(my_coo) densifying. This is dumb in some cases but I'd rather not get in people's way, even as they're being dumb.
  2. It would be nice to maintain compatibility with scipy.sparse.

It's also worth noting that Numpy saves us a bit here with MemoryErrors. So if we don't do anything here then we're just relying on their policies, which I like.

In [1]: from sparse import COO, random

In [2]: x = random((100000, 100000), density=0.000001)

In [3]: x
Out[3]: <COO: shape=(100000, 100000), dtype=float64, nnz=10000, sorted=False, duplicates=True>

In [4]: import numpy as np

In [5]: np.array(x)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-5-152940d1f14f> in <module>()
----> 1 np.array(x)

/home/mrocklin/workspace/sparse/sparse/coo.py in __array__(self, dtype, **kwargs)
   1230 
   1231     def __array__(self, dtype=None, **kwargs):
-> 1232         x = self.todense()
   1233         if dtype and x.dtype != dtype:
   1234             x = x.astype(dtype)

/home/mrocklin/workspace/sparse/sparse/coo.py in todense(self)
    353         """
    354         self.sum_duplicates()
--> 355         x = np.zeros(shape=self.shape, dtype=self.dtype)
    356 
    357         coords = tuple([self.coords[i, :] for i in range(self.ndim)])

MemoryError: 

@mrocklin
Copy link
Contributor

Here is a strawman proposal: https://github.com/mrocklin/sparse/pull/87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants