-
-
Notifications
You must be signed in to change notification settings - Fork 132
Deal with np.array(sparsearr) densification #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
From the docs for You might also want to do Since we're passing a complete Numpy array, the data sharing will be done for us. However, when using But since we're marking it as "writable" that might cause other side effects I'm not aware of. Common sense says this should not be the case, though, since a new copy is produced every time |
One answer (now deleted) on SO suggested to use |
Interesting. He was probably right to delete the answer, as it didn't answer the original question. That said, it does solve our problem. I'm happy to merge either way (as this is clearly an improvement) but it'd be nice to see benchmarks either way. |
Where would you like to see them? Here as a comment, or as part of the codebase and docs? Also, one thing to keep in mind is that
now "just works" by silently converting the sparse array to a dense one. This could be a point of frustration for users, if they accidentally consume all their RAM in one innocent looking single line. I am also working on a slightly smarter implementation of a |
Here, as a comment. Benchmarks should only be included in docs if there are changes across releases. Also,
Although this would end up comparing sparsity structure exactly. In any case, I think it's close to what we want. If you don't want to compare sparsity structure:
You might want to look at the logic we use in |
On second thought, this PR in general might be a bad idea, until we implement a suitable framework for auto-densification (#10), as it would densify at most times, and densification would be implicit instead of explicit. Edit: On third thought, since np.array already does this worse than we do, it is a good idea, but we should give priority to #10. |
Not implementing this is even worse, as NumPy tries to densify the array anyways... |
Would be using |
It's okay for now, however we will need to find a resolution to #10 for the future. I'm looking into it now. Since Dask (and other parallel libraries) may use this, we might need thread local storage for the configuration options. |
There's another problem with this... If we do I fear that with this solution we're moving more and more towards implicit densification. At this point, I think it's just best if we just raise a Edit: I opened a new issue for this, #81. |
If possible, could you add tests for the cases |
What do you mean? |
Tests of the form |
55fbd85
to
7a9b597
Compare
@@ -68,6 +68,7 @@ COO | |||
COO.to_scipy_sparse | |||
COO.tocsc | |||
COO.tocsr | |||
COO.__array__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually want to document double-underscore functions in the API docs? I'm not sure what the convention is for the Python community. I know documenting them in code is good for potential contributors but I'm not sure if we should put them in the API docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think in case of __array__
we should, because it is likely to show unexpected behaviour that users should be able to look up.
I just tested... Prior to merging #68, I tried returning |
One (slightly hackish) way may be
|
Another alternative would be
which would be in line with what happens when you
Using this, the tests in #78 still fail |
Or
Which would be a little bit more in line with what happens when you
Using this, the tests in #78 succeed (but I don't know if what is happening internally is really sane) |
The way I see it, we have two ways we can really go:
I'm open to both but slightly in favour of the first. #81 is fixed with both 1 and 2, and even if we return Personally, I think if we implement 2, and run into problems again, we drop Numpy 1.12 support completely. |
I wish |
cc @mrocklin Your input here would be valuable. |
@nils-werner What do you think? I'm not sure how widespread Numpy 1.13 adoption is and whether we should put in the effort to support 1.12. |
I don't know. I don't understand all this new ufunc magic well enough to give a useful answer... |
I'm not sure I know enough about NumPy internals to quickly know the right thing to do here. cc'ing @njsmith in case he has time to comment here (Also, Nathaniel, meet @hameerabbasi and @nils-werner , both of whom have been doing a lot of great work on this project over the last month.) |
@njsmith, just a quick run-down. We want We're basically looking for a way to tell Numpy 1.12- to NOT use |
I've opened #84 to deal with this for now. |
Definitely don't inherit from For In general, the plan is for numpy to continue adding stuff like For interaction with |
Thanks a lot @njsmith! |
xref numpy/numpy#4164 |
@nils-werner, as discussed in #84, would you mind adding the fix in https://github.com/mrocklin/sparse/pull/72#issuecomment-357771582 (with the slight exception that we should mimic numpy behaviour for It might be worthwhile converting the dtype with |
OK, I've taken a bit of time and read through things here. In general I apologize for being absent recently. Some thoughts.
It's also worth noting that Numpy saves us a bit here with In [1]: from sparse import COO, random
In [2]: x = random((100000, 100000), density=0.000001)
In [3]: x
Out[3]: <COO: shape=(100000, 100000), dtype=float64, nnz=10000, sorted=False, duplicates=True>
In [4]: import numpy as np
In [5]: np.array(x)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-5-152940d1f14f> in <module>()
----> 1 np.array(x)
/home/mrocklin/workspace/sparse/sparse/coo.py in __array__(self, dtype, **kwargs)
1230
1231 def __array__(self, dtype=None, **kwargs):
-> 1232 x = self.todense()
1233 if dtype and x.dtype != dtype:
1234 x = x.astype(dtype)
/home/mrocklin/workspace/sparse/sparse/coo.py in todense(self)
353 """
354 self.sum_duplicates()
--> 355 x = np.zeros(shape=self.shape, dtype=self.dtype)
356
357 coords = tuple([self.coords[i, :] for i in range(self.ndim)])
MemoryError: |
Here is a strawman proposal: https://github.com/mrocklin/sparse/pull/87 |
Picking up from #68:
Implementing
len()
means that NumPy can createnp.ndarray
s fromCOO
arrays usingBut it's very very slow.
If I understand it correctly,
numpy.array(x)
looks forbuffer
and__array_interface__
when creating an array from an object. If it doesn't find one it simply iterates over it (hence the slowness)What if we implement
__array_interface__
that callsself.todense()
?Before
After
I can't explain where the 2x slowdown (11 vs 23 us) comes from though. The factor is also there for
x = sparse.random((20, 30, 40000))
:78ms
vs184ms
.But please be aware that I am phrasing this suggestion as a question! I don't exactly know what NumPy does with the value returned from
__array_interface__
. Does it copy it? Does it share it? Are we leaking memory? Are we double-freeing it?