Skip to content

Automatic conversion to dense arrays #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Apr 27, 2017 · 21 comments
Closed

Automatic conversion to dense arrays #10

mrocklin opened this issue Apr 27, 2017 · 21 comments
Labels
discussion enhancement Indicates new feature requests

Comments

@mrocklin
Copy link
Contributor

Operations on sparse arrays sometimes produce dense arrays. This type instability can cause some frustration downstream but may be optimal performance-wise.

In many occasions we actually inherit this behavior from scipy.sparse, which returns numpy.matrix objects in some cases. Currently we also return dense numpy.ndarray objects when this happens and when the number of non-zeros is high. I'm running into cases where I want to do this more and more, especially in parallel computing cases where I tensordot and add together many sparse arrays. Switching to dense starts to make a lot of sense.

However this will likely cause some frustration downstream as users sometime receive sparse and and sometimes receive dense arrays based on their data at the moment. Is this performance gain worth it?

@jakevdp
Copy link
Contributor

jakevdp commented Apr 27, 2017

That's a tough one... I'd probably avoid returning sparse or dense conditionally from the same routine, as that's a recipe for hard-to-catch bugs in corner cases.

@mrocklin
Copy link
Contributor Author

Yeah, last week I would have come out strongly in favor of type stability. As I've been running computations in practice my confidence has decreased somewhat here. I'm still not entirely in favor of optional conversions, but if I'm unable to optimize down some sparse operations it may become an interesting option.

The saving grace of losing type stability is that this may be come less of an issue if the __array_ufunc__ protocol emerges in Numpy 1.13. The type of container may start to matter much less in numpy code in the future.

@hameerabbasi
Copy link
Collaborator

One issue that I see in the code is that operator.add for COO is treated differently from other operators, like operator.mul, in the way that we do maybe_densify for add but not for other operators, and when trying to correct this, a unit test failed. I ran into this when I tried to implement operators.gt, etc., and wanted to unify behavior across operators and consistently use the same code. Since there is a unit test, I'm assuming someone has a use for it, and I didn't want to change it without asking. We should decide on a concrete strategy to follow:

  1. We use maybe_densify consistently.
  2. We always raise an error.
  3. We always densify when needed.
  4. We provide a global flag or option that can be used with with similar to other libraries.
  5. We use maybe_densify consistently with global flags/options for maybe_densify.

Whatever we choose, we should bake it into elemwise and elemwise_binary (and possibly provide kwargs). Since __array_ufunc__ is now here, it should be okay to lose type stability. I'm in favor of option 5, as it is the most general.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 25, 2017

Two other things that would be helpful to think about are:

  • Do we auto-densify above a certain sparsity level?
  • Do we treat operators like operator.mul differently from operator.add where the result is sparse if any one of the operands is sparse?

@hameerabbasi
Copy link
Collaborator

Paging @mrocklin @nils-werner @jakevdp. Your input would be very welcome.

@mrocklin
Copy link
Contributor Author

At the moment I'm not in favor of auto-densification by default. However I do agree that we should make it easy for users to do this on their own. One approach would be to expose something like maybe_densify to the user? Config options with a context manager also seem sensible to me. I don't have confidence here in any direction.

@mrocklin
Copy link
Contributor Author

Personally I'm inclined to ignore this issue until it comes up in a few more use cases.

@hameerabbasi
Copy link
Collaborator

I wanted to implement __array_ufunc__ so would it be okay (for now) to remove the auto-densification in __add__ and the test for it?

@mrocklin
Copy link
Contributor Author

If we currently do auto-densification then yes, I think it makes sense to remove it.

An __array_ufunc__ implementation would be quite welcome.

@hameerabbasi
Copy link
Collaborator

When researching how to do configurations with context managers the proper, thread-safe way, I came accross PEP 567. I'm surprised it doesn't already exist.

When making our own implementation (which I'm somewhat skeptical about), the following problems come to mind:

  • Inheriting from the parent thread.
  • Making it so race conditions don't make threads override each others' data.
  • I considered thread-local storage but that (probably?) won't be inherited by a child thread.

All of this is handled for us by PEP 567. However, if we're not worried about thread safety we can just use a stack.

@nils-werner
Copy link
Contributor

nils-werner commented Jan 11, 2018

Am I correct in assuming you talking about threadsafe context managers means you want to do something like

a = sparse.random((20, 20))

with sparse.allow_densify():
    a = a + 1

In that case I would seriously warn against that, for exactly the issues you are fearing. Plus, you would have to couple some singleton/module state to the behaviour of your COO instances, which can be a huge pain: Thread safety, unexpected behaviours, all unittests become much less meaningful etc.

This system of the module having an internal state is what gave us the horrible Matlab/Matplotlib principle where you need to track the state of plt in order to manipulate your plots:

plt.subplot(2, 1, 1)  # All commands after this implicitly target the first subplot
plt.plot(...)
plt.subplot(2, 1, 2)  # All commands after this implicitly target the second subplot
plt.plot(...)
plt.show()

Only recently they started introducing (and encouraging) the right way of doing it: instead of the module having a state you get back instances which have states:

f, (a, b) = plt.subplots(2, 1)  # a is the first, b the second subplot
a.plot(...)
b.plot(...)
plt.show()

So instead, if possible, I would recommend a solution that keeps the allow_densify state in the COO array itself:

a = sparse.random((20, 20))

with a.allow_densify():
    a = a + 1

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 11, 2018

While this is a very good solution, I was thinking more of the use-case where np.ndarray and COO can be used interchangeably when we've implemented a significant amount of the API. Then, libraries we might want to do something like this:

# Coming from somewhere else, we don't know where.
a = sparse.random((20, 20))
b = sparse.random((20, 20))

def get_neg_add_plus_one(a, b):
   """ In another library not related to us. """
    return -(a + b) + 1

Then, even if we allow things like:

with a.allow_densify, b.allow_densify:
    c = get_neg_add_plus_one(a, b):

We would have to make sure that (a + b) has allow_densify properly set. For arbitrary operations, we would need to maintain a DAG where each allow_densify node is only set if all parents are set.

I'm not entirely against this approach, I think it's more elegant, but we would need a bit of processing at the densifying stage.

Edit: Not to mention... Memory leaks if we actually store the parents in the child object...

@hameerabbasi
Copy link
Collaborator

Another thing I was thinking was how to manage mixed sparse-dense operations. For example, if a is dense and b is sparse:

  • a + b is dense, BUT:
  • a * b is sparse, and should be handled that way. Becomes even more important if a is being broadcasted to a higher dimension.

In this case,

  • Auto-densification rules should kick in for a + b
  • a * b should be handled the sparse way.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 26, 2018

I got around this by storing the parent configs (just the configs, not the full objects) in each child object using set. This stores the parent configs, but only of the "root" parents.

I also found a way to handle the issues in my previous comment in #75. My implementation is available in #88. Feedback welcome.

@Hoeze
Copy link

Hoeze commented Mar 25, 2019

np.asarray(x) should definitively return a dense numpy array, no matter what type of (possibly sparse) array x is.
This is also my current problem why I cannot use it with Dask.

@hameerabbasi
Copy link
Collaborator

Hello,

A design decision was made to disallow it (read the discussion over in #218). If you must have dense arrays, the best thing to do is to map blocks in Dask using .todense().

@hameerabbasi
Copy link
Collaborator

@Hoeze Also, you can control this behavior using the environment variable SPARSE_AUTO_DENSIFY. Please note that this variable is read once on import time.

@Hoeze
Copy link

Hoeze commented Mar 25, 2019

@hameerabbasi Thank you very much, you just saved my day!

@TomMaullin
Copy link

Apologies if this is a dumb question but I am having the same issue as described in this thread and was wondering... Would anyone be able to tell me how you changed the SPARSE_AUTO_DENSIFY environment variable in sparse?

@nils-werner
Copy link
Contributor

By starting python using

SPARSE_AUTO_DENSIFY=1 python

or running

import os
os.environ["SPARSE_AUTO_DENSIFY"] = 1

once Python is running.

@TomMaullin
Copy link

Ah! Sorry yes, I see! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

6 participants