Automatic conversion to dense arrays #10

mrocklin · 2017-04-27T19:00:35Z

Operations on sparse arrays sometimes produce dense arrays. This type instability can cause some frustration downstream but may be optimal performance-wise.

In many occasions we actually inherit this behavior from scipy.sparse, which returns numpy.matrix objects in some cases. Currently we also return dense numpy.ndarray objects when this happens and when the number of non-zeros is high. I'm running into cases where I want to do this more and more, especially in parallel computing cases where I tensordot and add together many sparse arrays. Switching to dense starts to make a lot of sense.

However this will likely cause some frustration downstream as users sometime receive sparse and and sometimes receive dense arrays based on their data at the moment. Is this performance gain worth it?

The text was updated successfully, but these errors were encountered:

jakevdp · 2017-04-27T19:56:08Z

That's a tough one... I'd probably avoid returning sparse or dense conditionally from the same routine, as that's a recipe for hard-to-catch bugs in corner cases.

mrocklin · 2017-04-27T20:08:01Z

Yeah, last week I would have come out strongly in favor of type stability. As I've been running computations in practice my confidence has decreased somewhat here. I'm still not entirely in favor of optional conversions, but if I'm unable to optimize down some sparse operations it may become an interesting option.

The saving grace of losing type stability is that this may be come less of an issue if the __array_ufunc__ protocol emerges in Numpy 1.13. The type of container may start to matter much less in numpy code in the future.

hameerabbasi · 2017-12-25T06:13:30Z

One issue that I see in the code is that operator.add for COO is treated differently from other operators, like operator.mul, in the way that we do maybe_densify for add but not for other operators, and when trying to correct this, a unit test failed. I ran into this when I tried to implement operators.gt, etc., and wanted to unify behavior across operators and consistently use the same code. Since there is a unit test, I'm assuming someone has a use for it, and I didn't want to change it without asking. We should decide on a concrete strategy to follow:

We use maybe_densify consistently.
We always raise an error.
We always densify when needed.
We provide a global flag or option that can be used with with similar to other libraries.
We use maybe_densify consistently with global flags/options for maybe_densify.

Whatever we choose, we should bake it into elemwise and elemwise_binary (and possibly provide kwargs). Since __array_ufunc__ is now here, it should be okay to lose type stability. I'm in favor of option 5, as it is the most general.

hameerabbasi · 2017-12-25T06:30:20Z

Two other things that would be helpful to think about are:

Do we auto-densify above a certain sparsity level?
Do we treat operators like operator.mul differently from operator.add where the result is sparse if any one of the operands is sparse?

hameerabbasi · 2017-12-26T20:00:33Z

Paging @mrocklin @nils-werner @jakevdp. Your input would be very welcome.

mrocklin · 2017-12-26T20:19:07Z

At the moment I'm not in favor of auto-densification by default. However I do agree that we should make it easy for users to do this on their own. One approach would be to expose something like maybe_densify to the user? Config options with a context manager also seem sensible to me. I don't have confidence here in any direction.

mrocklin · 2017-12-26T20:19:27Z

Personally I'm inclined to ignore this issue until it comes up in a few more use cases.

hameerabbasi · 2017-12-26T20:22:27Z

I wanted to implement __array_ufunc__ so would it be okay (for now) to remove the auto-densification in __add__ and the test for it?

mrocklin · 2017-12-26T20:23:43Z

If we currently do auto-densification then yes, I think it makes sense to remove it.

An __array_ufunc__ implementation would be quite welcome.

hameerabbasi · 2018-01-11T10:23:15Z

When researching how to do configurations with context managers the proper, thread-safe way, I came accross PEP 567. I'm surprised it doesn't already exist.

When making our own implementation (which I'm somewhat skeptical about), the following problems come to mind:

Inheriting from the parent thread.
Making it so race conditions don't make threads override each others' data.
I considered thread-local storage but that (probably?) won't be inherited by a child thread.

All of this is handled for us by PEP 567. However, if we're not worried about thread safety we can just use a stack.

nils-werner · 2018-01-11T12:28:13Z

Am I correct in assuming you talking about threadsafe context managers means you want to do something like

a = sparse.random((20, 20))

with sparse.allow_densify():
    a = a + 1

In that case I would seriously warn against that, for exactly the issues you are fearing. Plus, you would have to couple some singleton/module state to the behaviour of your COO instances, which can be a huge pain: Thread safety, unexpected behaviours, all unittests become much less meaningful etc.

This system of the module having an internal state is what gave us the horrible Matlab/Matplotlib principle where you need to track the state of plt in order to manipulate your plots:

plt.subplot(2, 1, 1)  # All commands after this implicitly target the first subplot
plt.plot(...)
plt.subplot(2, 1, 2)  # All commands after this implicitly target the second subplot
plt.plot(...)
plt.show()

Only recently they started introducing (and encouraging) the right way of doing it: instead of the module having a state you get back instances which have states:

f, (a, b) = plt.subplots(2, 1)  # a is the first, b the second subplot
a.plot(...)
b.plot(...)
plt.show()

So instead, if possible, I would recommend a solution that keeps the allow_densify state in the COO array itself:

a = sparse.random((20, 20))

with a.allow_densify():
    a = a + 1

hameerabbasi · 2018-01-11T12:49:29Z

While this is a very good solution, I was thinking more of the use-case where np.ndarray and COO can be used interchangeably when we've implemented a significant amount of the API. Then, libraries we might want to do something like this:

# Coming from somewhere else, we don't know where.
a = sparse.random((20, 20))
b = sparse.random((20, 20))

def get_neg_add_plus_one(a, b):
   """ In another library not related to us. """
    return -(a + b) + 1

Then, even if we allow things like:

with a.allow_densify, b.allow_densify:
    c = get_neg_add_plus_one(a, b):

We would have to make sure that (a + b) has allow_densify properly set. For arbitrary operations, we would need to maintain a DAG where each allow_densify node is only set if all parents are set.

I'm not entirely against this approach, I think it's more elegant, but we would need a bit of processing at the densifying stage.

Edit: Not to mention... Memory leaks if we actually store the parents in the child object...

hameerabbasi · 2018-01-11T12:53:23Z

Another thing I was thinking was how to manage mixed sparse-dense operations. For example, if a is dense and b is sparse:

a + b is dense, BUT:
a * b is sparse, and should be handled that way. Becomes even more important if a is being broadcasted to a higher dimension.

In this case,

Auto-densification rules should kick in for a + b
a * b should be handled the sparse way.

hameerabbasi · 2018-01-26T09:26:09Z

I got around this by storing the parent configs (just the configs, not the full objects) in each child object using set. This stores the parent configs, but only of the "root" parents.

I also found a way to handle the issues in my previous comment in #75. My implementation is available in #88. Feedback welcome.

Hoeze · 2019-03-25T17:43:16Z

np.asarray(x) should definitively return a dense numpy array, no matter what type of (possibly sparse) array x is.
This is also my current problem why I cannot use it with Dask.

hameerabbasi · 2019-03-25T17:47:20Z

Hello,

A design decision was made to disallow it (read the discussion over in #218). If you must have dense arrays, the best thing to do is to map blocks in Dask using .todense().

hameerabbasi · 2019-03-25T17:59:10Z

@Hoeze Also, you can control this behavior using the environment variable SPARSE_AUTO_DENSIFY. Please note that this variable is read once on import time.

Hoeze · 2019-03-25T18:09:40Z

@hameerabbasi Thank you very much, you just saved my day!

TomMaullin · 2019-09-08T13:15:39Z

Apologies if this is a dumb question but I am having the same issue as described in this thread and was wondering... Would anyone be able to tell me how you changed the SPARSE_AUTO_DENSIFY environment variable in sparse?

nils-werner · 2019-09-08T19:06:24Z

By starting python using

SPARSE_AUTO_DENSIFY=1 python

or running

import os
os.environ["SPARSE_AUTO_DENSIFY"] = 1

once Python is running.

TomMaullin · 2019-09-10T17:20:42Z

Ah! Sorry yes, I see! Thank you!

hameerabbasi mentioned this issue Dec 27, 2017

Remove auto densification and unify operator code. #46

Merged

hameerabbasi added discussion enhancement Indicates new feature requests labels Dec 28, 2017

hameerabbasi mentioned this issue Jan 11, 2018

Deal with np.array(sparsearr) densification #72

Closed

hameerabbasi mentioned this issue Jan 25, 2018

Add densification manager. #88

Closed

Hoeze mentioned this issue Mar 25, 2019

Some functions fail on Dask arrays from sparse dask/dask#4523

Closed

hameerabbasi closed this as completed Mar 25, 2019

ivirshup mentioned this issue Apr 21, 2021

Inconsistent elemwise auto-densify rules #460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic conversion to dense arrays #10

Automatic conversion to dense arrays #10

mrocklin commented Apr 27, 2017

jakevdp commented Apr 27, 2017

mrocklin commented Apr 27, 2017

hameerabbasi commented Dec 25, 2017

hameerabbasi commented Dec 25, 2017 •

edited

Loading

hameerabbasi commented Dec 26, 2017

mrocklin commented Dec 26, 2017

mrocklin commented Dec 26, 2017

hameerabbasi commented Dec 26, 2017

mrocklin commented Dec 26, 2017

hameerabbasi commented Jan 11, 2018

nils-werner commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 11, 2018

hameerabbasi commented Jan 26, 2018 •

edited

Loading

Hoeze commented Mar 25, 2019

hameerabbasi commented Mar 25, 2019

hameerabbasi commented Mar 25, 2019

Hoeze commented Mar 25, 2019

TomMaullin commented Sep 8, 2019

nils-werner commented Sep 8, 2019

TomMaullin commented Sep 10, 2019

Automatic conversion to dense arrays #10

Automatic conversion to dense arrays #10

Comments

mrocklin commented Apr 27, 2017

jakevdp commented Apr 27, 2017

mrocklin commented Apr 27, 2017

hameerabbasi commented Dec 25, 2017

hameerabbasi commented Dec 25, 2017 • edited Loading

hameerabbasi commented Dec 26, 2017

mrocklin commented Dec 26, 2017

mrocklin commented Dec 26, 2017

hameerabbasi commented Dec 26, 2017

mrocklin commented Dec 26, 2017

hameerabbasi commented Jan 11, 2018

nils-werner commented Jan 11, 2018 • edited Loading

hameerabbasi commented Jan 11, 2018 • edited Loading

hameerabbasi commented Jan 11, 2018

hameerabbasi commented Jan 26, 2018 • edited Loading

Hoeze commented Mar 25, 2019

hameerabbasi commented Mar 25, 2019

hameerabbasi commented Mar 25, 2019

Hoeze commented Mar 25, 2019

TomMaullin commented Sep 8, 2019

nils-werner commented Sep 8, 2019

TomMaullin commented Sep 10, 2019

hameerabbasi commented Dec 25, 2017 •

edited

Loading

nils-werner commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 11, 2018 •

edited

Loading

hameerabbasi commented Jan 26, 2018 •

edited

Loading