Inconsistent elemwise auto-densify rules #460

ivirshup · 2021-04-21T10:57:49Z

Describe the bug

Sometimes element wise operations return a dense result, sometimes they error. It seems to depend on the broadcasting.

To Reproduce

import sparse
import numpy as np

s = sparse.random((20, 10), .5)
dense_mtx = np.random.rand(20, 10)
dense_vec = np.random.rand(10) 

print(type(s + dense_mtx))
# <class 'numpy.ndarray'>

s + dense_vec

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-8f24377c4243> in <module>
      9 # np.ndarray
     10 
---> 11 s + dense_vec

/usr/local/lib/python3.8/site-packages/numpy/lib/mixins.py in func(self, other)
     19         if _disables_array_ufunc(other):
     20             return NotImplemented
---> 21         return ufunc(self, other)
     22     func.__name__ = '__{}__'.format(name)
     23     return func

~/github/sparse/sparse/_sparse_array.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    303 
    304         if method == "__call__":
--> 305             result = elemwise(ufunc, *inputs, **kwargs)
    306         elif method == "reduce":
    307             result = SparseArray._reduce(ufunc, *inputs, **kwargs)

~/github/sparse/sparse/_umath.py in elemwise(func, *args, **kwargs)
     47     """
     48 
---> 49     return _Elemwise(func, *args, **kwargs).get_result()
     50 
     51 

~/github/sparse/sparse/_umath.py in __init__(self, func, *args, **kwargs)
    460 
    461         self._check_broadcast()
--> 462         self._get_fill_value()
    463 
    464     def get_result(self):

~/github/sparse/sparse/_umath.py in _get_fill_value(self)
    554         equivalent_fv = equivalent(fill_value, fill_value_array).all()
    555         if not equivalent_fv and self.shape != self.ndarray_shape:
--> 556             raise ValueError(
    557                 "Performing a mixed sparse-dense operation that would result in a dense array. "
    558                 "Please make sure that func(sparse_fill_values, ndarrays) is a constant array."

ValueError: Performing a mixed sparse-dense operation that would result in a dense array. Please make sure that func(sparse_fill_values, ndarrays) is a constant array.

Expected behavior

Consistency. I'd like a dense output for this operation (following #10 (comment)), but would be happy with the consistency of:

Always error
Always return dense
Always return sparse

System

OS and version: macOS 10.15
sparse version (sparse.__version__) 0.12.0+8.g75125cd
NumPy version (np.__version__) 1.20.2
Numba version (numba.__version__) 0.53.1

The text was updated successfully, but these errors were encountered:

hameerabbasi · 2021-04-21T16:55:33Z

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array) should fail because it'd be effectively dense; and it'd blow up memory due to the densification of the sparse array, whereas np.multiply.outer(sparse_1d_array, dense_1d_array) is completely fine (and will be sparse).

One option is to allow "broadcasting" fill-values.

ivirshup · 2021-04-22T05:39:56Z

The reasoning behind this one was that doing np.add.outer(sparse_1d_array, dense_1d_array)

But wouldn't this still blow up memory if both these arrays are dense?

I'm not sure I would agree that it's this libraries job to error here, and I'm not aware of sparse matrix library that refuses this kind of operation (examples being scipy.sparse, R's Matrix, julia's SparseArray). I think a warning here is very reasonable, but it would be nice if these acted like arrays. I would note that R and Julia return sparse arrays from these operations, while scipy returns a dense array

One option is to allow "broadcasting" fill-values.

Personally, I'm not completley convinced allowing custom fill values is worth the complexity. It definitely has some cool applications, but to me the main goal is having working sparse arrays with an array interface.

I think this solution makes a lot of sense for a lazy interface, but is just a lot of complexity for an eager one.

hameerabbasi · 2021-04-22T06:40:38Z

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory), whereas np.multiply.outer(sparse_1d_array, dense_1d_array) fits in that range. Specifically, if we can't fit an algorithm that's better than a dense algorithm into our paradigm, we error, that's our definition of "memory blowup". In this particular case, one can simply do np.add.outer(sparse_1d_array.todense(), dense_1d_array).

hameerabbasi · 2021-04-22T06:47:08Z

This is also in line with the decision made over at #218.

ivirshup · 2021-04-22T08:00:55Z

I'll have to rephrase this one: np.add.outer(sparse_1d_array, dense_1d_array) uses memory that's beyond O(sparse_memory * dense_memory)

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

np.multiply.outer(sparse_1d_array, dense_1d_array)

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

This is also in line with the decision made over at 218.

To my reading, the short discussion there is more focussed on the behavior of np.asarray than the results of broadcasting. I think the results of operations on mixed sparse and dense arrays could use more discussion. I'd be interested to knowing why libraries in other languages tend to return sparse array outputs from these, even though the data can be dense.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

hameerabbasi · 2021-04-22T08:11:29Z

This isn't strictly true if the output array is dense, which would follow the current behaviour of sparse_nd_array + dense_nd_array -> dense_nd_array.

Currently we do not accept dense outputs.

But also, isn't the same general principle of large memory growth similar if both those arrays were sparse?

This is true, I hadn't considered that.

As a counter example here: np.multiply.outer(sparse.random(100, .1, fill_value=1), np.arange(100)). This seems like a case that would easily slip through testing, but show up in downstream use (a concern brought up here: #218 (comment)).

I was assuming a fill-value of 0 in this case, and this case is, indeed, handled more generally such that there aren't blowups. I do agree it could be tested better.

That said, it would be really nice if there was a universal way to convert an numpy-array-like to a numpy-array, which I thought was the intent of np.asarray.

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases" (like CuPy, which disallows __array__) and the worse-case "this may blow up memory so badly I may have to restart my machine or face OOM errors" (which is the case here).

ivirshup · 2021-04-22T08:54:20Z

Currently we do not accept dense outputs.

This doesn't seem to be the case to me. The example at the top of the issue (adding a sparse array and a dense array resulting in dense array) works on current master. Elemwise has code for special handling and returning of dense outputs, which is what allows that operation to occur:

sparse/sparse/_umath.py

Line 459 in 0d50c9d

self._dense_result = False

sparse/sparse/_umath.py

Lines 555 to 561 in 0d50c9d

    
           if not equivalent_fv and self.shape != self.ndarray_shape: 
        
               raise ValueError( 
        
                   "Performing a mixed sparse-dense operation that would result in a dense array. " 
        
                   "Please make sure that func(sparse_fill_values, ndarrays) is a constant array." 
        
               ) 
        
           elif not equivalent_fv: 
        
               self._dense_result = True

Not quite, there's a difference between "this is already something like an array, please convert it to one" and "this may cause a performance slowdown in some cases"

I think this is a different point. My point here was it be nice if my library only worked on np.ndarrays, but I could accept any array like by simply calling some conversion function, say arr.convert_to(np.ndarray) on the passed input. It seems like the preferred way to do this is changing to np.asarray(arr, like=np.ones(()))? Though I'm not sure why passing a type as like (like=np.ndarray) isn't the API. Also not sure how array likes are supposed to register conversion methods, but this may be a larger issue of me not fully understanding numpy's various dispatch mechanisms.

Your NEP-37 would help solve a number of these problems.

hameerabbasi · 2021-04-22T09:34:38Z

Currently we do not accept dense outputs.

What I meant by this is that we do not support ufunc(*input_arrays, out=dense_array), but yes, we do return output arrays being dense.

I think this is a different point.

While I agree, it's relevant. But in light of the point that by my definition, memory blowup could also happen on SparseArrays, I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

ivirshup · 2021-04-24T09:00:56Z

I'm willing to go for conditional sparse/dense outputs, but making it consistent is out of the question, as in some cases, sparse inputs may be desirable.

Just to be sure we're on the same page here, does something like this fit what you mean?

import numpy as np
import sparse

sparse_mtx = sparse.random((M, N))
dense_mtx = np.random.rand((M, N))

sparse_col = sparse_mtx[:, [0]]
sparse_row = sparse_mtx[0, :]

dense_col = dense_mtx[:, [0]]
dense_row = dense_mtx[0, :]

np.add(sparse_mtx, dense_mtx) -> np.ndarray
np.multiply(sparse_mtx, dense_mtx) -> SparseArray

np.add(dense_mtx, sparse_col) -> np.ndarray

np.add(sparse_mtx, dense_col) -> np.ndarray
np.add(sparse_mtx, sparse_col) -> SparseArray

np.mul(sparse_mtx, dense_col) -> SparseArray
np.mul(dense_mtx, sparse_col) -> ???

I think this might take a significant refactor of the elemwise code. I would think that going for more a signature based dispatch (e.g. something multiple dispatch-y) would be very helpful for defining methods here.

I would also like to investigate why it seems most other libraries seem to have gone for the consistency of just having these operations always return sparse arrays.

hameerabbasi · 2021-04-24T09:25:30Z

Just to be sure we're on the same page here, does something like this fit what you mean?

For a fill-value of 0 and inputs lacking NaNs, yes. Surprisingly, it doesn't require a refactor at all.

ivirshup · 2021-04-24T09:33:23Z

Surprisingly, it doesn't require a refactor at all.

I think this may be too inefficient of a way to find out that a = sparse.random((100000, 100000)); a * np.ones(a) returns sparse:

sparse/sparse/_umath.py

Lines 531 to 543 in e036c25

    
           zero_args = tuple( 
        
               arg.fill_value[...] if isinstance(arg, COO) else arg for arg in self.args 
        
           ) 
        
           # Some elemwise functions require a dtype argument, some abhorr it. 
        
           try: 
        
               fill_value_array = self.func( 
        
                   *np.broadcast_arrays(*zero_args), dtype=self.dtype, **self.kwargs 
        
               ) 
        
           except TypeError: 
        
               fill_value_array = self.func( 
        
                   *np.broadcast_arrays(*zero_args), **self.kwargs 
        
               )

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

hameerabbasi · 2021-04-24T09:38:13Z

I think it would make more sense to just say that "if the operation is multiplication, we only need too look at the intersection of the sparse indices". That is, I think it makes sense to have this logic written as code as opposed to determined dynamically.

While I agree, that would be a big refactor indeed. What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team.

ivirshup · 2021-04-24T09:48:33Z

What I would prefer instead is that we defer that to the rewrite that's taking place in collaboration with the TACO team

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

This definitely seems like the ideal, to just have all operations here generated by taco. I recall reading issues about this, but hadn't seen much movement on implementation. If this is something that might be a reality soon, then I would probably want to re-evaluate implementation priorities for CSR and CSC.

hameerabbasi · 2021-04-24T09:51:49Z

I hadn't realized this was something that was being worked on. Is there somewhere it can be tracked?

Mostly in the TACO repository. They're adding an array API instead of just Einsum-type operations.

Advanced indexing might have to be done on top. The work is expected to complete in June, after which we'll need to wrap it into something a bit more Pythonic.

ivirshup · 2021-04-24T10:23:27Z

The work is expected to complete in June

Great! I've been wanting to able to use taco in a package for a few years now, so very happy to see progress being made.

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

Advanced indexing might have to be done on top.

Yeah, so where does that leave this package, if most everything will just be implemented in taco?

hameerabbasi · 2021-04-24T10:27:31Z

Yeah, so where does that leave this package, if most everything will just be implemented in taco?

Well the plan is to have either pytaco (within the TACO repo) or ndsparse (outside the repo) be a new package that does these things. See #354 (comment)

hameerabbasi · 2021-04-24T10:30:58Z

The work is expected to complete in June

Also I should say -- The Array API work on TACO's side is expected in June, the bindings come after. Sorry for the ambiguity there.

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

The tentative plan is to statically link against LLVM so that neither LLVM nor a C++ compiler are required on the user's system. We haven't figured out a plan for custom operations from within Python, but I believe @guilhermeleobas had some ideas.

ivirshup · 2021-04-24T11:22:07Z

Well the plan is to have either pytaco (within the TACO repo) or ndsparse (outside the repo) be a new package that does these things.

I had understood that issue as being a tentative renaming. I might be misunderstanding what you're saying here, is it the plan to just deprecate this package?

My main concern here is: is there value in adding features/ fixing bugs in this package? It seems like much of it (at least for COO, GCXS, and CSR/CSC classes) would be replaced by pytaco bindings.

The tentative plan is to statically link against LLVM

Is there any sort of ETA for this? This seems like the largest hurdle for being able to have TACO as a dependency.

Also, are releases documented somewhere? I was looking to install it but was unsure how to choose a stable commit to build at.

hameerabbasi · 2021-04-24T11:29:36Z

I had understood that issue as being a tentative renaming. I might be misunderstanding what you're saying here, is it the plan to just deprecate this package?

My main concern here is: is there value in adding features/ fixing bugs in this package? It seems like much of it (at least for COO, GCXS, and CSR/CSC classes) would be replaced by pytaco bindings.

I'm open to both a rename, as well as simply replacing this package. The issue I see is that there will be (albeight slight) backcompat breaks, including NaN-accuracy. We could (instead of a rename) push out a version 2.0.

Is there any sort of ETA for this? This seems like the largest hurdle for being able to have TACO as a dependency.

Unfortunately it's dependent on too many things to calculate a reasonable ETA, some of which are behind the scenes and may become clearer within a few months.

Also, are releases documented somewhere? I was looking to install it but was unsure how to choose a stable commit to build at.

TACO doesn't have a stable release yet, they're aiming for one soon without the Array API.

guilhermeleobas · 2021-04-27T03:01:17Z

Could you share any info on the plan for distribution? Is this going to require a C++ compiler at runtime, or was the idea of emitting numba code still being looked at? It would be nice to be able to use custom semirings or broadcast operations.

When possible, we would like to statically link with LLVM. And about Numba, the initial idea was to reimplement TACO on top of Numba, but this is a lot of work, and we will not follow this direction anymore. In the future, we might consider Numba to generate user defined operations and link it with the LLVM generated by TACO.

ivirshup added the bug Indicates an unexpected problem or unintended behavior label Apr 21, 2021

ivirshup mentioned this issue Apr 24, 2021

[WIP] CSR/ CSC Elemwise #465

Draft

6 tasks

hameerabbasi mentioned this issue Apr 24, 2021

Reconsider the name of this package #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent elemwise auto-densify rules #460

Inconsistent elemwise auto-densify rules #460

ivirshup commented Apr 21, 2021 •

edited

Loading

hameerabbasi commented Apr 21, 2021 •

edited

Loading

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021 •

edited

Loading

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021 •

edited

Loading

ivirshup commented Apr 24, 2021 •

edited

Loading

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021 •

edited

Loading

hameerabbasi commented Apr 24, 2021

guilhermeleobas commented Apr 27, 2021

Inconsistent elemwise auto-densify rules #460

Inconsistent elemwise auto-densify rules #460

Comments

ivirshup commented Apr 21, 2021 • edited Loading

hameerabbasi commented Apr 21, 2021 • edited Loading

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021 • edited Loading

ivirshup commented Apr 22, 2021

hameerabbasi commented Apr 22, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021 • edited Loading

ivirshup commented Apr 24, 2021 • edited Loading

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

hameerabbasi commented Apr 24, 2021

ivirshup commented Apr 24, 2021 • edited Loading

hameerabbasi commented Apr 24, 2021

guilhermeleobas commented Apr 27, 2021

ivirshup commented Apr 21, 2021 •

edited

Loading

hameerabbasi commented Apr 21, 2021 •

edited

Loading

hameerabbasi commented Apr 22, 2021 •

edited

Loading

hameerabbasi commented Apr 24, 2021 •

edited

Loading

ivirshup commented Apr 24, 2021 •

edited

Loading

ivirshup commented Apr 24, 2021 •

edited

Loading