Skip to content

utils.svd_flip breaks on rectangular matrix #732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-czech opened this issue Aug 27, 2020 · 3 comments
Closed

utils.svd_flip breaks on rectangular matrix #732

eric-czech opened this issue Aug 27, 2020 · 3 comments

Comments

@eric-czech
Copy link
Contributor

Is there a fundamental reason that this shouldn't work?

import dask.array as da
from dask_ml.utils import svd_flip

# Run svd on a short-fat array with a single chunk
x = da.random.random(size=(100, 1000)).rechunk((-1, -1))
u, s, v = da.linalg.svd(x)
u, v = svd_flip(u, v)
u.compute()
...
/opt/conda/lib/python3.7/site-packages/dask/core.py in _execute_task(arg, cache, dsk)
    119         # temporaries by their reference count and can execute certain
    120         # operations in-place.
--> 121         return func(*(_execute_task(a, cache) for a in args))
    122     elif not ishashable(arg):
    123         return arg

/opt/conda/lib/python3.7/site-packages/dask_ml/utils.py in _svd_flip_copy(x, y, u_based_decision)
     31         return skm.svd_flip(x, y, u_based_decision=u_based_decision)
     32     except ValueError:
---> 33         return skm.svd_flip(x.copy(), y.copy(), u_based_decision=u_based_decision)
     34 
     35 

/opt/conda/lib/python3.7/site-packages/sklearn/utils/extmath.py in svd_flip(u, v, u_based_decision)
    533         signs = np.sign(u[max_abs_cols, range(u.shape[1])])
    534         u *= signs
--> 535         v *= signs[:, np.newaxis]
    536     else:
    537         # rows of v, columns of u

ValueError: operands could not be broadcast together with shapes (1000,1000) (100,1) (1000,1000)

Environment:

  • Dask version: 2.21.0
  • Python version: 3.7
  • Operating System: linux
  • Install method (conda, pip, source): conda
@TomAugspurger
Copy link
Member

I'm not sure, but it seems that scikit-learn has similar behavior

In [27]: u, s, v = np.linalg.svd(x.compute())

In [28]: u, v = sklearn.utils.extmath.svd_flip(u, v)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-ffeacd5d685b> in <module>
----> 1 u, v = sklearn.utils.extmath.svd_flip(u, v)

~/Envs/dask-dev/lib/python3.8/site-packages/sklearn/utils/extmath.py in svd_flip(u, v, u_based_decision)
    533         signs = np.sign(u[max_abs_cols, range(u.shape[1])])
    534         u *= signs
--> 535         v *= signs[:, np.newaxis]
    536     else:
    537         # rows of v, columns of u

ValueError: operands could not be broadcast together with shapes (1000,1000) (100,1) (1000,1000)

@eric-czech
Copy link
Contributor Author

eric-czech commented Sep 2, 2020

Interesting, thanks @TomAugspurger.

It looks to me like the issue is that da.linalg.svd simply doesn't support full_matrices. That broadcast would be fine if the results passed to https://github.com/dask/dask-ml/blob/master/dask_ml/utils.py#L26 were already clipped to shapes (M, K) and (K, N) where K = min(M, N). I see a good dozen or so uses of scipy svd in scikit-learn with that parameter and they never follow an svd without removing the unnecessary rows/cols (https://github.com/scikit-learn/scikit-learn/search?p=2&q=full_matrices&unscoped_q=full_matrices) -- i.e. full_matrices is always false and that's the opposite of what dask does.

I'll make or find an issue in dask since this should be a precursor to supporting #731 well.

EDIT

Looks like that's been in limbo for a while: dask/dask#3576

@eric-czech
Copy link
Contributor Author

eric-czech commented Oct 2, 2020

This isn't technically fixed since I left the existing sklearn-like svd_flip function alone for compatibility with older dask-ml/dask versions, but the real issue underlying this was addressed in dask/dask#6616. tsqr was sometimes behaving like np.linalg.svd(..., full_matrices=False) and other times like np.linalg.svd(..., full_matrices=True). The latter was the problem here (related: dask/dask#3576).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants