-
-
Notifications
You must be signed in to change notification settings - Fork 329
Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @joshmoore and @MSanKeys963, I want to ask if this issue is still relevant. If yes, I will like to work on it immediately |
There's still definitely a lot of interest in this (cc: @ivirshup @martindurant et al), @Mubarraqqq, but it's a pretty sizable project. |
No problem. I'm still willing to work on the issue regardless. |
Perhaps this one is the other way around from what we've been talking about. It is of interest to use zarr as a storage for the various sparse layouts; but here you have a dense zarr and want to insert sparse rows into a DB. I don't see why you couldn't do this block-wise with dask or even serially. |
Here's a slightly simplified version of the code for doing this in Example implementationfrom scipy import sparse
import zarr
def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
"""\
Gives indexer tuples chunked along an axis.
Params
------
shape
Shape of array to be chunked
axis
Axis to chunk along
chunk_size
Size of chunk along axis
Returns
-------
An iterator of tuples for indexing into an array of passed shape.
"""
total = shape[axis]
cur = 0
mutable_idx = [slice(None) for i in range(len(shape))]
while cur + chunk_size < total:
mutable_idx[axis] = slice(cur, cur + chunk_size)
yield tuple(mutable_idx)
cur += chunk_size
mutable_idx[axis] = slice(cur, None)
yield tuple(mutable_idx)
def read_dense_as_csr(array: zarr.Array) -> sparse.csr_matrix:
axis_chunk = array.chunks[0]
sub_matrices = []
for idx in idx_chunks_along_axis(array.shape, 0, axis_chunk):
dense_chunk = array[idx]
sub_matrix = sparse.csr_matrix(dense_chunk)
sub_matrices.append(sub_matrix)
return sparse.vstack(sub_matrices, format="csr") Could probably be improved with parallelization/ tailoring to your memory needs. Original code here |
Looks like this was solved by the above comment - I'll close as such. Thanks for opening this, and if you have any other questions please don't hesitate to open new issues! |
Hi,
I have a huge NumPy 2D matrix stored in a Zarr directorystore (chunkwise), this is a boolean matrix of shape for example 600M rows, 100K columns. My goal is to access random rows, and since this is stored chunkwise, it will not be ideal to directly access random row via Zarr, so I am trying to index it in MySQL or some other database for fast querying. Since this matrix will be sparse, I would like to know if there is any way to convert the on-disk matrix into a CSR for indexing in a database where I can do fast queries.
Any suggestion or pointers would be really helpful.
Thank you.
The text was updated successfully, but these errors were encountered: