Skip to content

Default efficient row iteration #398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Jan 28, 2019 · 5 comments
Closed

Default efficient row iteration #398

jeromekelleher opened this issue Jan 28, 2019 · 5 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do

    for pos, row in enumerate(data_root.variants):
        print(row)
        if pos == 1000:
            break

it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do

    for pos, row in enumerate(chunk_iterator(data_root.variants)):
        print(row)
        if pos == 1000:
            break

it takes less than a second, where

def chunk_iterator(array):
    """ 
    Utility to iterate over the rows in the specified array efficiently
    by accessing one chunk at a time.
    """
    chunk_size = array.chunks[0]
    for j in range(array.shape[0]):
        if j % chunk_size == 0:
            chunk = array[j: j + chunk_size][:]
        yield chunk[j % chunk_size]

To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.

Any thoughts?

@jakirkham
Copy link
Member

It sounds like what you want is a cache for decoded chunks, which is proposed in PR ( #306 ). If that sounds right, would you be ok closing this to keep discussion over there? If not, what else are you looking for?

@jeromekelleher
Copy link
Member Author

Ah, that's a good feature, thanks @jakirkham! This is related, but I guess addresses a more limited application and is simpler. What I'm suggesting here is that __iter__ methods use something like the chunk_iterator internally so that there's automatically good iteration performance even if you don't specify the LRU cache. Running a simple for-loop should have good performance by default, I think.

Any thoughts @alimanfoo?

@alimanfoo
Copy link
Member

Hi @jeromekelleher, I like your suggestion. As @jakirkham says, the cache for decoded chunks would partly solve this, but as you say iteration is actually a simpler requirement where you only cache one chunk at a time while you iterate through it. PR welcome.

@jeromekelleher
Copy link
Member Author

It looks like there aren't any tests for iteration currently. It belongs in test_core in principle, but really we just want to test this over a selection of arrays that have a bunch of different dimensions and chunk sizes. Any pointers on where this should go @alimanfoo and where I can find some pre-defined arrays that would be suitable?

@alimanfoo
Copy link
Member

I'd suggest adding a test_iter() method on the TestArray class in the test_core module. E.g., something like:

    test_iter(self):
        params = (
            ((1000,), (100,)),
            ((100, 100), (10, 10)),
            # any other combination of shape and chunks you'd like to test
        )
        for shape, chunks in params:
            z = self.create_array(shape=shape, chunks=chunks, dtype=int)
            a = np.arange(np.product(shape)).reshape(shape)
            z[:] = a
            for expect, actual in izip_longest(a, z):
                assert_array_equal(expect, actual)

jeromekelleher added a commit to jeromekelleher/zarr that referenced this issue Feb 5, 2019
@alimanfoo alimanfoo added this to the v2.3 milestone Feb 5, 2019
jeromekelleher added a commit to jeromekelleher/zarr that referenced this issue Feb 6, 2019
alimanfoo pushed a commit that referenced this issue Feb 6, 2019
* Chunkwise iteration over arrays.

Closes #398.

* Fixed lint error from new flake8 version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants