Default efficient row iteration #398

jeromekelleher · 2019-01-28T10:51:46Z

It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do

    for pos, row in enumerate(data_root.variants):
        print(row)
        if pos == 1000:
            break

it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do

    for pos, row in enumerate(chunk_iterator(data_root.variants)):
        print(row)
        if pos == 1000:
            break

it takes less than a second, where

def chunk_iterator(array):
    """ 
    Utility to iterate over the rows in the specified array efficiently
    by accessing one chunk at a time.
    """
    chunk_size = array.chunks[0]
    for j in range(array.shape[0]):
        if j % chunk_size == 0:
            chunk = array[j: j + chunk_size][:]
        yield chunk[j % chunk_size]

To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.

Any thoughts?

jakirkham · 2019-01-29T08:17:42Z

It sounds like what you want is a cache for decoded chunks, which is proposed in PR ( #306 ). If that sounds right, would you be ok closing this to keep discussion over there? If not, what else are you looking for?

jeromekelleher · 2019-01-29T08:52:39Z

Ah, that's a good feature, thanks @jakirkham! This is related, but I guess addresses a more limited application and is simpler. What I'm suggesting here is that __iter__ methods use something like the chunk_iterator internally so that there's automatically good iteration performance even if you don't specify the LRU cache. Running a simple for-loop should have good performance by default, I think.

Any thoughts @alimanfoo?

alimanfoo · 2019-01-29T09:23:11Z

Hi @jeromekelleher, I like your suggestion. As @jakirkham says, the cache for decoded chunks would partly solve this, but as you say iteration is actually a simpler requirement where you only cache one chunk at a time while you iterate through it. PR welcome.

jeromekelleher · 2019-01-29T10:13:21Z

It looks like there aren't any tests for iteration currently. It belongs in test_core in principle, but really we just want to test this over a selection of arrays that have a bunch of different dimensions and chunk sizes. Any pointers on where this should go @alimanfoo and where I can find some pre-defined arrays that would be suitable?

alimanfoo · 2019-02-05T17:39:46Z

I'd suggest adding a test_iter() method on the TestArray class in the test_core module. E.g., something like:

    test_iter(self):
        params = (
            ((1000,), (100,)),
            ((100, 100), (10, 10)),
            # any other combination of shape and chunks you'd like to test
        )
        for shape, chunks in params:
            z = self.create_array(shape=shape, chunks=chunks, dtype=int)
            a = np.arange(np.product(shape)).reshape(shape)
            z[:] = a
            for expect, actual in izip_longest(a, z):
                assert_array_equal(expect, actual)

Closes zarr-developers#398.

* Chunkwise iteration over arrays. Closes #398. * Fixed lint error from new flake8 version.

jeromekelleher added a commit to jeromekelleher/zarr that referenced this issue Feb 5, 2019

Chunkwise iteration over arrays.

b716e47

Closes zarr-developers#398.

jeromekelleher mentioned this issue Feb 5, 2019

Chunkwise iteration over arrays. #399

Merged

7 tasks

alimanfoo added this to the v2.3 milestone Feb 5, 2019

jeromekelleher added a commit to jeromekelleher/zarr that referenced this issue Feb 6, 2019

Chunkwise iteration over arrays.

d11cb57

Closes zarr-developers#398.

alimanfoo closed this as completed in #399 Feb 6, 2019

alimanfoo pushed a commit that referenced this issue Feb 6, 2019

Chunkwise iteration over arrays. (#399)

8495469

* Chunkwise iteration over arrays. Closes #398. * Fixed lint error from new flake8 version.

yetyetanotherusername mentioned this issue Sep 23, 2020

Expand efficient row iteration with possibility to start iterating anywhere #615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default efficient row iteration #398

Default efficient row iteration #398

jeromekelleher commented Jan 28, 2019

jakirkham commented Jan 29, 2019

jeromekelleher commented Jan 29, 2019

alimanfoo commented Jan 29, 2019

jeromekelleher commented Jan 29, 2019

alimanfoo commented Feb 5, 2019

Default efficient row iteration #398

Default efficient row iteration #398

Comments

jeromekelleher commented Jan 28, 2019

jakirkham commented Jan 29, 2019

jeromekelleher commented Jan 29, 2019

alimanfoo commented Jan 29, 2019

jeromekelleher commented Jan 29, 2019

alimanfoo commented Feb 5, 2019