-
-
Notifications
You must be signed in to change notification settings - Fork 329
Default efficient row iteration #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It sounds like what you want is a cache for decoded chunks, which is proposed in PR ( #306 ). If that sounds right, would you be ok closing this to keep discussion over there? If not, what else are you looking for? |
Ah, that's a good feature, thanks @jakirkham! This is related, but I guess addresses a more limited application and is simpler. What I'm suggesting here is that Any thoughts @alimanfoo? |
Hi @jeromekelleher, I like your suggestion. As @jakirkham says, the cache for decoded chunks would partly solve this, but as you say iteration is actually a simpler requirement where you only cache one chunk at a time while you iterate through it. PR welcome. |
It looks like there aren't any tests for iteration currently. It belongs in |
I'd suggest adding a test_iter(self):
params = (
((1000,), (100,)),
((100, 100), (10, 10)),
# any other combination of shape and chunks you'd like to test
)
for shape, chunks in params:
z = self.create_array(shape=shape, chunks=chunks, dtype=int)
a = np.arange(np.product(shape)).reshape(shape)
z[:] = a
for expect, actual in izip_longest(a, z):
assert_array_equal(expect, actual) |
* Chunkwise iteration over arrays. Closes #398. * Fixed lint error from new flake8 version.
It seems like iterating over a chunked array is inefficient at the moment, presumably because we're repeatedly decompressing the chunks. For example, if I do
it takes several minutes (data_root.variants is a large 2D chunked matrix) but if I do
it takes less than a second, where
To me, it's quite a surprising gotcha that zarr isn't doing this chunkwise decompression, and I think it would be good to do it by default. There is a small extra memory overhead, but I think that's probably OK, given the performance benefits.
Any thoughts?
The text was updated successfully, but these errors were encountered: