Skip to content

Zarr use case: problems in multi threaded access to in memory cache #469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shikharsg opened this issue Aug 13, 2019 · 8 comments
Open
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)

Comments

@shikharsg
Copy link
Contributor

I have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk.

I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache to cache the chunks. The problem here is I'm not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let's say the first job, and then the rest of the jobs just hit the chunks in the cache and don't have to get if from blob. But because these jobs are done sequentially, for 500 jobs it's taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I'm using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it's getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel?

PR #2814 from xarray might be relevant here. Any thoughts @rabernat?

Does dask have any method of extracting array chunks from cache in parallel? @mrocklin @jhamman

@jakirkham @alimanfoo

@alimanfoo
Copy link
Member

alimanfoo commented Aug 13, 2019 via email

@dazzag24
Copy link

Has anyone had any experience with the Plasma Object store?

Could this help, as it might allow you to build a cache that doesn't suffer from GIL effects??

@jakirkham
Copy link
Member

I've looked at before, but not in the context of Zarr. It could be interesting to explore a storage backend for Zarr that uses a Plasma Object store.

@dazzag24
Copy link

dazzag24 commented Aug 14, 2019 via email

@jakirkham
Copy link
Member

I think at some point we are going to want to detach the concept of storage backend from where it sits in the loading pipeline. For example, when retrieving data from a cloud store, I might want to have an intermediate storage layer like a local database that provides quicker access to some data. Additionally I may want something after that, which holds data in-memory. Though maybe that writes some data to disk. We will want that functionality regardless of whether it uses a Plasma Object store, LMDB, or something else.

@dazzag24
Copy link

So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg case I believe he is caching the uncompressed chunks in memory.

@mrocklin
Copy link
Contributor

mrocklin commented Aug 14, 2019 via email

@shikharsg
Copy link
Contributor Author

I think at some point we are going to want to detach the concept of storage backend from where it sits in the loading pipeline. For example, when retrieving data from a cloud store, I might want to have an intermediate storage layer like a local database that provides quicker access to some data. Additionally I may want something after that, which holds data in-memory. Though maybe that writes some data to disk. We will want that functionality regardless of whether it uses a Plasma Object store, LMDB, or something else.

@jakirkham, I managed to implement a cache for decoded results using memcached. Now I can easily have multiple processes access chunks from (and store chunks to) the "same cache", which is stored outside of these processes in memcached. The performance vastly surpasses the sequential method I have described above. I can also see full cpu utilization on my D64 Azure VM.

Perhaps instead of using a cache you can just give all of the computations to Dask at once and have it keep track of the reusing the data many times itself.

On Wed, Aug 14, 2019 at 11:59 AM dazzag24 @.***> wrote: So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg https://github.com/shikharsg case I believe he is caching the uncompressed chunks in memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AACKZTEH4XOR2D4UJ4P3FT3QEQTWRA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JIR6A#issuecomment-521308408>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGTDY6D6F46ZLGANMDQEQTWRANCNFSM4ILMCV2A .

@mrocklin that's what I thought at first. But the application I am building is something like this: I get requests from the user to process some jobs. And let's say I give the first set of jobs together to dask(so all computations are given together), but before that set of job finishes I get a few more requests from the user, which I must start before the first set finishes(because I want the rate of processing to be fast) and which also might be using the same zarr chunks as the first set of jobs.

@dstansby dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

6 participants