Zarr use case: problems in multi threaded access to in memory cache #469

shikharsg · 2019-08-13T15:35:20Z

I have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk.

I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache to cache the chunks. The problem here is I'm not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let's say the first job, and then the rest of the jobs just hit the chunks in the cache and don't have to get if from blob. But because these jobs are done sequentially, for 500 jobs it's taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I'm using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it's getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel?

PR #2814 from xarray might be relevant here. Any thoughts @rabernat?

Does dask have any method of extracting array chunks from cache in parallel? @mrocklin @jhamman

@jakirkham @alimanfoo

The text was updated successfully, but these errors were encountered:

alimanfoo · 2019-08-13T17:05:59Z

Hi @shikarsg, FWIW I'd suggest to investigate using Dask for parallelism, although that is not a complete answer as there are lots of details around how you might get Dask and caching to play nicely together.

…

On Tue, 13 Aug 2019 at 16:35, shikharsg ***@***.***> wrote: I have a read only zarr array stored on Azure Blob. It is a 4 dimensional climate data set, and is chunked along all 4 dimensions. It has about 9000 time steps in total and each chunk has 12 time steps per chunk. I have a set of jobs, each of which I know will need access to a maximum of 4 time steps, so even if it crosses a chunk boundary, it will need access to a maximum of 2 chunks(as far as the time dimension is concerned). There are hundreds of these jobs, many of which will need access to the same chunk, so I use zarr.storage.LRUStoreCache to cache the chunks. The problem here is I'm not sure how I can do these jobs in parallel. What I currently do is sort the jobs in time order, and do one job at a time, in a for loop. The only parallelization here is getting the chunks from blob, for let's say the first job, and then the rest of the jobs just hit the chunks in the cache and don't have to get if from blob. But because these jobs are done sequentially, for 500 jobs it's taking about 500 seconds to complete (plus some time to get chunks from blob in the first job, which is much less than 500 seconds). I considered using threads, but it was really slow, presumably because of the GIL? I'm using a 64 core machine on Azure, where I can see the CPU spike during the first job, when it's getting chunks from blob, but not for the rest of jobs, where cpu usage is minimal. Behavior is the same when using threads. Is there a way I can access chunks from cache in parallel? PR #2814 <pydata/xarray#2814> from xarray might be relevant here. Any thoughts @rabernat <https://github.com/rabernat>? Does dask have any method of extracting array chunks from cache in parallel? @mrocklin <https://github.com/mrocklin> @jhamman <https://github.com/jhamman> @jakirkham <https://github.com/jakirkham> @alimanfoo <https://github.com/alimanfoo> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AAFLYQR6KYGGCOULSFML4NLQELIDVA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HE7VL3A>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFLYQUMZAUIVTD4HBUP7K3QELIDVANCNFSM4ILMCV2A> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

dazzag24 · 2019-08-14T11:22:59Z

Has anyone had any experience with the Plasma Object store?

Could this help, as it might allow you to build a cache that doesn't suffer from GIL effects??

jakirkham · 2019-08-14T15:49:30Z

I've looked at before, but not in the context of Zarr. It could be interesting to explore a storage backend for Zarr that uses a Plasma Object store.

dazzag24 · 2019-08-14T15:52:33Z

In this case I was considering it not for the storage backend as @shikarsg is using Azure blob for that, but rather as an alternative to the current Python based zarr.storage.LRUStoreCache that he is using which then suffers from GIL issues and cannot be used by multiple processes.

…

On Wed, 14 Aug 2019 at 16:49, jakirkham ***@***.***> wrote: I've looked at before, but not in the context of Zarr. It could be interesting to explore a storage backend for Zarr that uses a Plasma Object store. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AAIBYV4A6PS7WU4FVLOXX3DQEQSQXA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JHRKY#issuecomment-521304235>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIBYV3P7NZQRBNSHELFA23QEQSQXANCNFSM4ILMCV2A> .

jakirkham · 2019-08-14T15:57:03Z

I think at some point we are going to want to detach the concept of storage backend from where it sits in the loading pipeline. For example, when retrieving data from a cloud store, I might want to have an intermediate storage layer like a local database that provides quicker access to some data. Additionally I may want something after that, which holds data in-memory. Though maybe that writes some data to disk. We will want that functionality regardless of whether it uses a Plasma Object store, LMDB, or something else.

dazzag24 · 2019-08-14T15:59:33Z

So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg case I believe he is caching the uncompressed chunks in memory.

mrocklin · 2019-08-14T16:36:57Z

Perhaps instead of using a cache you can just give all of the computations to Dask at once and have it keep track of the reusing the data many times itself.

…

On Wed, Aug 14, 2019 at 11:59 AM dazzag24 ***@***.***> wrote: So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg <https://github.com/shikharsg> case I believe he is caching the uncompressed chunks in memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AACKZTEH4XOR2D4UJ4P3FT3QEQTWRA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JIR6A#issuecomment-521308408>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTGTDY6D6F46ZLGANMDQEQTWRANCNFSM4ILMCV2A> .

shikharsg · 2019-08-31T11:47:41Z

I think at some point we are going to want to detach the concept of storage backend from where it sits in the loading pipeline. For example, when retrieving data from a cloud store, I might want to have an intermediate storage layer like a local database that provides quicker access to some data. Additionally I may want something after that, which holds data in-memory. Though maybe that writes some data to disk. We will want that functionality regardless of whether it uses a Plasma Object store, LMDB, or something else.

@jakirkham, I managed to implement a cache for decoded results using memcached. Now I can easily have multiple processes access chunks from (and store chunks to) the "same cache", which is stored outside of these processes in memcached. The performance vastly surpasses the sequential method I have described above. I can also see full cpu utilization on my D64 Azure VM.

Perhaps instead of using a cache you can just give all of the computations to Dask at once and have it keep track of the reusing the data many times itself.
…
On Wed, Aug 14, 2019 at 11:59 AM dazzag24 @.***> wrote: So some kind of automatic local caching of data IF you are using a cloud storage backend? makes sense. Would this be a cache of the compressed or uncompressed chunks? In @shikharsg https://github.com/shikharsg case I believe he is caching the uncompressed chunks in memory. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#469?email_source=notifications&email_token=AACKZTEH4XOR2D4UJ4P3FT3QEQTWRA5CNFSM4ILMCV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JIR6A#issuecomment-521308408>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGTDY6D6F46ZLGANMDQEQTWRANCNFSM4ILMCV2A .

@mrocklin that's what I thought at first. But the application I am building is something like this: I get requests from the user to process some jobs. And let's say I give the first set of jobs together to dask(so all computations are given together), but before that set of job finishes I get a few more requests from the user, which I must start before the first set finishes(because I want the rate of processing to be fast) and which also might be using the same zarr chunks as the first set of jobs.

dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Zarr use case: problems in multi threaded access to in memory cache #469

Zarr use case: problems in multi threaded access to in memory cache #469

shikharsg commented Aug 13, 2019

alimanfoo commented Aug 13, 2019 via email

Uh oh!

dazzag24 commented Aug 14, 2019

Uh oh!

jakirkham commented Aug 14, 2019

Uh oh!

dazzag24 commented Aug 14, 2019 via email

Uh oh!

jakirkham commented Aug 14, 2019

Uh oh!

dazzag24 commented Aug 14, 2019

Uh oh!

mrocklin commented Aug 14, 2019 via email

Uh oh!

shikharsg commented Aug 31, 2019

Uh oh!

Uh oh!

Zarr use case: problems in multi threaded access to in memory cache #469

Zarr use case: problems in multi threaded access to in memory cache #469

Comments

shikharsg commented Aug 13, 2019

alimanfoo commented Aug 13, 2019 via email

Uh oh!

dazzag24 commented Aug 14, 2019

Uh oh!

jakirkham commented Aug 14, 2019

Uh oh!

dazzag24 commented Aug 14, 2019 via email

Uh oh!

jakirkham commented Aug 14, 2019

Uh oh!

dazzag24 commented Aug 14, 2019

Uh oh!

mrocklin commented Aug 14, 2019 via email

Uh oh!

shikharsg commented Aug 31, 2019

Uh oh!