-
-
Notifications
You must be signed in to change notification settings - Fork 330
When I repeatedly append data to a zarr array in s3, appending takes longer and longer and longer #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I ran the above script last night. It appears to have stopped before doing 500 iterations, but you can at least see the trend. |
My hunch would be that this is related to s3fs directory listing and caching. @martindurant any thoughts? |
I recommend turning on debug for the logger "s3fs" to see what S3 processes, if any, are taking up time. I'm not sure if zarr would be calling any listing operations, but that would be an obvious thing that gets longer as more files accrue. |
On it. Let me see that s3fs has to say. Thank you. |
First, I don't think s3fs is doing directories listings on the variable path. It only does directory listings at the higher level with the .zgroup. So the problem does not seem to be that s3fs is having to deal with more and more objects under the variable path. Instead, the issue seems to be around writing the .zarray file. Every time my code appends some new data, there are 14 new log messages from s3fs. Here are the last 14 messages from the s3fs log of the run I did last weekend. If you look carefully at the time stamps, you can see where the jump in time is:
When I plot the time between log messages in the last few chunks of 14 messages vs. the first few chunks of 14 messages, the problem is clearly between the following two messages:
Here are the plots I did, showing the timing between messages at the beginning and end of the log: |
I think that the next thing I'd be curious to try is to use s3fs directly to repeatedly overwrite an object in s3 and see what happens. I think I'll also try using the boto3 library directly to do the same thing and see what the timing is. |
An alternative to appending, if you know the final size of the array, would be to pre-allocated a large empty array and just write to the desired region. You could also consider storing your metadata in a separate, faster store (e.g. redis, database, local file, etc), if this really turns out to be a bottleneck. |
Perhaps it would be useful for you to add more logging to the _call_s3 method, to see at what point things are caught. There are ways to profile what's happening inside the async event loop / the state of fsspec's thread, but it would involve some work. I wonder, what is your total event rate here? AWS imposes quotas or rate limits for some operations.
If this is in one process, it can even be just a dict, and you can upload it when done. Zarr allows you to decouple the data and metadata stores. |
@rabernat - We unfortunately do not have a final size for the array. Data keeps coming in until the mission is over. We did try pre-allocating an enormous array that was much bigger than our real datasets could ever grow to. This did sort out the timing issue, but made us worry about where to store the real size of the array. @martindurant / @rabernat - storing the metadata somewhere else where zarr can find it may be the solution. I did not know that was possible and I will look into it. Thank you. @martindurant - I think that aws write limits are possible, but unlikely. I just checked the throttling limits and my script is nowhere near them. I created a bucket just for this testing, so it's not as though anything else is writing there. The other possibility is network throttling since I'm using a kind of dinky EC2 for this test. But then I'd expect all the writes to be throttled. And they aren't. It's just the .zarray update. |
Grasping at straws a bit, is versioning turned on? From https://docs.aws.amazon.com/AmazonS3/latest/userguide/versioning-workflows.html:
@martindurant: would the 503's be (silently) retried? |
@joshmoore - that is an interesting idea, but my bucket did not have versioning. This is good to know, however, if we ever decide to turn versioning on. |
Can you please check: if you end your python session, and then restart appending in a new process, do you still see the slowness? |
@martindurant - Will do. Not sure if I have time to run it today, but I will get back to you. |
@martindurant - I appended to an existing array using a new process and it was just as slow as appending had been at the end of the first process. So I decided to wait a day and try appending again to see if there was some kind of temporary throttling happening in s3. Appending was still slow after it had been more than 12 hours since I'd last touched the zarr store. I even tried creating a brand new EC2 instance to append to the existing zarr store and that was also slow. (I have double and triple checked - we are definitely not using versioning in this bucket. So that is not the issue.) It feels as though this must be something outside of zarr, outside of s3fs, and outside of python. I'm going to try that test where I just repeatedly write to an the same object using s3fs and then boto3. If those also show this slow down, I'm going to try contacting AWS support. My team has access to paid technical support. Do you mind keeping the ticket open even though it seems as though this is most probably not zarr? If it is an AWS thing, others may run into the same issue and may find this thread helpful. |
Certainly, please let us know here anything you find out. |
So I don't know how to replicate this timing issue outside of zarr. |
I'm not sure if I already suggested it, but it would be worthwhile looking at s3fs's logs to see whether zarr is calling list at any point - that might get slower as more chunks arrive. |
Yeah. You actually did mention that before. As I said in #938 (comment), the logs do show that zarr lists the root directory repeatedly, which makes total sense since my script repeated opens that directory. But that directory has only one object - .zgroup. Zarr is not listing the /var/ directory, which has all the variable's objects. |
If I'm reading the zarr code correctly (big if), I think this is where zarr asks s3fs (in my case) to write out the .zarray file. So, in theory, this is where I think the slow down is happening: Lines 253 to 269 in 594c441
In my case
I'm going to be honest. I do not understand was S3Map is. When I look at the documentation (https://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map), I see: which does not mean anything to me because I don't know why they'd mirror a different class and I don't know what previous class they are mirroring. So in the script in #938 (comment), I just use the S3FileSystem directly to read and write stuff:
If you can help me understand what S3Map is, I could try rewriting my script to use that part of s3fs to read and write files. |
S3Map is an FSMap, providing a dict-like view over storage. I suppose the doc is bad because |
Hmmmmm. I'm going cross-eyed looking at the logs from my just-call-s3fs logs and the zarr-calls-s3fs logs. It appears that I'm seeing the same pattern of 6 log messages when I read and then write the .zarray in just-call-s3fs logs. So yeah, using S3Map is causing the same log messages as using S3FileSystem directly. I don't think that running a test with S3Map is going to make a difference. |
I think I'm going to try a script that just asks zarr to reshape the array rather than appending. This will write the .zarray object presumably, but there won't be any chunks of data written, which is closer to what my s3fs-only script was doing. The extra time in the zarr logs is between the following log messages:
This is on the boundary in the log between the last message about writing the |
Thank you for your help and patience. |
I've run a few more tests and I'm unfortunately still not sure were the slow down is happening.
So, even though the log messages from s3fs don't say anything about listing directories, they still slow down. Here are the messages from the last append:
Once again, the time jump happens between writing the .zarray and writing the data object:
You can see the timing quite clearly here at the beginning vs. at the end. (Note that the repeat happens every 12 messages rather than 14 because it's not listing anything in s3 because I'm not reopening the zarr store.) |
Hello! My colleague Hailiang is braver than I am and actually dove into zarr's code base. We think he found where the issue we're seeing lies. If you look at this section of code: Lines 2430 to 2441 in e381bf7
in Suppose that the old shape is 10x20 chunks and the new shape is 9x30 chunks. If I'm understanding what the current code does, it will loop over those 200 old chunks and discard 20 chunks that are out of bounds of the new shape. But it feels as though this isn't the most efficient way of deleting chunks. What my colleague's code suggestion does is loop over each dimension and figure out which chunks are clearly out of bounds. In the 10x20 --> 9x30 case, it looks at the first dimension and notices 10 > 9, so it flags the 20 chunks in the 10th row for deletion. It then looks at the second dimension and sees 20 < 30, so it doesn't need to delete any additional chunks. With the proposed algorithm, if all you're doing is appending, the code will very quickly figure out that it doesn't have to delete anything. And, indeed, when I ran my scripts against this new version of the code, appending was very quick even with a very large array. Not everyone will, of course, be appending when they try to reshape an array. But we think that actually looping over dimensions is generally still going to be faster than looping over all chunks. The only time we think looping over dimensions will be a performance issue is if someone creates an array with a huge number of dimensions and not very many chunks. So, if the array has a total of one chunk, but thousands of dimensions, this code change will be slower. Practically speaking, though, is this likely? Hailiang and I both work at the GES DISC, one of NASA's earth science distributed active archive centers (DAACs). The data we deal with has three or four dimensions and is more or less constantly growing. The proposed change is a clear win for that type of data. You would have more of an understanding of what kind of data is used more broadly with zarr. Maybe our data is the unusual case and extremely high dimensional data is the norm. |
Thanks a lot for your very detailed update @christine-e-smit ! |
Yeah, I think assuming a small number of dimensions is very reasonable. Relatively fewdatasets will have more than 4 or 5 dimensions. Usually 2-3 spatial with possible extra dimensions for channels, time, batch, etc. |
Just a note on this - zarr doesn't write any data it doesn't need to. If you pre-allocated an enormous array, it won't write chunks of zeros at every possible index, it just won't create the chunk files (or directories, for the NestedDirectoryStore). Then if it goes looking for a file later and can't find it, it'll return a chunk full of a particular fill value. If you later need to "realise" that sparse array, you can shrink it to the bounds of your actual data. |
@MSanKeys963 and I assume this is solved with 2.12.0 (#1014). Can anyone confirm? |
Yes. |
Thanks, @hailiangzhang. |
Minimal, reproducible code sample, a copy-pastable example if possible
Problem description
When we repeatedly append to an array in s3, the time zarr takes to append the new chunk takes longer and longer and longer. While we would normally expect this to happen if zarr had to rearrange what it had already written, we don't think that's what zarr does. As far as we can tell, zarr only needs to update the .zarray object before writing the new chunk of data into a new object. That .zarray object is very small and does not seem to increase in size as the array increases in size. So we expected that appending a new chunk to an existing array would be quick and that appending the 3rd chunk would take approximately as long as appending the 300th chunk. This is why we think we're seeing a bug.
The obvious solution is to just initialize the array to its full size. We've tried this and it does indeed work. So we do have a work-around. The issue is that our datasets do naturally grow because we get more data as time passes. Pre-allocating the array means we have to initialize the array to be extremely large and then keep track of its actual size somewhere else, which I think you will agree is a bit of a kluge, especially if we'd like to make these zarr stores available to third parties.
You may wonder why I repeatedly re-open the zarr store. The actual code we ran when we discovered this issue appended data periodically as data came in, so we were re-opening the zarr store repeatedly to write new data. I also am not sure what kind of caching/buffering zarr and/or s3fs may be doing. By allowing the zarr store to go out of scope, everything is presumably getting cleaned up and any cache/buffer is getting flushed.
Version and installation information
Please provide the following:
zarr.__version__
: 2.10.3numcodecs.__version__
: 0.9.1Also, if you think it might be relevant, please provide the output from
pip freeze
orconda env export
depending on which was used to install Zarr.The text was updated successfully, but these errors were encountered: