Support for failed chunk requests #449

dwest77a · 2024-04-11T09:01:07Z

I'm running Kerchunk at scale across TB's of NetCDF data and have noticed a consistent issue in my process for validation, where certain comparisons between NetCDF and Kerchunk data fails.

I've found when doing a max, min and mean comparison between arrays of arbitrary sizes, occasionally two of the three test fail, typically the mean comparison and one of the other two. I believe this is explained if the Kerchunk array has more NaN/near-zero values, so a max or min comparison may still work, but the mean comparison will fail. This I think is because of chunk requests being refused or failing, which results in that part of the array being left empty (near-zero).

I know the array chunks come out as near-zero when the chunk request goes wrong because these values can cause an overflow error in certain cases like when decoding time values, because the values are so low (10^-300). Over a set of 1000 datasets (100+ NetCDF files in each), I'm finding this problem affects ~10% of the datasets, with no dependency on type of dataset or anything. This issue comes up at random and is only detectable by direct comparison to the native data, or if it causes an error in decoding.

This could be a major issue with using Kerchunk at such a wide scale, and there really needs to be some kind of check within fsspec at the point of making a request, to ensure the correct type of data was received.

dwest77a · 2024-04-11T09:07:29Z

A simple check for near-zero values in received array chunks would suffice, an error could be raised that makes the user aware, otherwise data could be misrepresented without anyone being aware.

dwest77a · 2024-04-11T09:48:22Z

Another person in my group has pointed out these lines in the kerchunk source, which I'd assume would also be present in fsspec, where the numpy array being served is initialised as empty (slightly faster, but leaves the array with some VERY small values until properly filled) I think for failed chunk requests, the empty array is left empty but nothing is logged to state this.

I'm pulling together an example now which I will post later, and I'll see if it has a correlation with the number of chunks being requested (highly likely since more chunks = more chance for a failure)

https://github.com/search?q=repo%3Afsspec%2Fkerchunk%20.empty(&type=code

dwest77a · 2024-04-11T12:51:04Z

This is what happens when you try to open a local Kerchunk file with remote links, with no WiFi connection. If something like this happens for even 1 in 1 million requests with no reporting, it could have significant implications.

martindurant · 2024-04-11T13:07:00Z

referenceFS is supposed to return ReferenceNotReachable for the case that we know a reference exists (it's in the reference set) but loading it failed. Here is one occurrance: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/reference.py#L790

This exception is a subclass of RuntimeError, specifically so that it shouldn't be caught by any conversion to KeyError. In zarr, a KeyError means "does not exist, so fill with the replacement value", and for most stores, an IOError would get transformed to KeyError for this reason.

It would be good to add logging or otherwise debug what actually happens during the missing chunks. Turning on the "fsspec.reference" logger and whichever backend hosts the data (gcsfs? s3fs?) would be a good start.

You didn't say, is this happening in a single node? How many threads/processes are in play?

dwest77a · 2024-04-11T13:26:45Z

Thanks for the advice, how do I turn on logging in fsspec.reference? Is there a keyword to add to fsspec.get_mapper?

This is across multiple nodes in a cluster, local machine and a few VMs I've tried it on a remote cluster.

dwest77a · 2024-04-11T13:34:03Z

I see your example ReferenceNotReachable call. I can see my requests are awaited in implementations.http _cat_file (ln 231), and I tried looking at the call in implementations.reference _cat_file (ln 708) but the request doesn't seem to be coming from there, it must be from somewhere else where this is not caught in the same way.

martindurant · 2024-04-11T13:34:36Z

fsspec.utils.setup_logging(logger_name="fsspec.reference")

but you will need to run this on all processed (e.g., with dask client.run()).

dwest77a · 2024-04-11T14:08:52Z

I've set up a script locally with a scenario that I can now run that does one of two things. If I'm connected to WiFi the dataset loads after ~30 seconds no problem. If I disconnect and rerun, I trigger a decode error in pandas which carries through into xarray as an OverflowError (tiny values being decoded). Logging doesn't seem to reveal anything, I get a lot of:

2024-04-11 15:02:37,967 - fsspec.reference - DEBUG - _cat_common -- cat: time/23724
2024-04-11 15:02:37,967 - fsspec.reference - DEBUG - _cat_common -- Reference: time/23724 => https://dap.ceda.ac.uk/badc/cmip6/data/CMIP6/CMIP/BCC/BCC-ESM1/historical/r1i1p1f1/day/rsus/gn/v20181220/rsus_day_BCC-ESM1_historical_r1i1p1f1_gn_19500101-20141231.nc, offset 777976576, size 8

and I can see where await _cat_file is triggered because I added a logger message at that point. But nothing else happens after that before the overflow is triggered.

Edit: The very small time chunk is a part of this test, all the other files have default inline_threshold so these small chunks are normally just decoded from b64, but for this test it's obvious when the request has failed when trying to decode the times.

martindurant · 2024-04-11T14:27:03Z

You should also enable "fsspec.http" in that case - maybe it has something useful to tell.

dwest77a · 2024-04-11T14:43:28Z

Narrowed the issue down to a parameter on_error which is set to "return" not "raise" at a given point, which leads to the overflow issue. I'm digging through the traceback to see where this comes from, as some calls do have "raise" as the option while the very first one with "return" causes the issue.

Edit: Looks like this traces all the way back to zarr.storage ln 1420 on_error is set to "omit" when calling self.map.getitems, which gets handled as "return" in fsspec.mapping.FSMap.getitems

martindurant · 2024-04-11T14:46:54Z

"return" is supposed to be the right thing, so that further up the stack, code can decide whether the specific exception counts as a KeyError (i.e., missing values) or something to raise. If it were "raise", the only option would be to catch the exception, but then you have no data.

Another thing to try to find out: the HTTP request itself should be retriable, but there are many failure channels, so it's possible that something like a 500 (server busy) actually should never fail, just go slowly.

martindurant · 2024-04-11T14:47:57Z

zarr-developers/zarr-python#1604 was recently merged, does that help?

dwest77a · 2024-04-11T14:52:36Z

Looks like that does seem to work with my local test, thanks for that! I'll add here if I encounter more issues of this type, I should know soon if this works for the other files too.

Just noticed this detail with the raised error that may cause issues further on?

Edit: Realise this may also be a version issue, I'll update to the latest xarray and zarr, and try again.

dwest77a closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for failed chunk requests #449

Support for failed chunk requests #449

dwest77a commented Apr 11, 2024

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 •

edited

Loading

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 •

edited

Loading

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 •

edited

Loading

Uh oh!

Support for failed chunk requests #449

Support for failed chunk requests #449

Comments

dwest77a commented Apr 11, 2024

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

martindurant commented Apr 11, 2024

Uh oh!

dwest77a commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwest77a commented Apr 11, 2024 •

edited

Loading

dwest77a commented Apr 11, 2024 •

edited

Loading

dwest77a commented Apr 11, 2024 •

edited

Loading