-
Notifications
You must be signed in to change notification settings - Fork 90
Support for failed chunk requests #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A simple check for near-zero values in received array chunks would suffice, an error could be raised that makes the user aware, otherwise data could be misrepresented without anyone being aware. |
Another person in my group has pointed out these lines in the kerchunk source, which I'd assume would also be present in fsspec, where the numpy array being served is initialised as empty (slightly faster, but leaves the array with some VERY small values until properly filled) I think for failed chunk requests, the empty array is left empty but nothing is logged to state this. I'm pulling together an example now which I will post later, and I'll see if it has a correlation with the number of chunks being requested (highly likely since more chunks = more chance for a failure) https://github.com/search?q=repo%3Afsspec%2Fkerchunk%20.empty(&type=code |
referenceFS is supposed to return ReferenceNotReachable for the case that we know a reference exists (it's in the reference set) but loading it failed. Here is one occurrance: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/reference.py#L790 This exception is a subclass of RuntimeError, specifically so that it shouldn't be caught by any conversion to KeyError. In zarr, a KeyError means "does not exist, so fill with the replacement value", and for most stores, an IOError would get transformed to KeyError for this reason. It would be good to add logging or otherwise debug what actually happens during the missing chunks. Turning on the "fsspec.reference" logger and whichever backend hosts the data (gcsfs? s3fs?) would be a good start. You didn't say, is this happening in a single node? How many threads/processes are in play? |
Thanks for the advice, how do I turn on logging in fsspec.reference? Is there a keyword to add to fsspec.get_mapper? This is across multiple nodes in a cluster, local machine and a few VMs I've tried it on a remote cluster. |
I see your example ReferenceNotReachable call. I can see my requests are awaited in implementations.http _cat_file (ln 231), and I tried looking at the call in implementations.reference _cat_file (ln 708) but the request doesn't seem to be coming from there, it must be from somewhere else where this is not caught in the same way. |
but you will need to run this on all processed (e.g., with dask client.run()). |
I've set up a script locally with a scenario that I can now run that does one of two things. If I'm connected to WiFi the dataset loads after ~30 seconds no problem. If I disconnect and rerun, I trigger a decode error in pandas which carries through into xarray as an OverflowError (tiny values being decoded). Logging doesn't seem to reveal anything, I get a lot of:
and I can see where await _cat_file is triggered because I added a logger message at that point. But nothing else happens after that before the overflow is triggered. Edit: The very small time chunk is a part of this test, all the other files have default inline_threshold so these small chunks are normally just decoded from b64, but for this test it's obvious when the request has failed when trying to decode the times. |
You should also enable "fsspec.http" in that case - maybe it has something useful to tell. |
Narrowed the issue down to a parameter on_error which is set to "return" not "raise" at a given point, which leads to the overflow issue. I'm digging through the traceback to see where this comes from, as some calls do have "raise" as the option while the very first one with "return" causes the issue. Edit: Looks like this traces all the way back to zarr.storage ln 1420 on_error is set to "omit" when calling self.map.getitems, which gets handled as "return" in fsspec.mapping.FSMap.getitems |
"return" is supposed to be the right thing, so that further up the stack, code can decide whether the specific exception counts as a KeyError (i.e., missing values) or something to raise. If it were "raise", the only option would be to catch the exception, but then you have no data. Another thing to try to find out: the HTTP request itself should be retriable, but there are many failure channels, so it's possible that something like a 500 (server busy) actually should never fail, just go slowly. |
zarr-developers/zarr-python#1604 was recently merged, does that help? |
Looks like that does seem to work with my local test, thanks for that! I'll add here if I encounter more issues of this type, I should know soon if this works for the other files too. Just noticed this detail with the raised error that may cause issues further on? Edit: Realise this may also be a version issue, I'll update to the latest xarray and zarr, and try again. |
I'm running Kerchunk at scale across TB's of NetCDF data and have noticed a consistent issue in my process for validation, where certain comparisons between NetCDF and Kerchunk data fails.
I've found when doing a max, min and mean comparison between arrays of arbitrary sizes, occasionally two of the three test fail, typically the mean comparison and one of the other two. I believe this is explained if the Kerchunk array has more NaN/near-zero values, so a max or min comparison may still work, but the mean comparison will fail. This I think is because of chunk requests being refused or failing, which results in that part of the array being left empty (near-zero).
I know the array chunks come out as near-zero when the chunk request goes wrong because these values can cause an overflow error in certain cases like when decoding time values, because the values are so low (10^-300). Over a set of 1000 datasets (100+ NetCDF files in each), I'm finding this problem affects ~10% of the datasets, with no dependency on type of dataset or anything. This issue comes up at random and is only detectable by direct comparison to the native data, or if it causes an error in decoding.
This could be a major issue with using Kerchunk at such a wide scale, and there really needs to be some kind of check within fsspec at the point of making a request, to ensure the correct type of data was received.
The text was updated successfully, but these errors were encountered: