Data for variable is not inlined despite size below threshold #394

itcarroll · 2023-11-16T03:03:01Z

I am trying to build a MultiZarrToZarr from the GLDAS series distributed by GES DISC, and concatenation on time is not coming out right. I think I have traced the problem to something about the chunking of the "time" variable in the individual NetCDF4 files. A symptom of something being wrong is that the data for this variable does not get inlined and the ref is suspicious.

The "time" variable in each file has shape (1,) and dtype float64, but invoking SingleHdf5ToZarr using the default value for inline_threshold does not embed the data:

from kerchunk.hdf import SingleHdf5ToZarr

reference = SingleHdf5ToZarr("GLDAS_NOAH025_3H.A20000101.0300.021.nc4").translate()
print(reference["refs"]["time/0"])

['GLDAS_NOAH025_3H.A20000101.0300.021.nc4', 22998, 4096]

Rather than an array with a much too big number of bytes, I'm expecting a base64 string representing the single value. The only clue I can offer is that XArray reports the encoding on "time" as having a chunk size of (512,) ... why, I have no idea.

GLDAS_NOAH025_3H.A20000101.0300.021.nc4.zip

The text was updated successfully, but these errors were encountered:

martindurant · 2023-11-16T18:57:29Z

I does seem that the buffer is really that size. Perhaps the chunksize is as it is because it makes it possible to append to the HDF5 file later?

In any case, inlining looks at the buffer size and includes the whole buffer if appropriate. In this case, we know the array size is much smaller than that, so we should instead consider the whole size of the array and inline that instead (taking care to change the chunksize to match). So: doable, but needs a little work.

itcarroll · 2023-11-16T19:21:58Z

Thanks for looking! To confirm, I think you are saying that ...

$ h5dump -p -d time GLDAS_NOAH025_3H.A20000101.0600.021.nc4
HDF5 "GLDAS_NOAH025_3H.A20000101.0600.021.nc4" {
DATASET "time" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 512 )
      SIZE 4096
   }
...

means it really is using 4096 bytes on disk.

So yes, seems like inlining could get smarter, but is getting the right value.

martindurant · 2023-11-16T19:23:34Z

it really is using 4096 bytes on disk

Yes.

Note that if you combine over these values later, you will get only the actual values not the whole buffers.

itcarroll · 2023-11-27T04:28:56Z

I've further examined my problem, @martindurant, and would very much appreciate a suggestion on whether I should create an issue at zarr on the behavior of zarr.storage.FSStore or try to help with a resolution here. As for my original report concerning data not getting inlined ... I am inclined to close because a) it's not my actual problem and b) inlining is a chunk-level decision and you can't go inlining some small chunk at the edge of an array because it only holds a few values.

My actual problem is that MultiZarrToZarr does not get informed of the error that occurs when s3 credentials are missing/invalid/expired for a reference that must be de-referenced to get data for a coordinate in concat_dims. In the example below, the multi object gets an incorrect time array, whose ultimate source is the empty array created in zarr.core.Array._get_selection. That empy array is passing through untouched, because the zarr.storage.FSStore.getitems method omits the "time" key when it cannot reach the data and no fill value is ever applied. I'm unsure whether this is a bug in zarr's logic or not, and if the latter, whether MultiZarrToZarr could even check for missing/invalid s3 credentials up front.

single = {
    "version": 1,
    "refs": {
        ".zgroup": '{"zarr_format":2}',
        "time/.zarray": '{"chunks":[512],"compressor":null,"dtype":"<f8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}',
        "time/0": [
            "s3://gesdisc-cumulus-prod-protected/GLDAS/GLDAS_NOAH025_3H.2.1/2018/152/GLDAS_NOAH025_3H.A20180601.0000.021.nc4",
            22998,
            4096,
        ],
    },
}
multi = MultiZarrToZarr([single], concat_dims="time").translate()
multi["refs"]["time/0"]                 <-- the value is arbitrary/wrong

You get the right value if you pass good credentials with remote_options, but something ought to break here rather than be silently incorrect.

martindurant · 2023-11-27T14:43:37Z

I am not at all sure how that might be happening...
When accessing data via xarray/zarr/referenceFS, there should be a different exception for "there is no reference of that path" (FileNotFoundError) versus "that reference exists but didn't load" (ReferenceNotReachable), which was supposed to prevent random chunks turning up as NaN when there are intermittent IO problems. I would have expected an exception; but that is at final read time, not during combine - so something must be different. I recommend activating the loggers fsspec.reference and s3fs and maybe set breakpoints to see where this happens.

itcarroll · 2023-11-27T15:28:18Z

You do get ReferenceNotReachable from fsspec.FSMap.getitems#L101, but because the caller (zarr.FSStore.getitems#1420) set on_error='omit' the error is not raised.

martindurant · 2023-11-27T15:43:18Z

Well... it really should be "return", and then filter what comes back for things that are in self.exceptions, else raise (i.e., get in a batch should produce the same as serially getting each using __getitem__). So I'd say it's their fault.

itcarroll · 2023-11-27T16:55:52Z

Closing original issue, which was not a bug.

Moving the secondary issue discovered to zarr as zarr-developers/zarr-python#1578.

itcarroll closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data for variable is not inlined despite size below threshold #394

Data for variable is not inlined despite size below threshold #394

itcarroll commented Nov 16, 2023

martindurant commented Nov 16, 2023

Uh oh!

itcarroll commented Nov 16, 2023

Uh oh!

martindurant commented Nov 16, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!

martindurant commented Nov 27, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!

martindurant commented Nov 27, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!

Data for variable is not inlined despite size below threshold #394

Data for variable is not inlined despite size below threshold #394

Comments

itcarroll commented Nov 16, 2023

martindurant commented Nov 16, 2023

Uh oh!

itcarroll commented Nov 16, 2023

Uh oh!

martindurant commented Nov 16, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!

martindurant commented Nov 27, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!

martindurant commented Nov 27, 2023

Uh oh!

itcarroll commented Nov 27, 2023

Uh oh!