Skip to content

Data for variable is not inlined despite size below threshold #394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
itcarroll opened this issue Nov 16, 2023 · 8 comments
Closed

Data for variable is not inlined despite size below threshold #394

itcarroll opened this issue Nov 16, 2023 · 8 comments

Comments

@itcarroll
Copy link

I am trying to build a MultiZarrToZarr from the GLDAS series distributed by GES DISC, and concatenation on time is not coming out right. I think I have traced the problem to something about the chunking of the "time" variable in the individual NetCDF4 files. A symptom of something being wrong is that the data for this variable does not get inlined and the ref is suspicious.

The "time" variable in each file has shape (1,) and dtype float64, but invoking SingleHdf5ToZarr using the default value for inline_threshold does not embed the data:

from kerchunk.hdf import SingleHdf5ToZarr

reference = SingleHdf5ToZarr("GLDAS_NOAH025_3H.A20000101.0300.021.nc4").translate()
print(reference["refs"]["time/0"])
['GLDAS_NOAH025_3H.A20000101.0300.021.nc4', 22998, 4096]

Rather than an array with a much too big number of bytes, I'm expecting a base64 string representing the single value. The only clue I can offer is that XArray reports the encoding on "time" as having a chunk size of (512,) ... why, I have no idea.

GLDAS_NOAH025_3H.A20000101.0300.021.nc4.zip

@martindurant
Copy link
Member

I does seem that the buffer is really that size. Perhaps the chunksize is as it is because it makes it possible to append to the HDF5 file later?

In any case, inlining looks at the buffer size and includes the whole buffer if appropriate. In this case, we know the array size is much smaller than that, so we should instead consider the whole size of the array and inline that instead (taking care to change the chunksize to match). So: doable, but needs a little work.

@itcarroll
Copy link
Author

Thanks for looking! To confirm, I think you are saying that ...

$ h5dump -p -d time GLDAS_NOAH025_3H.A20000101.0600.021.nc4
HDF5 "GLDAS_NOAH025_3H.A20000101.0600.021.nc4" {
DATASET "time" {
   DATATYPE  H5T_IEEE_F64LE
   DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 512 )
      SIZE 4096
   }
...

means it really is using 4096 bytes on disk.

So yes, seems like inlining could get smarter, but is getting the right value.

@martindurant
Copy link
Member

it really is using 4096 bytes on disk

Yes.

Note that if you combine over these values later, you will get only the actual values not the whole buffers.

@itcarroll
Copy link
Author

I've further examined my problem, @martindurant, and would very much appreciate a suggestion on whether I should create an issue at zarr on the behavior of zarr.storage.FSStore or try to help with a resolution here. As for my original report concerning data not getting inlined ... I am inclined to close because a) it's not my actual problem and b) inlining is a chunk-level decision and you can't go inlining some small chunk at the edge of an array because it only holds a few values.

My actual problem is that MultiZarrToZarr does not get informed of the error that occurs when s3 credentials are missing/invalid/expired for a reference that must be de-referenced to get data for a coordinate in concat_dims. In the example below, the multi object gets an incorrect time array, whose ultimate source is the empty array created in zarr.core.Array._get_selection. That empy array is passing through untouched, because the zarr.storage.FSStore.getitems method omits the "time" key when it cannot reach the data and no fill value is ever applied. I'm unsure whether this is a bug in zarr's logic or not, and if the latter, whether MultiZarrToZarr could even check for missing/invalid s3 credentials up front.

single = {
    "version": 1,
    "refs": {
        ".zgroup": '{"zarr_format":2}',
        "time/.zarray": '{"chunks":[512],"compressor":null,"dtype":"<f8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}',
        "time/0": [
            "s3://gesdisc-cumulus-prod-protected/GLDAS/GLDAS_NOAH025_3H.2.1/2018/152/GLDAS_NOAH025_3H.A20180601.0000.021.nc4",
            22998,
            4096,
        ],
    },
}
multi = MultiZarrToZarr([single], concat_dims="time").translate()
multi["refs"]["time/0"]                 <-- the value is arbitrary/wrong

You get the right value if you pass good credentials with remote_options, but something ought to break here rather than be silently incorrect.

@martindurant
Copy link
Member

I am not at all sure how that might be happening...
When accessing data via xarray/zarr/referenceFS, there should be a different exception for "there is no reference of that path" (FileNotFoundError) versus "that reference exists but didn't load" (ReferenceNotReachable), which was supposed to prevent random chunks turning up as NaN when there are intermittent IO problems. I would have expected an exception; but that is at final read time, not during combine - so something must be different. I recommend activating the loggers fsspec.reference and s3fs and maybe set breakpoints to see where this happens.

@itcarroll
Copy link
Author

You do get ReferenceNotReachable from fsspec.FSMap.getitems#L101, but because the caller (zarr.FSStore.getitems#1420) set on_error='omit' the error is not raised.

@martindurant
Copy link
Member

Well... it really should be "return", and then filter what comes back for things that are in self.exceptions, else raise (i.e., get in a batch should produce the same as serially getting each using __getitem__). So I'd say it's their fault.

@itcarroll
Copy link
Author

Closing original issue, which was not a bug.

Moving the secondary issue discovered to zarr as zarr-developers/zarr-python#1578.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants