tiff_to_zarr for geotiff with compression: zarr reads strange values #317

croth1 · 2023-03-12T08:47:16Z

imagecodecs               2023.1.23          py38h5408aff_0    conda-forge
zarr                      2.13.6               pyhd8ed1ab_0    conda-forge
kerchunk                  0.1.0+32.g39658f8          pypi_0    pypi

When I translate geotiffs created by rasterio with tiff_to_zarr, it only seems to work for uncompressed files. As soon as I choose a compression, the values seem off:

import warnings
from pathlib import Path

import numpy as np
import rasterio as rio
import zarr
from kerchunk.tiff import tiff_to_zarr

warnings.simplefilter("ignore")

tiff_file = Path("data.tif").absolute()

x, y, z = 512, 512, 5


opts = dict(tiled=True, blockxsize=256, blockysize=256, dtype=np.uint32)

compressions = ["NONE", "ZSTD", "LZW", "DEFLATE", "LERC"]
for compression in compressions:
    with rio.open(
        tiff_file,
        "w",
        driver="GTiff",
        width=x,
        height=y,
        count=z,
        compress=compression,
        **opts,
    ) as out:
        datablock = np.arange(x * y * z).reshape(z, x, y)
        out.write(datablock)

    # sanity check - reading the full block back into memory
    with rio.open(tiff_file) as tiff_in:
        block2 = tiff_in.read()
        np.testing.assert_array_equal(block2, datablock)

    try:
        zarr_file = Path("data_zarr.json").absolute()
        tiff_to_zarr(
            f"file://{tiff_file}",
            target=f"file://{zarr_file}",
            target_options=dict(mode="w"),
        )

        arr = zarr.open("reference://", storage_options=dict(fo="data_zarr.json"))
        np.testing.assert_equal(arr[42, 42, 4], datablock[4, 42, 42])
        print(f"{compression}: success.")

    except Exception as e:
        print(f"{compression}: failed ({arr.compressor})- ({e}).")

NONE: success.
ZSTD: failed (None)- (
Items are not equal:
 ACTUAL: 466472691
 DESIRED: 1070122).
LZW: failed (None)- (
Items are not equal:
 ACTUAL: 282600147
 DESIRED: 1070122).
DEFLATE: failed (None)- (
Items are not equal:
 ACTUAL: 1692767126
 DESIRED: 1070122).
LERC: failed (None)- (
Items are not equal:
 ACTUAL: 1075052576
 DESIRED: 1070122).

The values it reads are much larger than 512*512*5 = 1310720, which should be the largest value in the array. Also for some reason the array does not seem to have a compressor set, although the kerchunk generated files do mention a compressor Any ideas what I am doing wrong?

".zarray": "{\n \"chunks\": [\n  256,\n  256,\n  5\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_lerc\"\n },\n \"dtype\": \"<u4\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  512,\n  512,\n  5\n ],\n \"zarr_format\": 2\n}",

The text was updated successfully, but these errors were encountered:

cgohlke · 2023-03-13T04:41:42Z

Using a different json file in each case, e.g., f"{compression}.json" instead of "data_zarr.json", works for me, except for zstd:

NONE: success.
ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).
LZW: success.
DEFLATE: success.
LERC: success.

Switching to imagecodecs_zstd instead of numcodecs zstd works, which indicates that the numcodecs zstd codec is broken.

croth1 · 2023-03-13T09:34:20Z

Thanks! That already helped a lot! It seems for me this is a combination of several issues. I now get:

NONE: success. (None)
ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).
LZW: failed (Zstd(level=1))- (codec not available: 'imagecodecs_lzw').
DEFLATE: success. (Zlib(level=1))
LERC: failed (Zlib(level=1))- (codec not available: 'imagecodecs_lerc').

~~Somehow the imagecodecs are not registered automatically in the version I got from conda-forge, or I am missing something - I will debug outside of this issue.~~

edit: this is the trick:

from imagecodecs.numcodecs import register_codecs
register_codecs()

ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).

What jumped to my eye here is the compression level, I think this should be 9, or am I wrong?

ZSTD_LEVEL=[1-22]: Set the level of compression when using ZSTD compression (or LERC_ZSTD). A value of 22 is best (very slow), and 1 is least compression. The default is 9.

source: https://gdal.org/drivers/raster/gtiff.html

It seems the compression level is not recorded in the json? 🤔

then again for deflate it should have been 6, but Zlib(level=1) seemed to have worked?

ZLEVEL=[1-9] or [1-12]: Set the level of compression when using DEFLATE compression (or LERC_DEFLATE). A value of 9 (resp. 12) is best/slowest when using zlib (resp. libdeflate), and 1 is least/fastest compression. The default is 6.

martindurant · 2023-03-13T13:13:59Z

It seems the compression level is not recorded in the json? 🤔

The compression level is an option at compression type, but all streams should be uncompressible without specifying it again (zstd must record it in the compressed block header, or maybe it's more subtle like size of dictionary that is saved).

which indicates that the numcodecs zstd codec is broken.

@cgohlke , thanks for commenting on here, and I didn't realise that imagecodecs included such non-image codecs :). Do you think you can build a reproducer of the numcodecs zstd issue?

@croth1 , it may be that your fsspec inastances are being cached when you call it multiple times with identical arguments (i.e., the same references file name, but with it having different content). You can always pass skip_instance_cach=True in storage_options when remaking instances in a single session to be sure.

cgohlke · 2023-03-13T15:21:07Z

Do you think you can build a reproducer of the numcodecs zstd issue?

I'll try.

Looks like ZSTD_getDecompressedSize fails for the input created by rasterio/gdal/libtiff. The function is deprecated and replaced by ZSTD_getFrameContentSize. Imagecodecs uses an arbitrary size as fallback, which might work in this case.

The issues are: why does the function fail on this input? Since zarr knows the size of the decompressed chunks, why is it not passed on to the numcodecs codec, e.g. as an output buffer of correct size?

martindurant · 2023-03-13T15:24:16Z

I don't suppose it necessarily knows the output buffer size, since there may be another codec in the chain before forming the final array. I wonder, does cramjam's zstd decompressor work here? Or standard zstandard? I don't know why numcodecs needs to directly build against source when those options exist (similar discussion going on around blosc).

cgohlke · 2023-03-13T16:02:28Z

zstd.decompress(tile) fails with zstd.Error: Input data invalid or missing content size in frame header.

cramjam.zstd.decompress(tile) succeeds. Cramjam just wraps the zstd crate's Decoder class, which uses streaming decompression.

GDAL/libtiff use the zstd streaming API and apparently do not write a header with content size.

martindurant · 2023-03-13T19:25:39Z

Ah, so this is a "framed" versus "unframed" thing. I parquet usage (which prompted cramjam), the buffer size is kept in other metadata.

cgohlke · 2023-03-13T19:32:53Z

According to zstd.h, note 2 "decompressed size is an optional field, it may not be present, typically in streaming mode."

cgohlke · 2023-03-15T23:20:31Z

Do you think you can build a reproducer of the numcodecs zstd issue?

>>> import numcodecs, cramjam
>>> numcodecs.Zstd().decode(cramjam.zstd.compress(bytes(1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "numcodecs/zstd.pyx", line 219, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 153, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

martindurant · 2023-03-17T13:25:02Z

I have made zarr-developers/numcodecs#424 . Let's see if it resolves, else can replace with imagecodecs_zstd for now (requiring an extra install).

cgohlke · 2023-03-17T18:24:33Z

else can replace with imagecodecs_zstd for now (requiring an extra install).

Tifffile 2023.3.15 now writes fsspec reference files with imagecodecs_zstd instead of zstd.

martindurant · 2023-03-17T18:26:21Z

@croth1 , can you reinstall and give it a go?

croth1 · 2023-03-18T17:44:07Z

Thanks a lot to both of you. With latest tifffile and skip_instance_cache, all the compressions tested here work :)

martindurant · 2023-03-18T17:54:52Z

Great! Closing this for now. If numcodecs fixes itself, we may revisit.

martindurant mentioned this issue Mar 17, 2023

zstd failure for unsepcified buffer size zarr-developers/numcodecs#424

Open

martindurant closed this as completed Mar 18, 2023

mkitti mentioned this issue Mar 29, 2025

ZEP9 (phase 1): add clarifications for extension naming zarr-developers/zarr-specs#330

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

croth1 commented Mar 12, 2023 •

edited

Loading

cgohlke commented Mar 13, 2023 •

edited

Loading

Uh oh!

croth1 commented Mar 13, 2023 •

edited

Loading

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 15, 2023

Uh oh!

martindurant commented Mar 17, 2023

Uh oh!

cgohlke commented Mar 17, 2023

Uh oh!

martindurant commented Mar 17, 2023

Uh oh!

croth1 commented Mar 18, 2023

Uh oh!

martindurant commented Mar 18, 2023

Uh oh!

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

Comments

croth1 commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cgohlke commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

croth1 commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

martindurant commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 13, 2023

Uh oh!

cgohlke commented Mar 15, 2023

Uh oh!

martindurant commented Mar 17, 2023

Uh oh!

cgohlke commented Mar 17, 2023

Uh oh!

martindurant commented Mar 17, 2023

Uh oh!

croth1 commented Mar 18, 2023

Uh oh!

martindurant commented Mar 18, 2023

Uh oh!

croth1 commented Mar 12, 2023 •

edited

Loading

cgohlke commented Mar 13, 2023 •

edited

Loading

croth1 commented Mar 13, 2023 •

edited

Loading