Skip to content

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
croth1 opened this issue Mar 12, 2023 · 14 comments
Closed

tiff_to_zarr for geotiff with compression: zarr reads strange values #317

croth1 opened this issue Mar 12, 2023 · 14 comments

Comments

@croth1
Copy link

croth1 commented Mar 12, 2023

imagecodecs               2023.1.23          py38h5408aff_0    conda-forge
zarr                      2.13.6               pyhd8ed1ab_0    conda-forge
kerchunk                  0.1.0+32.g39658f8          pypi_0    pypi

When I translate geotiffs created by rasterio with tiff_to_zarr, it only seems to work for uncompressed files. As soon as I choose a compression, the values seem off:

import warnings
from pathlib import Path

import numpy as np
import rasterio as rio
import zarr
from kerchunk.tiff import tiff_to_zarr

warnings.simplefilter("ignore")

tiff_file = Path("data.tif").absolute()

x, y, z = 512, 512, 5


opts = dict(tiled=True, blockxsize=256, blockysize=256, dtype=np.uint32)

compressions = ["NONE", "ZSTD", "LZW", "DEFLATE", "LERC"]
for compression in compressions:
    with rio.open(
        tiff_file,
        "w",
        driver="GTiff",
        width=x,
        height=y,
        count=z,
        compress=compression,
        **opts,
    ) as out:
        datablock = np.arange(x * y * z).reshape(z, x, y)
        out.write(datablock)

    # sanity check - reading the full block back into memory
    with rio.open(tiff_file) as tiff_in:
        block2 = tiff_in.read()
        np.testing.assert_array_equal(block2, datablock)

    try:
        zarr_file = Path("data_zarr.json").absolute()
        tiff_to_zarr(
            f"file://{tiff_file}",
            target=f"file://{zarr_file}",
            target_options=dict(mode="w"),
        )

        arr = zarr.open("reference://", storage_options=dict(fo="data_zarr.json"))
        np.testing.assert_equal(arr[42, 42, 4], datablock[4, 42, 42])
        print(f"{compression}: success.")

    except Exception as e:
        print(f"{compression}: failed ({arr.compressor})- ({e}).")
NONE: success.
ZSTD: failed (None)- (
Items are not equal:
 ACTUAL: 466472691
 DESIRED: 1070122).
LZW: failed (None)- (
Items are not equal:
 ACTUAL: 282600147
 DESIRED: 1070122).
DEFLATE: failed (None)- (
Items are not equal:
 ACTUAL: 1692767126
 DESIRED: 1070122).
LERC: failed (None)- (
Items are not equal:
 ACTUAL: 1075052576
 DESIRED: 1070122).

The values it reads are much larger than 512*512*5 = 1310720, which should be the largest value in the array. Also for some reason the array does not seem to have a compressor set, although the kerchunk generated files do mention a compressor Any ideas what I am doing wrong?

".zarray": "{\n \"chunks\": [\n  256,\n  256,\n  5\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_lerc\"\n },\n \"dtype\": \"<u4\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  512,\n  512,\n  5\n ],\n \"zarr_format\": 2\n}",
@cgohlke
Copy link

cgohlke commented Mar 13, 2023

Using a different json file in each case, e.g., f"{compression}.json" instead of "data_zarr.json", works for me, except for zstd:

NONE: success.
ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).
LZW: success.
DEFLATE: success.
LERC: success.

Switching to imagecodecs_zstd instead of numcodecs zstd works, which indicates that the numcodecs zstd codec is broken.

@croth1
Copy link
Author

croth1 commented Mar 13, 2023

Thanks! That already helped a lot! It seems for me this is a combination of several issues. I now get:

NONE: success. (None)
ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).
LZW: failed (Zstd(level=1))- (codec not available: 'imagecodecs_lzw').
DEFLATE: success. (Zlib(level=1))
LERC: failed (Zlib(level=1))- (codec not available: 'imagecodecs_lerc').

Somehow the imagecodecs are not registered automatically in the version I got from conda-forge, or I am missing something - I will debug outside of this issue.

edit: this is the trick:

from imagecodecs.numcodecs import register_codecs
register_codecs()

ZSTD: failed (Zstd(level=1))- (Zstd decompression error: invalid input data).

What jumped to my eye here is the compression level, I think this should be 9, or am I wrong?

ZSTD_LEVEL=[1-22]: Set the level of compression when using ZSTD compression (or LERC_ZSTD). A value of 22 is best (very slow), and 1 is least compression. The default is 9.

source: https://gdal.org/drivers/raster/gtiff.html

It seems the compression level is not recorded in the json? 🤔

then again for deflate it should have been 6, but Zlib(level=1) seemed to have worked?

ZLEVEL=[1-9] or [1-12]: Set the level of compression when using DEFLATE compression (or LERC_DEFLATE). A value of 9 (resp. 12) is best/slowest when using zlib (resp. libdeflate), and 1 is least/fastest compression. The default is 6.

@martindurant
Copy link
Member

It seems the compression level is not recorded in the json? 🤔

The compression level is an option at compression type, but all streams should be uncompressible without specifying it again (zstd must record it in the compressed block header, or maybe it's more subtle like size of dictionary that is saved).

which indicates that the numcodecs zstd codec is broken.

@cgohlke , thanks for commenting on here, and I didn't realise that imagecodecs included such non-image codecs :). Do you think you can build a reproducer of the numcodecs zstd issue?

@croth1 , it may be that your fsspec inastances are being cached when you call it multiple times with identical arguments (i.e., the same references file name, but with it having different content). You can always pass skip_instance_cach=True in storage_options when remaking instances in a single session to be sure.

@cgohlke
Copy link

cgohlke commented Mar 13, 2023

Do you think you can build a reproducer of the numcodecs zstd issue?

I'll try.

Looks like ZSTD_getDecompressedSize fails for the input created by rasterio/gdal/libtiff. The function is deprecated and replaced by ZSTD_getFrameContentSize. Imagecodecs uses an arbitrary size as fallback, which might work in this case.

The issues are: why does the function fail on this input? Since zarr knows the size of the decompressed chunks, why is it not passed on to the numcodecs codec, e.g. as an output buffer of correct size?

@martindurant
Copy link
Member

I don't suppose it necessarily knows the output buffer size, since there may be another codec in the chain before forming the final array. I wonder, does cramjam's zstd decompressor work here? Or standard zstandard? I don't know why numcodecs needs to directly build against source when those options exist (similar discussion going on around blosc).

@cgohlke
Copy link

cgohlke commented Mar 13, 2023

zstd.decompress(tile) fails with zstd.Error: Input data invalid or missing content size in frame header.

cramjam.zstd.decompress(tile) succeeds. Cramjam just wraps the zstd crate's Decoder class, which uses streaming decompression.

GDAL/libtiff use the zstd streaming API and apparently do not write a header with content size.

@martindurant
Copy link
Member

Ah, so this is a "framed" versus "unframed" thing. I parquet usage (which prompted cramjam), the buffer size is kept in other metadata.

@cgohlke
Copy link

cgohlke commented Mar 13, 2023

According to zstd.h, note 2 "decompressed size is an optional field, it may not be present, typically in streaming mode."

@cgohlke
Copy link

cgohlke commented Mar 15, 2023

Do you think you can build a reproducer of the numcodecs zstd issue?

>>> import numcodecs, cramjam
>>> numcodecs.Zstd().decode(cramjam.zstd.compress(bytes(1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "numcodecs/zstd.pyx", line 219, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 153, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

@martindurant
Copy link
Member

I have made zarr-developers/numcodecs#424 . Let's see if it resolves, else can replace with imagecodecs_zstd for now (requiring an extra install).

@cgohlke
Copy link

cgohlke commented Mar 17, 2023

else can replace with imagecodecs_zstd for now (requiring an extra install).

Tifffile 2023.3.15 now writes fsspec reference files with imagecodecs_zstd instead of zstd.

@martindurant
Copy link
Member

@croth1 , can you reinstall and give it a go?

@croth1
Copy link
Author

croth1 commented Mar 18, 2023

Thanks a lot to both of you. With latest tifffile and skip_instance_cache, all the compressions tested here work :)

@martindurant
Copy link
Member

Great! Closing this for now. If numcodecs fixes itself, we may revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants