Skip to content

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. #1090

@benjeffery

Description

@benjeffery

@tnguyengel has hit the following error while running vcf_to_zarr with the default arguments:

  File "/home/tnguyen/conda/sgkit_main/lib/python3.10/site-packages/zarr/core.py", line 2168, in _process_for_setitem
    chunk = value.astype(self._dtype, order=self._order, copy=False)
ValueError: could not convert string to float: 'A'

This is because concat_zarrs_optimized is using dtype=float64 to concat and convert the variant_alleles array.
This is because the first temp zarr chunk has a variant_allele dtype of float64
This is because the first temp zarr chunk is zero-sized.

I assume this is because the target_chunk_size default of 20M is smaller than the VCF header, leading to no sites being in the first chunk. I have asked her to try a larger target_chunk_size as a workaround, and will work on a proper fix.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingupstreamUsed when our build breaks due to upstream changes

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions