Common writer for plink and ICF #339

benjeffery · 2025-03-31T08:55:45Z

This WIP PR attempts to use a single schema and writer class for both ICF and plink encoding. The tests are almost completely unchanged.

This needs some polish - for example I've only adapted the Writer.encode_X methods needed for plink. They should be consistent and ICF specific stuff moved out of the common writer.

Plink conversion is also missing the indexing.

coveralls · 2025-03-31T09:02:42Z

coverage: 98.765% (-0.1%) from 98.867%
when pulling f29b5a0 on benjeffery:common-writer
into 3dca18e on sgkit-dev:main.

benjeffery · 2025-04-03T12:11:34Z

I think this is a good point to draw the line. Would appreciate a review here @jeromekelleher

jeromekelleher

Looks great! I'm not sure I follow the divison of labour, though.

How about the following structure:

bio2zarr/vcz.py. This contains the definition of a VcfZarrSchema and the VcfZarrWriter class. It does not know about icf, plink or any other format.

ICF anda PlinkFormat then have a generate_schema method. So, icf.py and plink.py will both need to import vcz.py, but that makes sense as that is the format that they are targeting.

Does that makes sense?

jeromekelleher · 2025-04-03T14:54:08Z

bio2zarr/plink.py

+        self.root_attrs = {}
+
+    def iter_alleles(self, start, stop, num_alleles):
+        ref_field = self.bed.allele_1


Are you sure this is correct? I thought allele1 was the minor allele/ALT (usually) in BED? I'm constantly confused by this!

It seems there is no fixed convention and we'll have to allow this to be user configurable?
From reading https://www.cog-genomics.org/plink/1.9/data it seems by default, PLINK often assigns the major allele as A2 and the minor allele as A1, but this can be overridden with flags like --keep-allele-order, --real-ref-alleles, --a1-allele, or --a2-allele.

jeromekelleher · 2025-04-03T14:56:16Z

bio2zarr/plink.py

-    gt_mask.flush()
-    logger.debug(f"GT slice {start}:{stop} done")
+            gt[values == -127] = -1  # Missing values
+            gt[values == 0] = [1, 1]  # Homozygous ALT (2 in PLINK)


Yep, not sure about this! See above about the alleles.

jeromekelleher · 2025-04-03T15:34:14Z

bio2zarr/vcf2zarr/vcz.py

-
-        if local_alleles:
-            array_specs = convert_local_allele_field_types(array_specs)
+def generate_schema(


Shouldn't this just be a method of the ICF/Plink formats? The vcz module shouldn't be creating schemas I think.

jeromekelleher · 2025-04-03T15:35:16Z

bio2zarr/vcf2zarr/vcz.py

-            j = gt_phased.next_buffer_row()
-            icf.sanitise_value_int_1d(
-                gt_phased.buff, j, value[:, -1] if value is not None else None
+    array_specs = [


This definitely seems like it should be a method of the ICF

jeromekelleher · 2025-04-03T15:37:37Z

bio2zarr/vcf2zarr/vcz.py

@@ -1248,7 +251,7 @@ def mkschema(


 def encode(


These high-level methods are breaking the clarity of the division of labour here; it is probably too much to move them out in this PR, but we should look at moving them somewhere else. The current modules should just be concerned with writing VCZ efficiently, and not know any thing about where that data comes from.

jeromekelleher · 2025-04-03T15:39:52Z

bio2zarr/writer.py

@@ -0,0 +1,813 @@
+import dataclasses


I'm not sure I see the value of this module - why not keep the VCZ writing logic in vcz.py (which we could move to root of the package hierarchy)?

jeromekelleher · 2025-04-03T15:40:36Z

tests/test_vcf_examples.py

@@ -109,6 +109,8 @@ def test_float_info_fields(self, ds):
            dtype=np.float32,
        )
        values = ds["variant_AF"].values
+        print(values)


stray print

benjeffery · 2025-04-04T13:11:49Z

Yes, I meant to have a re-org and forgot to.
I've removed make the schema generators member functions of their sources. I've removed vcz.py with the encode_* methods going into icf.py.
I'd like to keep schema.py and writer.py separate - it is most of what was in vcz.py but now at the top-level.

Will double check the plink details next.

jeromekelleher

Generally looks great, I think we're nearly there.

Regarding the files, I think you're missing the point that bio2zarr may contain writers for multiple formats, like say, BED files (#281). So, really, what you would want to have is bio2zarr/vcz_writer.py and bio2zarr/vcz_schema.py. This has much less clarity to me that bio2zarr/vcz.py as the source of what you need to write a VCZ file. That feels to me like a long-term stable API: bio2zarr.vcz.Schema to define the structure and bio2zarr.vcz.Writer to do the writing.

You could define a vcz package, but since the schema is only 200 lines, what's the point? I feel like we could drop the vcf2zarr package also now, as it's not doing very much.

jeromekelleher · 2025-04-04T14:26:25Z

bio2zarr/plink.py

+            yield alleles
+
+    def iter_field(self, field_name, shape, start, stop):
+        data = {


Is this a roundabout way of saying

assert field_name == "position" yield from self.bed.bp_position[start: stop]

Yes, when I wrote this I thought there might be other fields that could be added.

Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.

jeromekelleher · 2025-04-04T14:27:31Z

bio2zarr/plink.py

+        m = self.bed.sid_count
+        logging.info(f"Scanned plink with {n} samples and {m} variants")
+
+        # FIXME


Can leave these unset and pass to VcfZarrSchema contructor to set in one place

Fixed in 66a86fc

jeromekelleher · 2025-04-04T14:29:07Z

bio2zarr/vcf2zarr/icf.py

-from .. import constants, core, provenance, vcf_utils
+from bio2zarr import schema
+
+from .. import constants, core, provenance, vcf_utils, writer


might as well import schema from ".." as well, right?

Fixed in 66a86fc

jeromekelleher · 2025-04-04T14:31:03Z

bio2zarr/vcf2zarr/icf.py

+    def generate_schema(
+        self, variants_chunk_size=None, samples_chunk_size=None, local_alleles=None
+    ):
+        # Import schema here to avoid circular import


Imported at the top

Fixed in 66a86fc

jeromekelleher · 2025-04-04T14:31:38Z

bio2zarr/vcf2zarr/icf.py

+
+        m = self.num_records
+        n = self.num_samples
+        if samples_chunk_size is None:


Same point about default chunk sizes here - let it be set in the Schema constructor

Fixed in 66a86fc

jeromekelleher · 2025-04-04T14:43:05Z

bio2zarr/writer.py

@@ -615,13 +160,14 @@ class VcfZarrWriteSummary(core.JsonDataclass):


 class VcfZarrWriter:
-    def __init__(self, path):
+    def __init__(self, source_type, path):


Why does the writer need to know the class of the source ahead of init?

load_metadata needs it, which can be called by e.g. encode_parition without init being called.

benjeffery · 2025-04-05T07:23:04Z

You're right, I was missing that point!
I've combined the writer and schema into vcz.py and removed the vcf2zarr module.

jeromekelleher · 2025-04-05T08:48:47Z

This looks great, a really big step forward in terms of clarity and extensibility.

I think it's ready to merge, but I'm still uneasy about flipping the A1/A2 alleles in plink. Unless there's a good reason (beyond it being less confusing) I think we should keep it the way it is, as the rules are quite complicated and there could easily be untested corner cases that the new behaviour differs on. I took the original code from sgkit, and I think it was pretty well used there, so it should be reasonably correct.

jeromekelleher

Looks good to me! I'm still a bit iffy about changing the plink code, but I guess if the tests are passing it's probably OK. Ready for a squash and merge I think.

jeromekelleher · 2025-04-07T08:45:43Z

bio2zarr/plink.py

+            yield alleles
+
+    def iter_field(self, field_name, shape, start, stop):
+        data = {


Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.

benjeffery · 2025-04-07T14:47:31Z

I've change the plink code to follow the exact original logic in b9494a5 - the difference was that the new code was not specifying count_A1=False when opening the bed file. I've added that and reverted the genotype encoding logic to the original. Note that the original logic had allele_1 as the REF allele. So this is still the case here.

jeromekelleher · 2025-04-07T15:20:59Z

Hmm, looks like something funny is happening with numcodecs exact sizes again. Did we just get a release or something?

benjeffery · 2025-04-07T15:51:34Z

Hmm, looks like something funny is happening with numcodecs exact sizes again. Did we just get a release or something?

Yep about an hour ago. I've filed #347

benjeffery · 2025-04-08T09:15:06Z

Looks like PyPI is having issues: https://x.com/TadejKrevh/status/1909532147846688906

Will merge later.

benjeffery force-pushed the common-writer branch 2 times, most recently from 69710a9 to 9655012 Compare April 3, 2025 12:10

benjeffery marked this pull request as ready for review April 3, 2025 12:11

benjeffery mentioned this pull request Apr 3, 2025

Remove schema data #343

Merged

jeromekelleher reviewed Apr 3, 2025

View reviewed changes

jeromekelleher reviewed Apr 4, 2025

View reviewed changes

benjeffery force-pushed the common-writer branch 2 times, most recently from 0207d21 to 669fdf3 Compare April 5, 2025 07:13

jeromekelleher approved these changes Apr 7, 2025

View reviewed changes

benjeffery force-pushed the common-writer branch 2 times, most recently from f1790a1 to 6714853 Compare April 8, 2025 09:08

Create common schema and writer for ICF and plink encoding

f29b5a0

benjeffery force-pushed the common-writer branch from 6714853 to f29b5a0 Compare April 8, 2025 09:12

benjeffery merged commit eed60f0 into sgkit-dev:main Apr 8, 2025
20 of 40 checks passed

benjeffery deleted the common-writer branch April 8, 2025 23:43

tomwhite mentioned this pull request Apr 14, 2025

ImportError: cannot import name 'vcf2zarr' from 'bio2zarr' sgkit-dev/sgkit#1308

Closed

jeromekelleher mentioned this pull request Apr 14, 2025

Top-level Python package structure proposal #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common writer for plink and ICF #339

Common writer for plink and ICF #339

benjeffery commented Mar 31, 2025

coveralls commented Mar 31, 2025 •

edited

Loading

benjeffery commented Apr 3, 2025

jeromekelleher left a comment

jeromekelleher Apr 3, 2025

benjeffery Apr 6, 2025

jeromekelleher Apr 3, 2025

jeromekelleher Apr 3, 2025

jeromekelleher Apr 3, 2025

jeromekelleher Apr 3, 2025

jeromekelleher Apr 3, 2025

jeromekelleher Apr 3, 2025

benjeffery commented Apr 4, 2025

jeromekelleher left a comment

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

jeromekelleher Apr 7, 2025

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

jeromekelleher Apr 4, 2025

benjeffery Apr 4, 2025

benjeffery commented Apr 5, 2025

jeromekelleher commented Apr 5, 2025

jeromekelleher left a comment

jeromekelleher Apr 7, 2025

benjeffery commented Apr 7, 2025 •

edited

Loading

jeromekelleher commented Apr 7, 2025

benjeffery commented Apr 7, 2025

benjeffery commented Apr 8, 2025

Common writer for plink and ICF #339

Common writer for plink and ICF #339

Conversation

benjeffery commented Mar 31, 2025

coveralls commented Mar 31, 2025 • edited Loading

benjeffery commented Apr 3, 2025

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjeffery commented Apr 4, 2025

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjeffery commented Apr 5, 2025

jeromekelleher commented Apr 5, 2025

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjeffery commented Apr 7, 2025 • edited Loading

jeromekelleher commented Apr 7, 2025

benjeffery commented Apr 7, 2025

benjeffery commented Apr 8, 2025

coveralls commented Mar 31, 2025 •

edited

Loading

benjeffery commented Apr 7, 2025 •

edited

Loading