Skip to content

Common writer for plink and ICF #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 8, 2025

Conversation

benjeffery
Copy link
Contributor

This WIP PR attempts to use a single schema and writer class for both ICF and plink encoding. The tests are almost completely unchanged.

This needs some polish - for example I've only adapted the Writer.encode_X methods needed for plink. They should be consistent and ICF specific stuff moved out of the common writer.

Plink conversion is also missing the indexing.

@coveralls
Copy link
Collaborator

coveralls commented Mar 31, 2025

Coverage Status

coverage: 98.765% (-0.1%) from 98.867%
when pulling f29b5a0 on benjeffery:common-writer
into 3dca18e on sgkit-dev:main.

@benjeffery benjeffery force-pushed the common-writer branch 2 times, most recently from 69710a9 to 9655012 Compare April 3, 2025 12:10
@benjeffery benjeffery marked this pull request as ready for review April 3, 2025 12:11
@benjeffery
Copy link
Contributor Author

I think this is a good point to draw the line. Would appreciate a review here @jeromekelleher

@benjeffery benjeffery mentioned this pull request Apr 3, 2025
Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm not sure I follow the divison of labour, though.

How about the following structure:

  • bio2zarr/vcz.py. This contains the definition of a VcfZarrSchema and the VcfZarrWriter class. It does not know about icf, plink or any other format.

ICF anda PlinkFormat then have a generate_schema method. So, icf.py and plink.py will both need to import vcz.py, but that makes sense as that is the format that they are targeting.

Does that makes sense?

self.root_attrs = {}

def iter_alleles(self, start, stop, num_alleles):
ref_field = self.bed.allele_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? I thought allele1 was the minor allele/ALT (usually) in BED? I'm constantly confused by this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no fixed convention and we'll have to allow this to be user configurable?
From reading https://www.cog-genomics.org/plink/1.9/data it seems by default, PLINK often assigns the major allele as A2 and the minor allele as A1, but this can be overridden with flags like --keep-allele-order, --real-ref-alleles, --a1-allele, or --a2-allele.

gt_mask.flush()
logger.debug(f"GT slice {start}:{stop} done")
gt[values == -127] = -1 # Missing values
gt[values == 0] = [1, 1] # Homozygous ALT (2 in PLINK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, not sure about this! See above about the alleles.


if local_alleles:
array_specs = convert_local_allele_field_types(array_specs)
def generate_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just be a method of the ICF/Plink formats? The vcz module shouldn't be creating schemas I think.

j = gt_phased.next_buffer_row()
icf.sanitise_value_int_1d(
gt_phased.buff, j, value[:, -1] if value is not None else None
array_specs = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like it should be a method of the ICF

@@ -1248,7 +251,7 @@ def mkschema(


def encode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These high-level methods are breaking the clarity of the division of labour here; it is probably too much to move them out in this PR, but we should look at moving them somewhere else. The current modules should just be concerned with writing VCZ efficiently, and not know any thing about where that data comes from.

@@ -0,0 +1,813 @@
import dataclasses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the value of this module - why not keep the VCZ writing logic in vcz.py (which we could move to root of the package hierarchy)?

@@ -109,6 +109,8 @@ def test_float_info_fields(self, ds):
dtype=np.float32,
)
values = ds["variant_AF"].values
print(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray print

@benjeffery
Copy link
Contributor Author

Yes, I meant to have a re-org and forgot to.
I've removed make the schema generators member functions of their sources. I've removed vcz.py with the encode_* methods going into icf.py.
I'd like to keep schema.py and writer.py separate - it is most of what was in vcz.py but now at the top-level.

Will double check the plink details next.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, I think we're nearly there.

Regarding the files, I think you're missing the point that bio2zarr may contain writers for multiple formats, like say, BED files (#281). So, really, what you would want to have is bio2zarr/vcz_writer.py and bio2zarr/vcz_schema.py. This has much less clarity to me that bio2zarr/vcz.py as the source of what you need to write a VCZ file. That feels to me like a long-term stable API: bio2zarr.vcz.Schema to define the structure and bio2zarr.vcz.Writer to do the writing.

You could define a vcz package, but since the schema is only 200 lines, what's the point? I feel like we could drop the vcf2zarr package also now, as it's not doing very much.

yield alleles

def iter_field(self, field_name, shape, start, stop):
data = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a roundabout way of saying

assert field_name == "position"
yield from self.bed.bp_position[start: stop]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I wrote this I thought there might be other fields that could be added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.

m = self.bed.sid_count
logging.info(f"Scanned plink with {n} samples and {m} variants")

# FIXME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can leave these unset and pass to VcfZarrSchema contructor to set in one place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 66a86fc

from .. import constants, core, provenance, vcf_utils
from bio2zarr import schema

from .. import constants, core, provenance, vcf_utils, writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well import schema from ".." as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 66a86fc

def generate_schema(
self, variants_chunk_size=None, samples_chunk_size=None, local_alleles=None
):
# Import schema here to avoid circular import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imported at the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 66a86fc


m = self.num_records
n = self.num_samples
if samples_chunk_size is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point about default chunk sizes here - let it be set in the Schema constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 66a86fc

@@ -615,13 +160,14 @@ class VcfZarrWriteSummary(core.JsonDataclass):


class VcfZarrWriter:
def __init__(self, path):
def __init__(self, source_type, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the writer need to know the class of the source ahead of init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_metadata needs it, which can be called by e.g. encode_parition without init being called.

@benjeffery benjeffery force-pushed the common-writer branch 2 times, most recently from 0207d21 to 669fdf3 Compare April 5, 2025 07:13
@benjeffery
Copy link
Contributor Author

You're right, I was missing that point!
I've combined the writer and schema into vcz.py and removed the vcf2zarr module.

@jeromekelleher
Copy link
Contributor

This looks great, a really big step forward in terms of clarity and extensibility.

I think it's ready to merge, but I'm still uneasy about flipping the A1/A2 alleles in plink. Unless there's a good reason (beyond it being less confusing) I think we should keep it the way it is, as the rules are quite complicated and there could easily be untested corner cases that the new behaviour differs on. I took the original code from sgkit, and I think it was pretty well used there, so it should be reasonably correct.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'm still a bit iffy about changing the plink code, but I guess if the tests are passing it's probably OK. Ready for a squash and merge I think.

yield alleles

def iter_field(self, field_name, shape, start, stop):
data = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.

@benjeffery
Copy link
Contributor Author

benjeffery commented Apr 7, 2025

I've change the plink code to follow the exact original logic in b9494a5 - the difference was that the new code was not specifying count_A1=False when opening the bed file. I've added that and reverted the genotype encoding logic to the original. Note that the original logic had allele_1 as the REF allele. So this is still the case here.

@jeromekelleher
Copy link
Contributor

Hmm, looks like something funny is happening with numcodecs exact sizes again. Did we just get a release or something?

@benjeffery
Copy link
Contributor Author

Hmm, looks like something funny is happening with numcodecs exact sizes again. Did we just get a release or something?

Yep about an hour ago. I've filed #347

@benjeffery benjeffery force-pushed the common-writer branch 2 times, most recently from f1790a1 to 6714853 Compare April 8, 2025 09:08
@benjeffery
Copy link
Contributor Author

Looks like PyPI is having issues: https://x.com/TadejKrevh/status/1909532147846688906

Will merge later.

@benjeffery benjeffery merged commit eed60f0 into sgkit-dev:main Apr 8, 2025
20 of 40 checks passed
@benjeffery benjeffery deleted the common-writer branch April 8, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants