-
Notifications
You must be signed in to change notification settings - Fork 9
Common writer for plink and ICF #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
69710a9
to
9655012
Compare
I think this is a good point to draw the line. Would appreciate a review here @jeromekelleher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm not sure I follow the divison of labour, though.
How about the following structure:
- bio2zarr/vcz.py. This contains the definition of a VcfZarrSchema and the VcfZarrWriter class. It does not know about icf, plink or any other format.
ICF anda PlinkFormat then have a generate_schema
method. So, icf.py and plink.py will both need to import vcz.py, but that makes sense as that is the format that they are targeting.
Does that makes sense?
self.root_attrs = {} | ||
|
||
def iter_alleles(self, start, stop, num_alleles): | ||
ref_field = self.bed.allele_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is correct? I thought allele1 was the minor allele/ALT (usually) in BED? I'm constantly confused by this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is no fixed convention and we'll have to allow this to be user configurable?
From reading https://www.cog-genomics.org/plink/1.9/data it seems by default, PLINK often assigns the major allele as A2 and the minor allele as A1, but this can be overridden with flags like --keep-allele-order, --real-ref-alleles, --a1-allele, or --a2-allele.
bio2zarr/plink.py
Outdated
gt_mask.flush() | ||
logger.debug(f"GT slice {start}:{stop} done") | ||
gt[values == -127] = -1 # Missing values | ||
gt[values == 0] = [1, 1] # Homozygous ALT (2 in PLINK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, not sure about this! See above about the alleles.
bio2zarr/vcf2zarr/vcz.py
Outdated
|
||
if local_alleles: | ||
array_specs = convert_local_allele_field_types(array_specs) | ||
def generate_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just be a method of the ICF/Plink formats? The vcz module shouldn't be creating schemas I think.
bio2zarr/vcf2zarr/vcz.py
Outdated
j = gt_phased.next_buffer_row() | ||
icf.sanitise_value_int_1d( | ||
gt_phased.buff, j, value[:, -1] if value is not None else None | ||
array_specs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like it should be a method of the ICF
bio2zarr/vcf2zarr/vcz.py
Outdated
@@ -1248,7 +251,7 @@ def mkschema( | |||
|
|||
|
|||
def encode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These high-level methods are breaking the clarity of the division of labour here; it is probably too much to move them out in this PR, but we should look at moving them somewhere else. The current modules should just be concerned with writing VCZ efficiently, and not know any thing about where that data comes from.
bio2zarr/writer.py
Outdated
@@ -0,0 +1,813 @@ | |||
import dataclasses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see the value of this module - why not keep the VCZ writing logic in vcz.py (which we could move to root of the package hierarchy)?
tests/test_vcf_examples.py
Outdated
@@ -109,6 +109,8 @@ def test_float_info_fields(self, ds): | |||
dtype=np.float32, | |||
) | |||
values = ds["variant_AF"].values | |||
print(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray print
Yes, I meant to have a re-org and forgot to. Will double check the plink details next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, I think we're nearly there.
Regarding the files, I think you're missing the point that bio2zarr may contain writers for multiple formats, like say, BED files (#281). So, really, what you would want to have is bio2zarr/vcz_writer.py
and bio2zarr/vcz_schema.py
. This has much less clarity to me that bio2zarr/vcz.py
as the source of what you need to write a VCZ file. That feels to me like a long-term stable API: bio2zarr.vcz.Schema
to define the structure and bio2zarr.vcz.Writer
to do the writing.
You could define a vcz package, but since the schema is only 200 lines, what's the point? I feel like we could drop the vcf2zarr
package also now, as it's not doing very much.
bio2zarr/plink.py
Outdated
yield alleles | ||
|
||
def iter_field(self, field_name, shape, start, stop): | ||
data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a roundabout way of saying
assert field_name == "position"
yield from self.bed.bp_position[start: stop]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I wrote this I thought there might be other fields that could be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.
bio2zarr/plink.py
Outdated
m = self.bed.sid_count | ||
logging.info(f"Scanned plink with {n} samples and {m} variants") | ||
|
||
# FIXME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can leave these unset and pass to VcfZarrSchema contructor to set in one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 66a86fc
bio2zarr/vcf2zarr/icf.py
Outdated
from .. import constants, core, provenance, vcf_utils | ||
from bio2zarr import schema | ||
|
||
from .. import constants, core, provenance, vcf_utils, writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well import schema from ".." as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 66a86fc
bio2zarr/vcf2zarr/icf.py
Outdated
def generate_schema( | ||
self, variants_chunk_size=None, samples_chunk_size=None, local_alleles=None | ||
): | ||
# Import schema here to avoid circular import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imported at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 66a86fc
bio2zarr/vcf2zarr/icf.py
Outdated
|
||
m = self.num_records | ||
n = self.num_samples | ||
if samples_chunk_size is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point about default chunk sizes here - let it be set in the Schema constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 66a86fc
bio2zarr/writer.py
Outdated
@@ -615,13 +160,14 @@ class VcfZarrWriteSummary(core.JsonDataclass): | |||
|
|||
|
|||
class VcfZarrWriter: | |||
def __init__(self, path): | |||
def __init__(self, source_type, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the writer need to know the class of the source ahead of init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_metadata
needs it, which can be called by e.g. encode_parition
without init
being called.
0207d21
to
669fdf3
Compare
You're right, I was missing that point! |
This looks great, a really big step forward in terms of clarity and extensibility. I think it's ready to merge, but I'm still uneasy about flipping the A1/A2 alleles in plink. Unless there's a good reason (beyond it being less confusing) I think we should keep it the way it is, as the rules are quite complicated and there could easily be untested corner cases that the new behaviour differs on. I took the original code from sgkit, and I think it was pretty well used there, so it should be reasonably correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'm still a bit iffy about changing the plink code, but I guess if the tests are passing it's probably OK. Ready for a squash and merge I think.
bio2zarr/plink.py
Outdated
yield alleles | ||
|
||
def iter_field(self, field_name, shape, start, stop): | ||
data = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch to the obvious version so - it took me a good 30 seconds to figure out what was happening here.
I've change the plink code to follow the exact original logic in b9494a5 - the difference was that the new code was not specifying |
Hmm, looks like something funny is happening with numcodecs exact sizes again. Did we just get a release or something? |
Yep about an hour ago. I've filed #347 |
f1790a1
to
6714853
Compare
6714853
to
f29b5a0
Compare
Looks like PyPI is having issues: https://x.com/TadejKrevh/status/1909532147846688906 Will merge later. |
This WIP PR attempts to use a single schema and writer class for both ICF and plink encoding. The tests are almost completely unchanged.
This needs some polish - for example I've only adapted the
Writer.encode_X
methods needed for plink. They should be consistent and ICF specific stuff moved out of the common writer.Plink conversion is also missing the indexing.