-
Notifications
You must be signed in to change notification settings - Fork 15
Should we always set a metadata "name" for populations in tsinfer? #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the transition to SGkit, we would presumably use the "name" string for a cohort (AKA population) that @jeromekelleher argued for at https://github.com/pystatgen/sgkit/issues/224#issuecomment-694350208, so this would fit reasonably nicely into that structure. |
Without looking at the details, I think we should always set a name. I'd lean towards sticking it into metadata, but it depends on how things work out. I guess we can just keep a set of population names during the |
I also think we should stick them in metadata. I wonder if during the creation of a SampleData file, we should have a parameter "population_metadata_schema", which is a schema that allows "name" and "description" to be set. If Then during the Alternatively, we could just set the population metadata schema to the |
Sounds good, but we should be a little wary about any performance implications (we've hit some walls recently with metadata encoding) I see the So, setting the schema at the end may be short term pragmatic option. |
I assume (with no evidence!) that metadata in populations is unlikely to cause performance issues, because we rarely have large numbers of them in a loop. I guess we might want to avoid doing lots of comparisons of population names between samples though? |
Yes, of course - my bad |
I've just been trying to get this to work with the test suite. I actually wonder if we want to make the "default" population names more-or-less unique to the sample data file. Otherwise, when we merge sample data files (there is a function to do that), and assuming we merge on the name, we we'll simply end up merging the first pop in one file with the first in the other, etc. So I wonder if the default name, if no name is given, should also include the UUID of the sample data file? |
That's pretty messy from a user perspective - you'd expect the names to be something you could predict. We'll have to figure out a different way of fixing the tests. |
Yeah, fair point. It's not so much fixing tests though. It's a question of whether, if you merge together 2 separate sample data files (both with populations defined, say 2 in the first file and 3 in the second), you would expect the populations in each file to be treated as separate (i.e. 5 populations in the final file), or whether you would expect the 2 populations in the first file to be equated to the first 2 populations in the second. I think you would expect the former behaviour, creating 5 populations. If we allocate the same set of names by default, and match on names only we will get the second behaviour. A few other ways we could get the behaviour I would expect
I'm leaning towards 3. |
Here's the (rather complicated) sketch of the SampleData.add_population method that I propose. What do you think @jeromekelleher ?
|
Uh oh!
There was an error while loading. Please reload this page.
Similar to MesserLab/SLiM#169, should
tsinfer
always create a "name" (and maybe "description") for populations in the resulting tree sequence? I think we should.If a name is not given in the metadata field (when calling add_population ) then we could simply create a name like "sample_data_pop1". Alternatively, we could simply add a required parameter
name
toadd_population
, either storing it in the sample data metadata field (easiest), or creating a new zarr array for it, and requiring that each population name is unique. Do you have any preferences/thoughts @jeromekelleher ?This would help with #604 as then we could match populations by name.
The text was updated successfully, but these errors were encountered: