Skip to content

Should we always set a metadata "name" for populations in tsinfer? #609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hyanwong opened this issue Nov 17, 2021 · 10 comments
Open

Should we always set a metadata "name" for populations in tsinfer? #609

hyanwong opened this issue Nov 17, 2021 · 10 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 17, 2021

Similar to MesserLab/SLiM#169, should tsinfer always create a "name" (and maybe "description") for populations in the resulting tree sequence? I think we should.

If a name is not given in the metadata field (when calling add_population ) then we could simply create a name like "sample_data_pop1". Alternatively, we could simply add a required parameter name to add_population, either storing it in the sample data metadata field (easiest), or creating a new zarr array for it, and requiring that each population name is unique. Do you have any preferences/thoughts @jeromekelleher ?

This would help with #604 as then we could match populations by name.

@hyanwong
Copy link
Member Author

In the transition to SGkit, we would presumably use the "name" string for a cohort (AKA population) that @jeromekelleher argued for at https://github.com/pystatgen/sgkit/issues/224#issuecomment-694350208, so this would fit reasonably nicely into that structure.

@hyanwong hyanwong changed the title Should we always set a metadata "name" for populations in tsinfer Should we always set a metadata "name" for populations in tsinfer? Nov 17, 2021
@jeromekelleher
Copy link
Member

Without looking at the details, I think we should always set a name. I'd lean towards sticking it into metadata, but it depends on how things work out. I guess we can just keep a set of population names during the add phase to make sure they are unique?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 22, 2021

I also think we should stick them in metadata. I wonder if during the creation of a SampleData file, we should have a parameter "population_metadata_schema", which is a schema that allows "name" and "description" to be set. If None, we use the msprime population schema.

Then during the add phase, we simply write to "name" and "description", assuming they exist (and we would automatically error out here if the user passed in an incompatible schema when creating the file).

Alternatively, we could just set the population metadata schema to the permissive_json_schema().

@jeromekelleher
Copy link
Member

Sounds good, but we should be a little wary about any performance implications (we've hit some walls recently with metadata encoding) I see the required flag in the JSON schema as just telling users that the key is guaranteed to exist BTW.

So, setting the schema at the end may be short term pragmatic option.

@hyanwong
Copy link
Member Author

I assume (with no evidence!) that metadata in populations is unlikely to cause performance issues, because we rarely have large numbers of them in a loop. I guess we might want to avoid doing lots of comparisons of population names between samples though?

@jeromekelleher
Copy link
Member

Yes, of course - my bad

@hyanwong
Copy link
Member Author

hyanwong commented Nov 22, 2021

I've just been trying to get this to work with the test suite. I actually wonder if we want to make the "default" population names more-or-less unique to the sample data file. Otherwise, when we merge sample data files (there is a function to do that), and assuming we merge on the name, we we'll simply end up merging the first pop in one file with the first in the other, etc. So I wonder if the default name, if no name is given, should also include the UUID of the sample data file?

@jeromekelleher
Copy link
Member

That's pretty messy from a user perspective - you'd expect the names to be something you could predict. We'll have to figure out a different way of fixing the tests.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 23, 2021

That's pretty messy from a user perspective - you'd expect the names to be something you could predict. We'll have to figure out a different way of fixing the tests.

Yeah, fair point. It's not so much fixing tests though. It's a question of whether, if you merge together 2 separate sample data files (both with populations defined, say 2 in the first file and 3 in the second), you would expect the populations in each file to be treated as separate (i.e. 5 populations in the final file), or whether you would expect the 2 populations in the first file to be equated to the first 2 populations in the second. I think you would expect the former behaviour, creating 5 populations. If we allocate the same set of names by default, and match on names only we will get the second behaviour.

A few other ways we could get the behaviour I would expect

  1. by default, put the UUID in the description field and merge only if both name and description are the same. But then how do we make names unique in the final file?
  2. Treat the default, auto-generated notation (e.g. p1, p2 etc) as special: if we detect names of that format when merging, we don't merge together p1 from self and p1 from other, even thought they have the same name. That seems a bit of a hack.
  3. Somehow flag up the population names as autogenerated. E.g. put "autogenerated name" in the description field, and don't merge populations that have that string as a description.

I'm leaning towards 3.

@hyanwong
Copy link
Member Author

Here's the (rather complicated) sketch of the SampleData.add_population method that I propose. What do you think @jeromekelleher ?

    def add_population(self, metadata=None, name=None, description=None):
        """
        Adds a new :ref:`sec_inference_data_model_population` to this
        :class:`.SampleData` and returns its ID. Metadata including (at a minimum)
        a name and a description, will be associated with each population.

        All calls to this method must be made **before** individuals or sites
        are defined.

        :param dict metadata: A JSON encodable dict-like object containing
            metadata that is to be associated with this population. Keys called
            "name" or "description" may have their values overwritten using the
            `name` and `description` parameters below.
        :param str name: A unique name for this population (of length > 0). If given,
            this will override any name defined in `metadata`. If not given, and
            no "name" key exists in `metadata`, a suitable name will be autogenerated:
            population 0 will be given the name `p0`, population 1 `p1`, etc.
        :param str description: A description for this population (default: None). If not
            None, this will override any description provided in `metadata`. Otherwise
            (i.e. no "description" key exists in `metadata` and `description` is None)
            if the population name was name was autogenerated the description will be
            set to the string "autogenerated name", whereas if the name was not
            autogenerated the description will default to the empty string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants