Skip to content

Standardize how intermediate variables are handled #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eric-czech opened this issue Sep 29, 2020 · 6 comments
Closed

Standardize how intermediate variables are handled #286

eric-czech opened this issue Sep 29, 2020 · 6 comments
Milestone

Comments

@eric-czech
Copy link
Collaborator

We currently have several functions with logic like this (from here):

def Tajimas_D(
    ds: Dataset,
    allele_counts: Hashable = "variant_allele_count",
) -> DataArray:
    if allele_counts not in ds:
        ds_new = count_variant_alleles(ds)
    else:
        ds_new = ds
    ac = ds_new[allele_counts]

What should we do though if a non-default variable name is provided and that variable doesn't exist in the original dataset? In this case above the code would fail on ds_new[allele_counts]. There is also the question of whether or not intermediate variables like this should be in the result or not, which I believe @tomwhite has mentioned before. We have it both ways in the code. Lastly, if count_variant_alleles had any optional behavior (e.g. flags for handling partial calls differently), it wouldn't necessarily make sense for us to use the default behavior transparently like this.

At this point, I'm inclined to say we should remove default calculations altogether and require that the variables are present in the first place. What do you think of that @tomwhite / @jeromekelleher / @ravwojdyla?

@eric-czech eric-czech mentioned this issue Sep 29, 2020
18 tasks
@jeromekelleher
Copy link
Collaborator

In terms of library semantics, I agree it would be considerably simpler if left out the default calculations like this. I'm starting to view the current API as the "expert's" low-level interface, which hopefully we can build something more high-level on top of (at some point), so I think it would be better if we left out user-convenience stuff like this and focused on keeping things simple and efficient.

Anyway, yes, I agree, let's ditch the automatic intermediate value calculation.s

@tomwhite
Copy link
Collaborator

tomwhite commented Oct 1, 2020

+1 to removing default calculations.

@tomwhite
Copy link
Collaborator

Taken to its logical conclusion, this would mean that to calculate Fst you'd have to call divergence first. And for pbs, you'd have to call divergence and Fst first.

@jeromekelleher
Copy link
Collaborator

That does seem annoying, and would lead to ugly user code. What's the alternative? -Should we start thinking about the "sgkit-lite" easy interface now?

@eric-czech
Copy link
Collaborator Author

eric-czech commented Oct 22, 2020

From the call, there was agreement on trying this instead:

  1. Standardize on having all intermediate variables added to the dataset (some are used temporarily now).
  2. Make it clear in the user guide that non-default names for variables will not result in automatic definition of those variables when they aren't present.
    • Arguably, we should try to detect and throw errors specifically for this case
  3. Add some examples to show users how to override the default definition of a variable prior to calling another method that depends on it, if he/she wishes to define it in a non-default way.

@eric-czech eric-czech changed the title Eliminate automatically computed intermediate variables? Standardize how intermediate variables are handled Oct 22, 2020
@tomwhite
Copy link
Collaborator

tomwhite commented Jun 7, 2021

Fixed in #360 (and previous issues for the implementation)

@tomwhite tomwhite closed this as completed Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants