Skip to content

Remove samples and keep_size from posterior_predictive #5775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ricardoV94 opened this issue May 17, 2022 · 16 comments
Closed

Remove samples and keep_size from posterior_predictive #5775

ricardoV94 opened this issue May 17, 2022 · 16 comments

Comments

@ricardoV94
Copy link
Member

ricardoV94 commented May 17, 2022

A bit more work than what was needed for #5772, mostly because we use samples in several tests.

Quoting @OriolAbril

imo kill everything, size, keep_size and samples (to get less samples one can always provide a thinned posterior as shown in the example)

@covertg
Copy link

covertg commented Jun 8, 2022

Hi, I'd be interested in working on this. I'm new to contributing to pymc so would appreciate any pointers on what to target for this issue.

Seems like we'll want to update code and docstrings in pymc/sampling.py, as well as any tests in pymc/tests/ that call on sample_posterior_predictive with these args. Anything else?

@OriolAbril
Copy link
Member

That sounds right. I think the examples in the docs (for example but probably not restricted to: https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/posterior_predictive.html) are already not using either argument, but it'd be best to check too.

@covertg
Copy link

covertg commented Jun 10, 2022

Thanks!
Should we also remove the samples argument from sample_posterior_predictive_w()? Edit: never mind, I see it's a work in progress for v4 (#5800)

@covertg
Copy link

covertg commented Jun 10, 2022

Another question: if the user passes return_inferencedata=False, we can either reshape the array elements of the return value (which is type Dict[str, ndarray]) to be shape (chain, draw, ...), or just leave them as-is (flattened arrays). Which should we do?

The default behavior currently is to reshape (since keep_size defaults to True) but just wanted to make sure, given that we're nixing keep_size. The tests currently make use of both, but it's much more weighted towards reshaping.

@OriolAbril
Copy link
Member

I think we can also remove this and always return InferenceData. Unlike with pm.sample where there is some info that is still present only in the old trace that is not the case here. What do you think? cc @ricardoV94 @michaelosthege

@ricardoV94
Copy link
Member Author

I would prefer to keep having the trace option

@OriolAbril
Copy link
Member

Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet)

@ricardoV94
Copy link
Member Author

The default behavior currently is to reshape (since keep_size defaults to True) but just wanted to make sure, given that we're nixing keep_size. The tests currently make use of both, but it's much more weighted towards reshaping.

Reshaping sounds alright.

@ricardoV94
Copy link
Member Author

Here the options are dict and InferenceData, trace is only on pm.sample (which I agree can't be removed yet.

Is a dict also what we get from prior predictive with return_inferencedata = False?

@OriolAbril
Copy link
Member

yep, we could get rid of the dict output in prior sampling I think

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jun 10, 2022

Unless there is a large maintenance burden I rather keep it, I was just asking to understand if there was something special about posterior predictive.

I understand a lot of people prefer InferenceData, but I find it a PITA to be honest.

@michaelosthege
Copy link
Member

I see the benefits of a trace object with a simpler API, but the current implementation of BaseTrace/MultiTrace has some terrible incoherences.

We'll always have a non-InferenceData data structure to manage draws at MCMC-time.
If that thing has a coherent API, I think we can offer to return it to the user.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jun 11, 2022

@michaelosthege this is about the dict alternative from prior/posterior_predictive. I also thought it was a Multitrace object, but it's even simpler than that.

I don't see the problem woth a simple dict :) Also, I can come up with cases where the idata would fail, such as dynamic sized variables, although those are overall not supported in PyMC

@covertg
Copy link

covertg commented Jun 12, 2022

Happy to incorporate the decision either way in the PR. I can get back to it early this coming week.

A bit of feedback as a user; the fact that the return_inferencedata argument behaves differently in different methods (ie False to the mcmc sampler returns multitrace, False to the predictive samplers returns dict) was a bit confusing at first. It seems like really they're different things, or at least the behavior of "don't return inferencedata" if kept could be made more clear at the argument level.

@covertg
Copy link

covertg commented Jun 12, 2022

Also, not sure how important this is, but it seems that nearly all of the tests that do posterior predictive sampling currently use return_inferencedata=False. Is this fine left as-is or should any be updated to use idata?

@michaelosthege
Copy link
Member

Since return_inferencedata=True is now the default, I agree that the naming could be improved.
Something like:

  • pm.sample(return_mtrace=False)
  • pm.sample_p*_predictive(return_dict=False)

w.r.t. to the original description of the issue, I'm also in favor of removing samples, keep_size arguments from sample_posterior_predictive.
Then thinned sampling will only be supported with xarray inputs, as shown in the docstring:

pymc/pymc/sampling.py

Lines 1812 to 1819 in 3f2afb2

Thin a sampled inferencedata by keeping 1 out of every 5 draws
before passing it to sample_posterior_predictive
.. code:: python
thinned_idata = idata.sel(draw=slice(None, None, 5))
with model:
idata.extend(pymc.sample_posterior_predictive(thinned_idata))

Most tests use return_inferencedata=False simply because they were not migrated.
But skipping the InferenceData conversion is also slightly faster, so I think we should stick to return_inferencedata=False/return_dict=True in the test suite.

Likewise many tests use the pm.sample(return_inferencedata=False) option because working with the MultiTrace representation is less verbose and skips the conversion step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants