Skip to content

Compute correct log_likelihood for models with partially observed variables #5255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ricardoV94 opened this issue Dec 12, 2021 · 5 comments
Labels
needs info Additional information required trace-backend Traces and ArviZ stuff v4

Comments

@ricardoV94
Copy link
Member

ricardoV94 commented Dec 12, 2021

Masked observations are no longer associated with the observed RV, but to a separate free RV.

import numpy as np
import pymc as pm

with pm.Model() as m:
  x = pm.Normal('x', observed=[np.nan, 1, 2, 3])

print(m['x_observed'].tag.observations)  # TensorConstant{[1.0 2.0 3.0]}

As such I am not sure whether we are computing the correcting model log_likelihood for partially observed models, assuming the imputed observations should appear in the final log_likelihood:

Running all tests in test_idata_conversion.py with coverage, confirmed that these lines lines in InferenceDataConverter.log_likelihood-vals_point are not being triggered:

pymc/pymc/backends/arviz.py

Lines 248 to 257 in 0c90e82

if isinstance(var.owner.op, (AdvancedIncSubtensor, AdvancedIncSubtensor1)):
try:
obs_data = extract_obs_data(var.tag.observations)
except TypeError:
warnings.warn(f"Could not extract data from symbolic observation {var}")
mask = obs_data.mask
if np.ndim(mask) > np.ndim(log_like_val):
mask = np.any(mask, axis=-1)
log_like_val = np.where(mask, np.nan, log_like_val)

This came up in #5245

@ricardoV94 ricardoV94 added the v4 label Dec 12, 2021
@ricardoV94 ricardoV94 changed the title Compute correct log_likelihood for partially observed variables Compute correct log_likelihood for models with partially observed variables Dec 12, 2021
@ricardoV94 ricardoV94 added needs info Additional information required trace-backend Traces and ArviZ stuff labels Dec 12, 2021
@ricardoV94
Copy link
Member Author

Anyone has a clear source (or knowledge) of whether imputed variables should be considered during model comparison (and as such be part of the InferenceData log-likelihood field)?

In V3 they were, now they are not.

@twiecki
Copy link
Member

twiecki commented Jan 17, 2022

@OriolAbril
Copy link
Member

My hunch is that neither option is correct, but with loo it's relatively easy to check it out by comparing psis-loo results (via az.loo) to "brute force" cross validation (which can be partly automated with az.reloo).

I lean towards neither because both waic and loo are designed for model comparison based on predictive accuracy but when performing inputation, part of the prediction task is "included" in the model and affects the logp. IMO using waic/loo for model *-comparison with missing data should be fit the model on the observed data, then loo/waic are an estimation of how well the model should be able to predict the missing data and you can use that to to compare models. The first and most important question when using loo/waic is defining what is actually the "predictive task".

Disclaimer: I might be a bit biased against imputation. From what I said above, one could also argue that the model with data imputation is taking into account some extra information (as if one used better priors) and therefore both could be correct depending on the goal.

What would make some more sense to me would be to exclude the missing/imputed data from the pointwise log likelihood data (but not from the fit) and then compare the model that excludes the missing data from the fit with the model with imputed data. Now both would be using the same data to estimate their predictive accuracy and can be compared.

Similarly I guess you could include imputed data there to compare data imputation approaches. But note that here you are not estimating the predictive accuracy on missing data using the cv approach/assumption (i.e. how well we fit observed data is a valid measure of how well we'll fit unobserved data) but by evaluating the likelihood on the imputation and assuming it's observed. So if the goal is actually to estimate this data, using it to calculate loo/waic might not be the best measure and it might be preferable to simply use the log predictive density on imputed data only to compare different imputation approaches?

@fonnesbeck
Copy link
Member

fonnesbeck commented Jan 24, 2022

There is nothing special about missing data as far as Bayesian inference is concerned. They are just latent variables, and should be treated as such during model comparison. Most of the time, the missing data will be at terminal nodes of the model graph, with nothing downstream of them, meaning they should not have much influence on model comparison. However, occasionally they are deeper in the model (e.g. imputing missing inputs in a regression). These have to be included in model comparison, because you'd want the model that imputes values that result in better estimates downstream. I can't think of any principled reason for excluding imputed data nodes, if they are part of the model.

@ricardoV94
Copy link
Member Author

@fonnesbeck if I understand you correctly that's(accidentally) the current behavior of V4. We should then add a test to avoid regressions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Additional information required trace-backend Traces and ArviZ stuff v4
Projects
None yet
Development

No branches or pull requests

4 participants