-
Notifications
You must be signed in to change notification settings - Fork 421
[Feature Request] Implement most likely heteroskedastic GP and write a tutorial notebook #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The notebook looks great! a few comments:
# test on the training points
# call if X_test just for ease of use in future
X_test = torch.linspace(0,1,100)
mll.eval()
with torch.no_grad():
posterior_f = mll.model(X_test) # Call the GPyTorch model directly to get p(f*|D, x*)
test_pred = mll.likelihood(mll.model(X_test), X_test) # Call the likelihood on the result to get p(y*|D, x*)
# above works -- we just needed to pass X_test to the likelihood as well because it depends on the input. Then you should be able to call
|
Whoops! It seems like I'm lost :-). I've been responding to all the issues opened this weekend that when I got another email about an example notebook for another GP model, I just automatically responded assumed we were on the GPyTorch repo. Apologies! Obviously an example notebook on the botorch repo should definitely be free to use BoTorch features :-). My comments about getting the predictive posterior are still useful, hopefully. |
@mc-robinson, thanks for the comprehensive issue. We have designed the However, since the
Indeed, the actual noise returned by the likelihood should not be on the log scale. The reason the model fits the inner GP on the log noise is that this way we don't have to worry about zero (or negative) noise predictions. The correct way to handle is to pass in a custom
Generally, a heteroskedastic GP for the case where you don't have noise observations would be a reasonable thing to implement. I haven't thought all too much about this so far though. Edit: Just realized this is all in the colab notebook - awesome job! I'll give it a closer look tomorrow. @jacobrgardner, haha, feel free to keep helping out with issues over here, I don't mind... |
Thanks for the detailed responses! And @jacobrgardner, I know it was a mistake haha, but I am also happy to (eventually) try to do a detailed notebook on heteroskedastic GPs in GpyTorch. It might be helpful to show how one might build one from the ground up. I already sort of have a draft ready -- are there any examples currently showing how to embed one model in another such as what is done in the heteroskedastic model in BoTorch? @Balandat Sorry I missed the observation noise hyperparameter! I will definitely add that to the notebook when I clean it up. Also, I'll reiterate that my way of creating a heteroskedasticGP is basically the simplest/dumbest way I know, but there are lots of schemes that seem to get quite complicated. If anyone here has an idea for something better that isn't too nasty to implement, I would love to hear it. The iterative scheme found in http://people.csail.mit.edu/kersting/papers/kersting07icml_mlHetGP.pdf is relatively close to what we are already doing. And might not be too hard to implement? I know that there are some issues noted in other papers with convergence and such, but might be worth trying. If a |
The "Most Likely Heteroscedastic Gaussian Process Regression:" seems to be pretty much exactly what you're doing, just adding an EM-style fitting process where you iteratively re-estimate the mean and the observation variances from a heteroskedastic model, starting from a homoskedastic mean estimate. This should be just a few lines of code to implement. The only question is what exactly it means for this process to "converge" (the paper doesn't say). It's probably reasonable to just look at the change in mean predictions and noise variance estimates between iterations.
We could add a |
The issue with the |
Hi guys, thanks for the work you've already put into this. I'll try to update the example notebook soon, and I'm happy to try and help build out the heteroskedastic models. Just let me know if there is a best place to start. As for your question @Balandat
I too am a bit confused as to what they used. I assumed they were just looking at the value of the likelihood, but I think your method would also work well. I have also found the following critique in http://www.comonsens.org/documents/conferences/205_vhgpr_icml.pdf
|
I think this would work for small models, but we will quickly run into scalability issues if we have a latent noise variable for every datapoint. You could think of many other ways to address that issue though, like parameterize the noise function for these latent variables using some kind of spline interpolation... For now I think the EM-style algorithm would be a good thing to try first. Essentially, the model you end up getting is exactly a Heteroskedastic GP with the specific inferred noise values, so I don't think it's necessary to build a full new model. Maybe for now it's enough to have a helper function of the form
that takes in a SingleTaskGP and internally performs the fitting, including the construction fo the Heteroskedastic GP, and then returns that fitted model. Happy to accept a PR :) |
Hi all, Sorry for the delay on this one -- but I am finally attaching a slightly updated notebook: In it is a first attempt at a @Balandat , I also switched over to using the |
Thanks @mc-robinson, I will take a look later today. |
Looks looks like you're doing Looking through the rest of the nb now. |
Thanks a lot for expanding on the nb.
So if I get that right your method of estimating the noise is just doing
Another thing to note is that the paper users an RBF kernel for the noise GP, which will smooth things more than the Matern kernel that's used by default in the SingleTaskGP. This could also help with convergence. A side comment: If you use Let's see how things go if you switch out the noise estimation procedure - if the results look reasonable and things converge then we can bake this into a PR. |
Ok great, thanks for all of the helpful comments. I'll try to update the notebook with these suggestions later today. |
Alright, sorry for the delay on this one, but here is the updated notebook: I have to say I went down a bit of a rabbit hole trying to figure out the variance estimate and why my naive estimate does not work. Part of my confusion is that I still don't quite understand this statement:
From my reading of the paper the samples they draw
I actually tried to analytically show that their estimate and my estimate are both consistent. The math is quite rough at the moment, but I think it may show just that. I also plotted the noise estimates for both methods, and the shapes look quite similar (not that that is any rigorous method). Furthermore, I updated the function to include both estimates. My naive way is a bit quicker since no sampling is needed, but may be a bit worse at capturing the true noise levels. I am working with a colleague on using a good rigorous way to show this such as using negative log estimated @Balandat, I was not able to get useful noise estimates from the code you suggested in your previous post, let me know if you can see why. Please let me know if you see any glaring errors (especially with the math if you have the time to dive into the differing variance estimates, I'll keep looking at it). Otherwise, I think we are getting closer to a usable function. |
Thanks for the update, this looks great!
I must have misread that part, sorry about that.
Sampling from the posterior (predictive or latent), once computed, is quite cheap on the standard BoTorch models, is there actually a big difference in wall time (per iteration, I see that this doesn't converge as often as your method)? Regarding the results: It's not overly surprising that you don't fully capture the very high-variance areas. The internal GP for estimating the noise level will smooth out things, so if you have relatively abrupt changes in the noise levels, like the peak you have, it will smooth that out. If you have reason to believe to see these kind of fast changes in the noise levels, you could modify the prior you're putting on the lengthscale of the noise GP (putting more probability mass to lower levels). By default, the kernel used is a 5/2 Matern with |
Thanks! And no I don't think the difference is that big in these small toy cases, but I am looking into the difference between the variance estimators for some more non-trivial datasets. And that's a great point about changing the prior -- thanks. But just wondering, is there a reason I cannot pass a |
No particular reason other than simplicity - we didn't want to overload the out-of-the box models with too many options. We may revisit this though and add some optional kwargs to allow specifying the various modules. If you want more flexibility, for now you could just subclass |
One thing to be aware of is that the models derived from |
Cool, thanks for the advice. I'll try to implement these things and look into the effects of the kernel soon. |
Hi, sorry this took so long -- but here is an updated notebook with the implemented "most likely heteroskedastic GP" from the paper https://colab.research.google.com/drive/1dvFA3LYdLcH0ObQhRvnG1yXCUNgeYIXz . The GP is benchmarked on all of the datasets mentioned on the paper and seems to achieve results either comparable or better to those reported in the paper. I have not done a lot of work to change the default kernels yet, but I figure it would be good to get this basic, working model implemented first. Note that the most annoying part of the implementation is the normalization and standardization of the independent and dependent variables, respectively. A lot of extra code in the notebook is devoted to to this process specifically. If it all looks alright to you, I can bake the basic heteroskedastic GP function into a pull request. |
@mc-robinson This is awesome, many thanks for setting this up and doing the benchmarking. I'd love to see this in a PR. Since this really is a training procedure that returns an existing I would suggest we put this function into What do you think of cleaning up the notebook after the PR goes in and add it as a tutorial notebook under Advanced Usage here? |
Hi @mc-robinson, +1 to what @Balandat said—this is a very nice writeup and it would be awesome to have it in our main BoTorch tutorials. Happy to provide feedback and contribute to some of the verbiage in the tutorial. Let us know if you need any pointers on how to add a tutorial to the site. |
@mc-robinson, have you made any progress on this PR? I think this is pretty nifty and I'd like to get this in soon. We can also package this up ourselves if that's ok with you. |
Hey @Balandat, sorry, been bogged down with a master's thesis due Friday... Happy to finally work on it this weekend though. That being said, totally happy for you guys to package it up yourselves -- I just want it to be useful! And if you want to do the PR, but still want me to write up the tutorial, happy for that as well. Just let me know what works best! |
If you can put up an PR early next week that would be great. It doesn't have to be complete, we can take care of cleaning it up. Good luck with the thesis! |
Thanks! And great, will try to get it up by end of this weekend. |
Seems like the PR never got merged |
Yeah. I have an internal set of changes with an updated version but it has some (surmountable) issues and needs to be rebased. |
Gotcha! |
@Balandat i think this model would get a good deal of usage if we landed it. many folks have heteroskedastic error. we can always refine the model later. |
Well “refine” would mean “fix”, at least for the updated internal implementation. So I’d prefer to avoid landing something that We know is broken. What I can do is rebase this and put our a PR for visibility and then plan for some time (and someone) to work on it - while in the meantime hoping for someone lurking here to fix it :) |
Updates on this? :D |
I don't think anyone at Meta is currently working on this. But thanks for letting us know you're interested! And of course a PR is always welcome :) |
🚀 Feature Request
I would like to submit a potential example notebook, based on my following work here:
https://colab.research.google.com/drive/1dOUHQzl3aQ8hz6QUtwRrXlQBGqZadQgG
(hopefully it is easy for others to run in Colab!)
The notebook details the use of a heteroskedastic GP in BoTorch. However, before finalizing the notebook and cleaning it up a bit, there are a few things that I believe could be changed to help improve the entire training process (also described in notebook comments). I apologize if this reads as many features requests/issues in one -- but I am happy to contribute pull requests and please let me know what you all think / if I am totally wrong about something. I can also open separate feature requests for each statement if you agree on the need.
Listed in roughly decreasing order of perceived importance.
mll.model.predictive_posterior
function. Since themll.model.posterior(X_test).mvn.confidence_region()
function currently only gives the posterior for the function mean, this can be confusing to those who think they are getting the predictive posterior (as is shown in using confidence region with Gpytorch tutorials, for example). Currently, for simple models the predictive posterior is easy to get throughmll.likelihood(mll.model(X_test)).confidence_region()
. However, this does not work for theHeteroskedasticSingleTaskGP
model. For thisHeteroskedasticSingleTaskGP
model, I could only obtain the prediction intervals using the following code:which is quite burdensome. (Note that the noise model of
HeteroskedasticSingleTaskGP
currently returns thelog
of the variance, thus the exponentiation. I believe this may be an error in the noise model.)In summary, I believe a
predictive_posterior
function is often what people are after and would be a nice complement toposterior
. Furthermore, it would be nice to standardize the methods of obtaining prediction intervals across tasks since the same method does not currently work for all models.train_x
,train_y
, and does not require the input oftrain_yvar
. This can get rather complicated, but a simple solution for now would be to initially create aSingleTaskGP
that is used to estimatetrain_Yvar
once trained. The observed variances could then be fed into the current implementation ofHeteroskedasticSingleTaskGP
along withtrain_x
,train_y
, as is done in the notebook. This is perhaps the more controversial of the feature requests.Please let me know if you have any other comments on the notebook, and thanks for the library!
The text was updated successfully, but these errors were encountered: