Skip to content

Improve NUTS initialization #1512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
twiecki opened this issue Nov 8, 2016 · 25 comments
Closed

Improve NUTS initialization #1512

twiecki opened this issue Nov 8, 2016 · 25 comments

Comments

@twiecki
Copy link
Member

twiecki commented Nov 8, 2016

I have seen many complaints about NUTS being slow. In 100% of these cases the root cause was bad initialization / scaling of the NUTS sampler. Using ADVI to estimate a diagonal covariance matrix for scaling NUTS is a robust solution.

However, I wonder if there isn't something better we can do. Specifically, I bet the @stan-dev guys use some clever tricks that they might be able to help us with. CC @betanalpha @bob-carpenter

@springcoil
Copy link
Contributor

I agree Thomas,
That was my suspicion about what the complaints were.

I've run into this myself, in some sense it's bad for the user experience
though - we shouldn't impose on the user a leaky abstraction like having to
understand the intricacies of initialization/ scaling.

On Tue, Nov 8, 2016 at 3:35 PM, Thomas Wiecki [email protected]
wrote:

I have seen many complaints about NUTS being slow. In 100% of these cases
the root cause was bad initialization / scaling of the NUTS sampler. Using
ADVI to estimate a diagonal covariance matrix for scaling NUTS is a robust
solution.

However, I wonder if there isn't something better we can do. Specifically,
I bet the @stan-dev https://github.com/stan-dev guys use some clever
tricks that they might be able to help us with. CC @betanalpha
https://github.com/betanalpha @bob-carpenter
https://github.com/bob-carpenter


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1512, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA8DiJsalLo7yqs8pduO4-rdCyqom16wks5q8JazgaJpZM4Ksj9t
.

Peadar Coyle
Skype: springcoilarch
www.twitter.com/springcoil
peadarcoyle.wordpress.com

@ferrine
Copy link
Member

ferrine commented Nov 8, 2016

What about adding NUTS.from_advi(..., tune_interval=None) constructor where ... is advi and nuts common arguments and tune_interval is optional int defining n samples for estimation covariance matrix instead of diagonal advi solution?

@fonnesbeck
Copy link
Member

I'm not convinced that initialization is the issue. The MAP estimate should be a fine place to start an MCMC run, yet this almost never works. I think we need a top-to bottom review of the NUTS sampler, as a start. For example, I notice that the scaling speed of adaptation parameter (k) does not even get used by the algorithm. That's a little concerning.

@twiecki
Copy link
Member Author

twiecki commented Nov 9, 2016

I agree, most likely there's a bug in estimating the scaling using the hessian.

@bob-carpenter
Copy link

Bad adaptation is almost never our only problem. When initialization and adaptation are slow, mixing usually is, too. The adaptation algorithm's in the code and described in the manual. Often we wind up having to reparameterize models to reduce extreme ill-conditioning of the Hessian matrix.

I think there's still a lot of room for improvement. The goal is to get to the typical set as quickly as possible in order to spend time there adapting the mass matrix and step size. We were hoping black box mean-field variational inference would get us there quickly (get to posterior mean and take approximate draw [don't use posterior mean, which is often not in the typical set itself]), but it's not nearly robust enough yet for that purpose. Stan's HMC is intentionally conservative in its adaptation because if adaptation fails, so will sampling.

You may also want to check out the new version of NUTS that Michael put into Stan recently---it replaces the slice sampler with a discrete draw (multinomial with count 1) as described here:

https://arxiv.org/abs/1601.00225

@twiecki twiecki mentioned this issue Nov 12, 2016
@springcoil
Copy link
Contributor

Is this resolved now @twiecki or is there still work to be done?

@twiecki
Copy link
Member Author

twiecki commented Nov 12, 2016

@bob-carpenter Thanks for your help. I found using ADVIs diagonal covariance estimate as the mass matrix to work quite well in practice. Which regularizers do you use for the covariance estimation?

I read the paper on XHMC but thought it was more of a new version, rather than a replacement for NUTS. Or are there are insights (like the multinomial sample) that made it into the NUTS implementation?

@springcoil Lets leave it open for now.

@bob-carpenter
Copy link

Our current NUTS (Stan 2.12) uses the multinomial sampling
instead of slice sampling. I'm pretty sure the exhaustive HMC
is implemented, but not exposed in the interfaces.
Riemannian HMC is also implemented but not exposed yet.

The diagonal is regularized toward the unit diagonal and
if there's a dense mass matrix, then that's regularized
toward the diagonal. I'd have to dive into the code to
see exactly how it's implemented now. The regularization is
obviously scale sensitive and I don't think we expose any way
to set the scales at the moment or provide inits. That's
coming soon after we refactor the basic back-end interface
calls.

@twiecki
Copy link
Member Author

twiecki commented Nov 16, 2016

Is there any more detail on the Multinomial sampling?

Also, in your own view, will XHMC become a drop-in replacement for NUTS in Stan? From the paper it seems like the theoretical improvements do not always carry through empirically.

@bob-carpenter
Copy link

It's a drop-in replacement at the API level, but we haven't replaced it because we haven't tested it.

The basic idea of the multinomial sampler is easy, but it has the same subtlety as basic NUTS in wanting to bias the selection toward the second half the simulation. The details are in Michael Betancourt's paper and in the Stan code itself.

@ferrine
Copy link
Member

ferrine commented Nov 19, 2016

I'm writing a panel UserModel and found that nuts is incredibly slow. Is that because I have very high dimensionality? I used new advi initialization.

@twiecki
Copy link
Member Author

twiecki commented Nov 21, 2016

@ferrine Can you upload the model with example data?

@ferrine
Copy link
Member

ferrine commented Nov 21, 2016

Model

from pymc3.models.linear import Glm    # from PR #1525
import pandas as pd


class FEPanel(Glm):
    def __init__(self, x, y, index, dummy_na=True, intercept=False, labels=None,
                 priors=None, vars=None, family='normal', name='', model=None):
        if not isinstance(x, pd.DataFrame):
            raise TypeError('Need Pandas DataFrame for x')
        if not isinstance(y, pd.Series):
            raise TypeError('Need Pandas Series for y')
        if not isinstance(index, (tuple, list)):
            index = [index]
        x = pd.get_dummies(
            x, columns=index, drop_first=not intercept, dummy_na=dummy_na
        )   # type: pd.DataFrame
        is_dummy = lambda s: any(s.startswith('%s_' % l) for l in index)
        self.dummies = list(filter(is_dummy, x.columns))
        self.not_dummies = list(set(x.columns) - set(self.dummies))
        new_priors = dict.fromkeys(
            self.dummies, self.default_intercept_prior
        )
        if priors is None:
            priors = dict()
        new_priors.update(priors)
        super(FEPanel, self).__init__(
            x, y, intercept=intercept, labels=labels,
            priors=new_priors, vars=vars, family=family, name=name, model=model
        )
    
    @classmethod
    def from_formula(cls, *args, **kwargs):
        raise NotImplementedError('Sorry')
    
    @property
    def dummies_vars(self):
        return [self[v] for v in self.dummies]

    @property
    def not_dummies_vars(self):
        return [self[v] for v in self.not_dummies]

Data

data = pd.read_csv('tests/data/testdata.csv')

Usage

with pm.Model() as model:
    g = FEPanel(data.iloc[:,1:], data.iloc[:, 0], index=['Country'])
    trace = sample(1000, n_init=12000) # extremely slow when NUTS starts sampling

@twiecki
Copy link
Member Author

twiecki commented Nov 22, 2016

tests/data/testdata.csv does not seem to exist on that branch.

@ferrine
Copy link
Member

ferrine commented Nov 23, 2016

It was in my repo, here is raw

@twiecki
Copy link
Member Author

twiecki commented Nov 23, 2016

image

Not sure if this RV should be in the model, in any case, it's what seems to make NUTS stuck. Should it be in the model?

@ferrine
Copy link
Member

ferrine commented Nov 23, 2016

Same issue without nans (2.73 it/s). Maybe that's just bad test set

@twiecki
Copy link
Member Author

twiecki commented Nov 23, 2016

I tried dropping nans too but the RV above was still created which made me think it was a model creation issue.

@ferrine
Copy link
Member

ferrine commented Nov 23, 2016

that's because dummy_na, is true

@twiecki
Copy link
Member Author

twiecki commented Nov 23, 2016

If I set that to False, I get 75.00it/s and a clean trace.

@ferrine
Copy link
Member

ferrine commented Nov 23, 2016

15 was top on my laptop:(
I have macbook air 13

@twiecki
Copy link
Member Author

twiecki commented Nov 23, 2016

But it converges fine otherwise?

@ferrine
Copy link
Member

ferrine commented Nov 23, 2016

ye

@ferrine ferrine mentioned this issue Nov 23, 2016
@twiecki
Copy link
Member Author

twiecki commented Nov 28, 2016

Closing as the ADVI initialization seems to have fixed most of the issues.

@citynorman
Copy link

This problem went away after I did

import os
os.environ["MKL_THREADING_LAYER"] = "GNU"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants