Skip to content

[WIP] Multinomial sampling for nuts #1769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 24, 2017

Conversation

aseyboldt
Copy link
Member

@aseyboldt aseyboldt commented Feb 10, 2017

The change is surprisingly small (just the last commit, the others are from 1757), but it is easy to make some subtle mistake here. This could use some careful review ;-)

@twiecki
Copy link
Member

twiecki commented Feb 10, 2017

Very cool. Have you compared multinomial to slice in any way? The stan guys say it's more efficient and I trust them, just curious.

@@ -12,6 +12,7 @@


def bern(p):
assert np.isfinite(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you checking for here? generally we should avoid assertions in favor of explicitly raising an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That slipped in from debugging... But it might make sense to check for nan. infs should be fine.

@@ -229,8 +226,8 @@ def _build_subtree(self, left, depth, epsilon):
self.max_energy_change = energy_change
p_accept = min(1, np.exp(-energy_change))

size = int(self.log_u + energy_change <= 0)
diverging = not (self.log_u + energy_change < self.Emax)
size = - energy_change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space between operator and energy_change

self.step_size = step_size
self.Emax = Emax
self.start_energy = np.array(start.energy)

self.left = self.right = start
self.proposal = Proposal(start.q, start.energy, 1.0)
self.depth = 0
self.size = 1
self.size = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to log_size or similar?

@@ -242,15 +239,15 @@ def _build_subtree(self, left, depth, epsilon):

tree2, diverging, turning = self._build_subtree(tree1.right, depth - 1, epsilon)

size = tree1.size + tree2.size
size = np.logaddexp(tree1.size, tree2.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

accept_sum = tree1.accept_sum + tree2.accept_sum
n_proposals = tree1.n_proposals + tree2.n_proposals

left, right = tree1.left, tree2.right
span = np.sign(epsilon) * (right.q - left.q)
turning = turning or (span.dot(left.p) < 0) or (span.dot(right.p) < 0)

if bern(tree2.size * 1. / max(size, 1)):
if bern(np.exp(tree2.size - size)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing needed here/yet, but it looks like we might just refactor bern to work in log space, since that's how we're always using it now.

@aseyboldt
Copy link
Member Author

@ColCarrol sounds all fine and should be fixed now.

@twiecki
Copy link
Member

twiecki commented Feb 10, 2017

======================================================================
FAIL: Check effective sample size is equal to number of samples when initializing with MAP
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pymc-devs/pymc3/pymc3/tests/test_diagnostics.py", line 160, in test_effective_n
    assert_allclose(n_effective, n_jobs * n_samples, 2)
  File "/home/travis/miniconda2/envs/testenv/lib/python2.7/site-packages/numpy/testing/utils.py", line 1392, in assert_allclose
    verbose=verbose, header=header)
  File "/home/travis/miniconda2/envs/testenv/lib/python2.7/site-packages/numpy/testing/utils.py", line 739, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=2, atol=0
(mismatch 20.0%)
 x: array([  300., -1577.,   300.,   300.,   300.])
 y: array(300)
-------------------- >> begin captured stdout << ---------------------
Optimization terminated successfully.
         Current function value: 4.594693
         Iterations: 0
         Function evaluations: 1
         Gradient evaluations: 1
--------------------- >> end captured stdout << ----------------------

@aseyboldt
Copy link
Member Author

aseyboldt commented Feb 10, 2017 via email

@twiecki
Copy link
Member

twiecki commented Feb 11, 2017

@aseyboldt Fix is merged, need to rebase.

@aseyboldt aseyboldt force-pushed the multinomial-nuts branch 2 times, most recently from 5eea3c3 to b90e3dc Compare February 11, 2017 23:14
@aseyboldt
Copy link
Member Author

The previous version did not handle cases where the energy is -inf correctly. Initially I thought the "diverging" check would account for this, but even if we are diverging, we still might check bern(tree2.log_size - log_size). But this is non if log_size is inf.

@aseyboldt aseyboldt force-pushed the multinomial-nuts branch 2 times, most recently from b42666a to e1623a4 Compare February 13, 2017 17:47
@aseyboldt aseyboldt changed the title Multinomial sampling for nuts [WIP] Multinomial sampling for nuts Feb 14, 2017
@twiecki
Copy link
Member

twiecki commented Feb 16, 2017

@aseyboldt Anything other than rebase you think is missing here?

@aseyboldt
Copy link
Member Author

Yes, two things:

  • The current version handles divergence a bit different than stan. Stan only aborts the trajectory if the hamiltonian decreases a lot (in this case the acceptance prob is pretty much 0 anyway), this code aborts if it decreases or increases a lot. Both should not happen if the step size is small enough. I guess we should do the same thing as stan here? But I still need to go through the code and make sure no strange things happen if something is inf or nan.
  • Trouble with the normal termination criterion. I think it should somehow use the mass-matrix, but I haven't figured it out yet. I guess this wouldn't need to go in this PR.

@twiecki
Copy link
Member

twiecki commented Feb 20, 2017

FWIW It seems sensical to me to abort if the hamiltonian increases by a lot too, but when in doubt we should copy Stan.

@aseyboldt
Copy link
Member Author

This is ready for review again. I also included the termination criterion, that takes into account the metric (see [1]). As far as I can see, this should now do pretty much the same as stan.

[1] Betancourt, Michael. “Identifying the Optimal Integration Time in Hamiltonian Monte Carlo.” arXiv:1601.00225 [Stat], January 2, 2016. http://arxiv.org/abs/1601.00225.

f.trust_input = True
return f


INTEGRATORS_SINGLE = {
'leapfrog': _theano_single_leapfrog,
'two-stage': _theano_single_twostage,
'three-stage': _theano_single_threestage
'three-stage': _theano_single_threestage,
'leapfrog3': _theano_single_leapfrog3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this new integrator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that one slipped in by accident. It does three leapfrog steps in a row. I'm using that to compare three-stage and leapfrog (more on that later). I'll take it out of this pull request.

@@ -132,6 +137,7 @@ def check_trace(self, step_method):
trace = sample(n_steps, step=step_method(), random_seed=1)

if not benchmarking:
print(trace.get_values('x'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray debug statement.

@twiecki
Copy link
Member

twiecki commented Feb 21, 2017

@ColCarroll can you also take a look at this?

Copy link
Member

@ColCarroll ColCarroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good! will try to look up the answers to my own questions tonight...

@@ -178,17 +182,17 @@ def competence(var):


# A node in the NUTS tree that is at the far right or left of the tree
Edge = namedtuple("Edge", 'q, p, q_grad, energy')
Edge = namedtuple("Edge", 'q, p, v, q_grad, energy')

# A proposal for the next position
Proposal = namedtuple("Proposal", "q, energy, p_accept")

# A subtree of the binary tree build by nuts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: 'built'


size = tree1.size + tree2.size
accept_sum = tree1.accept_sum + tree2.accept_sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move accept_sum and n_proposal closer to where they're used

return tree, diverging, False

tree1, diverging, turning = self._build_subtree(left, depth - 1, epsilon)
if diverging or turning:
return tree1, diverging, turning

tree2, diverging, turning = self._build_subtree(tree1.right, depth - 1, epsilon)
ok = not (diverging or turning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super descriptive, but might make reading the code easier to just move it to the if statement (if not (diverging or turning):)

@@ -123,7 +124,7 @@ def __init__(self, A):
self.L = scipy.linalg.cholesky(A, lower=True)

def velocity(self, x):
return x.T.dot(self.A.T)
return tt.dot(self.A, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -57,6 +57,13 @@ def _theano_energy_function(H, q, **theano_kwargs):
return energy_function, p


def _theano_velocity_function(H, p, **theano_kwargs):
v = H.pot.velocity(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like H.pot.velocity is already a theano function, but it also looks like you're using the velocity calculation separately -- is it necessary to have it attached to the named tuple H as well as calculated separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. We need the velocity at the start of the trajectory. Later values are returned by leapfrog. This is pretty much the same as with the gradient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am getting at is that wherever you call self.compute_velocity, you can call H.pot.velocity. I dug some more and H.pot.velocity calls to one of the functions in quadpotential.py, and is multiplying the momentum by a scaling constant or matrix. This should be pretty performant, but I haven't tested whether compiling to theano makes it faster. Just considering surface area of code, I'd rather not add another function though.

Separately! We pass the gradient through since that's a fairly expensive operation. Shouldn't velocity be resampled at the start of each leapfrog step, so returning it doesn't help things? I flagged another comment on the one place where it might be useful, but I don't know why it is useful there!

return f


def _theano_single_leapfrog3(H, q, p, q_grad, **theano_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_theano_triple_leapfrog? I don't love integers in function names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was in there by mistake. I just used this to compare leapfrog with three-stage. I took it out.

@aseyboldt
Copy link
Member Author

@ColCarroll done

Copy link
Member

@ColCarroll ColCarroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to belabor this -- the only issue left I think is more about my lack of understanding than code issues.

span = right.q - left.q
turning = turning or (span.dot(left.p) < 0) or (span.dot(right.p) < 0)
p_sum = self.p_sum
turning = (p_sum.dot(left.v) <= 0) or (p_sum.dot(right.v) <= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this equivalent? it looks like this is different from in the paper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different. The old implementation is actually somewhat wrong: It does not take into account the metric, but assumes a standard euclidean space. This was an assumption in the original paper, but in general we need to use the scalar product defined by the inverse mass matrix. (see the old stan version) Stan switched to a new version recently (I linked a paper with a description of that somewhere in the comments – Ctrl-F Betancourt). As far as I understand that was mostly to make riemannian hmc easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and thank for double checking :-)

@@ -57,6 +57,13 @@ def _theano_energy_function(H, q, **theano_kwargs):
return energy_function, p


def _theano_velocity_function(H, p, **theano_kwargs):
v = H.pot.velocity(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am getting at is that wherever you call self.compute_velocity, you can call H.pot.velocity. I dug some more and H.pot.velocity calls to one of the functions in quadpotential.py, and is multiplying the momentum by a scaling constant or matrix. This should be pretty performant, but I haven't tested whether compiling to theano makes it faster. Just considering surface area of code, I'd rather not add another function though.

Separately! We pass the gradient through since that's a fairly expensive operation. Shouldn't velocity be resampled at the start of each leapfrog step, so returning it doesn't help things? I flagged another comment on the one place where it might be useful, but I don't know why it is useful there!

@ColCarroll
Copy link
Member

beautiful, thank you. i'll merge when tests pass again -- quadpotential.py could get turned into all theano functions, and then we could refactor this to get rid of a separate compute_velocity function, but I'm not too worried about that.

@aseyboldt
Copy link
Member Author

About the velocity: I thought it would be better to put this in the theano graph of the leapfrog, because I'm hoping that some models might be faster on the gpu, and especially a matrix vector product would be much faster there. For the initial velocity that is not a option, so I made an extra function for that. We could just call it directly without going through theano, but then the quadpotential would need to work as theano graph and as normal function. That's not impossible, but I thought that would be confusing.

@ColCarroll
Copy link
Member

ColCarroll commented Feb 24, 2017 via email

@ColCarroll ColCarroll merged commit f73d2ff into pymc-devs:master Feb 24, 2017
@ColCarroll
Copy link
Member

thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants