Skip to content

Text and SQLite backends for PyMC3 #449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Jan 7, 2014

This is not ready to merge, but I'm appreciate any feedback.

All tests are passing locally. See backends/init.py and 2e80cef commit message for documentation.

kyleam added 11 commits January 6, 2014 11:08
This is only needed for python 2 because mock is in stdlib for python 3
(unittest.mock).
To avoid any confusing with function pymc.sample, which is the imported
from model pymc.sample in __init__.py
This commit contains a new backend for sampling and selecting values.
Non-backend files have been changed to work with the new backend.
Everything seems to be working with the exception of two issues (marked
with FIXME):

1. pymc.plots.forestplot has not been updated yet for the new backend.
2. The previous behavior of passing a trace object to sample is not the
   same. I updated stochastic_volatility to do this with the same trace
   object.

This commit also introduces a change to `sample`/`psample`. Instead of
having separate function, `sample` now takes a keyword argument
`threads`, and if this is over one, the multiprocessing version is used.

The method for selecting values has also been changed. Traces can still
be indexed to return values, a new slice, or a point (depending on the
index), but the handling of chains is different. The trace object is now
manages multiple chains itself instead of having a separate class to
manage the single trace object. `get_values` is the main method for
selecting values. By default, it returns separate results for all the
chains. The chains can be combine with the `combine` flags, and
particular chains can be select with the `chains` argument.

The motivation for both sample and selection changes above was to have a
unified interface for dealing with multiple chains, as most people are
likely going to take advantage of the parallel sampling.
Wraps `enumerate` in progress bar update. This allows for checking the
progress bar flag once and choosing `enumerate` or `enumerate_progress`
as function versus checking progress bar flag each iteration.
This test compares of selection methods for NDArray to SQLite traces.
@twiecki
Copy link
Member

twiecki commented Jan 7, 2014

This looks really interesting but will certainly require a proper code review.

@kyleam
Copy link
Contributor Author

kyleam commented Jan 7, 2014

Absolutely. It touches a lot (which I think is inevitable) and makes
decisions that should be discussed (although I have tried to keep the
front-end trace operations mostly the same).

@fonnesbeck
Copy link
Member

Thanks for getting this started. A sqlite backend is definitely something we want. I used it extensively with PyMC 2.

@@ -1,202 +0,0 @@
from .point import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please keep the filename the same for PR? It makes it hard to see the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I put this into a separate commit before making any
changes to the code, which means that it diffs cleanly within the
branch, but the diff between branches won't be very useful. I'll push
changes reverting this commit (as well as the commit moving the summary
function to the stats module).

However, bringing back the name conflict between the sample module and
the function results in not being able to import the sample module
directly (meaning that only the functions from sample.py imported into
init.py can be tested), so I think this it is worth renaming
sample.py back to sampling.py at some point.

@jsalvatier
Copy link
Member

I'm excited to take a look! Having more backends would be great for users! Thanks for taking the time to put this together.

from pymc.model import modelcontext


class Backend(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this class provides too much structure for the different backends, and I wasn't expecting that there would be a shared base class for all backends at all (though maybe there's enough shared code that you really want one).

I would prefer to have a very simple interface for the backends, mainly they implement a record and perhaps close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, there is at least a significant subset of backends that will have significant shared code: backends with a distinct container for each variable (lets call those 'container backends'), like a NdArray backend or SQL backend. However, I still think this class can be made significantly simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. I will try to incorporate your suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference that we're trying to draw between a Trace and a Backend? It looks to me like Trace and Backend should be the same class. The object that is responsible for storing the points should also be responsible for retrieving them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems to me that a trace with multiple chains should just be a container for multiple traces, and the logic for individual chains should be in separate classes. Composition is better than just adding features to a class.

Is the reason why even the basic traces deal with 'chains' because the SQL db has a shared resource between its chains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference that we're trying to draw between a Trace and a
Backend? It looks to me like Trace and Backend should be the same
class. The object that is responsible for storing the points should
also be responsible for retrieving them.

In an earlier version, I experimented with using one class, and we could
move that way again, but there are a few reasons I think two classes are
a good idea.

  • It makes the trace interface more consistent and less cluttered for
    users. By returning a separate trace object for retrieval, users have
    an object with methods that only concern what they'll be using it for
    (selecting and viewing values). They don't deal with the storage
    object directly (sample does).
  • I think having separate classes for storage versus retrieval makes it
    clearer when defining custom backends which methods should be
    overridden for each class. For a working backend, only the record
    method of the storage backend must be defined. The trace class just
    provides a way to make the custom backend have a consistent interface
    for accessing values.
  • If put in one class, it's pretty large, so dividing the
    responsibilities between storing and retrieving values seemed like a
    natural division.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems to me that a trace with multiple chains should just be a
container for multiple traces, and the logic for individual chains
should be in separate classes.

I was thinking the other way around: a single-chain trace should just be
a "multiple" trace object with one chain. This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.

Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.

Is there a specific advantage to using a container object that you have
in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific advantage to using a container object that you have
in mind?

I think I'm mostly thinking of code clarity, extendability and composability. I've found this code somewhat challenging to read and understand and that concerns me.

It makes the trace interface more consistent and less cluttered for
users.

That's fair. That is a benefit.

I think having separate classes for storage versus retrieval makes it
clearer when defining custom backends which methods should be
overridden for each class.

This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.

I definitely agree a common interface to traces is a good idea. There are other ways to achieve this, for example, by always returning the container trace class.

Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.

True, but in psample, each process still needs an individual object to record their traces.

kyleam added 4 commits January 9, 2014 11:59
Summary was not updated to reflect variables name change (which is
caught by 'test_summary_2dim_value_model').
This reverts commit 6dda7e1.

The is being reverted to make it easier to compare changes with the
master branch.

Conflicts:
        pymc/__init__.py
        pymc/stats.py
        pymc/tests/test_trace.py
        pymc/trace.py
This reverts commit 683507b.

The is being reverted to make it easier to compare changes with the
master branch.

However, this reintroduces the naming conflict that the commit fixed.
'pymc.sample' now refers to the function, but the module of the same
name cannot be accessed. This makes it difficult to test anything other
than the top-level functions that are imported into pymc/__init__.py
(sample and iter_sample).

Conflicts:
        pymc/__init__.py
Changes in this commit are intended to simplify the backend storage
class.

- Reduce the number of storage class methods that need to be overridden.
  Now only `record` must be defined. During sampling, `setup` and
  `close` are also called, so the object should have these methods, but
  they do not need to do anything.

- Sampling returns the storage object's `trace` attribute. In all the
  backends provided, this is a base.Trace object inherited to define
  value access methods specific to that backend.

- As long as the methods above are defined, the storage object will
  work. This gives more flexibility to implement the backend storage
  class, so long as the `record` method properly stores the values.
  However, the setup in backends.base.Backend.__init__ should be useful
  to most backends (because of access to model information).

- The load functions have been modified so that they only work if a
  model is supplied or if within model context, which removes the option
  to load the values but not connect to the model.

- The base Trace object still provides a lot of structure. This is meant
  to help create a child Trace object for a backend that behaves the
  same as Traces from other backends. This means that the user can
  select values in the same way regardless of the backend. However, it
  is still possible for the user to create a very different Trace
  backend (assigned to the sampling objects trace attribute), as long as
  it has a `merge_chains` method to combine the results from parallel
  sampling.
@kyleam
Copy link
Contributor Author

kyleam commented Jan 10, 2014

I've pushed changes to make the storage class simpler. See the commit
message and the main pymc.backends docstring for more details.

Most of the suggestions have been incorporated. The main one that hasn't
is making a container class for each variable. I don't understand the
benefit of having a class for each variable. It seems cleaner to me if
the storage class encapsulates all the variables. Otherwise, you'd need
another class to manage all the variable container objects. In the
master branch, NpTrace is acting similar to the backend storage classes
here. A separate container object could be made to hold a variable (like
ListArray), but instead the NDArray backend is creating an array of
zeros. This avoids having to extend the array each iteration, but it
does require to pass in the intended number of draws to setup the array.

In addition to a record and close method, there is a setup method that
receives the expected number of draws and the chain number. I think it
is likely to be useful to most custom backends for setting up storage
structure (for example, the SQLite has to know the chain number ahead of
time). Also, having a setup function to create the storage structure
outside the class init means that a single instance can be passed to
sample, and, since setup is called separately, the storage class can
receive different chain numbers from each thread.

Please let me know you thoughts.

active_chains : list of ints
Values from chains to be used operations
"""
def __init__(self, var_names, backend=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: why not just 'vars'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should definitely go through this and change it to be consistent.
Should it be varnames instead of vars because they're strings or are
you using vars for both? Also, any concern that vars is a built-in?

@jsalvatier
Copy link
Member

My notion was that Backend should look basically like NpTrace except that instead of calling ListArray, it calls a function that is passed to it which creates a single variable backend. That might be ListArray or something like list array for SQLite or something like it for Text.

## Make array of zeros for each variable
var_arrays = {}
for var_name, shape in self.var_shapes.items():
var_arrays[var_name] = np.zeros((draws, ) + shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we had ArrayList was to support doing more sampling even after you've filled the original number of samples. Does that still work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is something I had marked with FIXME. With the changes
I just pushed, you should be able to extend previous chains for all the
backends.

When cleaning interrupt was merged with NDArray close, NDArray close
call should have been added here so that interrupt is cleaned for Text
too.
This was off by one, which resulted in unnecessary slicing (although the
result is the same).
This commit enables sampling to extend the a chain from a previous call.
This is done for the NumPy array by concatenating more zeros and setting
the draw index to the right position. For SQLite, this involves setting
the draw index to the right position for the given chain.
@kyleam
Copy link
Contributor Author

kyleam commented Jan 14, 2014

It should be easy to move the structure underlying the numpy backend to ListArray objects. My reasoning for creating an array of zeros instead was that appending to a list would be slower for storage and concatenating during each selection will also be slower (a few tests here).

@jsalvatier
Copy link
Member

Hey, I just wanted to let you know that I like what you've done a lot. You've clearly put a lot of though and effort into this, and I'm excited to have this functionality. I haven't responded as much as I want to because I haven't had much time at all, and I haven't figured out my thoughts on what the design should look like. Sorry about that. I hope to be able to spend more time on this sometime next week.

@kyleam
Copy link
Contributor Author

kyleam commented Jan 15, 2014

No problem. Thanks for the feedback you've given so far.

"""
self.chain = chain
## Concatenate new array if chain is already present.
if chain in self.trace.samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the basic NDArray backend shouldn't know anything about what number chain it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the basic NDArray backend shouldn't know anything about what number
chain it is.

The NDArray could be rewritten to do without this, but I don't think
out-of-memory backends could do parallel sampling without knowing this
(and, even for NDArray, this is used to restart the same chain,
especially with parallel sampling).

Please let me know your thoughts. I'm happy to try different ideas and
rework this.

@kyleam
Copy link
Contributor Author

kyleam commented Feb 9, 2014

Thanks for your feedback. (Moving this out of the inline comments
because it's quite general.)

To briefly summarize the current system:

  1. Sampling uses the backend storage object (NDArray, SQLite...).
    Different chains have instances of the storage backend that are
    distinguished by their chain numbers.
  2. Following sampling, a trace object is returned. If multiple chains
    were sampled, their different chains are merged into a single
    multi-chain object before being returned.
  3. This returned trace serves as the selection interface for the user.

The two main issues currently raised are that

  1. The multi-chain trace object should consist of single-chain trace
    objects.
  2. Storage and selection should be in the same class (instead of the
    storage class returning the trace class after sampling).

Is there a specific advantage to using a container object that you
have in mind?

I think I'm mostly thinking of code clarity, extendability and
composability.

I certainly agree these are important, but it's not obvious to me that
having single-chain objects and a multi-chain container improves these.

For me it comes back to the idea that it is more consistent and cleaner
to treat a chain with a single chain simply as a multi-chain object with
one chain than to have a single container class and then another class
to manage these. Assuming we want to consistently present the
multi-chain interface for selection (as discussed below), it seems
unnecessarily complicated to give the individual chains a class with
their own selection methods when standard objects will do (for NDArray,
a dictionary of values keyed by the chain, or for SQLite, a list of
chain numbers).

I'm not strongly opposed to reworking this to use single-chain objects
(as I propose below), but I wanted to clarify why I currently favor just
using multi-chain objects.

This way, all traces are handled the same, and functions like
traceplot don't need to check whether the trace is an instance of the
single or multiple trace class.

I definitely agree a common interface to traces is a good idea. There
are other ways to achieve this, for example, by always returning the
container trace class.

I'd argue that this is already what is happening: the trace object
(which is a multi-chain object) is always returned.

I've found this code somewhat challenging to read and understand and
that concerns me.

Yes, that concerns me too, especially since it should be relatively easy
for a user to implement a new backend.

I propose the following changes to move towards what you are suggesting.

  1. Add the single-chain trace object functionality as part of the
    current storage backend class. The value selection for the backend
    will be defined here.
  2. Make the multi-chain object take a group of single-chain objects. It
    will use the single-chain value selection methods to select values.
    This will be the object that is always returned from sampling.

With this setup, adding a new backend would mean subclassing object 1
and defining the storage and single-chain selection methods. The
multi-chain object shouldn't need to be subclassed, because it uses the
shared single-chain methods to get the values.

This should address both issue 1 and 2 above. There would now be
single-chain objects managed by a container object (1), and storage and
single-chain objects are no longer separate (2).

What do you think? If these are heading in the direction you were
suggesting, I'll put this together on another branch.

@jsalvatier
Copy link
Member

Your proposal sounds great good to me. Thank you for your persistence and your patience with me, Kyle.

The multi-chain object shouldn't need to be subclassed, because it uses the
shared single-chain methods to get the values.

I think you will need to subclass the multi-chain in order to override the __init__ in cases where the subchains have a shared resource (for example, in the SQL trace).

@kyleam kyleam closed this Mar 1, 2014
@kyleam kyleam deleted the pymc3-backends branch May 25, 2014 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants