Text and SQLite backends for PyMC3 #449

kyleam · 2014-01-07T01:24:20Z

This is not ready to merge, but I'm appreciate any feedback.

All tests are passing locally. See backends/init.py and 2e80cef commit message for documentation.

This is only needed for python 2 because mock is in stdlib for python 3 (unittest.mock).

To avoid any confusing with function pymc.sample, which is the imported from model pymc.sample in __init__.py

This commit contains a new backend for sampling and selecting values. Non-backend files have been changed to work with the new backend. Everything seems to be working with the exception of two issues (marked with FIXME): 1. pymc.plots.forestplot has not been updated yet for the new backend. 2. The previous behavior of passing a trace object to sample is not the same. I updated stochastic_volatility to do this with the same trace object. This commit also introduces a change to `sample`/`psample`. Instead of having separate function, `sample` now takes a keyword argument `threads`, and if this is over one, the multiprocessing version is used. The method for selecting values has also been changed. Traces can still be indexed to return values, a new slice, or a point (depending on the index), but the handling of chains is different. The trace object is now manages multiple chains itself instead of having a separate class to manage the single trace object. `get_values` is the main method for selecting values. By default, it returns separate results for all the chains. The chains can be combine with the `combine` flags, and particular chains can be select with the `chains` argument. The motivation for both sample and selection changes above was to have a unified interface for dealing with multiple chains, as most people are likely going to take advantage of the parallel sampling.

Wraps `enumerate` in progress bar update. This allows for checking the progress bar flag once and choosing `enumerate` or `enumerate_progress` as function versus checking progress bar flag each iteration.

This test compares of selection methods for NDArray to SQLite traces.

twiecki · 2014-01-07T21:14:02Z

This looks really interesting but will certainly require a proper code review.

kyleam · 2014-01-07T22:41:17Z

Absolutely. It touches a lot (which I think is inevitable) and makes
decisions that should be discussed (although I have tried to keep the
front-end trace operations mostly the same).

fonnesbeck · 2014-01-08T15:47:29Z

Thanks for getting this started. A sqlite backend is definitely something we want. I used it extensively with PyMC 2.

jsalvatier · 2014-01-09T05:59:49Z

pymc/sample.py

@@ -1,202 +0,0 @@
-from .point import *


Would you please keep the filename the same for PR? It makes it hard to see the changes.

Yes, good point. I put this into a separate commit before making any
changes to the code, which means that it diffs cleanly within the
branch, but the diff between branches won't be very useful. I'll push
changes reverting this commit (as well as the commit moving the summary
function to the stats module).

However, bringing back the name conflict between the sample module and
the function results in not being able to import the sample module
directly (meaning that only the functions from sample.py imported into
init.py can be tested), so I think this it is worth renaming
sample.py back to sampling.py at some point.

jsalvatier · 2014-01-09T06:03:47Z

I'm excited to take a look! Having more backends would be great for users! Thanks for taking the time to put this together.

jsalvatier · 2014-01-09T06:07:02Z

pymc/backends/base.py

+from pymc.model import modelcontext
+
+
+class Backend(object):


In general, I think this class provides too much structure for the different backends, and I wasn't expecting that there would be a shared base class for all backends at all (though maybe there's enough shared code that you really want one).

I would prefer to have a very simple interface for the backends, mainly they implement a record and perhaps close.

On second thought, there is at least a significant subset of backends that will have significant shared code: backends with a distinct container for each variable (lets call those 'container backends'), like a NdArray backend or SQL backend. However, I still think this class can be made significantly simpler.

Thanks for your comments. I will try to incorporate your suggestions.

What is the difference that we're trying to draw between a Trace and a Backend? It looks to me like Trace and Backend should be the same class. The object that is responsible for storing the points should also be responsible for retrieving them.

It also seems to me that a trace with multiple chains should just be a container for multiple traces, and the logic for individual chains should be in separate classes. Composition is better than just adding features to a class.

Is the reason why even the basic traces deal with 'chains' because the SQL db has a shared resource between its chains?

What is the difference that we're trying to draw between a Trace and a
Backend? It looks to me like Trace and Backend should be the same
class. The object that is responsible for storing the points should
also be responsible for retrieving them.

In an earlier version, I experimented with using one class, and we could
move that way again, but there are a few reasons I think two classes are
a good idea.

It makes the trace interface more consistent and less cluttered for
users. By returning a separate trace object for retrieval, users have
an object with methods that only concern what they'll be using it for
(selecting and viewing values). They don't deal with the storage
object directly (sample does).

I think having separate classes for storage versus retrieval makes it
clearer when defining custom backends which methods should be
overridden for each class. For a working backend, only the record
method of the storage backend must be defined. The trace class just
provides a way to make the custom backend have a consistent interface
for accessing values.

If put in one class, it's pretty large, so dividing the
responsibilities between storing and retrieving values seemed like a
natural division.

It also seems to me that a trace with multiple chains should just be a
container for multiple traces, and the logic for individual chains
should be in separate classes.

I was thinking the other way around: a single-chain trace should just be
a "multiple" trace object with one chain. This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.

Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.

Is there a specific advantage to using a container object that you have
in mind?

Is there a specific advantage to using a container object that you have
in mind?

I think I'm mostly thinking of code clarity, extendability and composability. I've found this code somewhat challenging to read and understand and that concerns me.

It makes the trace interface more consistent and less cluttered for
users.

That's fair. That is a benefit.

I think having separate classes for storage versus retrieval makes it
clearer when defining custom backends which methods should be
overridden for each class.

This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.

I definitely agree a common interface to traces is a good idea. There are other ways to achieve this, for example, by always returning the container trace class.

Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.

True, but in psample, each process still needs an individual object to record their traces.

Summary was not updated to reflect variables name change (which is caught by 'test_summary_2dim_value_model').

This reverts commit 6dda7e1. The is being reverted to make it easier to compare changes with the master branch. Conflicts: pymc/__init__.py pymc/stats.py pymc/tests/test_trace.py pymc/trace.py

This reverts commit 683507b. The is being reverted to make it easier to compare changes with the master branch. However, this reintroduces the naming conflict that the commit fixed. 'pymc.sample' now refers to the function, but the module of the same name cannot be accessed. This makes it difficult to test anything other than the top-level functions that are imported into pymc/__init__.py (sample and iter_sample). Conflicts: pymc/__init__.py

Changes in this commit are intended to simplify the backend storage class. - Reduce the number of storage class methods that need to be overridden. Now only `record` must be defined. During sampling, `setup` and `close` are also called, so the object should have these methods, but they do not need to do anything. - Sampling returns the storage object's `trace` attribute. In all the backends provided, this is a base.Trace object inherited to define value access methods specific to that backend. - As long as the methods above are defined, the storage object will work. This gives more flexibility to implement the backend storage class, so long as the `record` method properly stores the values. However, the setup in backends.base.Backend.__init__ should be useful to most backends (because of access to model information). - The load functions have been modified so that they only work if a model is supplied or if within model context, which removes the option to load the values but not connect to the model. - The base Trace object still provides a lot of structure. This is meant to help create a child Trace object for a backend that behaves the same as Traces from other backends. This means that the user can select values in the same way regardless of the backend. However, it is still possible for the user to create a very different Trace backend (assigned to the sampling objects trace attribute), as long as it has a `merge_chains` method to combine the results from parallel sampling.

kyleam · 2014-01-10T22:36:10Z

I've pushed changes to make the storage class simpler. See the commit
message and the main pymc.backends docstring for more details.

Most of the suggestions have been incorporated. The main one that hasn't
is making a container class for each variable. I don't understand the
benefit of having a class for each variable. It seems cleaner to me if
the storage class encapsulates all the variables. Otherwise, you'd need
another class to manage all the variable container objects. In the
master branch, NpTrace is acting similar to the backend storage classes
here. A separate container object could be made to hold a variable (like
ListArray), but instead the NDArray backend is creating an array of
zeros. This avoids having to extend the array each iteration, but it
does require to pass in the intended number of draws to setup the array.

In addition to a record and close method, there is a setup method that
receives the expected number of draws and the chain number. I think it
is likely to be useful to most custom backends for setting up storage
structure (for example, the SQLite has to know the chain number ahead of
time). Also, having a setup function to create the storage structure
outside the class init means that a single instance can be passed to
sample, and, since setup is called separately, the storage class can
receive different chain numbers from each thread.

Please let me know you thoughts.

jsalvatier · 2014-01-12T06:36:26Z

pymc/backends/base.py

+    active_chains : list of ints
+        Values from chains to be used operations
+    """
+    def __init__(self, var_names, backend=None):


minor: why not just 'vars'?

I should definitely go through this and change it to be consistent.
Should it be varnames instead of vars because they're strings or are
you using vars for both? Also, any concern that vars is a built-in?

jsalvatier · 2014-01-12T07:41:41Z

My notion was that Backend should look basically like NpTrace except that instead of calling ListArray, it calls a function that is passed to it which creates a single variable backend. That might be ListArray or something like list array for SQLite or something like it for Text.

jsalvatier · 2014-01-12T07:58:00Z

pymc/backends/ndarray.py

+        ## Make array of zeros for each variable
+        var_arrays = {}
+        for var_name, shape in self.var_shapes.items():
+            var_arrays[var_name] = np.zeros((draws, ) + shape)


The reason we had ArrayList was to support doing more sampling even after you've filled the original number of samples. Does that still work here?

Good point. This is something I had marked with FIXME. With the changes
I just pushed, you should be able to extend previous chains for all the
backends.

When cleaning interrupt was merged with NDArray close, NDArray close call should have been added here so that interrupt is cleaned for Text too.

This was off by one, which resulted in unnecessary slicing (although the result is the same).

This commit enables sampling to extend the a chain from a previous call. This is done for the NumPy array by concatenating more zeros and setting the draw index to the right position. For SQLite, this involves setting the draw index to the right position for the given chain.

kyleam · 2014-01-14T00:19:18Z

It should be easy to move the structure underlying the numpy backend to ListArray objects. My reasoning for creating an array of zeros instead was that appending to a list would be slower for storage and concatenating during each selection will also be slower (a few tests here).

jsalvatier · 2014-01-14T07:07:41Z

Hey, I just wanted to let you know that I like what you've done a lot. You've clearly put a lot of though and effort into this, and I'm excited to have this functionality. I haven't responded as much as I want to because I haven't had much time at all, and I haven't figured out my thoughts on what the design should look like. Sorry about that. I hope to be able to spend more time on this sometime next week.

kyleam · 2014-01-15T01:55:06Z

No problem. Thanks for the feedback you've given so far.

jsalvatier · 2014-02-06T08:09:45Z

pymc/backends/ndarray.py

+        """
+        self.chain = chain
+        ## Concatenate new array if chain is already present.
+        if chain in self.trace.samples:


the basic NDArray backend shouldn't know anything about what number chain it is.

the basic NDArray backend shouldn't know anything about what number
chain it is.

The NDArray could be rewritten to do without this, but I don't think
out-of-memory backends could do parallel sampling without knowing this
(and, even for NDArray, this is used to restart the same chain,
especially with parallel sampling).

Please let me know your thoughts. I'm happy to try different ideas and
rework this.

kyleam · 2014-02-09T22:08:07Z

Thanks for your feedback. (Moving this out of the inline comments
because it's quite general.)

To briefly summarize the current system:

Sampling uses the backend storage object (NDArray, SQLite...).
Different chains have instances of the storage backend that are
distinguished by their chain numbers.
Following sampling, a trace object is returned. If multiple chains
were sampled, their different chains are merged into a single
multi-chain object before being returned.
This returned trace serves as the selection interface for the user.

The two main issues currently raised are that

The multi-chain trace object should consist of single-chain trace
objects.
Storage and selection should be in the same class (instead of the
storage class returning the trace class after sampling).

Is there a specific advantage to using a container object that you
have in mind?

I think I'm mostly thinking of code clarity, extendability and
composability.

I certainly agree these are important, but it's not obvious to me that
having single-chain objects and a multi-chain container improves these.

For me it comes back to the idea that it is more consistent and cleaner
to treat a chain with a single chain simply as a multi-chain object with
one chain than to have a single container class and then another class
to manage these. Assuming we want to consistently present the
multi-chain interface for selection (as discussed below), it seems
unnecessarily complicated to give the individual chains a class with
their own selection methods when standard objects will do (for NDArray,
a dictionary of values keyed by the chain, or for SQLite, a list of
chain numbers).

I'm not strongly opposed to reworking this to use single-chain objects
(as I propose below), but I wanted to clarify why I currently favor just
using multi-chain objects.

This way, all traces are handled the same, and functions like
traceplot don't need to check whether the trace is an instance of the
single or multiple trace class.

I definitely agree a common interface to traces is a good idea. There
are other ways to achieve this, for example, by always returning the
container trace class.

I'd argue that this is already what is happening: the trace object
(which is a multi-chain object) is always returned.

I've found this code somewhat challenging to read and understand and
that concerns me.

Yes, that concerns me too, especially since it should be relatively easy
for a user to implement a new backend.

I propose the following changes to move towards what you are suggesting.

Add the single-chain trace object functionality as part of the
current storage backend class. The value selection for the backend
will be defined here.
Make the multi-chain object take a group of single-chain objects. It
will use the single-chain value selection methods to select values.
This will be the object that is always returned from sampling.

With this setup, adding a new backend would mean subclassing object 1
and defining the storage and single-chain selection methods. The
multi-chain object shouldn't need to be subclassed, because it uses the
shared single-chain methods to get the values.

This should address both issue 1 and 2 above. There would now be
single-chain objects managed by a container object (1), and storage and
single-chain objects are no longer separate (2).

What do you think? If these are heading in the direction you were
suggesting, I'll put this together on another branch.

jsalvatier · 2014-02-10T00:58:38Z

Your proposal sounds great good to me. Thank you for your persistence and your patience with me, Kyle.

The multi-chain object shouldn't need to be subclassed, because it uses the
shared single-chain methods to get the values.

I think you will need to subclass the multi-chain in order to override the __init__ in cases where the subchains have a shared resource (for example, in the SQL trace).

kyleam added 11 commits January 6, 2014 11:08

Include tests_require argument in setup.py

c1c6a85

Include mock as test dependency

3060c24

This is only needed for python 2 because mock is in stdlib for python 3 (unittest.mock).

Move summary to stats module

6dda7e1

Rename sample.py to sampling.py

683507b

To avoid any confusing with function pymc.sample, which is the imported from model pymc.sample in __init__.py

Add Text backend

ae77f69

Add SQLite backend

e3152f7

Add backend documentation

d9b5d2f

Add enumerate_progress function

caab2e1

Wraps `enumerate` in progress bar update. This allows for checking the progress bar flag once and choosing `enumerate` or `enumerate_progress` as function versus checking progress bar flag each iteration.

Dump/load tests for text and SQLite

40eb2b3

Test equality of NDArray and SQLite selections

156fc42

This test compares of selection methods for NDArray to SQLite traces.

jsalvatier reviewed Jan 9, 2014
View reviewed changes

kyleam added 4 commits January 9, 2014 11:59

Fix error in summary warning

978a528

Summary was not updated to reflect variables name change (which is caught by 'test_summary_2dim_value_model').

Revert "Move summary to stats module"

2e40ff5

This reverts commit 6dda7e1. The is being reverted to make it easier to compare changes with the master branch. Conflicts: pymc/__init__.py pymc/stats.py pymc/tests/test_trace.py pymc/trace.py

jsalvatier reviewed Jan 12, 2014
View reviewed changes

kyleam added 3 commits January 12, 2014 22:44

Call NDArray close when close text

a1c1458

When cleaning interrupt was merged with NDArray close, NDArray close call should have been added here so that interrupt is cleaned for Text too.

Fix NDArray close check for interrupt

9133a4c

This was off by one, which resulted in unnecessary slicing (although the result is the same).

jsalvatier reviewed Feb 6, 2014
View reviewed changes

kyleam mentioned this pull request Feb 9, 2014

Rename sample.py to sampling.py #481

Merged

kyleam mentioned this pull request Mar 1, 2014

Text and SQLite backends for PyMC3 (update) #500

Closed

kyleam closed this Mar 1, 2014

kyleam deleted the pymc3-backends branch May 25, 2014 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text and SQLite backends for PyMC3 #449

Text and SQLite backends for PyMC3 #449

kyleam commented Jan 7, 2014

twiecki commented Jan 7, 2014

kyleam commented Jan 7, 2014

fonnesbeck commented Jan 8, 2014

jsalvatier Jan 9, 2014

kyleam Jan 9, 2014

jsalvatier commented Jan 9, 2014

jsalvatier Jan 9, 2014

jsalvatier Jan 9, 2014

kyleam Jan 9, 2014

jsalvatier Feb 6, 2014

jsalvatier Feb 6, 2014

kyleam Feb 7, 2014

kyleam Feb 7, 2014

jsalvatier Feb 9, 2014

kyleam commented Jan 10, 2014

jsalvatier Jan 12, 2014

kyleam Jan 14, 2014

jsalvatier commented Jan 12, 2014

jsalvatier Jan 12, 2014

kyleam Jan 14, 2014

kyleam commented Jan 14, 2014

jsalvatier commented Jan 14, 2014

kyleam commented Jan 15, 2014

jsalvatier Feb 6, 2014

kyleam Feb 7, 2014

kyleam commented Feb 9, 2014

jsalvatier commented Feb 10, 2014

Text and SQLite backends for PyMC3 #449

Text and SQLite backends for PyMC3 #449

Conversation

kyleam commented Jan 7, 2014

twiecki commented Jan 7, 2014

kyleam commented Jan 7, 2014

fonnesbeck commented Jan 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsalvatier commented Jan 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleam commented Jan 10, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsalvatier commented Jan 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleam commented Jan 14, 2014

jsalvatier commented Jan 14, 2014

kyleam commented Jan 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleam commented Feb 9, 2014

jsalvatier commented Feb 10, 2014