-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Text and SQLite backends for PyMC3 #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is only needed for python 2 because mock is in stdlib for python 3 (unittest.mock).
To avoid any confusing with function pymc.sample, which is the imported from model pymc.sample in __init__.py
This commit contains a new backend for sampling and selecting values. Non-backend files have been changed to work with the new backend. Everything seems to be working with the exception of two issues (marked with FIXME): 1. pymc.plots.forestplot has not been updated yet for the new backend. 2. The previous behavior of passing a trace object to sample is not the same. I updated stochastic_volatility to do this with the same trace object. This commit also introduces a change to `sample`/`psample`. Instead of having separate function, `sample` now takes a keyword argument `threads`, and if this is over one, the multiprocessing version is used. The method for selecting values has also been changed. Traces can still be indexed to return values, a new slice, or a point (depending on the index), but the handling of chains is different. The trace object is now manages multiple chains itself instead of having a separate class to manage the single trace object. `get_values` is the main method for selecting values. By default, it returns separate results for all the chains. The chains can be combine with the `combine` flags, and particular chains can be select with the `chains` argument. The motivation for both sample and selection changes above was to have a unified interface for dealing with multiple chains, as most people are likely going to take advantage of the parallel sampling.
Wraps `enumerate` in progress bar update. This allows for checking the progress bar flag once and choosing `enumerate` or `enumerate_progress` as function versus checking progress bar flag each iteration.
This test compares of selection methods for NDArray to SQLite traces.
This looks really interesting but will certainly require a proper code review. |
Absolutely. It touches a lot (which I think is inevitable) and makes |
Thanks for getting this started. A sqlite backend is definitely something we want. I used it extensively with PyMC 2. |
@@ -1,202 +0,0 @@ | |||
from .point import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please keep the filename the same for PR? It makes it hard to see the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I put this into a separate commit before making any
changes to the code, which means that it diffs cleanly within the
branch, but the diff between branches won't be very useful. I'll push
changes reverting this commit (as well as the commit moving the summary
function to the stats module).
However, bringing back the name conflict between the sample module and
the function results in not being able to import the sample module
directly (meaning that only the functions from sample.py imported into
init.py can be tested), so I think this it is worth renaming
sample.py back to sampling.py at some point.
I'm excited to take a look! Having more backends would be great for users! Thanks for taking the time to put this together. |
from pymc.model import modelcontext | ||
|
||
|
||
class Backend(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think this class provides too much structure for the different backends, and I wasn't expecting that there would be a shared base class for all backends at all (though maybe there's enough shared code that you really want one).
I would prefer to have a very simple interface for the backends, mainly they implement a record
and perhaps close
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, there is at least a significant subset of backends that will have significant shared code: backends with a distinct container for each variable (lets call those 'container backends'), like a NdArray backend or SQL backend. However, I still think this class can be made significantly simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. I will try to incorporate your suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference that we're trying to draw between a Trace and a Backend? It looks to me like Trace and Backend should be the same class. The object that is responsible for storing the points should also be responsible for retrieving them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also seems to me that a trace with multiple chains should just be a container for multiple traces, and the logic for individual chains should be in separate classes. Composition is better than just adding features to a class.
Is the reason why even the basic traces deal with 'chains' because the SQL db has a shared resource between its chains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference that we're trying to draw between a Trace and a
Backend? It looks to me like Trace and Backend should be the same
class. The object that is responsible for storing the points should
also be responsible for retrieving them.
In an earlier version, I experimented with using one class, and we could
move that way again, but there are a few reasons I think two classes are
a good idea.
- It makes the trace interface more consistent and less cluttered for
users. By returning a separate trace object for retrieval, users have
an object with methods that only concern what they'll be using it for
(selecting and viewing values). They don't deal with the storage
object directly (sample
does). - I think having separate classes for storage versus retrieval makes it
clearer when defining custom backends which methods should be
overridden for each class. For a working backend, only therecord
method of the storage backend must be defined. The trace class just
provides a way to make the custom backend have a consistent interface
for accessing values. - If put in one class, it's pretty large, so dividing the
responsibilities between storing and retrieving values seemed like a
natural division.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also seems to me that a trace with multiple chains should just be a
container for multiple traces, and the logic for individual chains
should be in separate classes.
I was thinking the other way around: a single-chain trace should just be
a "multiple" trace object with one chain. This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.
Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.
Is there a specific advantage to using a container object that you have
in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific advantage to using a container object that you have
in mind?
I think I'm mostly thinking of code clarity, extendability and composability. I've found this code somewhat challenging to read and understand and that concerns me.
It makes the trace interface more consistent and less cluttered for
users.
That's fair. That is a benefit.
I think having separate classes for storage versus retrieval makes it
clearer when defining custom backends which methods should be
overridden for each class.This way, all traces are
handled the same, and functions like traceplot don't need to check
whether the trace is an instance of the single or multiple trace class.
I definitely agree a common interface to traces is a good idea. There are other ways to achieve this, for example, by always returning the container trace class.
Also, when using non-memory backends, the distinction between single and
multiple trace isn't as clear to me because this will be made at the
level of the call to the database.
True, but in psample, each process still needs an individual object to record their traces.
Summary was not updated to reflect variables name change (which is caught by 'test_summary_2dim_value_model').
This reverts commit 6dda7e1. The is being reverted to make it easier to compare changes with the master branch. Conflicts: pymc/__init__.py pymc/stats.py pymc/tests/test_trace.py pymc/trace.py
This reverts commit 683507b. The is being reverted to make it easier to compare changes with the master branch. However, this reintroduces the naming conflict that the commit fixed. 'pymc.sample' now refers to the function, but the module of the same name cannot be accessed. This makes it difficult to test anything other than the top-level functions that are imported into pymc/__init__.py (sample and iter_sample). Conflicts: pymc/__init__.py
Changes in this commit are intended to simplify the backend storage class. - Reduce the number of storage class methods that need to be overridden. Now only `record` must be defined. During sampling, `setup` and `close` are also called, so the object should have these methods, but they do not need to do anything. - Sampling returns the storage object's `trace` attribute. In all the backends provided, this is a base.Trace object inherited to define value access methods specific to that backend. - As long as the methods above are defined, the storage object will work. This gives more flexibility to implement the backend storage class, so long as the `record` method properly stores the values. However, the setup in backends.base.Backend.__init__ should be useful to most backends (because of access to model information). - The load functions have been modified so that they only work if a model is supplied or if within model context, which removes the option to load the values but not connect to the model. - The base Trace object still provides a lot of structure. This is meant to help create a child Trace object for a backend that behaves the same as Traces from other backends. This means that the user can select values in the same way regardless of the backend. However, it is still possible for the user to create a very different Trace backend (assigned to the sampling objects trace attribute), as long as it has a `merge_chains` method to combine the results from parallel sampling.
I've pushed changes to make the storage class simpler. See the commit Most of the suggestions have been incorporated. The main one that hasn't In addition to a record and close method, there is a setup method that Please let me know you thoughts. |
active_chains : list of ints | ||
Values from chains to be used operations | ||
""" | ||
def __init__(self, var_names, backend=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: why not just 'vars'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should definitely go through this and change it to be consistent.
Should it be varnames
instead of vars
because they're strings or are
you using vars
for both? Also, any concern that vars
is a built-in?
My notion was that |
## Make array of zeros for each variable | ||
var_arrays = {} | ||
for var_name, shape in self.var_shapes.items(): | ||
var_arrays[var_name] = np.zeros((draws, ) + shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we had ArrayList was to support doing more sampling even after you've filled the original number of samples. Does that still work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is something I had marked with FIXME. With the changes
I just pushed, you should be able to extend previous chains for all the
backends.
When cleaning interrupt was merged with NDArray close, NDArray close call should have been added here so that interrupt is cleaned for Text too.
This was off by one, which resulted in unnecessary slicing (although the result is the same).
This commit enables sampling to extend the a chain from a previous call. This is done for the NumPy array by concatenating more zeros and setting the draw index to the right position. For SQLite, this involves setting the draw index to the right position for the given chain.
It should be easy to move the structure underlying the numpy backend to |
Hey, I just wanted to let you know that I like what you've done a lot. You've clearly put a lot of though and effort into this, and I'm excited to have this functionality. I haven't responded as much as I want to because I haven't had much time at all, and I haven't figured out my thoughts on what the design should look like. Sorry about that. I hope to be able to spend more time on this sometime next week. |
No problem. Thanks for the feedback you've given so far. |
""" | ||
self.chain = chain | ||
## Concatenate new array if chain is already present. | ||
if chain in self.trace.samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the basic NDArray backend shouldn't know anything about what number chain it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the basic NDArray backend shouldn't know anything about what number
chain it is.
The NDArray could be rewritten to do without this, but I don't think
out-of-memory backends could do parallel sampling without knowing this
(and, even for NDArray, this is used to restart the same chain,
especially with parallel sampling).
Please let me know your thoughts. I'm happy to try different ideas and
rework this.
Thanks for your feedback. (Moving this out of the inline comments To briefly summarize the current system:
The two main issues currently raised are that
I certainly agree these are important, but it's not obvious to me that For me it comes back to the idea that it is more consistent and cleaner I'm not strongly opposed to reworking this to use single-chain objects
I'd argue that this is already what is happening: the trace object
Yes, that concerns me too, especially since it should be relatively easy I propose the following changes to move towards what you are suggesting.
With this setup, adding a new backend would mean subclassing object 1 This should address both issue 1 and 2 above. There would now be What do you think? If these are heading in the direction you were |
Your proposal sounds great good to me. Thank you for your persistence and your patience with me, Kyle.
I think you will need to subclass the multi-chain in order to override the |
This is not ready to merge, but I'm appreciate any feedback.
All tests are passing locally. See backends/init.py and 2e80cef commit message for documentation.