Skip to content

Process definition API #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mancellin opened this issue Sep 9, 2018 · 13 comments
Closed

Process definition API #49

mancellin opened this issue Sep 9, 2018 · 13 comments

Comments

@mancellin
Copy link

Hi,
I've discovered xarray-simlab recently and it has caught my attention.
The project looks very promising: I'm using more and more xarray in my models and a framework helping the assembling and running of models could help a lot my workflow.

These days, I mostly work with frequency-domain models, but I guess it would be easy to extend xarray-simlab to this kind of models. They are actually simpler than the time dependent ones.

After reading the doc and playing with some toy cases, I'd like to share some feedback about the API.
Hopefully, the issue tracker is not such a bad place for that.

I'm very impressed by the implementation, in particular the automatic assembling of the processes in a model. The integration with xarray is quite promising. The efforts on the documentation and the availability of the package are also great!
However, I am not totally convinced by the way the processes are defined.

Let's consider the example from the tutorial.
The ProfileU and u_vars abstractions were not very clear to me, so I rewrote them in the framework of conservation laws, with which I'm more familiar:

@xs.process
class EulerForwardTimeIntegration:
    fluxes = xs.group('fluxes')
    u = xs.variable(dims='x', intent='inout', description='conserved quantity')

    def run_step(self, dt):
        self.sum_of_fluxes = dt * sum(self.fluxes)
                                    # ^ Note that you don't need a generator here

    def finalize_step(self):
        self.u += self.sum_of_fluxes

@xs.process
class UpwindAdvectionFlux:
    v = xs.variable(description='velocity', intent='in')
    dx = xs.foreign(UniformGrid1D, 'dx')

    u = xs.foreign(EulerForwardTimeIntegration, 'u', intent='in')
    flux = xs.variable(dims='x', intent='out', group='fluxes')

    def run_step(self, dt):
        if self.v > 0.0:
            self.flux = self.v * (np.roll(self.u, 1) - self.u)/self.dx
        else:
            self.flux = self.v * (self.u - np.roll(self.u, -1))/self.dx

The full code is here.

I love the modularity of the design, allowing me to easily replace the advection term by another one or to add more of them.

However, I was not able to easily modify the time integration process to have a second order scheme such as:

@xs.process
class RK2TimeIntegration:
    fluxes = xs.group('fluxes')
    u = xs.variable(dims='x', intent='inout', description='conserved quantity')

    def run_step(self, dt):
        predicted_u = self.u + dt * sum(self.fluxes)
        # Missing something like: predicted_fluxes = Flux(predicted_u)
        self.next_u = (self.u + predicted_u)/2 + dt/2 * sum(predicted_fluxes)

    def finalize_step(self):
        self.u = self.next_u

To do that, I would have to create new advection flux classes:

  • The explicit reference to EulerForwardTimeIntegration in the foreign variable of UpwindAdvectionFlux is not a big problem. It could be possible to define this kind of linking in another way, for instance at model creation.
  • What bothers me more is that I would have to add in all Flux classes an expression mapping predicted_u to predicted_flux which would be the same as the one mapping u to flux. Or in other words, the process UpwindAdvectionFlux should be a function accepting other arguments than u.

And this last point illustrates my general feeling about the current API: most of the process classes could be rewritten as functions, and maybe (functional programming aficionado speaking) they should.
The wrapping around attrs is quite nice but the xs.process are not data holders but functions, and the xs.variable are functions inputs and outputs with annotations.

So, maybe the best way to define the processes is to define functions.
The metadata on the variables could be passed by the docstring or the type annotations, e.g.:

def upwind_advection_flux(
        u: Variable(dims=("x",), name='u'),
        v: float,
        dx: float,
) -> Variable(dims=("x",), name='flux'):
    if v > 0.0:
        return v * (np.roll(u, 1) - u)/dx
    else:
        return v * (u - np.roll(u, -1))/dx

I made a mock-up of the whole model here.

The code would be (in my humble opinion) both more flexible and more secure.
Indeed, the processes could be easily unit-tested, unlike the current process class.
I don't know exactly how the model builder works, but it should still have enough clues to link the processes together.
Note finally that this framework would be easily extendable to other kind of models than the time-dependent ones.

These were my two cents. I'd be happy to discuss it further if you want.
Cheers,

@benbovy
Copy link
Member

benbovy commented Sep 10, 2018

Hi @mancellin,

Many thanks for your detailed feedback! I try below to give some thoughts on each topic you've mentioned:

Doc enhancement / process naming

Let's consider the example from the tutorial. The ProfileU and u_vars abstractions were not very clear to me, so I rewrote them in the framework of conservation laws, with which I'm more familiar

I agree that this example from the tutorial could be improved. The code you have rewritten is clearer at several places, so I'd be very pleased if you could put this together in a PR!

On a same note, I've often found it difficult to find a good name for a process. Although it's natural to choose a name that describe what the process does, sometimes it doesn't work very well. In your example, EulerForwardTimeIntegration / time_scheme is explicit, but then when you use the xarray extension you end up with a variable named time_scheme__u in your input/output datasets, which is not very meaningful. A name like profile__u is slightly better to understand IMO, especially when the dataset is saved and loaded later in another context.

Support advanced time stepping

I was not able to easily modify the time integration process to have a second order scheme

I admit that I haven't really considered time schemes like RK2 when working on the design of this modelling framework.

One limitation is that by design no cycle is allowed in the graph of process dependencies. As a workaround, you could duplicate process classes (both flux and time integration) to implement each step of the RK2 scheme. You could also use class inheritance so that you don't need to duplicate all the code but just re-define u and flux. I agree that this workaround is not ideal, though, and it's still quite laborious , e.g., if you want to include a lot of different "flux" process classes.

it could be possible to define this kind of linking in another way, for instance at model creation

I'm very open to any suggestion to improve the situation! That said, I think that the linking between processes should remain automatic as much as possible (That's a core feature of the framework)!

Define processes as functions

I haven't really thought about that but I like the idea!

Process classes are still useful, though. I have a few use cases where a process holds some data, e.g., some working arrays that are allocated at the beginning of a simulation and updated later, some state variables recorded for the last n time steps or complex objects from wrapped code dealing with more than one variable... Classes are also useful to implement some variables computed "on-demand".

In cases where a process doesn't hold any data, I agree that using functions would be simpler and much easier to test!

I think that both classes and functions might co-exist.

I don't see any major implementation issue. We could decorate those functions like we do for classes to extract/handle/stick all the information needed to build models. Function arguments consist of variables with intent=in and the returned value(s) are related to variable(s) with intent=out (intent=inout is not supported with functions).

My main concern is that we'll probably have to maintain two different systems for declaring variables in processes, e.g., attr.ib wrappers for class variables and typing derived classes for annotated functions.

Time-independent models

this framework would be easily extendable to other kind of models than the time-dependent ones.

Yes that's definitely the aim of this framework: to be as generic as possible.

If we support both classes and functions for processes, maybe a run method in process classes could help for better clarity when used in time-independent models? The general order of simulation steps would be initialize -> n_steps*[run_step -> finalize_step] -> run -> finalize. Implementing the related methods is entirely optional in a process class, so it’s easy to just ignore all methods but the .run() one. For process functions, an argument to the decorator would help defining the simulation step (run would be the default one?).

@mancellin
Copy link
Author

Hi @benbovy,

Thanks for your answer. Below some comments:

Doc enhancement / process naming

On a same note, I've often found it difficult to find a good name for a process. Although it's natural to choose a name that describe what the process does, sometimes it doesn't work very well. In your example, EulerForwardTimeIntegration / time_scheme is explicit, but then when you use the xarray extension you end up with a variable named time_scheme__u in your input/output datasets, which is not very meaningful. A name like profile__u is slightly better to understand IMO, especially when the dataset is saved and loaded later in another context.

Ok, I understand better the ProfileU process now. It is meant to be a frame for the variable u and its role as time integrator is kind of a coincidence. Maybe the problem here is that u is shared by several processes and does not really need to belong to any process in particular.

Process linking

I think that the linking between processes should remain automatic as much as possible (That's a core feature of the framework)!

I totally agree. The builder should be able to guess which output to plug on each input, based on the shape, physical unit or name of the variables.

My critic of the current implementation is that writing explicitly that the input of a process should come from another process (in xs.foreign) hinders the modularity. It could be necessary in some cases: for instance when there is too many variables and not enough metadata to distinguish them. But then, it would be better to make the link as an option to the model builder rather than hard-coded in the process.

Define processes as functions

Process classes are still useful, though. I have a few use cases where a process holds some data, e.g., some working arrays that are allocated at the beginning of a simulation and updated later, some state variables recorded for the last n time steps or complex objects from wrapped code dealing with more than one variable...

Could you make those classes callables? For instance:

@process
class InitialGaussianProfile:
    loc = Variable(kind='position', description='location of initial profile')
    width = Variable(kind='length', description='width of initial profile')

    def __call__(self, x: Variable[kind='position']) -> Variable[dims=('x',), kind='u']:
        return np.exp(-1 / self.width**2 * (x - self.loc)**2)

So you have basically a function with attributes.
Note that it solves the naming problem by separating the private parameters that define a process and the variables that are just going through it.

Other time stepping or time-independent models

One limitation is that by design no cycle is allowed in the graph of process dependencies.

My suggestion is to have process taking subprocesses as arguments. It complexifies the implementation of the model builder and the graph representation, but I don't see any other way to assemble complex models in a generic framework without dependencies cycles.

The general order of simulation steps would be initialize -> n_steps*[run_step -> finalize_step] -> run -> finalize. Implementing the related methods is entirely optional in a process class, so it’s easy to just ignore all methods but the .run() one.

The time variable would then be a normal variable used it in a "metaprocess" taking processes as arguments. That's what I tried to do in my function-based mockup.
The "standard library" of xsimlab could then come with a generic process of the form:

@process
class TimeDependentSimulation:
    step_runners: List[Process]
    
    def __call__(self, clock: Variable[dims=('t',)], ...: ...) -> ...:
        for t in clock:
            for run_step in step_runners:
                ... = run_step(...)
        return ...


model = Model(
    {
    some_initialization_process,
    TimeDependentSimulation(step_runners=[some_function, ...]),
    some_time_independent_process,
    ...
    }
    )

And the model would now have just a single run step.
The problem is to figure out how to make the dependancy graph when you have processes containing other processes. Maybe this is a bit too complicated and your solution is better in the short term.

@benbovy
Copy link
Member

benbovy commented Sep 12, 2018

Maybe the problem here is that u is shared by several processes and does not really need to belong to any process in particular.

Actually, nothing prevents the user from creating a "process" class that only declares one or more variables without doing any computation. Such classes could be abstract (to be subclassed) or could serve as a glue (i.e., declare variable aliases) between two sets of process classes using different naming conventions for the same physical fields. So maybe the problem is the name process which doesn't accurately represent the concept here.

The framework encourages the creation of small components, so it might make sense to have a process class that only declares u. However, in practice this could results in a lot of classes, which is a bit tough to deal with in Model objects. That's why in the example in the docs I've created a single class to declare u and to implement the time integrator (the main driver).

The builder should be able to guess which output to plug on each input, based on the shape, physical unit or name of the variables.

This is another approach that I've considered and which is used by other (more domain specific) modelling frameworks such as landlab. Both approaches have pros and cons. Here I've chosen not to rely on standards (names, units, etc..) because I wanted the framework to be as generic as possible and flexible enough so that it's easy to connect model components, e.g., written by people working in different domains / using different standards. Standards are good but sometimes they are not very flexible. Unfortunately, I haven't found any better alternative than xs.foreign.

But then, it would be better to make the link as an option to the model builder rather than hard-coded in the process.

I've also considered this, but in early mockups it quickly became ugly to (re)link at model creation all the variables declared in each process. I really want to keep the creation of a new Model very simple, i.e., ideally requiring no more than just a collection of processes. Fortunately, things like subclassing processes or xs.group already address some issues with hard-coding linked processes (only to a limited extent, I admit).

Could you make those classes callables?

They might be indeed some advantages of splitting variable declaration as you show in your code, but I think I prefer to have a clearer separation between the declarative and computation parts, i.e., one declarative block at the top of the class and then methods for computation below is cleaner IMO.

Also, I'm not a big fan of encouraging the use of dunder methods like __call__ in the public API (process classes are end-user code).

My suggestion is to have process taking subprocesses as arguments.

A nested collection of processes looks indeed too complicated to me. I can see how difficult it will be to deal with it when implementing features like a callback system for model diagnostics/debugging/logging or even basic things like the repr.

There exist other, possibly better approaches to solve the problem of advanced time stepping schemes, time-independent models and cycles. For example, we could design a more generic/flexible way to define the passes trough the graph of processes during a simulation, with possible filtering. Currently, the framework defines 4 (optional) simulation stages, but we could imagine these stages somehow defined by the user (maybe with repeating/nested patterns, e.g., time steps coupled to one or more other iterative schemes). Regarding the API, decorating methods in process classes might help here (see, e.g., #40).

@mancellin
Copy link
Author

Actually, nothing prevents the user from creating a "process" class that only declares one or more variables without doing any computation. Such classes could be abstract (to be subclassed) or could serve as a glue (i.e., declare variable aliases) between two sets of process classes using different naming conventions for the same physical fields. So maybe the problem is the name process which doesn't accurately represent the concept here.

Its interesting to notice that we are thinking with different programming paradigms.
You've designed xsimlab with an imperative programming paradigm (a process is a sequence of instructions that affect preexisting mutable variables) and I'm imagining it in a functional programming mindset (a process is a mathematical function and the model is a fancy composition of functions).
I guess they both have pros and cons.

For instance, in the case of xs.foreign, I see it as a side-effect of the imperative design of the code and I think it is possible to get rid of it.
I'll make a small prototype to try to make my case for a more functional design.

@benbovy
Copy link
Member

benbovy commented Sep 13, 2018

I think that it would be great if we could add to xarray-simlab the capability to create models using Python functions as processes. However, I don't think that a purely functional design is possible or wanted here.

A pure functional, stateless approach might not always be the best nor the most natural. I have use cases where using exclusively immutable variables is simply not an option (e.g., allocating/destroying big arrays in memory at each time step). After all, a computational model is often viewed as a set of instructions that update one or more state variables.

That being said, some sort of functional style is already supported! By appropriately setting intent for variables in each process class, it is actually totally possible to produce a model with only immutable variables:

  • For intent=in variables, it is not possible to set their value within their process class:
    • the xs.process class decorator turns these variables into read-only properties
    • a value may still be set either in another process class (using xs.foreign) or by the user as model input
  • For intent=out variables, it is not possible to set their values outside of their process class:
    • they'll never be model inputs
    • an error is raised at model creation if intent=out is also set for any of its reference (i.e., a xs.foreign) declared in another process class

So, even though xarray-simlab uses Python classes to define processes, your definition of a process (a mathematical function...) may still be perfectly valid here!

What you suggest is not much different: using Python functions instead of classes where intent is implicit (intent=in for function arguments and intent=out for returned values). As I said before, I'm not opposed to also eventually support Python functions as processes, even though it will certainly raise some design questions. I'm one of those who think that functions are generally easier to deal with than classes, and in some cases like yours it might be nicer.

@benbovy
Copy link
Member

benbovy commented Sep 13, 2018

Rather than following an imperative paradigm, I've designed xarray-simlab as an object-oriented layer on top of a common data store for the simulation. Process classes are just a way to split the overall computation of a model into building blocks and to provide some "syntactic sugar" for accessing the simulation data store in a consistent manner (i.e., with safe guards).

Off the top of my head, I can imagine decorated (and annotated) Python functions that could be syntactic sugar for creating simple process classes that are also callable. That should be reasonably straightforward to make it work with everything else. The big remaining question is how to define variables with type annotations.

@mancellin
Copy link
Author

I guess you're right, you're more familiar than me with the design constraints.
I'll take a closer look at the working of the code and the implementation details.

@mancellin
Copy link
Author

As you suggested above, I've started drafting a PR to improve the example in the documentation.

I've just noticed that we don't really need to pass the grid spacing from grid to advect: if we use a DataArray, all the information should already be in the coordinates of the array.

But, assuming we keep it as it is, shouldn't there be an arrow linking directly grid to advect in the dependency graph, since the latter requires a variable from the former?

@benbovy
Copy link
Member

benbovy commented Oct 11, 2018

I've just noticed that we don't really need to pass the grid spacing from grid to advect: if we use a DataArray, all the information should already be in the coordinates of the array.

Actually, DataArray objects are not intended to be passed or shared between the model processes during a simulation. xarray Datasets are just used for simulation I/O, whereas the simulation runtime data that is shared between the model processes rather consist of simple structures such as primitives or numpy arrays. Simulation runtime data is only data. Metadata like dimension labels or attributes are already defined in the variables (xs.variable()) declared in each process.

But, assuming we keep it as it is, shouldn't there be an arrow linking directly grid to advect in the dependency graph, since the latter requires a variable from the former?

Process dependencies are defined as follows: Process 1 depends on process 2 if the later declares a variable (resp. a foreign variable) with intent='out' that itself (resp. its target variable) is needed in process 1.

In the doc examples (e.g., model2), grid.spacing has intent='in'. In fact it is a model input, that is, a value for grid spacing is required to be set by the user before even starting the simulation. So, in this specific case, the advect process does not need the grid process to be run before getting the value of grid.spacing, hence no direct dependency grid -> advect.

@mancellin
Copy link
Author

Simulation runtime data is only data.

Ok, it makes sense.

Process dependencies are defined as follows: Process 1 depends on process 2 if the later declares a variable (resp. a foreign variable) with intent='out' that itself (resp. its target variable) is needed in process 1.

Ok, I understand. In the graph, the arrows represent dependency in the sense of the order of execution. I was thinking of dependency in the sense of requiring the other Process to be also part of model.

@benbovy
Copy link
Member

benbovy commented Oct 12, 2018

In the graph, the arrows represent dependency in the sense of the order of execution. I was thinking of dependency in the sense of requiring the other Process to be also part of model.

You're right. "dependency" is confusing here. Maybe "workflow dependency" or "runtime dependency" would be better to clearly differentiate from "variable dependency".

@benbovy
Copy link
Member

benbovy commented Jan 16, 2020

I'm closing this issue, as one of the main concerns here (process unit-testing) has been addressed in #63.

@mancellin feel free to open new issues or comment on the relevant open issues if you want to further discuss other points raised here.

@benbovy benbovy closed this as completed Jan 16, 2020
@mancellin
Copy link
Author

Sure, you can close the issue.

Sorry for never finishing the documentation PR I mentioned above.

Since then, I've been scratching my itch by doing some experimentation on my side: https://github.com/mancellin/labelled_functions
The idea is to add labels on the inputs and outputs of Python functions, similarly to xarray adding labels to the dimensions of numpy arrays.
It allows to interact with labelled data and compose functions more easily.
The scope and implementation are different from xarray-simlab, but your work was a major source of inspiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants