-
Notifications
You must be signed in to change notification settings - Fork 9
Process definition API #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @mancellin, Many thanks for your detailed feedback! I try below to give some thoughts on each topic you've mentioned: Doc enhancement / process naming
I agree that this example from the tutorial could be improved. The code you have rewritten is clearer at several places, so I'd be very pleased if you could put this together in a PR! On a same note, I've often found it difficult to find a good name for a process. Although it's natural to choose a name that describe what the process does, sometimes it doesn't work very well. In your example, Support advanced time stepping
I admit that I haven't really considered time schemes like RK2 when working on the design of this modelling framework. One limitation is that by design no cycle is allowed in the graph of process dependencies. As a workaround, you could duplicate process classes (both flux and time integration) to implement each step of the RK2 scheme. You could also use class inheritance so that you don't need to duplicate all the code but just re-define
I'm very open to any suggestion to improve the situation! That said, I think that the linking between processes should remain automatic as much as possible (That's a core feature of the framework)! Define processes as functionsI haven't really thought about that but I like the idea! Process classes are still useful, though. I have a few use cases where a process holds some data, e.g., some working arrays that are allocated at the beginning of a simulation and updated later, some state variables recorded for the last n time steps or complex objects from wrapped code dealing with more than one variable... Classes are also useful to implement some variables computed "on-demand". In cases where a process doesn't hold any data, I agree that using functions would be simpler and much easier to test! I think that both classes and functions might co-exist. I don't see any major implementation issue. We could decorate those functions like we do for classes to extract/handle/stick all the information needed to build models. Function arguments consist of variables with My main concern is that we'll probably have to maintain two different systems for declaring variables in processes, e.g., Time-independent models
Yes that's definitely the aim of this framework: to be as generic as possible. If we support both classes and functions for processes, maybe a |
Hi @benbovy, Thanks for your answer. Below some comments: Doc enhancement / process naming
Ok, I understand better the Process linking
I totally agree. The builder should be able to guess which output to plug on each input, based on the shape, physical unit or name of the variables. My critic of the current implementation is that writing explicitly that the input of a process should come from another process (in Define processes as functions
Could you make those classes callables? For instance: @process
class InitialGaussianProfile:
loc = Variable(kind='position', description='location of initial profile')
width = Variable(kind='length', description='width of initial profile')
def __call__(self, x: Variable[kind='position']) -> Variable[dims=('x',), kind='u']:
return np.exp(-1 / self.width**2 * (x - self.loc)**2) So you have basically a function with attributes. Other time stepping or time-independent models
My suggestion is to have process taking subprocesses as arguments. It complexifies the implementation of the model builder and the graph representation, but I don't see any other way to assemble complex models in a generic framework without dependencies cycles.
The time variable would then be a normal variable used it in a "metaprocess" taking processes as arguments. That's what I tried to do in my function-based mockup. @process
class TimeDependentSimulation:
step_runners: List[Process]
def __call__(self, clock: Variable[dims=('t',)], ...: ...) -> ...:
for t in clock:
for run_step in step_runners:
... = run_step(...)
return ...
model = Model(
{
some_initialization_process,
TimeDependentSimulation(step_runners=[some_function, ...]),
some_time_independent_process,
...
}
) And the model would now have just a single |
Actually, nothing prevents the user from creating a "process" class that only declares one or more variables without doing any computation. Such classes could be abstract (to be subclassed) or could serve as a glue (i.e., declare variable aliases) between two sets of process classes using different naming conventions for the same physical fields. So maybe the problem is the name The framework encourages the creation of small components, so it might make sense to have a process class that only declares
This is another approach that I've considered and which is used by other (more domain specific) modelling frameworks such as landlab. Both approaches have pros and cons. Here I've chosen not to rely on standards (names, units, etc..) because I wanted the framework to be as generic as possible and flexible enough so that it's easy to connect model components, e.g., written by people working in different domains / using different standards. Standards are good but sometimes they are not very flexible. Unfortunately, I haven't found any better alternative than
I've also considered this, but in early mockups it quickly became ugly to (re)link at model creation all the variables declared in each process. I really want to keep the creation of a new
They might be indeed some advantages of splitting variable declaration as you show in your code, but I think I prefer to have a clearer separation between the declarative and computation parts, i.e., one declarative block at the top of the class and then methods for computation below is cleaner IMO. Also, I'm not a big fan of encouraging the use of dunder methods like
A nested collection of processes looks indeed too complicated to me. I can see how difficult it will be to deal with it when implementing features like a callback system for model diagnostics/debugging/logging or even basic things like the There exist other, possibly better approaches to solve the problem of advanced time stepping schemes, time-independent models and cycles. For example, we could design a more generic/flexible way to define the passes trough the graph of processes during a simulation, with possible filtering. Currently, the framework defines 4 (optional) simulation stages, but we could imagine these stages somehow defined by the user (maybe with repeating/nested patterns, e.g., time steps coupled to one or more other iterative schemes). Regarding the API, decorating methods in process classes might help here (see, e.g., #40). |
Its interesting to notice that we are thinking with different programming paradigms. For instance, in the case of |
I think that it would be great if we could add to xarray-simlab the capability to create models using Python functions as processes. However, I don't think that a purely functional design is possible or wanted here. A pure functional, stateless approach might not always be the best nor the most natural. I have use cases where using exclusively immutable variables is simply not an option (e.g., allocating/destroying big arrays in memory at each time step). After all, a computational model is often viewed as a set of instructions that update one or more state variables. That being said, some sort of functional style is already supported! By appropriately setting
So, even though xarray-simlab uses Python classes to define processes, your definition of a process (a mathematical function...) may still be perfectly valid here! What you suggest is not much different: using Python functions instead of classes where |
Rather than following an imperative paradigm, I've designed xarray-simlab as an object-oriented layer on top of a common data store for the simulation. Process classes are just a way to split the overall computation of a model into building blocks and to provide some "syntactic sugar" for accessing the simulation data store in a consistent manner (i.e., with safe guards). Off the top of my head, I can imagine decorated (and annotated) Python functions that could be syntactic sugar for creating simple process classes that are also callable. That should be reasonably straightforward to make it work with everything else. The big remaining question is how to define variables with type annotations. |
I guess you're right, you're more familiar than me with the design constraints. |
As you suggested above, I've started drafting a PR to improve the example in the documentation. I've just noticed that we don't really need to pass the grid spacing from But, assuming we keep it as it is, shouldn't there be an arrow linking directly |
Actually,
Process dependencies are defined as follows: Process 1 depends on process 2 if the later declares a variable (resp. a foreign variable) with In the doc examples (e.g., |
Ok, it makes sense.
Ok, I understand. In the graph, the arrows represent dependency in the sense of the order of execution. I was thinking of dependency in the sense of requiring the other Process to be also part of model. |
You're right. "dependency" is confusing here. Maybe "workflow dependency" or "runtime dependency" would be better to clearly differentiate from "variable dependency". |
I'm closing this issue, as one of the main concerns here (process unit-testing) has been addressed in #63. @mancellin feel free to open new issues or comment on the relevant open issues if you want to further discuss other points raised here. |
Sure, you can close the issue. Sorry for never finishing the documentation PR I mentioned above. Since then, I've been scratching my itch by doing some experimentation on my side: https://github.com/mancellin/labelled_functions |
Hi,
I've discovered xarray-simlab recently and it has caught my attention.
The project looks very promising: I'm using more and more xarray in my models and a framework helping the assembling and running of models could help a lot my workflow.
These days, I mostly work with frequency-domain models, but I guess it would be easy to extend xarray-simlab to this kind of models. They are actually simpler than the time dependent ones.
After reading the doc and playing with some toy cases, I'd like to share some feedback about the API.
Hopefully, the issue tracker is not such a bad place for that.
I'm very impressed by the implementation, in particular the automatic assembling of the processes in a model. The integration with xarray is quite promising. The efforts on the documentation and the availability of the package are also great!
However, I am not totally convinced by the way the processes are defined.
Let's consider the example from the tutorial.
The
ProfileU
andu_vars
abstractions were not very clear to me, so I rewrote them in the framework of conservation laws, with which I'm more familiar:The full code is here.
I love the modularity of the design, allowing me to easily replace the advection term by another one or to add more of them.
However, I was not able to easily modify the time integration process to have a second order scheme such as:
To do that, I would have to create new advection flux classes:
EulerForwardTimeIntegration
in the foreign variable ofUpwindAdvectionFlux
is not a big problem. It could be possible to define this kind of linking in another way, for instance at model creation.Flux
classes an expression mappingpredicted_u
topredicted_flux
which would be the same as the one mappingu
toflux
. Or in other words, the processUpwindAdvectionFlux
should be a function accepting other arguments thanu
.And this last point illustrates my general feeling about the current API: most of the process classes could be rewritten as functions, and maybe (functional programming aficionado speaking) they should.
The wrapping around
attrs
is quite nice but thexs.process
are not data holders but functions, and thexs.variable
are functions inputs and outputs with annotations.So, maybe the best way to define the processes is to define functions.
The metadata on the variables could be passed by the docstring or the type annotations, e.g.:
I made a mock-up of the whole model here.
The code would be (in my humble opinion) both more flexible and more secure.
Indeed, the processes could be easily unit-tested, unlike the current process class.
I don't know exactly how the model builder works, but it should still have enough clues to link the processes together.
Note finally that this framework would be easily extendable to other kind of models than the time-dependent ones.
These were my two cents. I'd be happy to discuss it further if you want.
Cheers,
The text was updated successfully, but these errors were encountered: