Skip to content

Testing and Comparing Your Model

Chris Campbell edited this page May 8, 2025 · 8 revisions

On this page:

Overview

SDEverywhere includes extensive QA (quality assurance) packages and tools that are collectively known as "model-check". The model-check tool can run as you develop your model, either locally on your machine or in the cloud in a continuous integration environment (or both).

With model-check, there are two kinds of tests:

  • Checks are objective tests of a model's behavior.

    • These are "objective" in the sense that they always provide a yes/no or right/wrong answer.
    • Check tests are good for verifying that a model conforms to some expectations or ground truths.
    • They can help catch bugs and unintentional changes that might otherwise go undetected.
    • Here are a few examples of useful checks defined for En-ROADS (but there are countless other examples that will vary from model to model):
      • Stocks should never be negative
      • The population variable values should be within +/- 5% of the historical population data for the years 1900-2025
      • The population variable values should be between 8 billion and 12 billion for all defined input scenarios
      • The temperature variable values should always be lower with input scenario X (e.g., with a carbon tax) than with input scenario Y (e.g., a baseline scenario with no carbon tax)
  • Comparisons are subjective tests of the behavior of two versions of the same model.

    • These are "subjective" in the sense that they don't usually provide a right/wrong answer and are subject to interpretation by the modelers.
    • Comparison tests are good for making sense of how a change to the model impacts the output values of that model under a wide variety of input scenarios.
    • Comparison tests allow for exercising a model under many different scenarios in a short amount of time.
    • The model-check report orders the results so that the most significant changes are at the top, and the results are color coded to help you see at a glance what outputs have changed the most compared to the base/reference/previous version of the model.
    • Here are a few examples of useful comparisons defined for En-ROADS (and as with check tests, there are countless other examples depending on your model):
      • Baseline scenario (all inputs at default)
      • All inputs at their {min,max}imum values (all at once)
      • All main sliders at the {min,max}imum values (all at once)
      • Each individual input at its {min,max}imum value (while others are at default)
      • Low, medium, and high carbon price (for testing values between "min" and "max")
      • Fossil fuel phase out (multiple "reduce new infrastructure sliders" set together)

Defining Checks and Comparisons

Both checks and comparisons are typically defined in text files in YAML format, though it is possible to define them in JSON format or in TypeScript/JavaScript code if needed.

YAML files are designed to be read and edited by a human, but note that indentation is significant, so you need to be careful. We recommend using VS Code to edit these files and installing the YAML extension. The YAML files that are provided in the Quick Start templates are set up with a reference to the schema at the top (which the YAML extension uses) so that you will get some syntax highlighting (and red squiggles to indicate when the syntax is incorrect).

If you follow the Quick Start instructions, the generated template will include sample checks.yaml and comparisons.yaml files to get you started. Refer to the Creating a Web Application page for an overview of where these files reside in the recommended project structure.

Read the following two subsections for more details on how to define checks and comparisons.


Defining Checks

The following is an example of a group of check tests, taken from the SIR example project.

- describe: Population Variables
  tests:
    - it: should be between 0 and 10000 for all input scenarios
      scenarios:
        - preset: matrix
      datasets:
        - name: Infectious Population I
        - name: Recovered Population R
        - name: Susceptible Population S
      predicates:
        - gte: 0
          lte: 10000

Notes

  • The "describe" and "it" naming convention comes from unit testing frameworks in the software development world. This convention encourages naming tests in natural language that describes how the model should behave. For example, the test above is basically saying "population variables should be within a certain range across all input scenarios".
  • A group of tests starts with a describe field. This is used to group related tests together.
  • A describe group should contain one or more items in the tests field.
  • Each test starts with an it field that describes the expected behavior in plain language. The text usually begins with "should" (for example, this variable "should always be positive" or "should be close to historical values").
  • Each test includes 3 essential parts -- scenarios, datasets, and predicates.
  • You are not limited to a single describe group or a single yaml file. You can put multiple describe groups in a single file, or you can spread out and define many yaml files under your checks folder (for example, you can have population.yaml and temperature.yaml and more).

Scenarios

The scenarios field should contain one or more input scenarios for which the expectations hold true.

Click to reveal examples
  • A single scenario that includes a single input at a specific value:

    scenarios:
      - with: Input A
        at: 50
  • A single scenario that includes a single input at its defined extreme (minimum or maximum) value:

    scenarios:
      - with: Input A
        at: max
  • A single scenario that includes multiple input values set at the same time:

    scenarios:
      - with:
          - input: Input A
            at: 50
          - input: Input B
            at: 20
  • Multiple (distinct) scenarios that have the same expected behavior:

    scenarios:
      - with: Input A
        at: max
      - with: Input B
        at: max
  • A special "matrix" preset that will execute the test once for each input variable at its minimum, and again at its maximum:

    scenarios:
      - preset: matrix

Datasets

The datasets field should contain one or more datasets (output variables or external datasets) for which the expectations hold true.

Click to reveal examples
  • A single dataset referenced by name:

    datasets:
      - name: Output X
  • Multiple datasets referenced by name (one model output and one external dataset):

    datasets:
      - name: Output X
      - name: Historical Y
        source: HistoricalData
  • Multiple datasets in a predefined group:

    datasets:
      - group: Key Outputs

Predicates

The predicates field should contain one or more predicates, i.e., the behavior you expect to be true for the given scenario/dataset combinations.

Click to reveal examples
  • A predicate that says "greater than 0":

    predicates:
      - gt: 0
  • A predicate that says "greater than 10 and less than 20 in the year 1900":

    predicates:
      - gt: 10
        lt: 20
        time: 1900
  • A predicate that says "approximately 5 in the years between 1900 and 2000":

    predicates:
      - approx: 5
        tolerance: .01
        time: [1900, 2000]
  • A predicate that says "approximately 5 for the year 2000 and beyond":

    predicates:
      - approx: 5
        tolerance: .01
        time:
          after_incl: 2000
  • A predicate that says "within the historical data bounds for all years up to and including the year 2000":

    predicates:
      - gte:
          dataset:
            name: Historical X confidence lower bound
        lte:
          dataset:
            name: Historical X confidence upper bound
        time:
          before_incl: 2000

More Examples

For more examples of different kinds of check tests (including various predicates, combinations of inputs, time ranges, etc), refer to the checks.yaml file in the sample-check-tests example.

Screenshots

The following is a screenshot of the "Checks" tab in a sample model-check report, which shows two expanded test results, one that is failing (note the red X's) and one that is passing (note the green checkmarks).

Image

Defining Comparisons

The following is an example of a comparison scenario definition, taken from the SIR example project.

- scenario:
    title: Custom scenario
    subtitle: with avg duration=4 and contact rate=2
    with:
      - input: Average Duration of Illness d
        at: 4
      - input: Initial contact rate
        at: 2

Notes

  • A comparisons.yaml file will typically have at minimum one or more scenario definitions, but you can also have scenario_group, graph_group, and view_group definitions in the same file.
  • You are not limited to a single yaml file to hold your comparisons. You can put multiple definitions in a single file, and you can spread out and define many yaml files under your comparisons folder (for example, you can have renewables.yaml and economy.yaml and more).

Scenarios

A scenario definition represents an input scenario for which each output variable for the two models will be compared. The format of a scenario is similar to that of a check test (see above), except that it can contain:

  • a title and subtitle (for keeping similar scenarios grouped together in the model-check report)
  • an optional id (that allows for the scenario to be referenced in a scenario_group or view_group definition)
Click to reveal examples
  • A scenario that includes a single input at a specific value:

    - scenario
        title: Input A
        subtitle: at medium growth
        with: Input A
        at: 50
  • A scenario that includes a single input at its defined extreme (minimum or maximum) value:

    - scenario
        title: Input A
        subtitle: at maximum
        with: Input A
        at: max
  • A scenario that includes multiple input values set at the same time:

    - scenario
        title: Inputs A+B
        subtitle: at medium growth
        with:
          - input: Input A
            at: 50
          - input: Input B
            at: 20
  • A "baseline" scenario that sets all inputs to their default values:

    - scenario
        title: All inputs
        subtitle: at default
        with_inputs: all
        at: default
  • A special "matrix" preset that will generate comparisons for each input variable at its minimum, and again at its maximum:

    - scenario:
        preset: matrix

Scenario Groups

TODO: This section is under construction. See "More Examples" below for a link to an example of scenario groups.

Views and View Groups

TODO: This section is under construction. See "More Examples" below for a link to an example of view groups.

More Examples

For more examples of different kinds of comparison definitions (including different ways to define scenarios, scenario groups, views, etc), refer to the comparisons.yaml file in the sample-check-tests example.

Screenshots

The model-check report includes two separate tabs for viewing comparisons.

The "Comparisons by scenario" tab summary view lists all the input scenarios that were compared:
Image

Clicking on a scenario will take you to a detail view that shows graphs of all output variables under that input scenario:
Image

The "Comparisons by dataset" tab summary view lists all the datasets (output variables and external datasets) that were compared:
Image

Clicking on a dataset will take you to a detail view that shows graphs of that dataset for each tested input scenario:
Image


Measuring Performance

Every model-check report includes a table summarizing the size and run time (speed) of the two versions of your generated model being compared.

For example:
Image

If you click on the blue and red "heat map" to the right side of that table, it will open a performance testing page:
Image

Click on the "Run" button a few times to get a sense of how the run times compare for the two versions of the model. (Note that it's currently a somewhat hidden feature, so the UI is not fully polished.)

The heat map display is useful for seeing the average time and distribution of outlying samples. To ensure consistent results, it is recommended to run performance tests when your computer is "quiet" (idle).

Clone this wiki locally