-
Notifications
You must be signed in to change notification settings - Fork 274
Add documentation for the central data structures in CBMC IR #7470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for the central data structures in CBMC IR #7470
Conversation
Codecov ReportBase: 78.50% // Head: 78.50% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## develop #7470 +/- ##
========================================
Coverage 78.50% 78.50%
========================================
Files 1663 1663
Lines 191297 191297
========================================
Hits 150174 150174
Misses 41123 41123 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
at the IR level, and effectively it's the following ADT (in pseudocode): | ||
|
||
```js | ||
type goto_functiont { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to link this to the actual code or generated doxygen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can link to the file using raw markdown, but I don't think it can link to an actual code structure.
I need to look more into doxygen and its linking capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can link it, it might help increase the longevity. See #7470 (comment) for my experiences with the lifespan of documentation. The last thing I want is for you to put in the effort documenting things and then, in 2 years, have to do it all again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the final documentation, doxygen
is automatically linking class names to the class documentation it produces, so when the class names are written in full they get automatically picked up by it and linked.
However, given that these are central data structures and calcified by now (i.e. not experiencing significant change with regards to the actual components of the class), I'd rather the pseudocode representation stays as a fast-track way for someone to understand the essence of these data structures without the informational load and visual clutter that the class documentation pages in doxygen
have.
This is an `irept`. `irep`s are the central data structure that model most entities inside | ||
CBMC and the assorted tools - effectively a node/map like data structure that forms a hierachical | ||
tree that ends up modeling graphs like ASTs, CFGs, etc. (This will be further discussed in | ||
a dedicated page). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should commit TODO notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not so much as a TODO, rather, it's a small digression to guide user expectation (that they should expect a dedicated topic on this that they need to look for into the documentation).
Do you think this should leave, or is there a better way to phrase this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put in a link when it is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on not adding a TODO/promise. Maybe give a really brief overview and link to the source code/doxygen?
487c1a0
to
5b9ce2c
Compare
# Central Data Structures | ||
|
||
The following is some light technical documentation of the major data structures | ||
represented in the input transformation to Intermediate Representation (IR) inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"represented" -> "used"
Editing note, try not to use the same word for two different things in one sentence.
The `goto_modelt` is the main data structure that CBMC (and the other tools) use to | ||
represent GOTO-IR (the `GOTO` Intermediate Representation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"use to ..." -> "use for the GOTO-IR (GOTO
Intermediate Representation)."
## goto_functiont | ||
|
||
A `goto_functiont` is also a product type. It's designed to represent a function | ||
at the IR level, and effectively it's the following ADT (in pseudocode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"IR" not defined. Above the definition was only for "GOTO-IR".
"and effectively it's the following ..." -> "like the following ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intermediate representation (IR) is defined in line 4 of the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, you defined "GOTO-IR", not "IR". I propose change line 4:
"GOTO-IR (GOTO Intermediate Representation)" to something like "GOTO intermediate representation (IR) denoted GOTO-IR"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why say IR when we could just say GOTO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why say IR when we could just say GOTO?
Because GOTO is a CBMC concept, and IR is a lot more likely to be a known concept to a computer scientist, so by virtue of connecting the two concepts, we are leveraging people's pre-existing understanding of a general concept to relate it to the internal one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already explained that in the first paragraph. Repeating it again is redundant.
Of the two types listed above, the `parameters` one should be self-documenting, | ||
(the values of these are later looked up in the symbol table), so we're going to | ||
focus next on the type `goto_programt` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically you should describe the parameters in order and consistently (either the type or the name), maybe something like:
"The rest of this section details the body
and goto_programt
, the parameters
are a self-documenting list
of identifiers
that are later looked up in the symbol table."
* A `goto_instruction_codet` is a `codet` (a data structure broadly representing a statement | ||
inside CBMC) that contains the actual GOTO-IR instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a GOTO-IR modelt
contains GOTO-IR functiont
s which contains a list of GOTO-IR body
s which are GOTO-IR programt
s that contain GOTO-IR instructiont
s that contain an instruction
that contains GOTO-IR instruction_codet
that contain GOTO-IR codet
that contain GOTO-IR instruction_codet
that is a GOTO-IR codet
that contains GOTO-IR instructiont
... and this is not at all confusing or circular?
Or have I misread the documentation to this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't choose the class names 😅
Yes, your understanding is more or less correct. I'll see if I can simplify it further, but I can't promise massive improvements - this is just the description of the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the type names involved seem a little obtuse, which is just why it is so important to document them. Some of the relationships as expressed by @TGWDB are not quite right however. Given that this has been misunderstood by one person already, it suggests that we haven't managed to explain the relationships sufficiently clearly in this documentation.
- A model contains a collection of functions.
- Each function contains a single "body" which is of type
goto_programt
. - A "body" contains an ordered collection of
instructiont
. - An
instructiont
may optionally contain a "code" expression which is of typecodet
.codet
andgoto_instruction_codet
are currently defined to be the same type, but this is not guaranteed to be the case in future.
Note that a function has a single "body" not a collection of them. The fact that a function body is of type goto_programt
is the especially confusing point, as it is not an entire program! The set of data structures are not circular. There are 4 container ship relations between 5 levels of data.
The instructiont
class type works a bit like a tagged union. For some values of the instruction type, the code
member is constrained to be a particular sub-class of codet
. For others such as the SKIP
instruction type, the code
member is expected to be empty (is ID_nil
). Given that instructions also have other data members intructiont
is both a sum and product type.
This is an `irept`. `irep`s are the central data structure that model most entities inside | ||
CBMC and the assorted tools - effectively a node/map like data structure that forms a hierachical | ||
tree that ends up modeling graphs like ASTs, CFGs, etc. (This will be further discussed in | ||
a dedicated page). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on not adding a TODO/promise. Maybe give a really brief overview and link to the source code/doxygen?
5b9ce2c
to
16027ae
Compare
|
||
A `goto_modelt` is effectively a product type, consisting of: | ||
|
||
* A list of GOTO functions (pseudocode: `type goto_functionst = list<goto_functiont>`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest describing this as "A collection of GOTO functions" rather than a list, in order to avoid readers potentially conflating this with the linked list data structure. The exact data structure used within the class is a map, but this is not relevant to the point here in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation in this PR provides a reasonable summary of each of the individual classes involved. However I think this is still missing an overview of the way they relate to each other. This should be separate from the sections documenting the individual classes. It would be nice to have these relationships shown as a diagram.
## goto_functiont | ||
|
||
A `goto_functiont` is also a product type. It's designed to represent a function | ||
at the IR level, and effectively it's the following ADT (in pseudocode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why say IR when we could just say GOTO?
at the IR level, and effectively it's the following ADT (in pseudocode): | ||
|
||
```js | ||
type goto_functiont { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.
} | ||
``` | ||
|
||
Of the two types listed above, the `parameters` one should be self-documenting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with your assertion that "parameters" is self documenting. We should explain that a parameter identifier can be looked up in the symbol table (they do need to exist in the symbol table right?!). This may be obvious for an experienced CBMC developer but we should spell it out for those who are new to cbmc. This is important as an inexperienced developer could do something like constructing a goto function which has parameters which don't exist in the symbol table. We should also explain what information the parameter symbol should contain. Does the parameter symbol have the value set to be an instance of code_typet::parametert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
type goto_instructiont { | ||
instruction_type instr_type | ||
instruction statement | ||
guard boolean_expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is source_location
missing from this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?!
* A `goto_instruction_codet` is a `codet` (a data structure broadly representing a statement | ||
inside CBMC) that contains the actual GOTO-IR instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the type names involved seem a little obtuse, which is just why it is so important to document them. Some of the relationships as expressed by @TGWDB are not quite right however. Given that this has been misunderstood by one person already, it suggests that we haven't managed to explain the relationships sufficiently clearly in this documentation.
- A model contains a collection of functions.
- Each function contains a single "body" which is of type
goto_programt
. - A "body" contains an ordered collection of
instructiont
. - An
instructiont
may optionally contain a "code" expression which is of typecodet
.codet
andgoto_instruction_codet
are currently defined to be the same type, but this is not guaranteed to be the case in future.
Note that a function has a single "body" not a collection of them. The fact that a function body is of type goto_programt
is the especially confusing point, as it is not an entire program! The set of data structures are not circular. There are 4 container ship relations between 5 levels of data.
The instructiont
class type works a bit like a tagged union. For some values of the instruction type, the code
member is constrained to be a particular sub-class of codet
. For others such as the SKIP
instruction type, the code
member is expected to be empty (is ID_nil
). Given that instructions also have other data members intructiont
is both a sum and product type.
Another important data structure that needs to be discussed at this point is | ||
`source_locationt`. | ||
|
||
This is an `irept`. `irep`s are the central data structure that model most entities inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fact that source_locationt
inherits from irept
relevant to this part of the documentation? I would consider this to be a low-level implementation detail and omit it from the high level overview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but it might be important to cover irept
by itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have separated the explanation of irept
itself and added further details about it.
|
||
It might be possible that a specific source location might point to a CBMC instrumentation | ||
primitive (which might be reported as a `built-in-addition`) or there might even be no-source | ||
location (because it might be part of harnesses generated as an example, that have no presence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⛏️ "harness" is not well a well defined at this point in the documentation. We use it to mean the cbmc generated code surrounding the entry point specified to cbmc. This generated harness initialises global values, and the arguments for the entry points parameters. However my understanding is that there are users who write their own "harness" functions in C around the function they are verifying. So in this context there would be 2 harnesses when cbmc runs. The cbmc harness which calls the user written harness which calls the function the user is verifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty.
We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable.
16027ae
to
b6bd9b2
Compare
FYI - For the benefit of those who may be watching this PR, the work to address the comments on this has been handed over to myself from Fotis. |
|
||
A `goto_modelt` is effectively a pair, consisting of: | ||
|
||
* A list of GOTO functions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps say "A collection" (the precise data type is immaterial here, but people might wrongly be led to think it actually is a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its actually a mapping from the function identifier to the GOTO function.
A `goto_modelt` is effectively a pair, consisting of: | ||
|
||
* A list of GOTO functions, | ||
* A symbol table containing symbol references for the symbols contained in the GOTO functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "symbol reference?" Also, shouldn't the abstraction level be kept consistent and this just be "A collection of the symbols used in GOTO functions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now reworked the explanation around symbols / the symbol_table.
In pseudocode, the type looks this: | ||
|
||
```js | ||
type goto_modelt { | ||
type goto_functionst = list<goto_functiont> | ||
type symbol_tablet = map<identifier, symbolt> | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this add value?
a more elaborate explanation in the next section. | ||
|
||
The `parameters` subcomponent is a list of identifiers that are to be looked-up | ||
in the symbol-table for their values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "values" the most appropriate term here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"value" is a word which is a little too general / overloaded in this context. I have gone for "definition" instead for the moment.
A `goto_programt` is a list of GOTO instructions. In pseudocode, it would | ||
look like `type goto_programt = list<goto_instructiont>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the deal with pseudocode? Is that sentence adding any value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this particular example of pseudocode.
type goto_instructiont { | ||
instruction_type instr_type | ||
instruction statement | ||
guard boolean_expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?!
Another important data structure that needs to be discussed at this point is | ||
`source_locationt`. | ||
|
||
This is an `irept`. `irep`s are the central data structure that model most entities inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but it might be important to cover irept
by itself?
8855b93
to
234a124
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only occurred to me (thanks to a reminder by @zhassan-aws) that we have https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md. Can I please ask that we do a decent editor's job and avoid having similar information in multiple places?
The documentation being added in this PR is intended to be an architectural overview which explains how the different data structures fit together and where the detailed documentation for each of them can be found. Of a necessity, a less detailed overview will include information which is explained in more detail in the documentation which is specific to each data structure. It is also expected that this architectural overview will include points which are not directly relevant to goto-programs, such as the section we have now added on I can update the documentation added here to link to |
Yes, please make sure there is cross-referencing and also no contradicting information (redundant information likely is unavoidable, but that's ok). |
|
||
It might be possible that a specific source location might point to a CBMC instrumentation | ||
primitive (which might be reported as a `built-in-addition`) or there might even be no-source | ||
location (because it might be part of harnesses generated as an example, that have no presence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty.
We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable.
@@ -111,7 +111,7 @@ A goto program is a sequence of GOTO instructions (`goto_instructiont`). For | |||
details of the `goto_programt` class itself see the documentation in | |||
[`goto_program.h`](../../src/goto-programs/goto_program.h). For further details | |||
of the life cycle of goto programs see | |||
[`src/goto-programs/README.md`](../../src/goto-programs/README.md). | |||
[`src/goto-programs/README.md`](https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this be rendered as on diffblue.github.io?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link follows the same format as is found here -
[COMPILING.md](https://github.com/diffblue/cbmc/blob/develop/COMPILING.md#what-architecture) |
this example is rendered to https://diffblue.github.io/cbmc/compilation-and-development.html
The original link I used was causing the check-doxygen job to fail with the error - /home/runner/work/cbmc/cbmc/doc/architectural/central-data-structures.md:120: warning: unexpected token TK_EOF as the argument of ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, but COMPILING.md
isn't actually rendered on diffblue.github.io, while goto-programs/README.md
is (the latter as https://diffblue.github.io/cbmc/group__goto-programs.html). It would be great to find a way so that people aren't sent back&forth between Doxygen-rendered content and GitHub rendered pages.
To avoid introducing the new term `GOTO-IR` which is not widely used within the codebase.
Is was not previously clear that `goto_modelt` was used as a storage class.
Main suggests that there are other entities which hold goto, rather than expressing that all the other data structures are contained within the model.
So that we can add a relationship diagram to the central data structures documentation page.
In order to prevent it being included at the root level of the documentation site.
Because the data structure used is not a list and because we should avoid being specific here in order to maintain a high level of abstraction.
Because the symbol table contains the symbols; not symbol expressions.
To draw attention to the link to the `goto_programt` classes' documentation. Note that class names are automatically linked in generating the website.
This existence of an abstract interface was previously mentioned but not properly explained or justified.
In order to avoid the overloaded word "value".
In order to make it easier to find the documentation concerning these classes.
Because `irept` itself should be considered to live at a separate level of abstraction from the other classes.
In order to fix doxygen error.
2c747c9
to
e55d40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Approved.
(Cannot give green tick because Github UI doesn't allow the person who raised a PR to approve or request changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the agreement that #7497 will take care of cleanup.
This PR describes the central GOTO-IR data structures
at a light level of detail (containing some pseudocode).
Please let us know if the level of detail contained in this PR
is adequate, or if there's more that's needed.