GOTO language documentation #7471

New issue

Closed

GOTO language documentation#7471

Assignees

Labels

awsdocumentation

remi-delmas-3000

This thread is to collect needs for the GOTO language documentation effort we're starting in 2023.

The goal is to use this thread to scope the documentation and collect questions the documentation is supposed to answer.

which fragment of GOTO shall we document ?
what aspects of the language shall we document ?
- symbols and symbol tables,
- goto models,
- goto functions,
- goto programs,
- goto expressions abstract syntax,
- typchecking rules for instructions and expressions
- block structure and loops for goto programs,
- three level indexing (as needed to express trace semantics),
- stack vs heap model,
- built-in functions and predicates,
- supporting functions such as __CPROVER_start, __CPROVER_initialize,
- supporting state variables describing platform arch and memory safety check instrumentation,
- binary file format
- trace language syntax
- trace semantics (or rather, how to map a trace back to)

added

assigned

and

Contributor

We already have a documentation PR we just raised as part of our internal efforts to document more of CBMC.

It's at #7470 and covers the GOTO IR data structures with some light documentation.

celinval

Collaborator

Awesome! Is it possible to also document the goto file format?

martin-cs

Collaborator

On Tue, 2023-01-10 at 10:04 -0800, Celina G. Val wrote: Awesome! Is it possible to also document the goto file format?

I don't want to be negative but I would not recommend it. GOTO files are a direct serialisation of a bunch of C++ data structures. This is intentional. They are intended to be "internal" to the tool, even more so that `.o` files are for compilers. They are not intended to be an interface point for external tools. If you want to read / write / manipulate them the intent is that you link the relevant bits of CPROVER. See also the discussions over cross language APIs and why JSON Symtab exists. Why? If we declare that an interface then we fix: 1. The syntax / fields / representation of the data structures. 2. How they are used to represent things. 1. is annoying. They do change from time to time and I don't think anyone is claiming they are so good that we should freeze them. There is a good-faith effort to change the version number of GOTO files when this happens but I don't think anyone tests it so... 2. is more insideous and troubling and why this documentation is so needed but also hard. How constructs are expressed changes as and when things need to be fixed. They are still structurally the same but they use the ireps differently. This is why JSON Symtab based tools tend to break with an inexplicable error deep inside an entirely unrelated part of the code. I would recommend getting the documentation and, more importantly, the testing and checking of the key data structures really good before you attempt this.

martin-cs

Collaborator

@remi-delmas-3000 I really want to support this effort because it is so very needed. So please read this as "lessons learnt for the language documentation projects 2014, 2015, etc.":

Decide if you are writing documentation for developers / system integrators, beginning users or advanced users. See earlier discussions on the purpose of different bits of documentation.
There is already quite a bit of documentation. It is scattered. Bringing it together and rationalising is a worthwhile thing to do. Maintenance of documentation must be included in your plan to avoid this happening again.
Lifespan of up-to-date documentation, from best to worse:

Checks run during normal program operation
Checks run in debug mode
Unit / regression tests run in CI
Comments inline.
Everything in doc/
Anything out of the repository

For developer documentation, I would recommend starting with goto-programt and working down (exprt) and up ('goto_modeltandsymbol_tablet`) as they are core to pretty much everything. See also RFC: Documenting and checking "normal form" properties of goto_models #6495
For 'must always' and 'must not' requirements, best would be a literate programming implementation in the various .validate() methods. That is exactly why they were introduced.
For 'may' requirements, best would be regression or unit tests, JSON symtab might be a route.
Be prepared inconsistency between different parts of the code about what are well structured programs / semantics ( for example What is a safe overapproximation of an OTHER instruction #4323 and Killing #return_value behaves inconsistently #4336 ), dead code and untested areas of the code base.

added

CollaboratorAuthor

@martin-cs thanks for jumping in !

There will be two parallel documentation efforts :

(led by Diffblue) document GOTO as seen from inside CBMC for developers
(led by AWS) document GOTO as abstract language for users who need to :

produce GOTO from their own source language (they need to understand the abstract syntax/semantics of GOTO),
would like to design/specify instrumentation rules for their programs,
would like to write a GOTO interpreter that would be capable of validating a counter example trace produced by a GOTO verification tool.

For 2. we envision a document in natural language.

I would recommend getting the documentation and, more importantly, the
testing and checking of the key data structures really good before you
attempt this.

Agreed. We need to understand what is there before attempting to describe it and I'm expecting this to be a bottom-up effort, starting from the implementation.

is annoying. They do change from time to time and I don't think
anyone is claiming they are so good that we should freeze them. There
is a good-faith effort to change the version number of GOTO files when
this happens but I don't think anyone tests it so...

We mostly want to describe the abstract syntax of GOTO models (what's a goto model, a goto function, a goto program, a goto instruction, what operators are available in the expression language, and their semantics, i.e. small-step transition relation over concrete states). This could be similar to GOTO semantics by translation to FOL for instance.

Selecting a concrete syntax could come in a second step.

is more insideous and troubling and why this documentation is so needed but also hard. How constructs are expressed changes as and when things need to be fixed. They are still structurally the same but they use the ireps differently. This is why JSON Symtab based tools tend to break with an inexplicable error deep inside an entirely unrelated part of the code.

I thought that defining how things are modelled is mostly the role source-to-GOTO conversion. Do you know how often the meaning basic GOTO operators or instruction types are changed by changing their interpretation in Symex or when translating the SSA format to SAT or SMT ?

martin-cs

Collaborator

On Wed, 2023-01-11 at 09:08 -0800, Rémi Delmas wrote: There will be two parallel documentation efforts : 1. (led by Diffblue) document GOTO as seen from inside CBMC for developers 2. (led by AWS) document GOTO as abstract language for users who need to : * produce GOTO from their own source language (they need to understand the abstract syntax/semantics of GOTO),

How do you envisage them creating these? Which interface are you going to use?

* would like to write a GOTO interpreter that would be capable of validating a counter example trace produced by a GOTO verification tool.

I trust you know about `goto-instrument --interpreter` right? Also, if you made use of `memory-analyzer/gdb_api.h` and @tautschnig's work on GOTO programs in ELF format, you could also probably run the actual binary back-to-back to check the modelling.

For 2. we envision a document in natural language.

I would encourage you to make examples into unit tests and the like. In the past natural language documentation has not aged well or stayed tidy.

> 1. is more insideous and troubling and why this documentation is > so needed but also hard. How constructs are expressed changes as > and when things need to be fixed. They are still structurally the > same but they use the ireps differently. This is why JSON Symtab > based tools tend to break with an inexplicable error deep inside an > entirely unrelated part of the code. I thought that defining how things are modelled is mostly the role source-to-GOTO conversion.

Yes... but... it's not just there. For example, the semantics of exceptions is really in a transformation pass. The semantics of the heap is ... kind of scattered, some in the front-end, some in `ansi- c/library/` some in the symex / interpreter, some in the instrumentation.

Do you know how often the meaning basic GOTO operators or instruction types are changed by changing their interpretation in Symex or when translating the SSA format to SAT or SMT ?

Ummm... that's a bit difficult to answer. It's more how they are interpreted. For example, in the parse tree an expression can include arbitrary side-effects, function calls, etc. By the time they are in a GOTO program, they are side-effect free, have some types removed and are used in a semantics where variables are mutable. Once they become VCs they are interpreted with an immutable, logic-based semantics. Is that what you mean? Cheers, - Martin

…

-- UCU staff are currently in dispute with employers over pay, pensions and working conditions. Part of the industrial action is 'action short of a strike'. This may affect the speed at which you get responses to e-mails and requests from some staff. Details here: https://www.ucu.org.uk/article/12469/FAQs

celinval

Collaborator

Thanks @martin-cs for jumping in. If not goto binaries, which language do you suggest that we use to interface with CBMC?

martin-cs

Collaborator

@celinval I think goto binaries are a really good way of passing programs between tools. The goto-cc / goto-instrument / analysis tool (cbmc, goto-analyzer, etc.) split has shown to be a good architecture.

What I had meant to communicate in my previous response was that I strongly suggest using the CPROVER code for creating / reading / writing these because it insulates you from a number of drift and updating issues.

In cases where this is not possible (for example, you are not writing a tool in C++) then it's a bit more awkward. symtab2gb and the JSON symtab was our best attempt at this. It isn't ideal but it has worked for a number of years. See #7042 (comment) and the following discussion for some more of the history / context for this.

HTH

celinval

Collaborator

Hi @martin-cs, our tool is written in Rust. We do currently use the JSON format and symtab2gb but it scales poorly. The JSON files are often above 1GB and symtab2gb takes a considerable amount to convert them to goto.

7 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

awsdocumentation

Type

No type

Projects

GOTO-program standardisation

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GOTO language documentation #7471

7 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

GOTO language documentation #7471

Description

Activity

NlightNFotis commented on Jan 10, 2023

celinval commented on Jan 10, 2023

martin-cs commented on Jan 11, 2023

martin-cs commented on Jan 11, 2023

remi-delmas-3000 commented on Jan 11, 2023

martin-cs commented on Jan 11, 2023

celinval commented on Jan 12, 2023

martin-cs commented on Jan 12, 2023

celinval commented on Jan 12, 2023

7 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions