rework ZIR memory layout; overhaul source locations #8266

andrewrk · 2021-03-16T07:06:57Z

This is a follow-up from #7920, applying the same principles and strategies to ZIR.

The memory layout for ZIR instructions is completely reworked. See
zir.zig for those changes. Some new types:

zir.Code: a "finished" set of ZIR instructions. Instead of allocating
each instruction independently, there is now a Tag and 8 bytes of
data available for all ZIR instructions. Small instructions fit
within these 8 bytes; larger ones use 4 bytes for an index into
extra. There is also string_bytes so that we can have 4 byte
references to strings. zir.Inst.Tag describes how to interpret
those 8 bytes of data.
- This is shared by all Block scopes.
Module.WipZirCode: represents an in-progress zir.Code. In this
structure, the arrays are mutable, and get resized as we add/delete
things. There is extra state to keep track of things. This struct is
stored on the stack. Once it is finished, it produces an immutable
zir.Code, which will remain on the heap for the duration of a
function's existence.
- This is shared by all GenZir scopes.
Sema: represents in-progress semantic analysis of a zir.Code.
This data is stored on the stack and is shared among all Block
scopes. It is now the main "self" argument to everything in the file
that was previously named zir_sema.zig.
Additionally, I moved some logic that was in Module into here.

Module.Fn now stores its parameter names inside the zir.Code,
instead of inside ZIR instructions. When the TZIR memory layout
reworking time comes, codegen will be able to reference this data
directly instead of duplicating it.

astgen.zig is (so far) almost entirely untouched, but nearly all of it
will need to be reworked to adhere to this new memory layout structure.

I have no benchmarks to report yet, as I am still working through
compile errors and fixing various things that I broke in this branch.

Overhaul of Source Locations:

Previously we used usize everywhere to mean byte offset, but sometimes
also mean other stuff. This was error prone and also made us do
unnecessary work, and store unnecessary bytes in memory.

Now there are more types involved into source locations, and more ways
to describe a source location.

AllErrors.Message: embrace the assumption that files always have less
than 2 << 32 bytes.
SrcLoc gets more complicated, to model more complicated source
locations.
Introduce LazySrcLoc, which can model interesting source locations
with very little stored state. Useful for avoiding doing unnecessary
work when no compile errors occur.

Also, previously, we had src: usize on every ZIR instruction. This is
no longer the case. Each instruction now determines whether it even cares
about source location, and if so, how that source location is stored.
This requires more careful work inside Sema, but it results in fewer
bytes stored on the heap, without compromising accuracy and power of
compile error messages.

Miscellaneous:

std.zig: string literals have more helpful result values for
reporting errors. There is now a lower level API and a higher level
API.
- side note: I noticed that the string literal logic needs some love.
  There is some unnecessarily hacky code there.
cut & pasted some TZIR logic that was in zir.zig to ir.zig. This
probably broke stuff and needs to get fixed.
Removed type/Enum.zig, type/Union.zig, and type/Struct.zig. I don't
think this quite how this code will be organized. Need some more
careful planning about how to implement structs, unions, enums. They
need to be independent Decls, just like a top level function.

The memory layout for ZIR instructions is completely reworked. See zir.zig for those changes. Some new types: * `zir.Code`: a "finished" set of ZIR instructions. Instead of allocating each instruction independently, there is now a Tag and 8 bytes of data available for all ZIR instructions. Small instructions fit within these 8 bytes; larger ones use 4 bytes for an index into `extra`. There is also `string_bytes` so that we can have 4 byte references to strings. `zir.Inst.Tag` describes how to interpret those 8 bytes of data. - This is shared by all `Block` scopes. * `Module.WipZirCode`: represents an in-progress `zir.Code`. In this structure, the arrays are mutable, and get resized as we add/delete things. There is extra state to keep track of things. This struct is stored on the stack. Once it is finished, it produces an immutable `zir.Code`, which will remain on the heap for the duration of a function's existence. - This is shared by all `GenZir` scopes. * `Sema`: represents in-progress semantic analysis of a `zir.Code`. This data is stored on the stack and is shared among all `Block` scopes. It is now the main "self" argument to everything in the file that was previously named `zir_sema.zig`. Additionally, I moved some logic that was in `Module` into here. `Module.Fn` now stores its parameter names inside the `zir.Code`, instead of inside ZIR instructions. When the TZIR memory layout reworking time comes, codegen will be able to reference this data directly instead of duplicating it. astgen.zig is (so far) almost entirely untouched, but nearly all of it will need to be reworked to adhere to this new memory layout structure. I have no benchmarks to report yet, as I am still working through compile errors and fixing various things that I broke in this branch. Overhaul of Source Locations: Previously we used `usize` everywhere to mean byte offset, but sometimes also mean other stuff. This was error prone and also made us do unnecessary work, and store unnecessary bytes in memory. Now there are more types involved into source locations, and more ways to describe a source location. * AllErrors.Message: embrace the assumption that files always have less than 2 << 32 bytes. * SrcLoc gets more complicated, to model more complicated source locations. * Introduce LazySrcLoc, which can model interesting source locations with very little stored state. Useful for avoiding doing unnecessary work when no compile errors occur. Also, previously, we had `src: usize` on every ZIR instruction. This is no longer the case. Each instruction now determines whether it even cares about source location, and if so, how that source location is stored. This requires more careful work inside `Sema`, but it results in fewer bytes stored on the heap, without compromising accuracy and power of compile error messages. Miscellaneous: * std.zig: string literals have more helpful result values for reporting errors. There is now a lower level API and a higher level API. - side note: I noticed that the string literal logic needs some love. There is some unnecessarily hacky code there. * cut & pasted some TZIR logic that was in zir.zig to ir.zig. This probably broke stuff and needs to get fixed. * Removed type/Enum.zig, type/Union.zig, and type/Struct.zig. I don't think this quite how this code will be organized. Need some more careful planning about how to implement structs, unions, enums. They need to be independent Decls, just like a top level function.

also make emitBackwardBranch fail correctly

Next up is reworking the seam between the LazySrcLoc emitted by Sema and the byte offsets currently expected by codegen. And then the big one: updating astgen.zig to use the new memory layout.

I need the enum arrays that were just merged into master.

There are some `@panic("TODO")` in there but I'm trying to get the branch to the point where collaborators can jump in. Next is to repair the seam between LazySrcLoc and codegen's expected absolute file offsets.

(with a lot of things commented out)

this allows to compile with ninja

this was pretty low hanging fruit

We are now passing this test: ```zig export fn _start() noreturn {} ``` ``` test.zig:1:30: error: expected noreturn, found void ``` I ran into an issue where we get an integer overflow trying to compute node index offsets from the containing Decl. The problem is that the parser adds the Decl node after adding the child nodes. For some things, it is easy to reserve the node index and then set it later, however, for this case, it is not a trivial code change, because depending on tokens after parsing the decl determines whether we want to add a new node or not. Possible strategies here: 1. Rework the parser code to make sure that Decl nodes are before children nodes in the AST node array. 2. Use signed integers for Decl node offsets. 3. Just flip the order of subtraction and addition. Expect Decl Node index to be greater than children Node indexes. I opted for (3) because it seems like the simplest thing to do. We'll want to unify the logic for computing the offsets though because if the logic gets repeated, it will probably get repeated wrong.

The LLVM backend is now structured into 3 different structs, namely Object, DeclGen and FuncGen. Object represents an object that is generated by the LLVM backend. DeclGen is responsible for generating a decl and FuncGen is responsible for generating llvm instructions from tzir in a function.

this is to prevent future bugs

Idea here is to prefer un_node to un_tok in order to avoid unnecessary calls to `tree.firstToken`.

* free Module.Fn ZIR code when destroying the owner Decl * unreachable_safe and unreachable_unsafe are collapsed into one ZIR instruction with a safety flag. * astgen: emit an unreachable instruction for unreachable literals * don't forget to call deinit on ZIR code * astgen: implement some builtin functions

if only we could have compile errors for unused locals

These were previously implemented as a sub/sub_wrap instruction with a lhs of 0. Making this separate instructions however allows us to save some memory as there is no need to store a lhs.

Also fixed abiAlignment - for pointers it was returning the abi alignment inside the type, rather than of the pointer itself. There is now `ptrAlignment` for getting the alignment inside the type of pointers.

This avoids the unnecessary scope.getGenZir() virtual call for both convenience and performance.

Wanted to make sure those new test cases still pass. Also grab that CI fix so we can get those green check marks.

Here's what I think the ZIR should be. AstGen is not yet implemented to match this, and the main implementation of analyzeSwitch in Sema is not yet implemented to match it either. Here are some example byte size reductions from master branch, with the ZIR memory layout from this commit: ``` switch (foo) { a => 1, b => 2, c => 3, d => 4, } ``` 184 bytes (master) => 40 bytes (this branch) ``` switch (foo) { a, b => 1, c..d, e, f => 2, g => 3, else => 4, } ``` 240 bytes (master) => 80 bytes (this branch)

The logic for putting ranges into the else prong is moved from AstGen to Sema. However, logic to emit multi-items the same as single-items cannot be done until TZIR supports mapping multiple items to the same block of code. This will be simple to represent when we do the upcoming TZIR memory layout changes. Not yet implemented in this commit is the validation of duplicate values. The trick is going to be emitting error messages with accurate source locations, without adding extra source nodes to the ZIR switch instruction. This will be done by computing the respective AST node based on the switch node (which we do have available), only when a compile error occurs and we need to know the source location to attach the message to.

The switch_br ZIR instructions are now switch_block instructions. This avoids a pointless block always surrounding a switchbr in emitted ZIR code. Introduce typeof_elem ZIR instruction for getting the type of the element of a pointer value in 1 instruction. Change typeof to be un_node, not un_tok. Introduce switch_capture ZIR instructions for obtaining the capture value of switch prongs. Introduce Sema.resolveBody for when you want to extract a *Inst out of a block and you know that there is only going to be 1 break from it. What's not working yet: AstGen does not correctly elide store instructions when it turns out that the result location does not need to be used as a pointer. Also Sema validation code for duplicate switch items is not yet implemented.

* use the proper result location strategy even when there are noreturn prongs in the switch expression * when using break_operand strategy, actually omit the store_to_block_ptr instructions rather than eliding them. * for both strategies, properly handle noreturn prongs.

GenZir struct now has rl_ty_inst field which tracks the result location type (if any) a block expects all of its results to be coerced to. Remove a redundant coercion on const local initialization with a specified type. Switch expressions, during elision of store_to_block_ptr instructions, now re-purpose them to be type coercion when the block has a type in the result location.

* remove the LazySrcLoc.todo tag * finish updating Sema and AstGen, remove the last of the `@panic("TODO")`.

andrewrk · 2021-04-01T06:10:46Z

Follow-up issues extracted:

andrewrk added 2 commits March 16, 2021 00:03

stage2: rename zir_sema.zig to Sema.zig

099af0e

andrewrk added enhancement Solving this issue will likely involve adding new logic or components to the codebase. frontend Tokenization, parsing, AstGen, Sema, and Liveness. labels Mar 16, 2021

g-w1 and others added 26 commits March 16, 2021 14:47

zir-memory-layout: fix @setEvalBranchQuota

e430f3f

also make emitBackwardBranch fail correctly

stage2: work through some compile errors in Module and Sema

38b3d4b

stage2: Module and Sema are compiling again

66245ac

Next up is reworking the seam between the LazySrcLoc emitted by Sema and the byte offsets currently expected by codegen. And then the big one: updating astgen.zig to use the new memory layout.

Merge remote-tracking branch 'origin/master' into zir-memory-layout

f5aca4a

I need the enum arrays that were just merged into master.

stage2: get Module and Sema compiling again

b268223

There are some `@panic("TODO")` in there but I'm trying to get the branch to the point where collaborators can jump in. Next is to repair the seam between LazySrcLoc and codegen's expected absolute file offsets.

stage2: the code is compiling again

bd2154d

(with a lot of things commented out)

stage2: codegen: update asm IR to new names

abdbc11

zir-memory-layout: astgen: fill in identifier

e9810d9

llvm backend: use new srcloc

c50397c

this allows to compile with ninja

zir-memory-layout: astgen: literals and *, &

a4f0ccd

this was pretty low hanging fruit

stage2: fix export source locations not being relative to Decl

132df14

stage2: fix some math oopsies and typos

81a935a

Sema: allocate inst_map with arena where appropriate

0357cd8

stage2: first pass at repairing ZIR printing

937c43d

zir-memory-layout: astgen: more instructions

12eeb18

zir-memory-layout: remove all absolute src consts

907142a

this is to prevent future bugs

astgen: implement function calls

5001044

ZIR: move some un_tok tags to un_node instead

260c610

Idea here is to prefer un_node to un_tok in order to avoid unnecessary calls to `tree.firstToken`.

astgen: implement inline assembly

8bad5df

astgen: implement string literals

d8692b8

astgen: fix not detecting volatile asm

7800ae0

if only we could have compile errors for unused locals

zir: add negate/negate_wrap, implement astgen

310a44d

These were previously implemented as a sub/sub_wrap instruction with a lhs of 0. Making this separate instructions however allows us to save some memory as there is no need to store a lhs.

astgen: implement simple binary operators

4cfd5f6

andrewrk and others added 20 commits March 26, 2021 23:46

stage2: implement bitwise expr and error literals

1f5617a

stage2: enable passing tests

95cc457

stage2: fully implement Type.eql for pointers

68f4eb0

Also fixed abiAlignment - for pointers it was returning the abi alignment inside the type, rather than of the pointer itself. There is now `ptrAlignment` for getting the alignment inside the type of pointers.

stage2: rename WipZirCode => AstGen, astgen.zig => AstGen.zig

402f87a

AstGen: scope result location related functions

d123a5e

translate-c: fix typo made in prior commit

d5d8808

AstGen: pass *GenZir as the first arg, not *Module

f80f8a7

This avoids the unnecessary scope.getGenZir() virtual call for both convenience and performance.

stage2: implement sema for @errorToInt and @intToError

0005b34

stage2: fix error sets

8f469c1

Merge remote-tracking branch 'origin/master' into zir-memory-layout

281a7ba

Wanted to make sure those new test cases still pass. Also grab that CI fix so we can get those green check marks.

AstGen: switch expressions properly handle result locations

549af58

Sema: implement switch validation for ranges

e272c29

stage2: clean up RangeSet and fix swapped Sema switch logic for lhs/rhs

abd06d8

Sema: fix else case code generation for switch

08eedc9

stage2: cbe: restore all previously passing tests!

fedc9eb

andrewrk marked this pull request as ready for review April 1, 2021 01:16

andrewrk added 4 commits March 31, 2021 18:30

stage2: compile error for duplicate switch value on boolean

cec766f

stage2: compile error for duplicate switch value on sparse

e8143f6

stage2: finish source location reworkings in the branch

b27d052

* remove the LazySrcLoc.todo tag * finish updating Sema and AstGen, remove the last of the `@panic("TODO")`.

stage2: finish implementation of LazySrcLoc

c9e31fe

andrewrk merged commit 070a28e into master Apr 1, 2021

andrewrk deleted the zir-memory-layout branch April 1, 2021 06:11

andrewrk mentioned this pull request Apr 6, 2021

Implement compile error notes and colored compile errors #6033

Closed

andrewrk mentioned this pull request Jul 10, 2021

stage2: rework AIR memory layout #9353

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

rework ZIR memory layout; overhaul source locations #8266

rework ZIR memory layout; overhaul source locations #8266

Uh oh!

andrewrk commented Mar 16, 2021 •

edited

Loading

Uh oh!

andrewrk commented Apr 1, 2021

Uh oh!

Uh oh!

Uh oh!

rework ZIR memory layout; overhaul source locations #8266

rework ZIR memory layout; overhaul source locations #8266

Uh oh!

Conversation

andrewrk commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overhaul of Source Locations:

Miscellaneous:

Uh oh!

andrewrk commented Apr 1, 2021

Uh oh!

Uh oh!

andrewrk commented Mar 16, 2021 •

edited

Loading