-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
rework ZIR memory layout; overhaul source locations #8266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The memory layout for ZIR instructions is completely reworked. See zir.zig for those changes. Some new types: * `zir.Code`: a "finished" set of ZIR instructions. Instead of allocating each instruction independently, there is now a Tag and 8 bytes of data available for all ZIR instructions. Small instructions fit within these 8 bytes; larger ones use 4 bytes for an index into `extra`. There is also `string_bytes` so that we can have 4 byte references to strings. `zir.Inst.Tag` describes how to interpret those 8 bytes of data. - This is shared by all `Block` scopes. * `Module.WipZirCode`: represents an in-progress `zir.Code`. In this structure, the arrays are mutable, and get resized as we add/delete things. There is extra state to keep track of things. This struct is stored on the stack. Once it is finished, it produces an immutable `zir.Code`, which will remain on the heap for the duration of a function's existence. - This is shared by all `GenZir` scopes. * `Sema`: represents in-progress semantic analysis of a `zir.Code`. This data is stored on the stack and is shared among all `Block` scopes. It is now the main "self" argument to everything in the file that was previously named `zir_sema.zig`. Additionally, I moved some logic that was in `Module` into here. `Module.Fn` now stores its parameter names inside the `zir.Code`, instead of inside ZIR instructions. When the TZIR memory layout reworking time comes, codegen will be able to reference this data directly instead of duplicating it. astgen.zig is (so far) almost entirely untouched, but nearly all of it will need to be reworked to adhere to this new memory layout structure. I have no benchmarks to report yet, as I am still working through compile errors and fixing various things that I broke in this branch. Overhaul of Source Locations: Previously we used `usize` everywhere to mean byte offset, but sometimes also mean other stuff. This was error prone and also made us do unnecessary work, and store unnecessary bytes in memory. Now there are more types involved into source locations, and more ways to describe a source location. * AllErrors.Message: embrace the assumption that files always have less than 2 << 32 bytes. * SrcLoc gets more complicated, to model more complicated source locations. * Introduce LazySrcLoc, which can model interesting source locations with very little stored state. Useful for avoiding doing unnecessary work when no compile errors occur. Also, previously, we had `src: usize` on every ZIR instruction. This is no longer the case. Each instruction now determines whether it even cares about source location, and if so, how that source location is stored. This requires more careful work inside `Sema`, but it results in fewer bytes stored on the heap, without compromising accuracy and power of compile error messages. Miscellaneous: * std.zig: string literals have more helpful result values for reporting errors. There is now a lower level API and a higher level API. - side note: I noticed that the string literal logic needs some love. There is some unnecessarily hacky code there. * cut & pasted some TZIR logic that was in zir.zig to ir.zig. This probably broke stuff and needs to get fixed. * Removed type/Enum.zig, type/Union.zig, and type/Struct.zig. I don't think this quite how this code will be organized. Need some more careful planning about how to implement structs, unions, enums. They need to be independent Decls, just like a top level function.
also make emitBackwardBranch fail correctly
Next up is reworking the seam between the LazySrcLoc emitted by Sema and the byte offsets currently expected by codegen. And then the big one: updating astgen.zig to use the new memory layout.
I need the enum arrays that were just merged into master.
There are some `@panic("TODO")` in there but I'm trying to get the branch to the point where collaborators can jump in. Next is to repair the seam between LazySrcLoc and codegen's expected absolute file offsets.
(with a lot of things commented out)
this allows to compile with ninja
this was pretty low hanging fruit
We are now passing this test: ```zig export fn _start() noreturn {} ``` ``` test.zig:1:30: error: expected noreturn, found void ``` I ran into an issue where we get an integer overflow trying to compute node index offsets from the containing Decl. The problem is that the parser adds the Decl node after adding the child nodes. For some things, it is easy to reserve the node index and then set it later, however, for this case, it is not a trivial code change, because depending on tokens after parsing the decl determines whether we want to add a new node or not. Possible strategies here: 1. Rework the parser code to make sure that Decl nodes are before children nodes in the AST node array. 2. Use signed integers for Decl node offsets. 3. Just flip the order of subtraction and addition. Expect Decl Node index to be greater than children Node indexes. I opted for (3) because it seems like the simplest thing to do. We'll want to unify the logic for computing the offsets though because if the logic gets repeated, it will probably get repeated wrong.
The LLVM backend is now structured into 3 different structs, namely Object, DeclGen and FuncGen. Object represents an object that is generated by the LLVM backend. DeclGen is responsible for generating a decl and FuncGen is responsible for generating llvm instructions from tzir in a function.
this is to prevent future bugs
Idea here is to prefer un_node to un_tok in order to avoid unnecessary calls to `tree.firstToken`.
* free Module.Fn ZIR code when destroying the owner Decl * unreachable_safe and unreachable_unsafe are collapsed into one ZIR instruction with a safety flag. * astgen: emit an unreachable instruction for unreachable literals * don't forget to call deinit on ZIR code * astgen: implement some builtin functions
if only we could have compile errors for unused locals
These were previously implemented as a sub/sub_wrap instruction with a lhs of 0. Making this separate instructions however allows us to save some memory as there is no need to store a lhs.
Also fixed abiAlignment - for pointers it was returning the abi alignment inside the type, rather than of the pointer itself. There is now `ptrAlignment` for getting the alignment inside the type of pointers.
This avoids the unnecessary scope.getGenZir() virtual call for both convenience and performance.
Wanted to make sure those new test cases still pass. Also grab that CI fix so we can get those green check marks.
Here's what I think the ZIR should be. AstGen is not yet implemented to match this, and the main implementation of analyzeSwitch in Sema is not yet implemented to match it either. Here are some example byte size reductions from master branch, with the ZIR memory layout from this commit: ``` switch (foo) { a => 1, b => 2, c => 3, d => 4, } ``` 184 bytes (master) => 40 bytes (this branch) ``` switch (foo) { a, b => 1, c..d, e, f => 2, g => 3, else => 4, } ``` 240 bytes (master) => 80 bytes (this branch)
The logic for putting ranges into the else prong is moved from AstGen to Sema. However, logic to emit multi-items the same as single-items cannot be done until TZIR supports mapping multiple items to the same block of code. This will be simple to represent when we do the upcoming TZIR memory layout changes. Not yet implemented in this commit is the validation of duplicate values. The trick is going to be emitting error messages with accurate source locations, without adding extra source nodes to the ZIR switch instruction. This will be done by computing the respective AST node based on the switch node (which we do have available), only when a compile error occurs and we need to know the source location to attach the message to.
The switch_br ZIR instructions are now switch_block instructions. This avoids a pointless block always surrounding a switchbr in emitted ZIR code. Introduce typeof_elem ZIR instruction for getting the type of the element of a pointer value in 1 instruction. Change typeof to be un_node, not un_tok. Introduce switch_capture ZIR instructions for obtaining the capture value of switch prongs. Introduce Sema.resolveBody for when you want to extract a *Inst out of a block and you know that there is only going to be 1 break from it. What's not working yet: AstGen does not correctly elide store instructions when it turns out that the result location does not need to be used as a pointer. Also Sema validation code for duplicate switch items is not yet implemented.
* use the proper result location strategy even when there are noreturn prongs in the switch expression * when using break_operand strategy, actually omit the store_to_block_ptr instructions rather than eliding them. * for both strategies, properly handle noreturn prongs.
GenZir struct now has rl_ty_inst field which tracks the result location type (if any) a block expects all of its results to be coerced to. Remove a redundant coercion on const local initialization with a specified type. Switch expressions, during elision of store_to_block_ptr instructions, now re-purpose them to be type coercion when the block has a type in the result location.
* remove the LazySrcLoc.todo tag * finish updating Sema and AstGen, remove the last of the `@panic("TODO")`.
13 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
Solving this issue will likely involve adding new logic or components to the codebase.
frontend
Tokenization, parsing, AstGen, Sema, and Liveness.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a follow-up from #7920, applying the same principles and strategies to ZIR.
The memory layout for ZIR instructions is completely reworked. See
zir.zig for those changes. Some new types:
zir.Code
: a "finished" set of ZIR instructions. Instead of allocatingeach instruction independently, there is now a Tag and 8 bytes of
data available for all ZIR instructions. Small instructions fit
within these 8 bytes; larger ones use 4 bytes for an index into
extra
. There is alsostring_bytes
so that we can have 4 bytereferences to strings.
zir.Inst.Tag
describes how to interpretthose 8 bytes of data.
Block
scopes.Module.WipZirCode
: represents an in-progresszir.Code
. In thisstructure, the arrays are mutable, and get resized as we add/delete
things. There is extra state to keep track of things. This struct is
stored on the stack. Once it is finished, it produces an immutable
zir.Code
, which will remain on the heap for the duration of afunction's existence.
GenZir
scopes.Sema
: represents in-progress semantic analysis of azir.Code
.This data is stored on the stack and is shared among all
Block
scopes. It is now the main "self" argument to everything in the file
that was previously named
zir_sema.zig
.Additionally, I moved some logic that was in
Module
into here.Module.Fn
now stores its parameter names inside thezir.Code
,instead of inside ZIR instructions. When the TZIR memory layout
reworking time comes, codegen will be able to reference this data
directly instead of duplicating it.
astgen.zig is (so far) almost entirely untouched, but nearly all of it
will need to be reworked to adhere to this new memory layout structure.
I have no benchmarks to report yet, as I am still working through
compile errors and fixing various things that I broke in this branch.
Overhaul of Source Locations:
Previously we used
usize
everywhere to mean byte offset, but sometimesalso mean other stuff. This was error prone and also made us do
unnecessary work, and store unnecessary bytes in memory.
Now there are more types involved into source locations, and more ways
to describe a source location.
than 2 << 32 bytes.
locations.
with very little stored state. Useful for avoiding doing unnecessary
work when no compile errors occur.
Also, previously, we had
src: usize
on every ZIR instruction. This isno longer the case. Each instruction now determines whether it even cares
about source location, and if so, how that source location is stored.
This requires more careful work inside
Sema
, but it results in fewerbytes stored on the heap, without compromising accuracy and power of
compile error messages.
Miscellaneous:
reporting errors. There is now a lower level API and a higher level
API.
There is some unnecessarily hacky code there.
probably broke stuff and needs to get fixed.
think this quite how this code will be organized. Need some more
careful planning about how to implement structs, unions, enums. They
need to be independent Decls, just like a top level function.