Skip to content

use case: ability to recover from illegal behavior in safe build modes #3516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mogud opened this issue Oct 23, 2019 · 34 comments
Open

use case: ability to recover from illegal behavior in safe build modes #3516

mogud opened this issue Oct 23, 2019 · 34 comments
Labels
use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Milestone

Comments

@mogud
Copy link
Contributor

mogud commented Oct 23, 2019

In my situation, most game servers I designed so far use service as an abstraction of everything. And millions of service could be in only a single process. For the purpose of robustness, service manager catches errors/exceptions from all running services and chooses proper operations to them(kill service or just ignore it). It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.
So is it possible to add an option for this situation? As far as I know, nim has many compiler check switches to make these edge errors as runtime exceptions. rust can do catch_unwind after a panic. go has a recover() buitin funtion.

@Rocknest
Copy link
Contributor

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

@JesseRMeyer
Copy link

JesseRMeyer commented Oct 23, 2019

And millions of service could be in only a single process.

This is probably not what you want. If a single fatal error occurs, then the entire process is destroyed. Instead, you want each service to run in its own process that communicates to other processes using some standard format. That way, if, say, the login service fails, players who are already logged in and playing are not booted from their session. Also, it makes it trivial to distribute across machines in a network. While that is a notable increase in complexity, the alternative of solving who catches which exception thrown by what when is probably just a rats nest waiting to happen.

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Also, it makes it trivial to distribute across machines in a network.

In fact, user space codes always use RPC for communication, and do not need a concern abount if it is across machines or not.
A gateway may keep players' connections, but typically, more than one thound players' game logic must be handled within a single process. It's not acceptable that players are all kicked out only because of a division by zero error. A proper way I think is to record the log and report it to the maintainers. And they will decide if it is neccesary to shutdown game server and fix it.
So, if we can assure defers/errdefers work well and have a way to stop unwind by an compiler option when a division by zero happend, we have more choices.

@DaseinPhaos
Copy link

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Besides that, the question remains on who gets to decide how fatal an error is.

@DaseinPhaos
Copy link

Probably Relevent: #395, @thejoshwolfe 's comment on error handling

@JesseRMeyer
Copy link

JesseRMeyer commented Oct 24, 2019

It's not possible to have millions of processes.

Yes, it is possible, especially on a distributed network of servers. But its feasibility depends on your definition of a service, kernel and related architectural choices. Whether we process a single user or tens of thousands of them on a single process is an important decision, and error propagation does seem to have a say here, regardless of my architecture comments.

Zig maintains the mantra of no hidden control flow, and software exceptions violate that principle outright. But I agree that users should wield full control over error handling. If the runtime already catches these errors, it should first propagate them to the user process to see if it cares and wants to handle it directly, and if not, return it back for the default behavior.

@emekoi
Copy link
Contributor

emekoi commented Oct 24, 2019

what's wrong with

fn safe_div(a: var, b: @typeOf(a)) !@typeOf(a) {
    @setRuntimeSafety(false);
    if (b == 0) return error.DivisionByZero;
	return a / b;
}

and with #489, some of the performance hit from the check can be optimized away.

@JesseRMeyer
Copy link

JesseRMeyer commented Oct 24, 2019

@emekoi Are you suggesting that as a user or Standard Library function?

Here's why -- I do not want to pollute every div() callsite I make with error handling, especially when I know that the dividend is not 0, as inputs are often sanitized long before computations on them are performed. This is the crux of the problem, if we address this at too fine a granularity then the whole structure around it pays the cost in support. I suppose in those cases, the binary / would suffice, so maybe renaming this to safe_div() would indicate its purpose.

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

@emekoi

  1. It is verbose enough that everywhere I must use div function call instead of a simple binary operator.
  2. It is awful to review others' codes in order to make sure they follow the right way, or I have to create an static analyzer.
  3. It is hard to reuse third-party libraries, because obviously, they use /.
  4. overflow is also an unrecoverable error, and does this means all builtin arithmetic operators cannot be used? I think it's really really inconvenient.

@Rocknest
Copy link
Contributor

@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.

@emekoi
Copy link
Contributor

emekoi commented Oct 24, 2019

@JesseRMeyer if you know that the dividend is not zero, then just use /. there shouldn't be an issue if your input is already sanitized. as for overflow, we have compiler intrinsics that handle overflow. you can also just catch the exceptions from the OS, and go from there.

furthermore, if you're running a game server i think it is in your and your users best interests, if you carefully review all the libraries that you use...

@JesseRMeyer
Copy link

you can also just catch the exceptions from the OS, and go from there

How do we accomplish this in Zig?

@emekoi
Copy link
Contributor

emekoi commented Oct 24, 2019

it depends on the OS, but for windows you can use Structured Exception Handling like you would in C and for unix systems you can use signal handlers. we already use these to catch segmentation faults in debug mode on supported systems. the relevant code is from this line down.

@JesseRMeyer
Copy link

JesseRMeyer commented Oct 24, 2019

Thanks.

If user code can explicitly override Zig's safety features with their own, then that makes me glad.

@rohlem
Copy link
Contributor

rohlem commented Oct 24, 2019

Other related issues that haven't been mentioned yet: #1740 , #426 (note: rejected), #1356 (note: only tangentially related, discussion seemed to disfavour recover-like mechanisms).

In debug builds, a Zig panic calls the root source file's panic handler (doesn't seem documented yet - mentioned in documentation of @panic ). You are free to provide an implementation with f.e. a longjmp -- anything that holds the noreturn return type, so doesn't expect to return directly to the panic-ed stack.

The main issue is that in completely-optimized builds (ReleaseFast, ReleaseSmall), the LLVM IR that is emitted results in undefined behaviour. If you want uncompromised speed, you need to compromise recoverability (as far as I understand it). How recoverable that currently resulting undefined behaviour ends up being is left for the backend, currently LLVM, to decide.

If your main concern is correctness/stability, then allocating separate stack memory for each service invocation and having a longjmp-or-equivalent return plan from the panic handler might be an acceptable solution.

I also thought I remembered (but now can't find) another more in-depth discussion about turning each instance of detectable illegal behaviour into returning a standard error code - again, this prevents full-fledged optimizations.
Note that whatever judgement mainline Zig ends up pasing, with Zig's parser being part of the standard library, it might be feasible for you to add a compilation step that replaces certain unsafe expressions (like panicking operators) with safer function calls (like the error-returning alternatives from std.math, or a non-error fallback return value).

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.

But if it is well structured and won't crash, it's really useful design for extremely performance(no ipc, no serialization) . Think about go with so many goroutines. A service may consists of two or three goroutines. And I can make it never crash.
As a matter of fact, our current game server use this pattern since 2 years ago. And it only crashes once by a deeply concurrent issue.
So this is a real use case, we need that such errors or panics can be catch or handled more safely.

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

@emekoi You are right. But as I mentioned aboved, at least, we need the language guarantees that defers/errdefers must be proccessed as expected. Or it's not safely recoverable.

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

@rohlem Thanks.
By the way, catch_unwind is indeed what I want personally. Or I have to embed other script language like lua for convenience and robustness.

@mogud
Copy link
Contributor Author

mogud commented Oct 24, 2019

rust's recover(catch panic) rfcs

@Rocknest
Copy link
Contributor

@mogud panic is a debug tool, do not abuse it. If you want crash resilient program you have to pay for it in some way or another.

@mogud
Copy link
Contributor Author

mogud commented Oct 25, 2019

@mogud panic is a debug tool, do not abuse it.

I never said I use panic in zig as it's indeed a debug tool right now. I use it in go.

If you want crash resilient program you have to pay for it in some way or another.

Which way? I do know multiprocess can promote servers' robustness, but it completely breaks the origin design which can do perfectly in go, rust, nim and all other vm languages.

@JesseRMeyer
Copy link

@mogud

Well there's the way offered by @emekoi a few replies up on how to override Zig's panic handler. Maybe this facility can be expanded on.

@andrewrk
Copy link
Member

Hi @mogud. Thank you for opening this issue. I want to start by affirming that this is a valid and important use case, and the Zig project needs to have an answer for how this use case is recommended to be solved, even if the language does not address it, and such a recommendation is "use processes" or "a different language would be a better fit for this use case".

I think the Rust RFC you linked does a great job of explaining the situation, especially with regards to broken invariants of data structures.

@rohlem is correct about ReleaseFast mode vs ReleaseSafe mode. In ReleaseFast modes, the optimizer will assume illegal behavior, such as division by zero, does not occur. For a game server where it is important to not crash, ReleaseSafe will be a better choice for the global build mode, and this issue is suggesting that detected illegal behavior can be recovered from. @mogud I hope you don't mind that I rename this issue in light of #2402.

Given that a panic can happen in defer expressions, recovering from a panic is generally unsound, unless one very specific thing is done: arena-based resource management. When one creates an arena for resources, this creates a "recovery" point. If you think about it, this is why process-based recovery works so well - the OS creates an "arena" for you which cleans up all resources if the process crashes. Importantly, it also creates a thread of execution where control flow being abruptly terminated does not affect other threads of execution.

One thing to consider here is that in this use case in zig, it's extremely likely that the software would be written with event-based I/O. So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

@andrewrk andrewrk changed the title Should we have an option to make division by zero recoverable? use case: ability to recover from illegal behavior in safe build modes Oct 25, 2019
@andrewrk andrewrk added this to the 0.6.0 milestone Oct 25, 2019
@Rocknest
Copy link
Contributor

@andrewrk panic recovering sounds like runtime exceptions reborn. I dont think its a good idea to support such use case in a language without runtime and with direct access to the system's resources. Illegal behavior means there is a bug in the software, doesn't it?

@andrewrk
Copy link
Member

Illegal behavior means there is a bug in the software, doesn't it?

Yes. This use case is to have the ability to handle bugs in a large codebase without crashing.

It is currently considered to be out of scope of the language, and there are no open proposals to change this.

@mogud
Copy link
Contributor Author

mogud commented Oct 25, 2019

@andrewrk Thanks for your patience.

My English is not very well, maybe I cannot accurately tell the full story. So I may just point out what I think is more important.

  1. I think single threaded multiprocessing is great for reliable servers.
  2. We use single process(not accurate) because we built a very reusable RPC-based framework for different categories of game, like FPS, SLG, MMOARPG. We do not care abount if we have a server named mail or bill, They're all services and can be in any node by different launch configs. So we also can use processes even has communication costs. In order to make the whole system reliable, the base framework must be very fast and robust, that's why I cannot accept it crashes so easy.

Given that a panic can happen in defer expressions, recovering from a panic is generally unsound, unless one very specific thing is done: arena-based resource management.

In most game server's development, logic programmers cannot directly manage resources. For example, they can load data from db service, but cannot have a handle of db connection. Resources management codes often written by advanced programmers, and do not change for a long while so can be full tested.

One thing to consider here is that in this use case in zig, it's extremely likely that the software would be written with event-based I/O. So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Framework must guarantees it's safety and should be transparent to the users about this.

At last, I'm sorry that I cannot open a proposal for my poor English.

@JesseRMeyer
Copy link

@andrewrk Would you please explain why panicking in a defer context is problematic? Can't any defer scenario be composed without defer in the first place?

@andrewrk
Copy link
Member

panicking in a defer is not problematic. It just means that when you're in the panic handler, you've already potentially leaked resources and potentially have data structures with broken invariants.

Yes to your second question.

@rohlem
Copy link
Contributor

rohlem commented Oct 26, 2019

So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Currently @panic only receives a message, and the implementation of std.debug.panic retrieves the stack frame information via other means.
Assuming we can (note: limited to safe build modes) query whether the current stack is async, we could expose builtins @currentAwaiter() ?*anyawaiter and @returnToAwaiter(*anyawaiter) noreturn. Then the panic implementation could do:

fn panic(...) noreturn {
    any_panic_impl(...); //print stack trace etc.
    if(@currentAwaiter()) |awaiter| {
        //Potentially fill/initialize the return value the awaiter is awaiting; trickier, see below.
        @returnToAwaiter(awaiter); //note: of type noreturn
    }
    os.abort(); //or whatever else you do if you panic on the main stack (or on a stack currently without an awaiter)
}

This way the recoverability is a completely optional feature (maybe even opt-in compile-time toggle-able, akin to --single-threaded). Since we already have safety features for resuming non-suspended functions, I'm 90% sure that this would already be implementable.

Filling the awaiter's return value seems a little tricky: We could have a builtin to provide *@OpaqueType() that can be cast if the type is consistent across all async functions in the codebase.
Maybe error unions could be generalized in their layout to the point where the builtin can provide a *anyerror for any anyerror!T ; that would make it quite elegant to use, actually.

Otherwise switching on the type would require some runtime representation of the type (maybe via an auto-collected builtin enum similar to how anyerror is populated), but these ideas sound overcomplicating to me; for this particular use case a userland protocol would be sufficient:

//scheduler
var succeeded: bool = false;
const success_result = async failable_afunc(&succeeded, ...);
if(succeeded){
    //use success_result ...
}else{
    //handle failure... | success_result is undefined, do not access!
}

fn failable_afunc(succeeded: *bool) T {
    defer succeeded.* = true; //we need to somehow prohibit the optimizer from executing the assignment any earlier, which might not appear observable locally.
        //application logic implementation
}

(As an alternative to @currentAwaiter() we could introduce a separate panicAsync(?*awaiter) T, and @panic decides which one to use depending on if it's called on an async stack. Then the call of @returnToAwaiter and maybe also setting the awaiter's awaited return value could be hidden after the return of panicAsync (maybe of return type anyerror ?). This would reduce both complexity and flexibility/control of the feature in my eyes.)

@suirad
Copy link
Contributor

suirad commented Nov 2, 2019

Since @panic is somewhat of an exception to the zig rule of no hidden control flow, out of necessity; perhaps it could be a tool in the modes in which it is available(debug/release-safe). It seems to me that something to the effect of a temporary panic handlers for a single scope could be feasible. Perhaps purity of the scope could determine the eligibility of code/functions used within it, since side effects affect the recover-ability of state.

@shawnl
Copy link
Contributor

shawnl commented Nov 5, 2019

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

LLVM does not make all of these undefined behavior, but downgrades what can to only produce undefined values. Zig should know the difference, and be able to recover from undefined values. The big exception to this is divide by zero, which raises SIGFPE.

The general fix for this is to add setjmp()/longjmp() support to zig, which is #1656.

@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Feb 21, 2020
@ityonemo
Copy link
Contributor

Just wanted to add in that in my use case (FFI with the erlang VM) I'd like to turn on a panic trapping feature when I drive external unit test suites, so that a zig panic can record unreachable/undefined behavior inside zig from the calling VM in release-safe/release-debug. and not disrupt test counts/test tracking/CI. An opt-in ability to somehow trap a panic would be very useful. Conceptually this could easily take the form of a setjmp/longjmp that I can drop in at the zig/erlang boundary and recover from in the event of a panic. If zig doesn't want to support this that's fine, a panic during unit tests is also a valid way of alerting that there's something wrong with the code.

@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 17, 2020
@andrewrk andrewrk modified the milestones: 0.8.0, 0.9.0 Nov 6, 2020
@SpexGuy SpexGuy added the use case Describes a real use case that is difficult or impossible, but does not propose a solution. label Mar 21, 2021
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 May 19, 2021
@iacore
Copy link
Contributor

iacore commented Nov 23, 2022

@panic seem to send SIGABRT. You can catch this inside the process, or in its parent process.

Use shared memory & exit code to send error message back to parent.

@matu3ba
Copy link
Contributor

matu3ba commented Apr 20, 2023

recover from illegal behavior

Recovery of errors (as to not run into failures) requires to specify what safe and well-defined states are. How should Zig know this? Once you can specify them: Why can you not code them?

Asking more specific: What is the recoverable state classes + execution context classes that Zig should support?

It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.

The purpose of optimization compilers and languages defining them is to provide optimal machine code for the supported performance use cases and requires to "explicitly write" possible code semantics.
Zig has performance defaults for math stuff, which includes a / b trapping the CPU and crashing your program in safe modes.

As I understand you, you ask to enable the caller to change source code semantics, like what macros or operator overloading are used in C/C++ etc (typically to workaround bugs of called code (ie not intended for the use case). Is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Projects
None yet
Development

No branches or pull requests