Skip to content

jobs that run for 15min + sometimes disappear #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jessfraz opened this issue Feb 14, 2022 · 26 comments
Closed

jobs that run for 15min + sometimes disappear #24

jessfraz opened this issue Feb 14, 2022 · 26 comments

Comments

@jessfraz
Copy link

hard to reproduce but seen it happen in the cio repo repeatedly

@davepacheco
Copy link
Collaborator

Uh oh. What do you mean by disappearing? Is it that:

  • you're using an SEC
  • you've created a Saga through the SEC
  • the Saga is still running but the SEC doesn't report it in its list of jobs?

Or: you've created a Saga and it's not running any more?

Do you have any log files about what's going on?

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@jessfraz
Copy link
Author

the weird thing im doing which might not be totally supported is, instead of just calling a function inside the saga, I am calling a function but said function execs a binary and captures the output of the stdout from the binary. The binary is the same binary that is already running for the server, however I am calling a different sub command. The reason I am doing this is I wanted the stdout logs from the function and this seemed like the easiest way of capturing the log output in the function. So it might have something to do with that, like maybe that thread for the exec-ing is not happy but I don't see why that would be an issue when like I should be able to do whatever I want in the saga function, right?

The other thing I was thinking of is doing a double fork exec, such that then the exec I'm running can stay running after the main process dies, however I don't actually want this becuayse I want to be able to kill those jobs and cancel them when the server gets redeployed and they'll die anyways since its running in a container and pid one is the server. But I also don't think its the fact I'm execing because I don't get any errors from the exec, everything just disappears, which maybe that means it is it but figured I should disclose that in case this is not supported

@davepacheco
Copy link
Collaborator

Thanks for that. What happens if the parent process were to die, like on a panic? Also, do you have stderr from any of the processes?

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@davepacheco
Copy link
Collaborator

tl;dr: try starting a job that you know will take a few seconds, then hit webhooky with SIGKILL. Does that reproduce the problem?


Cool. I took a look at the cio stuff in webhooky that uses steno. I started from the top and was looking particularly for cases where things could go wrong without showing up in log output, stdout, or stderr. A few observations:

We're using an async logger:
https://github.com/oxidecomputer/cio/blob/85c21339e15a6fea7ff52bdea75531bbedb2811e/webhooky/src/main.rs#L187

Depending on how the process dies, this means you might not get log the messages leading up to the process exit. The docs call out std::process::exit explicitly as a case where you can lose messages.

I'm also not sure about this code:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L276

I see two ways this could go badly.

First, suppose the process takes SIGINT or SIGTERM. We start the "cleanup" job at L269. Let's say that completes quickly. Then we exit the process at L274 with std::process::exit. We might miss some of those info! messages in the logs for the reason mentioned above. Still, there's a good chance you'd catch the other messages, and anyway, I don't see how this would leave jobs in a bad state.

But suppose again that the process takes SIGINT or SIGTERM and we start the "cleanup" job. Let's dig into that. The start_job function makes a request to the server on localhost (running in the same process) and waits for it to finish:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2205-L2216

Note that start_job normally applies a timeout to the client request, but in the special case of the "cleanup" job, it uses no client-side request timeout.

Anyway, once we make the request to start the "cleanup" job, we land in the request handler for that endpoint:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2165-L2187

Seeing in "start_job" that this is expected to return "202 Accepted", I thought maybe we weren't waiting for cleanup to finish, but it looks to me like this function is really synchronous. By the time it returns, the cleanup process has finished. So that's good.

But what if it takes a long time? Back where we handled the signal, we had this comment:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L261

    // For Cloud run & ctrl+c, shutdown gracefully.
    // "The main process inside the container will receive SIGTERM, and after a grace period,
    // SIGKILL."
    // Regsitering SIGKILL here will panic at runtime, so let's avoid that.
    let mut signals = Signals::new(&[SIGINT, SIGTERM])?;

This is a quote from the docker stop docs, where that default grace period is 10 seconds. So suppose we take SIGTERM, we kick off the cleanup job, and it takes 10 seconds? We'd then expect to take SIGKILL, which immediately terminates the process with no opportunity to clean up.

Does that seem likely? Taking a look at the cleanup job, it looks like it iterates all the running sagas and for each one it makes a bunch of requests to Airtable. It seems quite plausible to me that this would take more than 10 seconds.

Is this consistent with the data we have so far? I think so. If you ran into this situation, then:

  • Most importantly: any jobs that the cleanup job didn't get to before it got SIGKILL would wind up not cancelled. I don't see any other code that would mark them cancelled. I gather from what you wrote above that on restart, webhooky won't resume any of those jobs. So I think that would explain the key data point that you have all these jobs that aren't completed, failed, or cancelled.
  • You wouldn't see a panic or almost any other error from stdout or stderr.
  • As you wrote: "basically the logs for the job that’s running cut out at some point and then nothing ever happens" -- that sounds like what would happen here too.
  • As you wrote: "everything just disappears" -- yup! The parent process (which is pid 1, you said) gets SIGKILL and all the child processes disappear when the container (immediately) shuts down.
  • It wouldn't show up all the time, and it might be hard to reproduce when there weren't many jobs running.
  • Assuming the distribution of jobs and signals is fairly random, it makes sense that this problem would affect long-running jobs most often because short-running jobs are unlikely to be interrupted by whatever signal is triggering this.

That's a compelling circumstantial case for me but I'd feel better with some direct evidence. What am I looking for? Thinking out loud:

  • Do you have any log output that shows "received signal:" and "triggering cleanup", but not "all clean, exiting"? If so, that doesn't mean you definitely hit this (because that message can get lost), but if you don't have any of that case, this hypothesis seems less likely. (It seems unlikely to me that we'd not manage to flush those first two messages in ten seconds.)
  • Do you know what the grace period is in this environment? (I think I answered this. If this is Cloud Run, I think it's 10 seconds.)
  • Can you tell from some other source whether the grace period elapsed and the runtime sent SIGKILL? Like if this is happening inside Cloud Run, does its API tell you whether it finally terminated the container with SIGTERM or SIGKILL, or how long termination took? Or does it have timestamps or a log where it (1) decided to terminate the container, and (2) when it actually stopped?

You could try to reproduce this by starting a job that you know will take a few seconds, then hitting webhooky with SIGKILL. My guess is you'll see the job disappear in the way you mentioned (not completed, not failed, not cancelled). As always, reproducing the issue may prove that this is an issue, but it can't prove that this was the issue. Still, if you can reproduce it, then we know it could be, and it's worth fixing and then seeing if you still see the problem.

Relatedly, how often do you see this under normal conditions? If you deployed a version with a possible fix, how soon would you be able to tell if it was fixed?

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022 via email

@jessfraz
Copy link
Author

heres the successes for the past month
Screen Shot 2022-02-15 at 9 34 12 PM

Keep in mind from jan 17th to feb 4 there were no new redeploys of the server: https://github.com/oxidecomputer/cio/commits/master?after=1a25363e12265db207ba811c20573c175b426ff2+34&branch=master and I don't see any separate deploys in cloud run

@jclulow
Copy link

jclulow commented Feb 16, 2022

I don’t push updates to the server for three week intervals

Does that mean in general that the host where this stuff is running is up for weeks at a time? Or does the platform start and stop the host on its own asynchronously?

Are you able to use SSH or some other shell to get into the host and poke around while it is misbehaving?

@jessfraz
Copy link
Author

it can stop and start it if something goes down but id be able to see it in the cloud run logs

@jessfraz
Copy link
Author

i dont think I can ssh in unless I throw sshd in container, which we could do... well we'd need to expose a port for it, might be super hacky idk if cloud run can have more than one port exposed

@jessfraz
Copy link
Author

actually if you look in the logs above too sync functions took an hour and 49 minutes, that's pretty good! but there should have been like 19 other jobs that day

@jessfraz
Copy link
Author

heres all the others from the 31st, that should have succeeded but all timed out

Screen Shot 2022-02-15 at 10 10 19 PM

@jessfraz
Copy link
Author

time out is what i calll it when it hasn't returned a status in over a day

@jessfraz
Copy link
Author

maybe im just completely using it wrong?

@davepacheco
Copy link
Collaborator

Yeah, the data point that we don't expect many shutdown events isn't consistent with that hypothesis. I gather that Cloud Run can trigger this when it decides to autoscale down, but I'm guessing we don't expect that either?

@davepacheco
Copy link
Collaborator

Digging in more, I see that we're using BufReader in reexec():
https://github.com/oxidecomputer/cio/blob/1a25363e12265db207ba811c20573c175b426ff2/webhooky/src/sagas.rs#L180

and we're also using Diesel all over the place, I guess to connect to Airtable? Both of these are synchronous -- I'd expect that the current thread will block while we either read from the child process or do a database query. This can be disastrous for tokio.

I'd love to know:

  • Once this has happened, what's the process tree look like? For the child processes of webhooky: what are they doing? how long have they been running? what are they blocked on? On illumos I'd use ptree and pargs $PID and pargs -e $PID and ps -opid,stime,etime,args -p $pid and pstack for each one. Also, what's webhooky's blocked on? I'd look at what pstack says for webhooky. This is where it'd be nice to be able to ssh and poke around, though we'd also need similar tools in the environment to be useful.
  • It would also help to have logs from webhooky that record when a pid is created and what saga id it's for, plus when a pid exits. Maybe we already have this in the sentry data? If so, I'd take one of the stuck jobs, find the pid that we ran, and look for the log entry saying that pid exited. If we don't have it, that suggests these child processes might be hanging.
  • I'd still love to confirm that we're not panicking or restarting in some other way that we're not accounting for. I'd look in the logs for a message that gets emitted on startup, like "starting cron job scheduler". Are we going three weeks without seeing that message?
  • Are there any jobs with status "completed" but no conclusion? Or a conclusion but status != "completed"? There are a bunch of different code paths that seem to set the final state but they don't all do the same thing. (For non-background jobs, we wait right away and use "on_saga_complete" to set "conclusion" when the job finishes. When Steno tells us the saga is done in "from_saga_cached_state", we set "completed_at" and "status" but not "conclusion". When any saga action completes, we set "conclusion" and "completed_at" but not "status". While waiting for the child process in reexec, we set "conclusion" but not "completed_at" or "status". I'd be interested to know what combination of properties are set both for normal jobs and the ones that get stuck.)

@jessfraz
Copy link
Author

jessfraz commented Feb 16, 2022

good news:

overnight things have gotten a lot better, like in a way I've never seen before:
Screen Shot 2022-02-16 at 7 19 36 AM

So I think I might have fixed it yesterday when I was cleaning up some old code, tbh I wrote this code like 8 or so months ago so a lot of the decisions I coundt remember a specific reason why I was doing it that way and some seem obvious in retrospect, but to summarize: oxidecomputer/cio@ee1f084...1a25363

  • I changed everything to run in the background versus waiting on the sagas that seems like the most obvious d'oh, and I have no idea why I was waiting on it in the first place
  • I upped the CPU and memory of the instance, seems likely not that
  • Changed how the scoped logger worked for the reexec, seems unlikely it was this as well
  • the cron jobs used to call out to the server via a rest HTTP call to do the job in the background because I originally thought doing something like this: oxidecomputer/cio@ee1f084...1a25363#diff-679930ba180d5e7a881a077e1ae3d76466e7dbc9dc4dd432778f750fbde4be0bR177 would be gross af, but in turn I also created a gross monster because I basically was just calling myself through indirection, but in combination with number 1 in the list I think it might have been those two

To your points above, the bufreader always made me a bit nervous is there a different way to make that async or do you think I'm good now?

The way the deisel stuff works is we have a database (postgres) and I use airtable as the frontend so thru the macro in that repo I basically update airtable everytime I update the database, kinda hacky and I'm sure there's actual database front ends we could use instead but it works and keeps everything in sync.

Are there any jobs with status "completed" but no conclusion? Or a conclusion but status != "completed"?

In the beginning like the first week I set this up there were some, 11 to be exact, and I scripted it to make those be marked as neutral: https://github.com/oxidecomputer/cio/blob/master/cio/src/functions.rs#L408 but its super rare see below

Screen Shot 2022-02-16 at 7 30 11 AM

I think it will work now tho so sorry for the noise, although I would be curious how to make it better due to the things you pointed out above, and I'll add you to all the things

@jessfraz
Copy link
Author

oh also we do have autoscaling for cloud run so some of the cancels could be caused by that but I run the crons enough that any cancelled jobs should not ruin everything, like I account for it to be a bit lossy, but not as lossy as it was if that makes sense https://github.com/oxidecomputer/cio/blob/master/.github/workflows/cloud-run.yml#L77

@jessfraz
Copy link
Author

for example in the period of just last night, we had 7 cancelled jobs, 0 timed out, and 21 successes, so that's the kind of ratio I would expect when its tearing down an extra instance that was up for the day, (with always keeping at least 1 up) and that's perfectly fine with what we are doing because I call the crons enough that some can fail and we don't really need every replica to be calling the crons as well but they are, so its okay that they cancel

@jessfraz
Copy link
Author

I also feel like since 0 timed out / were left lingering last night and I hadn’t modified the cleanup sequence yet, cloud run likely configures docker with a much longer timeout before sigkill so doesn’t seem to be an issue, I’m also okay with some timing out, just not like that many like maybe the same amount that randomly fall into the neutral category haha

@davepacheco
Copy link
Collaborator

I'm glad to hear things are working a lot better! So actually all of my comments were based on commit 85c21339e15a6fea7ff52bdea75531bbedb2811e, which I now realize postdates a bunch of your fixes yesterday. I probably should have asked what was running in prod and looked at that. It looks like a bunch of the code paths I was looking at did change (e.g., I see that from_saga_node_event wasn't setting the the conclusion before yesterday) so yeah, maybe those fixed it.

the bufreader always made me a bit nervous is there a different way to make that async or do you think I'm good now?

It's not good as-is. The problem is that tokio will have some fixed number of worker threads, probably close to NCPUS, and it assumes that things won't block. Examples of the problems I'd expect you to see:

  • all the time: high latency, especially tail latency. I'm guessing this doesn't matter too much here.
  • if a child process or database call hangs: you'll basically eat up a thread. If you eat up all the threads this way, the whole server will block
  • a special case of this: if a child process running in the last thread calls back into the parent process, you might get a deadlock

I can't say whether any of these is likely to bite you. To fix it: tokio has async analogs for things like BufReader, but I don't know that you can just create one from a sync reader. I think you'd want duct to give you an async reader. There's also tokio's block_in_place or spawn_blocking, which I think are supposed to be escape hatches for this -- i.e., you can put the blocking stuff in a closure that you pass to those functions and they'll be run in some other thread rather than blocking the executor. I haven't used them myself much.

The way the deisel stuff works is we have a database (postgres) and I use airtable as the frontend so thru the macro in that repo I basically update airtable everytime I update the database, kinda hacky and I'm sure there's actual database front ends we could use instead but it works and keeps everything in sync.

Using Diesel seems totally reasonable here. It's just that it has the same problem: it blocks the current thread, which tokio assumes won't block. It can lead to all the same behavior as above. For omicron we use bb8 (instead of r2d2) + async-bb8-diesel. That gives you async versions of the conn functions (like first_async instead of first).

oh also we do have autoscaling for cloud run so some of the cancels could be caused by that but I run the crons enough that any cancelled jobs should not ruin everything, like I account for it to be a bit lossy, but not as lossy as it was if that makes sense

I just meant that if cloud run ever autoscales you down, then we can wind up hitting the SIGKILL problem even if you never did a deploy (because Cloud Run does the same SIGTERM, wait 10 seconds, SIGKILL dance during autoscale-down). I don't know that it's happening often enough to account for all the disappearing jobs, but when it happens, it might result in some jobs being cancelled and others disappearing in the way you mentioned above.

I also feel like since 0 timed out / were left lingering last night and I hadn’t modified the cleanup sequence yet, cloud run likely configures docker with a much longer timeout before sigkill so doesn’t seem to be an issue

It's possible, but their docs (linked above) do say 10 seconds so I think it's more likely the cleanup just didn't take 10 seconds.


I'm going to close this for now because there doesn't appear to be a steno problem here. If the problem recurs, let me know.

@jessfraz
Copy link
Author

thanks for the help sorry for blaming steno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants