-
Notifications
You must be signed in to change notification settings - Fork 10
jobs that run for 15min + sometimes disappear #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Uh oh. What do you mean by disappearing? Is it that:
Or: you've created a Saga and it's not running any more? Do you have any log files about what's going on? |
I keep trying to reproduce it but it’s super hard to reproduce basically in the cio repo at the start of every cron job I mark it as started in the database. At the end I mark it as success or failure. If the server gets redeployed it gets marked as cancelled. So far we have 5000 that basically died and never returned any status or got cancelled. We have 5000 that were cancelled due to the server redeploying and 400 that are a success and like 12 that failed. So basically this has been why the cron jobs have been such shit. I might just be entirely using it wrong but that seems bad. I don’t have logs basically the logs for the job that’s running cut out at some point and then nothing ever happens. (I store those logs to the database as well) so there’s not much data it’s super hard to reproduce and I haven’t really tried to debug it
… On Feb 15, 2022, at 5:09 PM, David Pacheco ***@***.***> wrote:
Uh oh. What do you mean by disappearing? Is it that:
you're using an SEC
you've created a Saga through the SEC
the Saga is still running but the SEC doesn't report it in its list of jobs?
Or: you've created a Saga and it's not running any more?
Do you have any log files about what's going on?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
The ones that are succeeding are the ones that take less than 5 minutes usually like a minute to run
… On Feb 15, 2022, at 5:13 PM, Jess Frazelle ***@***.***> wrote:
I keep trying to reproduce it but it’s super hard to reproduce basically in the cio repo at the start of every cron job I mark it as started in the database. At the end I mark it as success or failure. If the server gets redeployed it gets marked as cancelled. So far we have 5000 that basically died and never returned any status or got cancelled. We have 5000 that were cancelled due to the server redeploying and 400 that are a success and like 12 that failed. So basically this has been why the cron jobs have been such shit. I might just be entirely using it wrong but that seems bad. I don’t have logs basically the logs for the job that’s running cut out at some point and then nothing ever happens. (I store those logs to the database as well) so there’s not much data it’s super hard to reproduce and I haven’t really tried to debug it
> On Feb 15, 2022, at 5:09 PM, David Pacheco ***@***.***> wrote:
>
>
> Uh oh. What do you mean by disappearing? Is it that:
>
> you're using an SEC
> you've created a Saga through the SEC
> the Saga is still running but the SEC doesn't report it in its list of jobs?
> Or: you've created a Saga and it's not running any more?
>
> Do you have any log files about what's going on?
>
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you authored the thread.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
the weird thing im doing which might not be totally supported is, instead of just calling a function inside the saga, I am calling a function but said function execs a binary and captures the output of the stdout from the binary. The binary is the same binary that is already running for the server, however I am calling a different sub command. The reason I am doing this is I wanted the stdout logs from the function and this seemed like the easiest way of capturing the log output in the function. So it might have something to do with that, like maybe that thread for the exec-ing is not happy but I don't see why that would be an issue when like I should be able to do whatever I want in the saga function, right? The other thing I was thinking of is doing a double fork exec, such that then the exec I'm running can stay running after the main process dies, however I don't actually want this becuayse I want to be able to kill those jobs and cancel them when the server gets redeployed and they'll die anyways since its running in a container and pid one is the server. But I also don't think its the fact I'm execing because I don't get any errors from the exec, everything just disappears, which maybe that means it is it but figured I should disclose that in case this is not supported |
Thanks for that. What happens if the parent process were to die, like on a panic? Also, do you have stderr from any of the processes? |
Yeah I pipe the stderr to the database as well and basically for those processes that failed like the handful of them that’s when the stderr is like the anyhow error of the function failing. Otherwise there’s nothing.
If the server dies on a panic all those processes would die but I’d also get the panic logged in sentry and that hasn’t been happening but good question I didn’t think of that
… On Feb 15, 2022, at 6:12 PM, David Pacheco ***@***.***> wrote:
Thanks for that. What happens if the parent process were to die, like on a panic? Also, do you have stderr from any of the processes?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
tl;dr: try starting a job that you know will take a few seconds, then hit webhooky with SIGKILL. Does that reproduce the problem? Cool. I took a look at the cio stuff in webhooky that uses steno. I started from the top and was looking particularly for cases where things could go wrong without showing up in log output, stdout, or stderr. A few observations: We're using an async logger: Depending on how the process dies, this means you might not get log the messages leading up to the process exit. The docs call out std::process::exit explicitly as a case where you can lose messages. I'm also not sure about this code: I see two ways this could go badly. First, suppose the process takes SIGINT or SIGTERM. We start the "cleanup" job at L269. Let's say that completes quickly. Then we exit the process at L274 with But suppose again that the process takes SIGINT or SIGTERM and we start the "cleanup" job. Let's dig into that. The Note that Anyway, once we make the request to start the "cleanup" job, we land in the request handler for that endpoint: Seeing in "start_job" that this is expected to return "202 Accepted", I thought maybe we weren't waiting for cleanup to finish, but it looks to me like this function is really synchronous. By the time it returns, the cleanup process has finished. So that's good. But what if it takes a long time? Back where we handled the signal, we had this comment: // For Cloud run & ctrl+c, shutdown gracefully.
// "The main process inside the container will receive SIGTERM, and after a grace period,
// SIGKILL."
// Regsitering SIGKILL here will panic at runtime, so let's avoid that.
let mut signals = Signals::new(&[SIGINT, SIGTERM])?; This is a quote from the docker stop docs, where that default grace period is 10 seconds. So suppose we take SIGTERM, we kick off the cleanup job, and it takes 10 seconds? We'd then expect to take SIGKILL, which immediately terminates the process with no opportunity to clean up. Does that seem likely? Taking a look at the cleanup job, it looks like it iterates all the running sagas and for each one it makes a bunch of requests to Airtable. It seems quite plausible to me that this would take more than 10 seconds. Is this consistent with the data we have so far? I think so. If you ran into this situation, then:
That's a compelling circumstantial case for me but I'd feel better with some direct evidence. What am I looking for? Thinking out loud:
You could try to reproduce this by starting a job that you know will take a few seconds, then hitting webhooky with SIGKILL. My guess is you'll see the job disappear in the way you mentioned (not completed, not failed, not cancelled). As always, reproducing the issue may prove that this is an issue, but it can't prove that this was the issue. Still, if you can reproduce it, then we know it could be, and it's worth fixing and then seeing if you still see the problem. Relatedly, how often do you see this under normal conditions? If you deployed a version with a possible fix, how soon would you be able to tell if it was fixed? |
I think this is totally viable and I’ll try it out but it doesn’t explain why this still happens when we have a live server w an uptime of weeks (in between when I’m not making changes) and it still has the same problems
… On Feb 15, 2022, at 8:37 PM, David Pacheco ***@***.***> wrote:
tl;dr: try starting a job that you know will take a few seconds, then hit webhooky with SIGKILL. Does that reproduce the problem?
Cool. I took a look at the cio stuff in webhooky that uses steno. I started from the top and was looking particularly for cases where things could go wrong without showing up in log output, stdout, or stderr. A few observations:
We're using an async logger:
https://github.com/oxidecomputer/cio/blob/85c21339e15a6fea7ff52bdea75531bbedb2811e/webhooky/src/main.rs#L187
Depending on how the process dies, this means you might not get log the messages leading up to the process exit. The docs call out std::process::exit explicitly as a case where you can lose messages.
I'm also not sure about this code:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L276
I see two ways this could go badly.
First, suppose the process takes SIGINT or SIGTERM. We start the "cleanup" job at L269. Let's say that completes quickly. Then we exit the process at L274 with std::process::exit. We might miss some of those info! messages in the logs for the reason mentioned above. Still, there's a good chance you'd catch the other messages, and anyway, I don't see how this would leave jobs in a bad state.
But suppose again that the process takes SIGINT or SIGTERM and we start the "cleanup" job. Let's dig into that. The start_job function makes a request to the server on localhost (running in the same process) and waits for it to finish:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2205-L2216
Note that start_job normally applies a timeout to the client request, but in the special case of the "cleanup" job, it uses no client-side request timeout.
Anyway, once we make the request to start the "cleanup" job, we land in the request handler for that endpoint:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2165-L2187
Seeing in "start_job" that this is expected to return "202 Accepted", I thought maybe we weren't waiting for cleanup to finish, but it looks to me like this function is really synchronous. By the time it returns, the cleanup process has finished. So that's good.
But what if it takes a long time? Back where we handled the signal, we had this comment:
https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L261
// For Cloud run & ctrl+c, shutdown gracefully.
// "The main process inside the container will receive SIGTERM, and after a grace period,
// SIGKILL."
// Regsitering SIGKILL here will panic at runtime, so let's avoid that.
let mut signals = Signals::new(&[SIGINT, SIGTERM])?;
This is a quote from the docker stop docs, where that default grace period is 10 seconds. So suppose we take SIGTERM, we kick off the cleanup job, and it takes 10 seconds? We'd then expect to take SIGKILL, which immediately terminates the process with no opportunity to clean up.
Does that seem likely? Taking a look at the cleanup job, it looks like it iterates all the running sagas and for each one it makes a bunch of requests to Airtable. It seems quite plausible to me that this would take more than 10 seconds.
Is this consistent with the data we have so far? I think so. If you ran into this situation, then:
Most importantly: any jobs that the cleanup job didn't get to before it got SIGKILL would wind up not cancelled. I don't see any other code that would mark them cancelled. I gather from what you wrote above that on restart, webhooky won't resume any of those jobs. So I think that would explain the key data point that you have all these jobs that aren't completed, failed, or cancelled.
You wouldn't see a panic or almost any other error from stdout or stderr.
As you wrote: "basically the logs for the job that’s running cut out at some point and then nothing ever happens" -- that sounds like what would happen here too.
As you wrote: "everything just disappears" -- yup! The parent process (which is pid 1, you said) gets SIGKILL and all the child processes disappear when the container (immediately) shuts down.
It wouldn't show up all the time, and it might be hard to reproduce when there weren't many jobs running.
Assuming the distribution of jobs and signals is fairly random, it makes sense that this problem would affect long-running jobs most often because short-running jobs are unlikely to be interrupted by whatever signal is triggering this.
That's a compelling circumstantial case for me but I'd feel better with some direct evidence. What am I looking for? Thinking out loud:
Do you have any log output that shows "received signal:" and "triggering cleanup", but not "all clean, exiting"? If so, that doesn't mean you definitely hit this (because that message can get lost), but if you don't have any of that case, this hypothesis seems less likely. (It seems unlikely to me that we'd not manage to flush those first two messages in ten seconds.)
Do you know what the grace period is in this environment? (I think I answered this. If this is Cloud Run, I think it's 10 seconds.)
Can you tell from some other source whether the grace period elapsed and the runtime sent SIGKILL? Like if this is happening inside Cloud Run, does its API tell you whether it finally terminated the container with SIGTERM or SIGKILL, or how long termination took? Or does it have timestamps or a log where it (1) decided to terminate the container, and (2) when it actually stopped?
You could try to reproduce this by starting a job that you know will take a few seconds, then hitting webhooky with SIGKILL. My guess is you'll see the job disappear in the way you mentioned (not completed, not failed, not cancelled). As always, reproducing the issue may prove that this is an issue, but it can't prove that this was the issue. Still, if you can reproduce it, then we know it could be, and it's worth fixing and then seeing if you still see the problem.
Relatedly, how often do you see this under normal conditions? If you deployed a version with a possible fix, how soon would you be able to tell if it was fixed?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
Over the course of 8 months or so was where that data came from some of which I don’t push updates to the server for three week intervals, and we still only have a rate of 400 success runs of functions. I’d expect that to be way higher. On shut down it’s likely only killing like 2 jobs. Which is fine and why I’m okay w it cancelling them. So the ratio just seems way off
… On Feb 15, 2022, at 9:13 PM, Jess Frazelle ***@***.***> wrote:
I think this is totally viable and I’ll try it out but it doesn’t explain why this still happens when we have a live server w an uptime of weeks (in between when I’m not making changes) and it still has the same problems
> On Feb 15, 2022, at 8:37 PM, David Pacheco ***@***.***> wrote:
>
>
> tl;dr: try starting a job that you know will take a few seconds, then hit webhooky with SIGKILL. Does that reproduce the problem?
>
> Cool. I took a look at the cio stuff in webhooky that uses steno. I started from the top and was looking particularly for cases where things could go wrong without showing up in log output, stdout, or stderr. A few observations:
>
> We're using an async logger:
> https://github.com/oxidecomputer/cio/blob/85c21339e15a6fea7ff52bdea75531bbedb2811e/webhooky/src/main.rs#L187
>
> Depending on how the process dies, this means you might not get log the messages leading up to the process exit. The docs call out std::process::exit explicitly as a case where you can lose messages.
>
> I'm also not sure about this code:
> https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L276
>
> I see two ways this could go badly.
>
> First, suppose the process takes SIGINT or SIGTERM. We start the "cleanup" job at L269. Let's say that completes quickly. Then we exit the process at L274 with std::process::exit. We might miss some of those info! messages in the logs for the reason mentioned above. Still, there's a good chance you'd catch the other messages, and anyway, I don't see how this would leave jobs in a bad state.
>
> But suppose again that the process takes SIGINT or SIGTERM and we start the "cleanup" job. Let's dig into that. The start_job function makes a request to the server on localhost (running in the same process) and waits for it to finish:
> https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2205-L2216
>
> Note that start_job normally applies a timeout to the client request, but in the special case of the "cleanup" job, it uses no client-side request timeout.
>
> Anyway, once we make the request to start the "cleanup" job, we land in the request handler for that endpoint:
> https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2165-L2187
>
> Seeing in "start_job" that this is expected to return "202 Accepted", I thought maybe we weren't waiting for cleanup to finish, but it looks to me like this function is really synchronous. By the time it returns, the cleanup process has finished. So that's good.
>
> But what if it takes a long time? Back where we handled the signal, we had this comment:
> https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L261
>
> // For Cloud run & ctrl+c, shutdown gracefully.
> // "The main process inside the container will receive SIGTERM, and after a grace period,
> // SIGKILL."
> // Regsitering SIGKILL here will panic at runtime, so let's avoid that.
> let mut signals = Signals::new(&[SIGINT, SIGTERM])?;
> This is a quote from the docker stop docs, where that default grace period is 10 seconds. So suppose we take SIGTERM, we kick off the cleanup job, and it takes 10 seconds? We'd then expect to take SIGKILL, which immediately terminates the process with no opportunity to clean up.
>
> Does that seem likely? Taking a look at the cleanup job, it looks like it iterates all the running sagas and for each one it makes a bunch of requests to Airtable. It seems quite plausible to me that this would take more than 10 seconds.
>
> Is this consistent with the data we have so far? I think so. If you ran into this situation, then:
>
> Most importantly: any jobs that the cleanup job didn't get to before it got SIGKILL would wind up not cancelled. I don't see any other code that would mark them cancelled. I gather from what you wrote above that on restart, webhooky won't resume any of those jobs. So I think that would explain the key data point that you have all these jobs that aren't completed, failed, or cancelled.
> You wouldn't see a panic or almost any other error from stdout or stderr.
> As you wrote: "basically the logs for the job that’s running cut out at some point and then nothing ever happens" -- that sounds like what would happen here too.
> As you wrote: "everything just disappears" -- yup! The parent process (which is pid 1, you said) gets SIGKILL and all the child processes disappear when the container (immediately) shuts down.
> It wouldn't show up all the time, and it might be hard to reproduce when there weren't many jobs running.
> Assuming the distribution of jobs and signals is fairly random, it makes sense that this problem would affect long-running jobs most often because short-running jobs are unlikely to be interrupted by whatever signal is triggering this.
> That's a compelling circumstantial case for me but I'd feel better with some direct evidence. What am I looking for? Thinking out loud:
>
> Do you have any log output that shows "received signal:" and "triggering cleanup", but not "all clean, exiting"? If so, that doesn't mean you definitely hit this (because that message can get lost), but if you don't have any of that case, this hypothesis seems less likely. (It seems unlikely to me that we'd not manage to flush those first two messages in ten seconds.)
> Do you know what the grace period is in this environment? (I think I answered this. If this is Cloud Run, I think it's 10 seconds.)
> Can you tell from some other source whether the grace period elapsed and the runtime sent SIGKILL? Like if this is happening inside Cloud Run, does its API tell you whether it finally terminated the container with SIGTERM or SIGKILL, or how long termination took? Or does it have timestamps or a log where it (1) decided to terminate the container, and (2) when it actually stopped?
> You could try to reproduce this by starting a job that you know will take a few seconds, then hitting webhooky with SIGKILL. My guess is you'll see the job disappear in the way you mentioned (not completed, not failed, not cancelled). As always, reproducing the issue may prove that this is an issue, but it can't prove that this was the issue. Still, if you can reproduce it, then we know it could be, and it's worth fixing and then seeing if you still see the problem.
>
> Relatedly, how often do you see this under normal conditions? If you deployed a version with a possible fix, how soon would you be able to tell if it was fixed?
>
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you authored the thread.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
There’s whole days where not a single function succeeds, not even the short ones but I know the server is up. Since otherwise we’d get no applications etc, like all hell would break loose.
… On Feb 15, 2022, at 9:16 PM, Jess Frazelle ***@***.***> wrote:
Over the course of 8 months or so was where that data came from some of which I don’t push updates to the server for three week intervals, and we still only have a rate of 400 success runs of functions. I’d expect that to be way higher. On shut down it’s likely only killing like 2 jobs. Which is fine and why I’m okay w it cancelling them. So the ratio just seems way off
> On Feb 15, 2022, at 9:13 PM, Jess Frazelle ***@***.***> wrote:
>
>
> I think this is totally viable and I’ll try it out but it doesn’t explain why this still happens when we have a live server w an uptime of weeks (in between when I’m not making changes) and it still has the same problems
>
> > On Feb 15, 2022, at 8:37 PM, David Pacheco ***@***.***> wrote:
> >
> >
> > tl;dr: try starting a job that you know will take a few seconds, then hit webhooky with SIGKILL. Does that reproduce the problem?
> >
> > Cool. I took a look at the cio stuff in webhooky that uses steno. I started from the top and was looking particularly for cases where things could go wrong without showing up in log output, stdout, or stderr. A few observations:
> >
> > We're using an async logger:
> > https://github.com/oxidecomputer/cio/blob/85c21339e15a6fea7ff52bdea75531bbedb2811e/webhooky/src/main.rs#L187
> >
> > Depending on how the process dies, this means you might not get log the messages leading up to the process exit. The docs call out std::process::exit explicitly as a case where you can lose messages.
> >
> > I'm also not sure about this code:
> > https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L276
> >
> > I see two ways this could go badly.
> >
> > First, suppose the process takes SIGINT or SIGTERM. We start the "cleanup" job at L269. Let's say that completes quickly. Then we exit the process at L274 with std::process::exit. We might miss some of those info! messages in the logs for the reason mentioned above. Still, there's a good chance you'd catch the other messages, and anyway, I don't see how this would leave jobs in a bad state.
> >
> > But suppose again that the process takes SIGINT or SIGTERM and we start the "cleanup" job. Let's dig into that. The start_job function makes a request to the server on localhost (running in the same process) and waits for it to finish:
> > https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2205-L2216
> >
> > Note that start_job normally applies a timeout to the client request, but in the special case of the "cleanup" job, it uses no client-side request timeout.
> >
> > Anyway, once we make the request to start the "cleanup" job, we land in the request handler for that endpoint:
> > https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L2165-L2187
> >
> > Seeing in "start_job" that this is expected to return "202 Accepted", I thought maybe we weren't waiting for cleanup to finish, but it looks to me like this function is really synchronous. By the time it returns, the cleanup process has finished. So that's good.
> >
> > But what if it takes a long time? Back where we handled the signal, we had this comment:
> > https://github.com/oxidecomputer/cio/blob/master/webhooky/src/server.rs#L257-L261
> >
> > // For Cloud run & ctrl+c, shutdown gracefully.
> > // "The main process inside the container will receive SIGTERM, and after a grace period,
> > // SIGKILL."
> > // Regsitering SIGKILL here will panic at runtime, so let's avoid that.
> > let mut signals = Signals::new(&[SIGINT, SIGTERM])?;
> > This is a quote from the docker stop docs, where that default grace period is 10 seconds. So suppose we take SIGTERM, we kick off the cleanup job, and it takes 10 seconds? We'd then expect to take SIGKILL, which immediately terminates the process with no opportunity to clean up.
> >
> > Does that seem likely? Taking a look at the cleanup job, it looks like it iterates all the running sagas and for each one it makes a bunch of requests to Airtable. It seems quite plausible to me that this would take more than 10 seconds.
> >
> > Is this consistent with the data we have so far? I think so. If you ran into this situation, then:
> >
> > Most importantly: any jobs that the cleanup job didn't get to before it got SIGKILL would wind up not cancelled. I don't see any other code that would mark them cancelled. I gather from what you wrote above that on restart, webhooky won't resume any of those jobs. So I think that would explain the key data point that you have all these jobs that aren't completed, failed, or cancelled.
> > You wouldn't see a panic or almost any other error from stdout or stderr.
> > As you wrote: "basically the logs for the job that’s running cut out at some point and then nothing ever happens" -- that sounds like what would happen here too.
> > As you wrote: "everything just disappears" -- yup! The parent process (which is pid 1, you said) gets SIGKILL and all the child processes disappear when the container (immediately) shuts down.
> > It wouldn't show up all the time, and it might be hard to reproduce when there weren't many jobs running.
> > Assuming the distribution of jobs and signals is fairly random, it makes sense that this problem would affect long-running jobs most often because short-running jobs are unlikely to be interrupted by whatever signal is triggering this.
> > That's a compelling circumstantial case for me but I'd feel better with some direct evidence. What am I looking for? Thinking out loud:
> >
> > Do you have any log output that shows "received signal:" and "triggering cleanup", but not "all clean, exiting"? If so, that doesn't mean you definitely hit this (because that message can get lost), but if you don't have any of that case, this hypothesis seems less likely. (It seems unlikely to me that we'd not manage to flush those first two messages in ten seconds.)
> > Do you know what the grace period is in this environment? (I think I answered this. If this is Cloud Run, I think it's 10 seconds.)
> > Can you tell from some other source whether the grace period elapsed and the runtime sent SIGKILL? Like if this is happening inside Cloud Run, does its API tell you whether it finally terminated the container with SIGTERM or SIGKILL, or how long termination took? Or does it have timestamps or a log where it (1) decided to terminate the container, and (2) when it actually stopped?
> > You could try to reproduce this by starting a job that you know will take a few seconds, then hitting webhooky with SIGKILL. My guess is you'll see the job disappear in the way you mentioned (not completed, not failed, not cancelled). As always, reproducing the issue may prove that this is an issue, but it can't prove that this was the issue. Still, if you can reproduce it, then we know it could be, and it's worth fixing and then seeing if you still see the problem.
> >
> > Relatedly, how often do you see this under normal conditions? If you deployed a version with a possible fix, how soon would you be able to tell if it was fixed?
> >
> > —
> > Reply to this email directly, view it on GitHub, or unsubscribe.
> > You are receiving this because you authored the thread.
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you are subscribed to this thread.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
heres the successes for the past month Keep in mind from jan 17th to feb 4 there were no new redeploys of the server: https://github.com/oxidecomputer/cio/commits/master?after=1a25363e12265db207ba811c20573c175b426ff2+34&branch=master and I don't see any separate deploys in cloud run |
Does that mean in general that the host where this stuff is running is up for weeks at a time? Or does the platform start and stop the host on its own asynchronously? Are you able to use SSH or some other shell to get into the host and poke around while it is misbehaving? |
it can stop and start it if something goes down but id be able to see it in the cloud run logs |
i dont think I can ssh in unless I throw sshd in container, which we could do... well we'd need to expose a port for it, might be super hacky idk if cloud run can have more than one port exposed |
actually if you look in the logs above too |
time out is what i calll it when it hasn't returned a status in over a day |
maybe im just completely using it wrong? |
Yeah, the data point that we don't expect many shutdown events isn't consistent with that hypothesis. I gather that Cloud Run can trigger this when it decides to autoscale down, but I'm guessing we don't expect that either? |
Digging in more, I see that we're using and we're also using Diesel all over the place, I guess to connect to Airtable? Both of these are synchronous -- I'd expect that the current thread will block while we either read from the child process or do a database query. This can be disastrous for tokio. I'd love to know:
|
good news: overnight things have gotten a lot better, like in a way I've never seen before: So I think I might have fixed it yesterday when I was cleaning up some old code, tbh I wrote this code like 8 or so months ago so a lot of the decisions I coundt remember a specific reason why I was doing it that way and some seem obvious in retrospect, but to summarize: oxidecomputer/cio@ee1f084...1a25363
To your points above, the bufreader always made me a bit nervous is there a different way to make that async or do you think I'm good now? The way the deisel stuff works is we have a database (postgres) and I use airtable as the frontend so thru the macro in that repo I basically update airtable everytime I update the database, kinda hacky and I'm sure there's actual database front ends we could use instead but it works and keeps everything in sync.
In the beginning like the first week I set this up there were some, 11 to be exact, and I scripted it to make those be marked as neutral: https://github.com/oxidecomputer/cio/blob/master/cio/src/functions.rs#L408 but its super rare see below I think it will work now tho so sorry for the noise, although I would be curious how to make it better due to the things you pointed out above, and I'll add you to all the things |
oh also we do have autoscaling for cloud run so some of the cancels could be caused by that but I run the crons enough that any cancelled jobs should not ruin everything, like I account for it to be a bit lossy, but not as lossy as it was if that makes sense https://github.com/oxidecomputer/cio/blob/master/.github/workflows/cloud-run.yml#L77 |
for example in the period of just last night, we had 7 cancelled jobs, 0 timed out, and 21 successes, so that's the kind of ratio I would expect when its tearing down an extra instance that was up for the day, (with always keeping at least 1 up) and that's perfectly fine with what we are doing because I call the crons enough that some can fail and we don't really need every replica to be calling the crons as well but they are, so its okay that they cancel |
I also feel like since 0 timed out / were left lingering last night and I hadn’t modified the cleanup sequence yet, cloud run likely configures docker with a much longer timeout before sigkill so doesn’t seem to be an issue, I’m also okay with some timing out, just not like that many like maybe the same amount that randomly fall into the neutral category haha |
I'm glad to hear things are working a lot better! So actually all of my comments were based on commit 85c21339e15a6fea7ff52bdea75531bbedb2811e, which I now realize postdates a bunch of your fixes yesterday. I probably should have asked what was running in prod and looked at that. It looks like a bunch of the code paths I was looking at did change (e.g., I see that
It's not good as-is. The problem is that tokio will have some fixed number of worker threads, probably close to NCPUS, and it assumes that things won't block. Examples of the problems I'd expect you to see:
I can't say whether any of these is likely to bite you. To fix it: tokio has async analogs for things like BufReader, but I don't know that you can just create one from a sync reader. I think you'd want
Using Diesel seems totally reasonable here. It's just that it has the same problem: it blocks the current thread, which tokio assumes won't block. It can lead to all the same behavior as above. For omicron we use bb8 (instead of r2d2) + async-bb8-diesel. That gives you async versions of the conn functions (like
I just meant that if cloud run ever autoscales you down, then we can wind up hitting the SIGKILL problem even if you never did a deploy (because Cloud Run does the same SIGTERM, wait 10 seconds, SIGKILL dance during autoscale-down). I don't know that it's happening often enough to account for all the disappearing jobs, but when it happens, it might result in some jobs being cancelled and others disappearing in the way you mentioned above.
It's possible, but their docs (linked above) do say 10 seconds so I think it's more likely the cleanup just didn't take 10 seconds. I'm going to close this for now because there doesn't appear to be a steno problem here. If the problem recurs, let me know. |
thanks for the help sorry for blaming steno |
hard to reproduce but seen it happen in the cio repo repeatedly
The text was updated successfully, but these errors were encountered: