-
Notifications
You must be signed in to change notification settings - Fork 31
fix: always flush data to apm before shutting down and rework agent done signal #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…one signal Add a defer statement to make sure that we always flush data to the apm server before shutting down. Remove agent done signal channel and avoid leaking implementation details. The channel was being recreated and closed on each event, racing with the intake handler that was sending to the channel. The channel is now used internally by the apm client and external packages can call 'Done()' to check whether the agent has sent the final intake request. See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
During multiple invocations the lambda can reuse the environment if a warm start is taking place. We cannot assume a request with 'flushed=true' will be the last one for the lifetime of the application. Replace the channel with a counter that is increased when we receive a request with 'flushed=true' and it is decreased if we meet such request in the buffered data while sending to the APM server.
The flush signal is received on a separate goroutine (http handler) so we cannot assume anything on its relationship with the event processed by other goroutines. If we just check once we might miss the signal and hang until the runtimeDone or timeout event is received. To prevent this, create a channel and periodically check the flush counter to minimize latency.
|
||
// ShouldFlush returns true if the client should flush APM data after processing the event. | ||
func (c *Client) ShouldFlush() bool { | ||
return c.sendStrategy == SyncFlush || c.hasPendingFlush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I don't know that this is desirable.
I think the ?flushed=true
wording is a bit confusing: it doesn't mean that the extension should flush immediately, it just means that the agent (client) has flushed, which in turn means that the Lambda invocation has completed.
I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misinterpreting the specification, but I think the goal of flushing=true
was to reduce latency so that the lambda knows that it can flush its data.
The way I interpreted it was:
SyncFlush
: flush on every intake request. We forwards every request from the agent as soon as we receive it.flushed=true
: sent with the final intake request. We buffered the previous requests and this is a signal that we can flush the data.
I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.
Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I think the naming is confusing :)
There's two distinct "flush" events:
- agent flushes data to the extension
- extension flushes data to the server
The ?flushed=true
request indicates to the extension that the first event has happened. This is a prerequisite for the extension flushing data to the server, but it does not mean the server must flush immediately. This behaviour is intended to be controlled by the send strategy.
Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.
Yes, that is intended. There's a trade-off, as explained in the send strategy docs linked above:
- In
syncflush
mode, data gets flushed immediately after an invocation. This means the extension cannot service another Lambda invocation until the events are flushed, reducing Lambda invocation throughput. - In
background
mode, data gets flushed in the background while subsequent Lambda invocations are being processed, or on shutdown if there are no subsequent invocations. This means that data might be significantly delayed in case there are only sporadic Lambda invocations, but Lambda invocation throughput will not be significantly reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thank you for the explanation! 🙇
I've updated the code to the correct behaviour
only synchronously flush on sendstrategy == syncflush. Do not flush just because there are unhandled flushed=true requests.
Go back to a less disruptive change. Remove flush count, don't keep track of multiple flushed requests but reset the channel before processing the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there :)
Basically I think we should revert the changes to ForwardApmData and EnqueueAPMData, and don't require enqueuing anything to close c.flushCh.
Co-authored-by: Andrew Wilkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your persistence 😄
LGTM!
I've run the Lambda in a loop for a while, and it doesn't appear to panic at all. I did see some unhandled timeout errors, but I get those without the agent or extension enabled too - doesn't appear to be related to the extension. |
Add a defer statement to make sure that we always flush data to the
apm server before shutting down.
Remove agent done signal channel and avoid leaking implementation details.
The channel was being recreated and closed on each event, racing with the
intake handler that was sending to the channel.
The channel is now used internally by the apm client and external packages
can call 'Done()' to check whether the agent has sent the final intake
request.
See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing
Closes #245