fix: always flush data to apm before shutting down and rework agent done signal #258

kruskall · 2022-08-01T03:01:28Z

Add a defer statement to make sure that we always flush data to the
apm server before shutting down.

Remove agent done signal channel and avoid leaking implementation details.
The channel was being recreated and closed on each event, racing with the
intake handler that was sending to the channel.
The channel is now used internally by the apm client and external packages
can call 'Done()' to check whether the agent has sent the final intake
request.

See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing

Closes #245

…one signal Add a defer statement to make sure that we always flush data to the apm server before shutting down. Remove agent done signal channel and avoid leaking implementation details. The channel was being recreated and closed on each event, racing with the intake handler that was sending to the channel. The channel is now used internally by the apm client and external packages can call 'Done()' to check whether the agent has sent the final intake request. See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing

apmmachine · 2022-08-01T03:09:46Z

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-08-22T01:48:54.027+0000
Duration: 7 min 16 sec

Test stats 🧪

Test	Results
Failed	1
Passed	129
Skipped	32
Total	162

Test errors

Expand to view the tests failures

`Test / Matrix - PLATFORM = 'ubuntu-18 && immutable' / Test / TestContinuedAPMServerFailure – elastic/apm-lambda-extension/apmproxy`

Expand to view the error details

 Failed

Expand to view the stacktrace

 === RUN   TestContinuedAPMServerFailure
    logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	APM server Transport status set to Healthy
    logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	APM server Transport status set to Failing
    logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	Grace period entered, reconnection count : 0
    logger.go:130: 2022-08-22T01:56:05.327Z	DEBUG	Grace period over - timer timed out
    logger.go:130: 2022-08-22T01:56:05.327Z	DEBUG	APM server Transport status set to Pending
    apmserver_test.go:474: 
        	Error Trace:	apmserver_test.go:474
        	Error:      	Condition never satisfied
        	Test:       	TestContinuedAPMServerFailure
--- FAIL: TestContinuedAPMServerFailure (5.00s)

Steps errors

Expand to view the steps failures

`Running Go tests`

Took 0 min 24 sec . View more details here
Description: gotestsum --format testname --junitfile junit-report.xml -- -v ./...

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apm-lambda-extension/apmproxy/receiver.go

During multiple invocations the lambda can reuse the environment if a warm start is taking place. We cannot assume a request with 'flushed=true' will be the last one for the lifetime of the application. Replace the channel with a counter that is increased when we receive a request with 'flushed=true' and it is decreased if we meet such request in the buffered data while sending to the APM server.

apm-lambda-extension/app/run.go

The flush signal is received on a separate goroutine (http handler) so we cannot assume anything on its relationship with the event processed by other goroutines. If we just check once we might miss the signal and hang until the runtimeDone or timeout event is received. To prevent this, create a channel and periodically check the flush counter to minimize latency.

apm-lambda-extension/apmproxy/apmserver.go

axw · 2022-08-17T01:10:08Z

apm-lambda-extension/apmproxy/apmserver.go

+
+// ShouldFlush returns true if the client should flush APM data after processing the event.
+func (c *Client) ShouldFlush() bool {
+	return c.sendStrategy == SyncFlush || c.hasPendingFlush()


Looking at this again, I don't know that this is desirable.

I think the ?flushed=true wording is a bit confusing: it doesn't mean that the extension should flush immediately, it just means that the agent (client) has flushed, which in turn means that the Lambda invocation has completed.

I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.

I might be misinterpreting the specification, but I think the goal of flushing=true was to reduce latency so that the lambda knows that it can flush its data.

The way I interpreted it was:

SyncFlush: flush on every intake request. We forwards every request from the agent as soon as we receive it.

flushed=true: sent with the final intake request. We buffered the previous requests and this is a signal that we can flush the data.

I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.

Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.

This is why I think the naming is confusing :)

There's two distinct "flush" events:

agent flushes data to the extension

extension flushes data to the server

The ?flushed=true request indicates to the extension that the first event has happened. This is a prerequisite for the extension flushing data to the server, but it does not mean the server must flush immediately. This behaviour is intended to be controlled by the send strategy.

Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.

Yes, that is intended. There's a trade-off, as explained in the send strategy docs linked above:

In syncflush mode, data gets flushed immediately after an invocation. This means the extension cannot service another Lambda invocation until the events are flushed, reducing Lambda invocation throughput.

In background mode, data gets flushed in the background while subsequent Lambda invocations are being processed, or on shutdown if there are no subsequent invocations. This means that data might be significantly delayed in case there are only sporadic Lambda invocations, but Lambda invocation throughput will not be significantly reduced.

Ah I see, thank you for the explanation! 🙇

I've updated the code to the correct behaviour

apm-lambda-extension/apmproxy/apmserver.go

only synchronously flush on sendstrategy == syncflush. Do not flush just because there are unhandled flushed=true requests.

apm-lambda-extension/apmproxy/apmserver.go

apm-lambda-extension/apmproxy/client.go

apm-lambda-extension/apmproxy/receiver.go

Go back to a less disruptive change. Remove flush count, don't keep track of multiple flushed requests but reset the channel before processing the event.

axw

Almost there :)

Basically I think we should revert the changes to ForwardApmData and EnqueueAPMData, and don't require enqueuing anything to close c.flushCh.

apm-lambda-extension/apmproxy/apmserver.go

apm-lambda-extension/apmproxy/receiver.go

apm-lambda-extension/app/run.go

Co-authored-by: Andrew Wilkins <[email protected]>

axw

Thanks for your persistence 😄
LGTM!

axw · 2022-09-29T03:42:24Z

I've run the Lambda in a loop for a while, and it doesn't appear to panic at all. I did see some unhandled timeout errors, but I get those without the agent or extension enabled too - doesn't appear to be related to the extension.

github-actions bot added the aws-λ-extension AWS Lambda Extension label Aug 1, 2022

kruskall requested a review from axw August 3, 2022 01:55

kruskall added 4 commits August 4, 2022 04:39

Merge branch 'main' into fix/apm-done-flush

500bb36

Merge branch 'main' into fix/apm-done-flush

f2e4705

sigh

0c35835

Merge branch 'main' into fix/apm-done-flush

ce7e05c

axw reviewed Aug 10, 2022

View reviewed changes

apm-lambda-extension/apmproxy/receiver.go Outdated Show resolved Hide resolved

kruskall requested a review from axw August 12, 2022 23:56

axw reviewed Aug 15, 2022

View reviewed changes

apm-lambda-extension/app/run.go Outdated Show resolved Hide resolved

kruskall requested a review from axw August 17, 2022 00:13

axw reviewed Aug 17, 2022

View reviewed changes

apm-lambda-extension/apmproxy/apmserver.go Outdated Show resolved Hide resolved

kruskall added 2 commits August 17, 2022 14:49

refactor: remove busy loop and rely on channels to signal flush requests

5e16d0f

fix: update behaviour based on flush strategy

5e1f58b

only synchronously flush on sendstrategy == syncflush. Do not flush just because there are unhandled flushed=true requests.

kruskall requested a review from axw August 18, 2022 07:15

axw reviewed Aug 18, 2022

View reviewed changes

apm-lambda-extension/apmproxy/apmserver.go Outdated Show resolved Hide resolved

apm-lambda-extension/apmproxy/client.go Outdated Show resolved Hide resolved

apm-lambda-extension/apmproxy/receiver.go Outdated Show resolved Hide resolved

fix: update flush logic and remove flush count

acbb700

Go back to a less disruptive change. Remove flush count, don't keep track of multiple flushed requests but reset the channel before processing the event.

kruskall requested a review from axw August 21, 2022 19:21

refactor: move flush reset inside process event

987d08a

axw requested changes Aug 22, 2022

View reviewed changes

kruskall and others added 2 commits August 22, 2022 03:32

refactor: revert queue changes

e096ec7

Co-authored-by: Andrew Wilkins <[email protected]>

fix: move reset flush to a defer call

7697493

kruskall requested a review from axw August 22, 2022 01:40

axw approved these changes Aug 22, 2022

View reviewed changes

Merge branch 'main' into fix/apm-done-flush

589deaf

kruskall merged commit 19bea8c into elastic:main Aug 22, 2022

kruskall deleted the fix/apm-done-flush branch August 22, 2022 01:59

fix: always flush data to apm before shutting down and rework agent done signal #258

fix: always flush data to apm before shutting down and rework agent done signal #258

Uh oh!

Conversation

kruskall commented Aug 1, 2022

Uh oh!

apmmachine commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Tests Failed

Build stats

Test stats 🧪

Test errors

Test / Matrix - PLATFORM = 'ubuntu-18 && immutable' / Test / TestContinuedAPMServerFailure – elastic/apm-lambda-extension/apmproxy

Steps errors

Running Go tests

🤖 GitHub comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

axw Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

kruskall Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

axw Aug 18, 2022

Choose a reason for hiding this comment

Uh oh!

kruskall Aug 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

axw commented Sep 29, 2022

Uh oh!

Uh oh!

apmmachine commented Aug 1, 2022 •

edited

Loading

`Test / Matrix - PLATFORM = 'ubuntu-18 && immutable' / Test / TestContinuedAPMServerFailure – elastic/apm-lambda-extension/apmproxy`

`Running Go tests`