Skip to content

fix: always flush data to apm before shutting down and rework agent done signal #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 22, 2022

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Aug 1, 2022

Add a defer statement to make sure that we always flush data to the
apm server before shutting down.

Remove agent done signal channel and avoid leaking implementation details.
The channel was being recreated and closed on each event, racing with the
intake handler that was sending to the channel.
The channel is now used internally by the apm client and external packages
can call 'Done()' to check whether the agent has sent the final intake
request.

See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing

Closes #245

…one signal

Add a defer statement to make sure that we always flush data to the
apm server before shutting down.

Remove agent done signal channel and avoid leaking implementation details.
The channel was being recreated and closed on each event, racing with the
intake handler that was sending to the channel.
The channel is now used internally by the apm client and external packages
can call 'Done()' to check whether the agent has sent the final intake
request.

See https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-aws-lambda.md#data-flushing
@github-actions github-actions bot added the aws-λ-extension AWS Lambda Extension label Aug 1, 2022
@apmmachine
Copy link

apmmachine commented Aug 1, 2022

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-22T01:48:54.027+0000

  • Duration: 7 min 16 sec

Test stats 🧪

Test Results
Failed 1
Passed 129
Skipped 32
Total 162

Test errors 1

Expand to view the tests failures

Test / Matrix - PLATFORM = 'ubuntu-18 && immutable' / Test / TestContinuedAPMServerFailure – elastic/apm-lambda-extension/apmproxy
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestContinuedAPMServerFailure
        logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	APM server Transport status set to Healthy
        logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	APM server Transport status set to Failing
        logger.go:130: 2022-08-22T01:56:00.348Z	DEBUG	Grace period entered, reconnection count : 0
        logger.go:130: 2022-08-22T01:56:05.327Z	DEBUG	Grace period over - timer timed out
        logger.go:130: 2022-08-22T01:56:05.327Z	DEBUG	APM server Transport status set to Pending
        apmserver_test.go:474: 
            	Error Trace:	apmserver_test.go:474
            	Error:      	Condition never satisfied
            	Test:       	TestContinuedAPMServerFailure
    --- FAIL: TestContinuedAPMServerFailure (5.00s)
     
    

Steps errors 1

Expand to view the steps failures

Running Go tests
  • Took 0 min 24 sec . View more details here
  • Description: gotestsum --format testname --junitfile junit-report.xml -- -v ./...

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@kruskall kruskall requested a review from axw August 3, 2022 01:55
During multiple invocations the lambda can reuse the environment if
a warm start is taking place. We cannot assume a request with 'flushed=true'
will be the last one for the lifetime of the application.
Replace the channel with a counter that is increased when we
receive a request with 'flushed=true' and it is decreased if
we meet such request in the buffered data while sending to the APM
server.
@kruskall kruskall requested a review from axw August 12, 2022 23:56
The flush signal is received on a separate goroutine (http handler) so we
cannot assume anything on its relationship with the event processed by other
goroutines.
If we just check once we might miss the signal and hang until the runtimeDone
or timeout event is received.
To prevent this, create a channel and periodically check the flush counter to
minimize latency.
@kruskall kruskall requested a review from axw August 17, 2022 00:13

// ShouldFlush returns true if the client should flush APM data after processing the event.
func (c *Client) ShouldFlush() bool {
return c.sendStrategy == SyncFlush || c.hasPendingFlush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I don't know that this is desirable.

I think the ?flushed=true wording is a bit confusing: it doesn't mean that the extension should flush immediately, it just means that the agent (client) has flushed, which in turn means that the Lambda invocation has completed.

I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misinterpreting the specification, but I think the goal of flushing=true was to reduce latency so that the lambda knows that it can flush its data.

The way I interpreted it was:

  • SyncFlush: flush on every intake request. We forwards every request from the agent as soon as we receive it.
  • flushed=true: sent with the final intake request. We buffered the previous requests and this is a signal that we can flush the data.

I think we should revert to only synchronously flushing when sendStrategy == SyncFlush.

Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I think the naming is confusing :)

There's two distinct "flush" events:

  1. agent flushes data to the extension
  2. extension flushes data to the server

The ?flushed=true request indicates to the extension that the first event has happened. This is a prerequisite for the extension flushing data to the server, but it does not mean the server must flush immediately. This behaviour is intended to be controlled by the send strategy.

Question: Wouldn't that mean that buffered data have a chance of being flushed only on shutdown or while processing an event ? Is that intended ?
I think that would lead to a potential delay since if the buffer is not emptied we would have to wait for shutdown which could take a while.

Yes, that is intended. There's a trade-off, as explained in the send strategy docs linked above:

  • In syncflush mode, data gets flushed immediately after an invocation. This means the extension cannot service another Lambda invocation until the events are flushed, reducing Lambda invocation throughput.
  • In background mode, data gets flushed in the background while subsequent Lambda invocations are being processed, or on shutdown if there are no subsequent invocations. This means that data might be significantly delayed in case there are only sporadic Lambda invocations, but Lambda invocation throughput will not be significantly reduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thank you for the explanation! 🙇

I've updated the code to the correct behaviour

only synchronously flush on sendstrategy == syncflush. Do not
flush just because there are unhandled flushed=true requests.
@kruskall kruskall requested a review from axw August 18, 2022 07:15
Go back to a less disruptive change. Remove flush count, don't
keep track of multiple flushed requests but reset the channel
before processing the event.
@kruskall kruskall requested a review from axw August 21, 2022 19:21
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :)

Basically I think we should revert the changes to ForwardApmData and EnqueueAPMData, and don't require enqueuing anything to close c.flushCh.

@kruskall kruskall requested a review from axw August 22, 2022 01:40
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your persistence 😄
LGTM!

@kruskall kruskall merged commit 19bea8c into elastic:main Aug 22, 2022
@kruskall kruskall deleted the fix/apm-done-flush branch August 22, 2022 01:59
@axw
Copy link
Member

axw commented Sep 29, 2022

I've run the Lambda in a loop for a while, and it doesn't appear to panic at all. I did see some unhandled timeout errors, but I get those without the agent or extension enabled too - doesn't appear to be related to the extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-λ-extension AWS Lambda Extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server panic when sending events to APM server
3 participants