-
Notifications
You must be signed in to change notification settings - Fork 475
fix(ess_billing): Improve reliability and prevent API errors #14744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(ess_billing): Improve reliability and prevent API errors #14744
Conversation
Format the CEL programs with celfmt. [git-generate] go run github.com/elastic/[email protected] -agent -i packages/ess_billing/data_stream/billing/agent/stream/cel.yml.hbs -o packages/ess_billing/data_stream/billing/agent/stream/cel.yml.hbs go run github.com/elastic/[email protected] -agent -i packages/ess_billing/data_stream/credits/agent/stream/cel.yml.hbs -o packages/ess_billing/data_stream/credits/agent/stream/cel.yml.hbs
Set want_more to false when receiving non-200 HTTP status codes to prevent the CEL input from making repeated requests that will inevitably fail. The input will retry at the next periodic interval instead of exhausting the CEL execution budget. Fixes elastic#14743
The `now()` function is evaluated at the time of execution, which can lead to inconsistent timestamps if the program's runtime spans across a time boundary that the logic is sensitive to. In contrast, `now` is a variable representing the timestamp when the program execution began. This provides a stable time reference throughout a single execution of the program. This change replaces all instances of now() with now to ensure consistent and predictable time-based calculations. See: https://pkg.go.dev/github.com/elastic/mito/lib#Time
1e80df8
to
9e53bb8
Compare
Add validation to ensure the 'from' parameter is not earlier than 2021-01-01 by using the max() function to clamp the calculated timestamp to the API minimum. This prevents repeated Bad Request errors that occur when the cursor or lookbehind configuration creates earlier dates. The commit also simplifies the date calculation logic using optMap for better maintainability. Fixes elastic#14755
This all seems sane, though it looks like the system test needs attention since it's failing. I would be nice to understand where the invalid timestamp in the bugged case came from, but this will fix it. |
Looks like the CI problem is my usage of
|
The CEL max() function requires elastic/mito 1.16.0 or greater. This was first added in Elastic Beats v8.18.0 (elastic/beats@8d5620d). So, I have raised the Kibana requirement to 8.18.0 (as a proxy for a true Elastic Agent version constraint) and documented in the changelog that Elastic Agent >=8.18 is necessary.
|
💚 Build Succeeded
History
|
@3kt would you be able to review my changes (since you have contributed and are a CODEOWNER)? Thanks! |
packages/ess_billing/data_stream/billing/_dev/deploy/docker/files/stream-config.yml
Show resolved
Hide resolved
@@ -83,22 +85,22 @@ program: |- | |||
"last_to": req.to, | |||
}, | |||
// Are we more than 1 day behind? | |||
"want_more": req.to < now() - duration("24h"), | |||
"want_more": req.to < now - duration("24h"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what's the difference between now()
and now
(apart from the obvious that one is a function and the other a variable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pkg.go.dev/github.com/elastic/mito/lib#Time
now
is global var set at execution time. now()
is function that returns the current time. By invoking now()
you could have a different view of the current time in all of the places that it is used.
@@ -119,7 +121,7 @@ program: |- | |||
"cursor": { | |||
"last_to": req.to, | |||
}, | |||
"want_more": req.to < now() - duration("24h"), | |||
"want_more": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we stop there?
Let's say we have a 429 or 503 - shouldn't we retry if we're not caught up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux of the problem. For instance, by returning want_more: true
on HTTP 500, it will trigger 999 more CEL program executions in just a few seconds (assuming the server keeps responding with non-200 status codes). There is already a default built-in retry request mechanism that attempts the request up to 5 times.
My goal is to stop the program and have the next iteration retry the same request. However, it has been some time since I made this change, so I need to revisit it to ensure the cursor isn't advanced. I might need to remove the cursor
from this failure case to prevent advancing it. I'll test this and provide an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other issue is that we'll wait 24 hours before retrying I guess? 24h being the execution schedule.
processors: | ||
- drop_event.when.equals.fake: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drop
processor then needs to be removed from the ingest pipeline as well:
- drop:
if: "ctx.ess?.billing == null && ctx.error?.message == null"
Its only purpose was to drop this fake event, but I got heavy-handed and stated "anything that doesn't have an error or the billing
namespace should be dropped".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be removed, but to be safe it should probably be removed at a later time to give Agents a chance to upgrade and get the local beat drop_event processor. The package upgrade and the agent policy upgrade are not an atomic operation. If the package upgrades, but the related Agent policies are not immediately upgraded then this could offer a window where fake: true
events sneak through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point actually, I forgot the local and remote parts aren't necessarily atomically tied.
Maybe worth commenting in the pipeline that this should be removed at a later date then?
NOTE: A commit-by-commit review is recommended because of the celfmt changes.
Proposed commit message
Checklist
changelog.yml
file.How to test this PR locally
elastic-package -C packages/ess_billing test system -v
Related issues