Skip to content

fix(ess_billing): Improve reliability and prevent API errors #14744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented Jul 30, 2025

NOTE: A commit-by-commit review is recommended because of the celfmt changes.

Proposed commit message

fix(ess_billing): Improve reliability and prevent API errors

This commit introduces several improvements to the ess_billing integration to
enhance its reliability and prevent common API errors.

- To prevent repeated failing requests on non-200 HTTP status codes, the CEL
  program now sets want_more to false. This allows the input to retry at the
  next periodic interval instead of exhausting the execution budget.
- A validation rule has been added to enforce a minimum 'from' date of
  2021-01-01. This clamps the calculated timestamp to the API's minimum allowed
  value, preventing 'Bad Request' errors caused by lookbehind configurations
  creating excessively early dates. This uses the max() function which requires
  Elastic Agent 8.18 or greater so the Kibana constraint was raised as a proxy.
- All instances of the now() function in the CEL program have been replaced
  with the now variable.  This ensures a stable time reference throughout a
  single execution, leading to more consistent and predictable time-based
  calculations.

  A system test has been added to cover these scenarios.

  Fixes #14743
  Fixes #14755

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

How to test this PR locally

elastic-package -C packages/ess_billing test system -v

Related issues

@andrewkroh andrewkroh marked this pull request as ready for review July 30, 2025 15:49
@andrewkroh andrewkroh requested a review from a team as a code owner July 30, 2025 15:49
Format the CEL programs with celfmt.

[git-generate]
go run github.com/elastic/[email protected] -agent -i packages/ess_billing/data_stream/billing/agent/stream/cel.yml.hbs -o packages/ess_billing/data_stream/billing/agent/stream/cel.yml.hbs
go run github.com/elastic/[email protected] -agent -i packages/ess_billing/data_stream/credits/agent/stream/cel.yml.hbs -o packages/ess_billing/data_stream/credits/agent/stream/cel.yml.hbs
Set want_more to false when receiving non-200 HTTP status codes to
prevent the CEL input from making repeated requests that will inevitably
fail. The input will retry at the next periodic interval instead of
exhausting the CEL execution budget.

Fixes elastic#14743
The `now()` function is evaluated at the time of execution, which can lead
to inconsistent timestamps if the program's runtime spans across a time
boundary that the logic is sensitive to.

In contrast, `now` is a variable representing the timestamp when the
program execution began. This provides a stable time reference
throughout a single execution of the program.

This change replaces all instances of now() with now to ensure
consistent and predictable time-based calculations.

See: https://pkg.go.dev/github.com/elastic/mito/lib#Time
@andrewkroh andrewkroh force-pushed the ess_billing/fix/no-retry-on-403 branch from 1e80df8 to 9e53bb8 Compare July 30, 2025 15:50
@andrewkroh andrewkroh added the Integration:ess_billing Elasticsearch Service Billing (Community supported) label Jul 30, 2025
Add validation to ensure the 'from' parameter is not earlier than
2021-01-01 by using the max() function to clamp the calculated timestamp
to the API minimum. This prevents repeated Bad Request errors that occur
when the cursor or lookbehind configuration creates earlier dates. The
commit also simplifies the date calculation logic using optMap for
better maintainability.

Fixes elastic#14755
@efd6
Copy link
Contributor

efd6 commented Jul 31, 2025

This all seems sane, though it looks like the system test needs attention since it's failing. I would be nice to understand where the invalid timestamp in the bugged case came from, but this will fix it.

@andrewkroh andrewkroh added the bugfix Pull request that fixes a bug issue label Jul 31, 2025
@andrewkroh andrewkroh changed the title fix(ess_billing.billing): set want_more to false on error fix(ess_billing): Improve reliability and prevent API errors Jul 31, 2025
@andrewkroh
Copy link
Member Author

andrewkroh commented Jul 31, 2025

Looks like the CI problem is my usage of max with 8.15.0.

fleet-server-1 | {"log.level":"info","@timestamp":"2025-07-31T03:37:26.000Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","req.Components":[{"id":"cel-default","message":"Healthy: communicating with pid '316'","status":"HEALTHY","type":"cel","units":[{"id":"cel-default-cel-ESS Billing-4f94630c-28ef-4096-a8ba-2306872c53e2","message":"failed to check program: failed compilation: ERROR: <input>:11:8: found no matching overload for 'max' applied to '(timestamp, timestamp)'\n | max(from, timestamp(\"2021-01-01T00:00:00Z\"))\n | .......^ accessing config","payload":{"streams":{"cel-ess_billing.billing-4f94630c-28ef-4096-a8ba-2306872c53e2":{"error":"","status":"STARTING"}}},"status":"FAILED","type":"input"},{"id":"cel-default","message":"Healthy","status":"HEALTHY","type":"output"}]}],"service.type":"fleet-server","http.request.id":"01K1F8WQZJQW54PFJT18VCHVNX","server.address":"","fleet.agent.id":"64fa78d6-ba93-421e-b6bf-f44e31225444","fleet.access.apikey.id":"wx2JXpgBcJYJrjvG18FE","@timestamp":"2025-07-31T03:37:26Z","ecs.version":"1.6.0","ecs.version":"1.6.0"}

The CEL max() function requires elastic/mito 1.16.0 or greater. This was
first added in Elastic Beats v8.18.0 (elastic/beats@8d5620d). So, I have
raised the Kibana requirement to 8.18.0 (as a proxy for a true Elastic
Agent version constraint) and documented in the changelog that Elastic
Agent >=8.18 is necessary.
Copy link

@elasticmachine
Copy link

💚 Build Succeeded

History

@andrewkroh
Copy link
Member Author

@3kt would you be able to review my changes (since you have contributed and are a CODEOWNER)? Thanks!

@andrewkroh andrewkroh requested a review from 3kt August 5, 2025 21:16
@@ -83,22 +85,22 @@ program: |-
"last_to": req.to,
},
// Are we more than 1 day behind?
"want_more": req.to < now() - duration("24h"),
"want_more": req.to < now - duration("24h"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the difference between now() and now (apart from the obvious that one is a function and the other a variable)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/github.com/elastic/mito/lib#Time

now is global var set at execution time. now() is function that returns the current time. By invoking now() you could have a different view of the current time in all of the places that it is used.

@@ -119,7 +121,7 @@ program: |-
"cursor": {
"last_to": req.to,
},
"want_more": req.to < now() - duration("24h"),
"want_more": false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we stop there?
Let's say we have a 429 or 503 - shouldn't we retry if we're not caught up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the crux of the problem. For instance, by returning want_more: true on HTTP 500, it will trigger 999 more CEL program executions in just a few seconds (assuming the server keeps responding with non-200 status codes). There is already a default built-in retry request mechanism that attempts the request up to 5 times.

My goal is to stop the program and have the next iteration retry the same request. However, it has been some time since I made this change, so I need to revisit it to ensure the cursor isn't advanced. I might need to remove the cursor from this failure case to prevent advancing it. I'll test this and provide an update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other issue is that we'll wait 24 hours before retrying I guess? 24h being the execution schedule.

processors:
- drop_event.when.equals.fake: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop processor then needs to be removed from the ingest pipeline as well:

  - drop:
      if: "ctx.ess?.billing == null && ctx.error?.message == null"

Its only purpose was to drop this fake event, but I got heavy-handed and stated "anything that doesn't have an error or the billing namespace should be dropped".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be removed, but to be safe it should probably be removed at a later time to give Agents a chance to upgrade and get the local beat drop_event processor. The package upgrade and the agent policy upgrade are not an atomic operation. If the package upgrades, but the related Agent policies are not immediately upgraded then this could offer a window where fake: true events sneak through.

Copy link
Contributor

@3kt 3kt Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point actually, I forgot the local and remote parts aren't necessarily atomically tied.
Maybe worth commenting in the pipeline that this should be removed at a later date then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Pull request that fixes a bug issue Integration:ess_billing Elasticsearch Service Billing (Community supported)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ESS Billing] Invalid 'from' query param [ESS Billing]: CEL Configuration Issue
4 participants