Skip to content

DRA kubelet: add dra_resource_claims_in_use gauge vector #131641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 7, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

The new metric informs admins whether DRA in general (empty "driver_name" label) and/or specific DRA drivers (label non-empty) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked.

Which issue(s) this PR fixes:

Slack discussion: https://kubernetes.slack.com/archives/C0409NGC1TK/p1746168044475379?thread_ts=1746001550.655339&cid=C0409NGC1TK
Related-to: kubernetes/enhancements#4381 (GA?)

Special notes for your reviewer:

The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data.

While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.

Let's review this proposal, then document it as part of the 1.34 KEP update before merging the implementation.

/hold
/assign @bart0sh

Does this PR introduce a user-facing change?

The new `dra_resource_claims_in_use` kubelet metrics informs about active ResourceClaims, overall and by driver.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels May 7, 2025
@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and sjenning May 7, 2025 11:52
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2025
@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2025
// cdiDevicesAsList returns a list of CDIDevices from the provided claim info.
// When the request name is non-empty, only devices relevant for that request
// are returned.
func (info *ClaimInfo) cdiDevicesAsList(requestName string) []kubecontainer.CDIDevice {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this method unchanged because it was odd that most of the ClaimInfo methods were above, except for this one which came after the claimInfoCache methods.

changed := false
for _, inUse := range delta {
switch {
case inUse.Count == 0 && inUse.DriverName != "":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal in this PR is to use the label driver_name="" to represent "claims in use regardless of the driver". It is debatable whether the empty string or some other special string is better.

We could also use <any> or <all> because those are not valid driver names and thus wouldn't clash with the per-driver vectors.

Thoughts?

cc @dgrisonnet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for <any>. it looks much more explicit than "" to me.

@@ -382,6 +384,43 @@ var _ = framework.SIGDescribe("node")("DRA", feature.DynamicResourceAllocation,
gomega.Eventually(kubeletPlugin2.GetGRPCCalls).WithTimeout(retryTestTimeout).Should(testdriver.NodeUnprepareResourcesSucceeded)
})

ginkgo.It("must provide metrics", func(ctx context.Context) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an e2e_node test because it is easier to get the kubelet metrics there.

If that could be done also in an E2E test, then putting the test there would be more appropriate. The test doesn't really depend on the kubelet configuration and E2E tests are easier to run.

@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 7, 2025
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board May 7, 2025
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking comments

@pohly
Copy link
Contributor Author

pohly commented May 13, 2025

e2e-kind failed because of #131748.

gomega.Expect(kubeletPlugin1.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 1 should have prepared resources.")
gomega.Expect(kubeletPlugin2.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 2 should have prepared resources.")
driverName := func(element any) string {
el := element.(*model.Sample)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers

ERROR: Some files are importing packages under github.com/prometheus/* but are not allow-listed to do so.

See: https://github.com/kubernetes/kubernetes/issues/89267

Failing files:
  ./test/e2e_node/dra_test.go

It is how some other code is checking metrics.

@dgrisonnet @richabanker: is this one of those cases where it's okay to extend the allow list? Or is there a different way of checking for the expected outcome?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have allowed to extend the allow list in the past, ref so I guess we can just do that now? Regarding the usage, I see similar usage of the model package in the codebase to verify the metrics data so should be fine? cc @serathius - the author of the linked issue if he has any ideas on how to avoid importing the package here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively maybe you could try to create a map representation (map[string]float64) of the vector where the key is the driver_name and the value is the metric value. And then use something like this for verifying the values?

claimsInUse := convertVectorToMap(metrics, "dra_resource_claims_in_use")
gomega.Expect(claimsInUse).Should(gstruct.MatchKeys(gstruct.IgnoreExtras, gstruct.Keys{
	"":                   gomega.BeEquivalentTo(1),
	kubeletPlugin1Name: gomega.BeEquivalentTo(1),
	kubeletPlugin2Name: gomega.BeEquivalentTo(1),
}), "metrics while pod is running")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels like a workaround. I prefer adding a type alias in k8s.io/component-base/metrics/testutil: that package is geared towards use in tests, and already defines a type which directly exposes model.Sample, so letting consumers of that package also use that type directly seems fair. The code already depends on it anyway.

In other words, this was possible before:

var metrics testutils.Metrics
metrics = ...
samples := metrics["dra_resource_claims_in_use"]
sample := samples[0]

It should also be possible to write:

var sample *testutil.Sample
sample = samples[0]

I suppose some of the importers under

./test/e2e/apimachinery/flowcontrol.go
./test/e2e_node/mirror_pod_grace_period_test.go
./test/e2e/node/pods.go
./test/e2e_node/resource_metrics_test.go
./test/instrumentation/main_test.go
./test/integration/apiserver/flowcontrol/concurrency_test.go
./test/integration/apiserver/flowcontrol/concurrency_util_test.go
./test/integration/metrics/metrics_test.go
could use the same approach, but I haven't checked.

For now I have added the type aliases to this PR and use them in dra_test.go.

func (cache *claimInfoCache) withLock(f func() error) error {
cache.Lock()
defer cache.Unlock()

if loggerV := cache.logger.V(5); loggerV.Enabled() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging doesn't have unit tests at the moment.

@bart0sh: do you find this useful? Shall I keep and test it or remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks useful for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's keep it. I added tests for it.

// actively deleted.
//
// The empty driver name provides the overall count of all active
// ResourceClaims regardless of the driver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up for debate, see #131641 (comment)

@pohly pohly force-pushed the dra-kubelet-in-use-metric branch from 13012f0 to 935ff9f Compare May 14, 2025 11:36
@richabanker
Copy link
Contributor

/lgtm

from metrics POV

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6d2c680c8f98146a7955c1944695f54b7cee1fba

@pohly
Copy link
Contributor Author

pohly commented May 20, 2025

@richabanker: can you also approve? It's needed for component-base/metrics.

@richabanker
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly, richabanker
Once this PR has been reviewed and has the lgtm label, please assign klueska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Contributor Author

pohly commented May 20, 2025

@bart0sh: LGTM to you for SIG Node and DRA?

I'll put it into a KEP update before merging, so we will get additional eyes on the proposal later.

type claimInfoCache struct {
logger klog.Logger
metrics.BaseStableCollector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making a separate structure for collector? Why do you think embedding is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed simpler. I can move it to a separate struct and then we can compare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed. I want to amend commits before we merge, so:
/hold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. I just thought that it might look a bit clearer if it's separated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? I don't have a strong opinion, both seem fine to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it better due to a bit better separation of concerns. That's why I lgtm-ed the PR.


// CollectWithStability implements the metrics.StableCollector interface.
func (cache *claimInfoCache) CollectWithStability(ch chan<- metrics.Metric) {
cache.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to lock in the ClaimsInUse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the cache.claimsInUse() call?

That needs to support being called with the lock already held for the usage in claimInfoCache.withLock. I could provide claimsInUse and claimsInUseWithLock, but I think letting the caller handle the locking is simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the explanations, let's keep it as is.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2025
@bart0sh
Copy link
Contributor

bart0sh commented May 21, 2025

pull-kubernetes-node-e2e-crio-cgrpv1-dra failure seems to be related to this PR.

@bart0sh
Copy link
Contributor

bart0sh commented May 21, 2025

/triage accepted
/lgtm

@pohly Feel free to unhold after CI failures fixed.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2dec57a3e52b9747f6b07cca8930e236fc904ff3

@pohly pohly force-pushed the dra-kubelet-in-use-metric branch from 6e5cbae to cb99711 Compare May 21, 2025 13:03
pohly added 4 commits May 22, 2025 08:01
The new metric informs admins whether DRA in general (empty "driver_name"
label) and/or specific DRA drivers (label non-empty) are in use on nodes.
This is useful to know because removing a driver is only safe if it is not in
use. If a driver gets removed while it has prepared a ResourceClaim,
unpreparing that ResourceClaim and stopping pods is blocked.

The implementation of the metric hooks into read/write locking of the claim
info cache. At that point it is possible to compare the before/after state and
update the metric. In addition, changes can be logged.

The unit tests check that metrics get calculated. The e2e_node test checks that
kubelet really exports the metrics data.

While at it, some bugs in the claiminfo_test.go get fixed: the way how the
cache got populated in the test did not match the code anymore.
Maintaining the GaugeVec was a bit odd. Collecting the information
only when requested is more natural.
Test code using the testutil.Metrics type already depended on the Prometheus
types, but couldn't reference them by name. This is necessary for example when
using Gomega (to cast from `any` in a matcher) or when defining a `var
sample *Sample` which gets set later.
The special string is more obvious than the empty one.
@pohly pohly force-pushed the dra-kubelet-in-use-metric branch from cb99711 to 985f626 Compare May 22, 2025 06:01
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Archive-it
Development

Successfully merging this pull request may close these issues.

5 participants