Skip to content

xds: XdsNR should be subscribing to clusters with XdsDepManager #12154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2025

Conversation

ejona86
Copy link
Member

@ejona86 ejona86 commented Jun 13, 2025

This is missing behavior defined in gRFC A74:

As per gRFC A31, the ConfigSelector gives each RPC a ref to the
cluster that was selected for it to ensure that the cluster is not
removed from the xds_cluster_manager LB policy config before the RPC
is done with its LB picks. These cluster refs will also hold a
subscription for the cluster from the XdsDependencyManager, so that
the XdsDependencyManager will not stop watching the cluster resource
until the cluster is removed from the xds_cluster_manager LB policy
config.

Without the logic, RPCs can race and see the error:

INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes #12152. This fixes the regression introduced in 297ab05

This is missing behavior defined in gRFC A74:

> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.

Without the logic, RPCs can race and see the error:

> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes grpc#12152. This fixes the regression introduced in 297ab05
@ejona86 ejona86 requested a review from kannanjgithub June 13, 2025 22:38
@@ -793,9 +793,13 @@ private void updateRoutes(
clusterRefs.get(cluster).refCount.incrementAndGet();
} else {
if (clusterNameMap.containsKey(cluster)) {
assert cluster.startsWith("cluster:");
XdsConfig.Subscription subscription =
xdsDependencyManager.subscribeToCluster(cluster.substring("cluster:".length()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not retrieving the subscription in CdsLoadBalancer2. The clusterSubscription field is only ever assigned in the case of dynamic cluster, and not otherwise. So for non dynamic cluster it will still be null and cause "Unable to find non-dynamic root cluster" error?
What is the race condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPCs use the current route configuration when they were created. But it takes time for them to progress through the filters, such that the route configuration could be different by the time they get to the terminating filter router and do a pick. So XdsNameResolver already has reference counting to keep clusters alive that are only pointed to by old route configurations that are still in use in RPCs.

When a new route configuration is used that points to different clusters, the old clusters will be removed from the XdsConfig, but XdsNR will be keeping the old CdsLB2 instances alive as long as RPCs still need them. Before A74 CdsLB2 would still have an xdsClient watch for that cluster, but before this change it will be receiving the XdsConfig and see the missing cluster. So as long as the XdsNR is keeping the CdsLB2 instance alive, it also needs to keep the subscription to that cluster for XdsConfig.

This case was tested in FakeControlPlaneXdsIntegrationTest.changeClusterForRoute, which is the test that was flaky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks.

@ejona86 ejona86 merged commit 2604ce8 into grpc:master Jun 17, 2025
15 of 16 checks passed
@ejona86 ejona86 deleted the xdsnr-needs-subscription branch June 17, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XdsDepManager generated invalid configuration
2 participants