fix: grpc connections #2386

exdx · 2021-10-01T19:38:08Z

Description of the change:
Builds off of #2333

Motivation for the change:
Improve gRPC connection performance, particularly in e2e test scenarios

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

openshift-ci · 2021-10-01T19:38:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: exdx
To complete the pull request process, please assign ecordell after the PR has been reviewed.
You can assign the PR to them by writing /assign @ecordell in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

exdx · 2021-10-04T17:52:22Z

Aside from the unit test failures, this seems to hit the IDLE problem where the catalog source pod sits in an IDLE state instead of going to READY

waiting for catalog pod mock-ocs-main-hdz7m to be available (after catalog update) - IDLE

https://github.com/operator-framework/operator-lifecycle-manager/pull/2386/checks?check_run_id=3792136502

tylerslaton

Just a few questions and potential suggestions!

pkg/package-server/provider/registry.go

tylerslaton · 2021-10-06T19:00:39Z

pkg/controller/registry/reconciler/configmap.go

 	if c.currentServiceAccount(source) == nil ||
 		c.currentRole(source) == nil ||
 		c.currentRoleBinding(source) == nil ||
 		c.currentService(source) == nil ||
-		len(c.currentPods(source, c.Image)) < 1 {
+		len(pods) < 1 ||
+		len(pods[0].Status.ContainerStatuses) < 1 ||


Will this panic if there are not any pods?

If I'm reading this right, in that scenario len(pods) < 1should evaluate totrueand the expression should short-circuit beforepods[0]` is evaluated

Would it short circuit though if the expression is OR-ed rather than AND-ed?

Yes, I think this is prone to a panic

Thinking on this again, I believe this may not actually be a problem as Nick mentioned. Thoughts on this?

https://play.golang.org/p/yPXWTIrT2Y6

I think I misunderstood the check when I had initially reviewed - if we're essentially just checking for whether the registry-server container is reporting a "healthy" state, then I don't think this kind of check is problematic (albeit difficult to read) due to the short circuiting mentioned earlier.

Something else I just noticed - if we're firing off Services with spec.ClusterIP: None, then presumably we cannot also check whether the status.ClusterIP has been populated to determine whether the service DNS has been established yet. Do we need to also query for the Endpoint object and ensure that value has been populated as another condition check for healthiness?

That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.

pkg/controller/registry/reconciler/grpc.go

timflannagan · 2021-10-11T20:31:45Z

pkg/controller/operators/catalog/operator.go

+	// if the pod isn't healthy, don't check the connection
+	// checking the connection before the dns is ready may lead dns to cache the miss
+	// (pod readiness is used as a hint that dns should be ready to avoid coupling this to dns)
+	continueSync = healthy


This makes sense to me - I feel like this is what I've been seeing locally and delaying this check until the pod is reporting a health state seems like like the most logical fix 👍

exdx · 2021-10-12T13:48:20Z

pkg/controller/registry/reconciler/configmap.go

 	if c.currentServiceAccount(source) == nil ||
 		c.currentRole(source) == nil ||
 		c.currentRoleBinding(source) == nil ||
 		c.currentService(source) == nil ||
-		len(c.currentPods(source, c.Image)) < 1 {
+		len(pods) < 1 ||
+		len(pods[0].Status.ContainerStatuses) < 1 ||


That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.

exdx · 2021-10-12T13:51:11Z

pkg/controller/registry/reconciler/grpc.go

+	pods := c.currentPodsWithCorrectImageAndSpec(source, source.ServiceAccount().GetName())
+	if len(pods) < 1 ||
+		len(pods[0].Status.ContainerStatuses) < 1 ||
+		!pods[0].Status.ContainerStatuses[0].Ready ||


we would update the service check here as well. In reality configmap based catalogs are deprecated (and only used in our e2e tests anymore from what I can tell) so maybe we should add more to this health check.

this should result in the same iptables or ipvs rules being generated because there is only one backend, but in practice setting this to None results in kubeproxy generating the rules much faster and speeding up connection times

the pods use a grpc_health_probe as their readiness probe, so this helps prevent initial connect issues with the grpc client

we only dial once we know the pod is up and responding to grpc via the grpc health probe, so any issues with connections are likely to be very transient and related to overlay network config propogation

without this, it seems to fight the catalog operator for a connection there's probably a deeper reason with a better fix (grpc server should be fine with multiple simultaneous connects on start) but this sidesteps the issue for now. it's largely only an issue for the e2e tests

Signed-off-by: Daniel Sover <[email protected]>

exdx · 2021-10-14T18:25:19Z

Related to #1186

openshift-ci · 2021-12-12T03:28:31Z

@exdx: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timflannagan · 2022-02-17T16:30:37Z

Closing as this is fairly stale at this point and needs to be rebased. It looks like there's some issues we'll need to weed out with the newer grpc version bump anyways, but at least that's tracked in the existing issue.

openshift-ci bot requested review from awgreene and gallettilance October 1, 2021 19:38

exdx force-pushed the fix/grpc-connections branch from 71b87ec to 30ed3a5 Compare October 4, 2021 14:21

timflannagan mentioned this pull request Oct 6, 2021

OLM CI Tracking #2401

Open

43 tasks

tylerslaton reviewed Oct 6, 2021

View reviewed changes

njhale linked an issue Oct 7, 2021 that may be closed by this pull request

Make catalog gRPC connections more consistent #2410

Open

timflannagan reviewed Oct 11, 2021

View reviewed changes

exdx commented Oct 12, 2021

View reviewed changes

exdx force-pushed the fix/grpc-connections branch from 7c4ea7b to afac9d7 Compare October 12, 2021 19:04

ecordell and others added 7 commits October 13, 2021 10:58

use clusterip: None for catalog service

77798bf

this should result in the same iptables or ipvs rules being generated because there is only one backend, but in practice setting this to None results in kubeproxy generating the rules much faster and speeding up connection times

wait for catalog pods to be "ready" before attempting to connect

865cd9b

the pods use a grpc_health_probe as their readiness probe, so this helps prevent initial connect issues with the grpc client

more aggressive backoffconfig for grpc connections

5ec8c5c

we only dial once we know the pod is up and responding to grpc via the grpc health probe, so any issues with connections are likely to be very transient and related to overlay network config propogation

fix: bump grpc package

7e996bc

Signed-off-by: Daniel Sover <[email protected]>

fix: unit tests

9029ea0

Signed-off-by: Daniel Sover <[email protected]>

test: run healthcheck regardless of past observed state

29d3f9d

Signed-off-by: Daniel Sover <[email protected]>

exdx force-pushed the fix/grpc-connections branch from afac9d7 to 29d3f9d Compare October 13, 2021 14:58

fix: remove unreliable READY check from syncCatalogSource

022c691

Signed-off-by: Daniel Sover <[email protected]>

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2021

timflannagan closed this Feb 17, 2022

fix: grpc connections #2386

fix: grpc connections #2386

Uh oh!

Conversation

exdx commented Oct 1, 2021

Uh oh!

openshift-ci bot commented Oct 1, 2021

Uh oh!

exdx commented Oct 4, 2021

Uh oh!

tylerslaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

exdx commented Oct 14, 2021

Uh oh!

openshift-ci bot commented Dec 12, 2021

Uh oh!

timflannagan commented Feb 17, 2022

Uh oh!

Uh oh!