Skip to content

fix: grpc connections #2386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

exdx
Copy link
Member

@exdx exdx commented Oct 1, 2021

Description of the change:
Builds off of #2333

Motivation for the change:
Improve gRPC connection performance, particularly in e2e test scenarios

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@openshift-ci
Copy link

openshift-ci bot commented Oct 1, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: exdx
To complete the pull request process, please assign ecordell after the PR has been reviewed.
You can assign the PR to them by writing /assign @ecordell in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@exdx exdx force-pushed the fix/grpc-connections branch from 71b87ec to 30ed3a5 Compare October 4, 2021 14:21
@exdx
Copy link
Member Author

exdx commented Oct 4, 2021

Aside from the unit test failures, this seems to hit the IDLE problem where the catalog source pod sits in an IDLE state instead of going to READY

waiting for catalog pod mock-ocs-main-hdz7m to be available (after catalog update) - IDLE

https://github.com/operator-framework/operator-lifecycle-manager/pull/2386/checks?check_run_id=3792136502

@timflannagan timflannagan mentioned this pull request Oct 6, 2021
43 tasks
Copy link
Contributor

@tylerslaton tylerslaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions and potential suggestions!

if c.currentServiceAccount(source) == nil ||
c.currentRole(source) == nil ||
c.currentRoleBinding(source) == nil ||
c.currentService(source) == nil ||
len(c.currentPods(source, c.Image)) < 1 {
len(pods) < 1 ||
len(pods[0].Status.ContainerStatuses) < 1 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this panic if there are not any pods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, in that scenario len(pods) < 1should evaluate totrueand the expression should short-circuit beforepods[0]` is evaluated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it short circuit though if the expression is OR-ed rather than AND-ed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is prone to a panic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this again, I believe this may not actually be a problem as Nick mentioned. Thoughts on this?

https://play.golang.org/p/yPXWTIrT2Y6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood the check when I had initially reviewed - if we're essentially just checking for whether the registry-server container is reporting a "healthy" state, then I don't think this kind of check is problematic (albeit difficult to read) due to the short circuiting mentioned earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else I just noticed - if we're firing off Services with spec.ClusterIP: None, then presumably we cannot also check whether the status.ClusterIP has been populated to determine whether the service DNS has been established yet. Do we need to also query for the Endpoint object and ensure that value has been populated as another condition check for healthiness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.

@njhale njhale linked an issue Oct 7, 2021 that may be closed by this pull request
Comment on lines +701 to +704
// if the pod isn't healthy, don't check the connection
// checking the connection before the dns is ready may lead dns to cache the miss
// (pod readiness is used as a hint that dns should be ready to avoid coupling this to dns)
continueSync = healthy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me - I feel like this is what I've been seeing locally and delaying this check until the pod is reporting a health state seems like like the most logical fix 👍

if c.currentServiceAccount(source) == nil ||
c.currentRole(source) == nil ||
c.currentRoleBinding(source) == nil ||
c.currentService(source) == nil ||
len(c.currentPods(source, c.Image)) < 1 {
len(pods) < 1 ||
len(pods[0].Status.ContainerStatuses) < 1 ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.

pods := c.currentPodsWithCorrectImageAndSpec(source, source.ServiceAccount().GetName())
if len(pods) < 1 ||
len(pods[0].Status.ContainerStatuses) < 1 ||
!pods[0].Status.ContainerStatuses[0].Ready ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would update the service check here as well. In reality configmap based catalogs are deprecated (and only used in our e2e tests anymore from what I can tell) so maybe we should add more to this health check.

@exdx exdx force-pushed the fix/grpc-connections branch from 7c4ea7b to afac9d7 Compare October 12, 2021 19:04
ecordell and others added 7 commits October 13, 2021 10:58
this should result in the same iptables or ipvs rules being generated
because there is only one backend, but in practice setting this to None
results in kubeproxy generating the rules much faster and speeding up
connection times
the pods use a grpc_health_probe as their readiness probe, so this
helps prevent initial connect issues with the grpc client
we only dial once we know the pod is up and responding to grpc via the
grpc health probe, so any issues with connections are likely to be
very transient and related to overlay network config propogation
without this, it seems to fight the catalog operator for a connection

there's probably a deeper reason with a better fix (grpc server should
be fine with multiple simultaneous connects on start) but this sidesteps
the issue for now. it's largely only an issue for the e2e tests
Signed-off-by: Daniel Sover <[email protected]>
Signed-off-by: Daniel Sover <[email protected]>
@exdx exdx force-pushed the fix/grpc-connections branch from afac9d7 to 29d3f9d Compare October 13, 2021 14:58
@exdx
Copy link
Member Author

exdx commented Oct 14, 2021

Related to #1186

@openshift-ci
Copy link

openshift-ci bot commented Dec 12, 2021

@exdx: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2021
@timflannagan
Copy link
Member

Closing as this is fairly stale at this point and needs to be rebased. It looks like there's some issues we'll need to weed out with the newer grpc version bump anyways, but at least that's tracked in the existing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make catalog gRPC connections more consistent
5 participants