-
Notifications
You must be signed in to change notification settings - Fork 560
fix: grpc connections #2386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: grpc connections #2386
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: exdx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
71b87ec
to
30ed3a5
Compare
Aside from the unit test failures, this seems to hit the IDLE problem where the catalog source pod sits in an IDLE state instead of going to READY
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions and potential suggestions!
if c.currentServiceAccount(source) == nil || | ||
c.currentRole(source) == nil || | ||
c.currentRoleBinding(source) == nil || | ||
c.currentService(source) == nil || | ||
len(c.currentPods(source, c.Image)) < 1 { | ||
len(pods) < 1 || | ||
len(pods[0].Status.ContainerStatuses) < 1 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this panic if there are not any pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right, in that scenario len(pods)
< 1should evaluate to
trueand the expression should short-circuit before
pods[0]` is evaluated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it short circuit though if the expression is OR-ed rather than AND-ed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is prone to a panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this again, I believe this may not actually be a problem as Nick mentioned. Thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood the check when I had initially reviewed - if we're essentially just checking for whether the registry-server container is reporting a "healthy" state, then I don't think this kind of check is problematic (albeit difficult to read) due to the short circuiting mentioned earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else I just noticed - if we're firing off Services with spec.ClusterIP: None
, then presumably we cannot also check whether the status.ClusterIP
has been populated to determine whether the service DNS has been established yet. Do we need to also query for the Endpoint object and ensure that value has been populated as another condition check for healthiness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.
// if the pod isn't healthy, don't check the connection | ||
// checking the connection before the dns is ready may lead dns to cache the miss | ||
// (pod readiness is used as a hint that dns should be ready to avoid coupling this to dns) | ||
continueSync = healthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me - I feel like this is what I've been seeing locally and delaying this check until the pod is reporting a health state seems like like the most logical fix 👍
if c.currentServiceAccount(source) == nil || | ||
c.currentRole(source) == nil || | ||
c.currentRoleBinding(source) == nil || | ||
c.currentService(source) == nil || | ||
len(c.currentPods(source, c.Image)) < 1 { | ||
len(pods) < 1 || | ||
len(pods[0].Status.ContainerStatuses) < 1 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable -- based on the fact the endpoint has the same name as the service it should be straightforward to query and check that the endpoints.subsets field is non-nil.
pods := c.currentPodsWithCorrectImageAndSpec(source, source.ServiceAccount().GetName()) | ||
if len(pods) < 1 || | ||
len(pods[0].Status.ContainerStatuses) < 1 || | ||
!pods[0].Status.ContainerStatuses[0].Ready || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would update the service check here as well. In reality configmap based catalogs are deprecated (and only used in our e2e tests anymore from what I can tell) so maybe we should add more to this health check.
7c4ea7b
to
afac9d7
Compare
this should result in the same iptables or ipvs rules being generated because there is only one backend, but in practice setting this to None results in kubeproxy generating the rules much faster and speeding up connection times
the pods use a grpc_health_probe as their readiness probe, so this helps prevent initial connect issues with the grpc client
we only dial once we know the pod is up and responding to grpc via the grpc health probe, so any issues with connections are likely to be very transient and related to overlay network config propogation
without this, it seems to fight the catalog operator for a connection there's probably a deeper reason with a better fix (grpc server should be fine with multiple simultaneous connects on start) but this sidesteps the issue for now. it's largely only an issue for the e2e tests
Signed-off-by: Daniel Sover <[email protected]>
Signed-off-by: Daniel Sover <[email protected]>
Signed-off-by: Daniel Sover <[email protected]>
afac9d7
to
29d3f9d
Compare
Signed-off-by: Daniel Sover <[email protected]>
Related to #1186 |
@exdx: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Closing as this is fairly stale at this point and needs to be rebased. It looks like there's some issues we'll need to weed out with the newer grpc version bump anyways, but at least that's tracked in the existing issue. |
Description of the change:
Builds off of #2333
Motivation for the change:
Improve gRPC connection performance, particularly in e2e test scenarios
Reviewer Checklist
/doc