-
Notifications
You must be signed in to change notification settings - Fork 573
deprecate consecutive_errors and add consecutive_gateway_errors & consecutive_5xx_errors #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @yingzhuivy. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@frankbu ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This CL adds consecutive_5xx_errors field in OutlierDetection. This field describes the number of 5xx errors before a host is ejected from the connection pool. I made changes to destination_rule.proto, the other files are auto-generated. Change-Id: Ib5097b3c6bf3ea2b8b2f857491537acb674ae1ff Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/195 Reviewed-by: Brian Wolfe <[email protected]> Reviewed-by: Jungho Ahn <[email protected]> Reviewed-by: Weibo He <[email protected]>
This reverts commit 064b737. Reason for revert: decided to deprecate consecutive_errors Change-Id: I95e3191db30711b1ce7abdebe7639de4899f2ab1 Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/163 Reviewed-by: Jungho Ahn <[email protected]>
With the original consecutive_errors design, there is no way to turn off consecutive gateway errors. This CL deprecate the field and add two new fields: consecutive_gateway_errors and consecutive_5xx_errors. See discussions here: istio#909 Change-Id: I0e98990d194216cef842fb792a76a5f59b6e674e Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/199 Reviewed-by: Jungho Ahn <[email protected]> Reviewed-by: Weibo He <[email protected]>
545f4a8
to
0d56230
Compare
@frankbu @howardjohn what's next step? should i submit a pr for the actual implementation now or wait for this to merge |
/ok-to-test |
@yingzhuivy you are free to work on the implementation if you prefer, with the caveat that there may be some wasted time if others object to this. You can use a @rshriram can you take a look? |
|
||
// Number of gateway errors before a host is ejected from the connection pool. | ||
// When the upstream host is accessed over HTTP, a 502, 503, or 504 return | ||
// code qualifies as an error. When the upstream host is accessed over an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a gateway error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// Number of 5xx errors before a host is ejected from the connection pool. | ||
// When the upstream host is accessed over TCP, errors are internally mapped | ||
// to HTTP 5xx codes and treated as such. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For TCP, its essentially just connect timeout and connection reset/error, unless there is more nuancing on the types of errors. So, I suggest repeating the above sentence "When the upstream host is accessed over an opaque TCP connection, connect timeouts, ... events qualify as an error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at envoy code, for TCP
connection timeout -> 504
connection failure -> 503
request failure -> 500
updated the comment
// Number of 5xx errors before a host is ejected from the connection pool. | ||
// When the upstream host is accessed over TCP, errors are internally mapped | ||
// to HTTP 5xx codes and treated as such. | ||
// If set to 0, this feature is disabled. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defaults to 5 as per Envoy docs. https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster/outlier_detection.proto#cluster-outlierdetection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in envoy is same as consecutive_5xx_errors=5 and consecutive_gateway_errors=0 (disabled)
My questions are:
- do we want to change the default behavior? e.g., a user without any setting will see a different behavior if we set gateway default to 5.
- if we want to set the both defaults to 5, we don’t need to set the default for gateway since it has no meaning (since 5xx includes gateway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- user without any setting but with just outlier detection probably falls under the legacy user using the older consecutive_errors format.
That said, Istio behavior for last two releases has been
if outlier.ConsecutiveErrors > 0 {
// Only listen to gateway errors, see https://github.com/istio/api/pull/617
out.EnforcingConsecutiveGatewayFailure = &wrappers.UInt32Value{Value: uint32(100)} // defaults to 0
out.EnforcingConsecutive_5Xx = &wrappers.UInt32Value{Value: uint32(0)} // defaults to 100
out.ConsecutiveGatewayFailure = &wrappers.UInt32Value{Value: uint32(outlier.ConsecutiveErrors)}
}
i.e. if user did not set the consecutive error, then we were still doing consecutive_5xx by default. So, I think what you have is good.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so
- there is discrepancy between the legacy consecutive_errors documentation and actual implementation for the last two releases: i.e. if user did not set the consecutive error, we were still doing 5xx while we should actually be doing gateway errors. Do we want to fix the implementation?
- for the new consecutive_5xx_errors and consecutive_gateway_errors, what do we want for default. do we want to use 5xx as default or gateway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
It's too late to change the implementation of the old
consecutive_errors
field. Right or wrong, it was what it was. Just keep it working the same, only deprecated now. -
I think the new default also needs to stay the same: 5 5xx errors. So to get the same behavior as setting the old field to some value, e.g.,
consecutive_errors = 10
, the user will now need to set both fields:
consecutive_gateway_errors = 10
consecutive_5xx_errors = 0 // disable
@rshriram WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have misunderstood your point, but I am not sure why would a user want to set the gateway greater than 5xx. In my mind, there are three cases:
- user want to set 5xx only
- user want to set gateway only
- user want to set both gateway and 5xx (gateway should be smaller than 5xx otherwise it is same as case 1)
I think the only caveat is that in case 2, the user needs to disable 5xx explictly by setting it as 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand that by distinguishing between 0 and unset we can cover the 3 cases, but I think there is a valid 4th case that neither Envoy nor Istio support:
- user wants to set gateway count to a higher number (e.g., 10) because they are likely to correct themselves but wants other 5xx errors (e.g., 500) to time out much earlier, for example after 1 or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, just to be clear, I think the Istio implementation should just do what Envoy does, whether they change to support case 4 or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also almost seems like a bug that if the user sets both 5xx to 5 and gateway to 10, that the gateway field is just ignored. If I set gateway to 10, I expect that to do something, instead of be ignored because it overlaps with another config parameter. That's why I think it might be worth opening an Envoy issue to see if they want to change their implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets go with what frank suggests for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc nits. otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//
@yingzhuivy It looks like we are now all in agreement on the API. Are you planning to update this PR accordingly, and then provide the implementation? |
I think that all that needs to change in this PR is to correct the documented default values of the two new fields and to add a sentence in the |
@frankbu Yes, @yingzhuivy will update this PR as soon as she resolves an issue in connecting from China. |
This CL changes the description of consecutive_5xx_errors and consecutive_gateway errors. 5xx defaults to 5 and gateway defaults to 0. Change-Id: I6c3b29cf92df1c972a7850a726eb159b5e23bf90 Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/211 Reviewed-by: Jungho Ahn <[email protected]>
// code qualifies as a gateway error. When the upstream host is accessed over | ||
// an opaque TCP connection, connect timeouts and connection error/failure | ||
// events qualify as a gateway error. | ||
// If set to 0, this feature is disabled. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If set to 0, this feature is disabled. Defaults to 0. | |
// This feature is disabled by default or when set to the value 0. | |
// | |
// Note that |
@yingzhuivy a couple of minor suggestions, but otherwise LGTM. |
This CL rewords the descriptions of consecutive 5xx & gateway as suggected in https://github.com/istio/api/pull/1189/files. Change-Id: Ia95c03da78a2c9f12c8762b9d8fb95e7add08516 Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/214 Reviewed-by: Jungho Ahn <[email protected]>
Related to #909
This PR deprecates consecutive_errors and adds consecutive_gateway_errors and consecutive_5xx_errors field in OutlierDetection.
@frankbu @howardjohn