Skip to content

deprecate consecutive_errors and add consecutive_gateway_errors & consecutive_5xx_errors #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 20, 2019

Conversation

yingzhuivy
Copy link
Member

@yingzhuivy yingzhuivy commented Dec 2, 2019

Related to #909
This PR deprecates consecutive_errors and adds consecutive_gateway_errors and consecutive_5xx_errors field in OutlierDetection.
@frankbu @howardjohn

@yingzhuivy yingzhuivy requested a review from a team as a code owner December 2, 2019 19:15
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Dec 2, 2019
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 2, 2019
@istio-testing
Copy link
Collaborator

Hi @yingzhuivy. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@istio-testing istio-testing added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-rebase Indicates a PR needs to be rebased before being merged size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 5, 2019
@yingzhuivy yingzhuivy changed the title add consecutive_5xx_errors for outlier detection deprecate consecutive_errors and add consecutive_gateway_errors & consecutive_5xx_errors Dec 6, 2019
@yingzhuivy
Copy link
Member Author

@frankbu ptal

Copy link
Contributor

@frankbu frankbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

yingzhuivy added 3 commits December 9, 2019 16:25
This CL adds consecutive_5xx_errors field in OutlierDetection.
This field describes the number of 5xx errors before a host is
ejected from the connection pool.

I made changes to destination_rule.proto, the other files are auto-generated.

Change-Id: Ib5097b3c6bf3ea2b8b2f857491537acb674ae1ff
Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/195
Reviewed-by: Brian Wolfe <[email protected]>
Reviewed-by: Jungho Ahn <[email protected]>
Reviewed-by: Weibo He <[email protected]>
This reverts commit 064b737.

Reason for revert: decided to deprecate consecutive_errors

Change-Id: I95e3191db30711b1ce7abdebe7639de4899f2ab1
Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/163
Reviewed-by: Jungho Ahn <[email protected]>
With the original consecutive_errors design, there is no way to turn
off consecutive gateway errors. This CL deprecate the field and add two
new fields: consecutive_gateway_errors and consecutive_5xx_errors.

See discussions here: istio#909

Change-Id: I0e98990d194216cef842fb792a76a5f59b6e674e
Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/199
Reviewed-by: Jungho Ahn <[email protected]>
Reviewed-by: Weibo He <[email protected]>
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 10, 2019
@yingzhuivy
Copy link
Member Author

@frankbu @howardjohn what's next step? should i submit a pr for the actual implementation now or wait for this to merge

@howardjohn
Copy link
Member

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Dec 10, 2019
@howardjohn
Copy link
Member

@yingzhuivy you are free to work on the implementation if you prefer, with the caveat that there may be some wasted time if others object to this. You can use a replace directive in go.mod until this merges for development purposes if you were not aware.

@rshriram can you take a look?

@yingzhuivy
Copy link
Member Author

cc @jhahn21 @brianwolfe


// Number of gateway errors before a host is ejected from the connection pool.
// When the upstream host is accessed over HTTP, a 502, 503, or 504 return
// code qualifies as an error. When the upstream host is accessed over an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a gateway error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Number of 5xx errors before a host is ejected from the connection pool.
// When the upstream host is accessed over TCP, errors are internally mapped
// to HTTP 5xx codes and treated as such.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TCP, its essentially just connect timeout and connection reset/error, unless there is more nuancing on the types of errors. So, I suggest repeating the above sentence "When the upstream host is accessed over an opaque TCP connection, connect timeouts, ... events qualify as an error".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at envoy code, for TCP
connection timeout -> 504
connection failure -> 503
request failure -> 500

https://github.com/envoyproxy/envoy/blob/1a7b32de759939f82af2cf7e83471aff5d57c93d/source/common/upstream/outlier_detection_impl.cc#L96

updated the comment

// Number of 5xx errors before a host is ejected from the connection pool.
// When the upstream host is accessed over TCP, errors are internally mapped
// to HTTP 5xx codes and treated as such.
// If set to 0, this feature is disabled. Defaults to 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in envoy is same as consecutive_5xx_errors=5 and consecutive_gateway_errors=0 (disabled)
My questions are:

  1. do we want to change the default behavior? e.g., a user without any setting will see a different behavior if we set gateway default to 5.
  2. if we want to set the both defaults to 5, we don’t need to set the default for gateway since it has no meaning (since 5xx includes gateway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. user without any setting but with just outlier detection probably falls under the legacy user using the older consecutive_errors format.
    That said, Istio behavior for last two releases has been
if outlier.ConsecutiveErrors > 0 {
		// Only listen to gateway errors, see https://github.com/istio/api/pull/617
		out.EnforcingConsecutiveGatewayFailure = &wrappers.UInt32Value{Value: uint32(100)} // defaults to 0
		out.EnforcingConsecutive_5Xx = &wrappers.UInt32Value{Value: uint32(0)}             // defaults to 100
		out.ConsecutiveGatewayFailure = &wrappers.UInt32Value{Value: uint32(outlier.ConsecutiveErrors)}
	}

i.e. if user did not set the consecutive error, then we were still doing consecutive_5xx by default. So, I think what you have is good.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so

  1. there is discrepancy between the legacy consecutive_errors documentation and actual implementation for the last two releases: i.e. if user did not set the consecutive error, we were still doing 5xx while we should actually be doing gateway errors. Do we want to fix the implementation?
  2. for the new consecutive_5xx_errors and consecutive_gateway_errors, what do we want for default. do we want to use 5xx as default or gateway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's too late to change the implementation of the old consecutive_errors field. Right or wrong, it was what it was. Just keep it working the same, only deprecated now.

  2. I think the new default also needs to stay the same: 5 5xx errors. So to get the same behavior as setting the old field to some value, e.g., consecutive_errors = 10, the user will now need to set both fields:

consecutive_gateway_errors = 10
consecutive_5xx_errors = 0 // disable

@rshriram WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have misunderstood your point, but I am not sure why would a user want to set the gateway greater than 5xx. In my mind, there are three cases:

  1. user want to set 5xx only
  2. user want to set gateway only
  3. user want to set both gateway and 5xx (gateway should be smaller than 5xx otherwise it is same as case 1)

I think the only caveat is that in case 2, the user needs to disable 5xx explictly by setting it as 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that by distinguishing between 0 and unset we can cover the 3 cases, but I think there is a valid 4th case that neither Envoy nor Istio support:

  1. user wants to set gateway count to a higher number (e.g., 10) because they are likely to correct themselves but wants other 5xx errors (e.g., 500) to time out much earlier, for example after 1 or 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, just to be clear, I think the Istio implementation should just do what Envoy does, whether they change to support case 4 or not.

Copy link
Contributor

@frankbu frankbu Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also almost seems like a bug that if the user sets both 5xx to 5 and gateway to 10, that the gateway field is just ignored. If I set gateway to 10, I expect that to do something, instead of be ignored because it overlaps with another config parameter. That's why I think it might be worth opening an Envoy issue to see if they want to change their implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets go with what frank suggests for now!

Copy link
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc nits. otherwise LGTM

Copy link
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//

@frankbu
Copy link
Contributor

frankbu commented Dec 17, 2019

@yingzhuivy It looks like we are now all in agreement on the API. Are you planning to update this PR accordingly, and then provide the implementation?

@frankbu
Copy link
Contributor

frankbu commented Dec 17, 2019

I think that all that needs to change in this PR is to correct the documented default values of the two new fields and to add a sentence in the consecutive_gateway_errors doc to say that its setting will have no effect if the value is greater than the value of consecutive_5xx_errors.

@jhahn21
Copy link

jhahn21 commented Dec 17, 2019

@frankbu Yes, @yingzhuivy will update this PR as soon as she resolves an issue in connecting from China.

This CL changes the description of consecutive_5xx_errors and
consecutive_gateway errors. 5xx defaults to 5 and gateway defaults to 0.

Change-Id: I6c3b29cf92df1c972a7850a726eb159b5e23bf90
Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/211
Reviewed-by: Jungho Ahn <[email protected]>
@yingzhuivy
Copy link
Member Author

@frankbu @rshriram just updated the PR, please take a look thanks!

// code qualifies as a gateway error. When the upstream host is accessed over
// an opaque TCP connection, connect timeouts and connection error/failure
// events qualify as a gateway error.
// If set to 0, this feature is disabled. Defaults to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If set to 0, this feature is disabled. Defaults to 0.
// This feature is disabled by default or when set to the value 0.
//
// Note that

@frankbu
Copy link
Contributor

frankbu commented Dec 20, 2019

@yingzhuivy a couple of minor suggestions, but otherwise LGTM.

This CL rewords the descriptions of consecutive 5xx & gateway as
suggected in https://github.com/istio/api/pull/1189/files.

Change-Id: Ia95c03da78a2c9f12c8762b9d8fb95e7add08516
Reviewed-on: https://gerrit.musta.ch/c/public/istio-api/+/214
Reviewed-by: Jungho Ahn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants