-
Notifications
You must be signed in to change notification settings - Fork 816
ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings". #5897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ideally you should not skip versions during upgrades, Can you provide the full error log message? |
Apr 25 04:20:03 cortex01 cortex[147661]: 2024/04/25 04:20:03 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings". There are no details about which line and the pids refer to both the distributor and the querier |
With debugging enabled , no mor details regarding the line nr were visible , this is a snippet from the querier log.. Apr 25 15:00:04 cortex01 cortex[506581]: 2024/04/25 15:00:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings". I will try again with a 1.15 release .. |
I downgraded the cluster to 1.15 .. and after an hour+ of testing there were no errors. |
Hi @KrisBuytaert,
Is it also happening to other components as well? Do you use memberlist? I am curious if we see the same error log in query frontend log |
@yeya24 these errors occured in both the querier and the distributor , and from what I noticed not in the query-frontend I`m not using memberlist |
FWIW I've upgraded to 1.17 and no change .. the errors are still there .. |
Can you share your configuration? |
We're using etcd as a ring ..
|
Hi @KrisBuytaert, thanks for sharing your configuration. I think the issue is etcd related and I do find this from etcd repo etcd-io/etcd#15719. Cortex enables This error matches the behavior mentioned in the gRPC doc here. But overall, this error happens when Cortex client sends keepalive pings to etcd server. Ideally it shouldn't impact the normal Ring functionalities like data loss, just Keep alive ping request is dropped. As mentioned in the code comment,
There are several options here:
|
@yeya24 thank you for this update ! We just moved away from consul to etcd because of their license change , not many options left .. but given that as you mention this is about the connection to etcd , that also means there is no data loss happening .. We're on etcd-3.5.12-1 Do I understand correctly that even if cortex were to implement the client config that wouldn't help yet untill etcd also exposes the feature ? So for now the best approach probably is to ignore the error log ? |
I guess we can disable this option on client side to mitigate the error log. It shouldn't cause much negative impact since it is just keep alive. And keep alive ping from client are still sent when there is an active connection |
After a bit more investigation I think the cause is grpc/grpc-go#5935, which is added in grpc-go v1.54.0 release. Now we are at v1.63.2 so it is unlikely to downgrade that. |
@KrisBuytaert Would it be possible for you to change this value to false and rebuild a new image and test? I am unable to reproduce this error log locally in a test case and I am unsure if this change could mitigate the issue 100%. If you could test it out it would be great. |
@yeya24 so I build the binary and launched it about 24 hours ago ... my logs show no more CALM errors ... |
Great to see it helps. Feel free to open a PR for it. Or anyone from the community can pick it up. AC:
|
Describe the bug
Since the upgrade of our (dev) cluster from 1.14 to 1.16 the
ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
started appearing frequently (multiple times per hour).
The cluster still seems to work.
After reading the comments and suggestions in #5803 I have added
the snippet to change the ingester_client config to gzip.
ingester_client:
grpc_client_config:
# Configure the client to allow messages up to 100MB.
max_recv_msg_size: 104857600
max_send_msg_size: 104857600
grpc_compression: gzip
My metrics indeed show a significant change in bandwidth usage, however the errors have persisted after this change.
[[email protected] ~]# yum info cortex
Last metadata expiration check: 0:15:12 ago on Thu 25 Apr 2024 01:58:47 PM CEST.
Installed Packages
Name : cortex
Version : 1.16.0
Release : 1
Architecture : x86_64
Size : 67 M
Source : cortex-1.16.0-1.src.rpm
Repository : @System
From repo : upstream
Summary : no description given
URL : https://github.com/cortexproject/cortex
License : Apache 2.0
Description : no description given
The text was updated successfully, but these errors were encountered: