Skip to content

ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings". #5897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KrisBuytaert opened this issue Apr 25, 2024 · 16 comments · Fixed by #5933

Comments

@KrisBuytaert
Copy link
Contributor

Describe the bug
Since the upgrade of our (dev) cluster from 1.14 to 1.16 the
ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
started appearing frequently (multiple times per hour).
The cluster still seems to work.

After reading the comments and suggestions in #5803 I have added
the snippet to change the ingester_client config to gzip.

ingester_client:
grpc_client_config:
# Configure the client to allow messages up to 100MB.
max_recv_msg_size: 104857600
max_send_msg_size: 104857600
grpc_compression: gzip

My metrics indeed show a significant change in bandwidth usage, however the errors have persisted after this change.

[[email protected] ~]# yum info cortex
Last metadata expiration check: 0:15:12 ago on Thu 25 Apr 2024 01:58:47 PM CEST.
Installed Packages
Name : cortex
Version : 1.16.0
Release : 1
Architecture : x86_64
Size : 67 M
Source : cortex-1.16.0-1.src.rpm
Repository : @System
From repo : upstream
Summary : no description given
URL : https://github.com/cortexproject/cortex
License : Apache 2.0
Description : no description given

@friedrichg
Copy link
Member

ideally you should not skip versions during upgrades, Can you provide the full error log message?

@KrisBuytaert
Copy link
Contributor Author

Apr 25 04:20:03 cortex01 cortex[147661]: 2024/04/25 04:20:03 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 04:20:04 cortex01 cortex[147707]: 2024/04/25 04:20:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 04:20:04 cortex01 cortex[147707]: 2024/04/25 04:20:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

There are no details about which line and the pids refer to both the distributor and the querier

@KrisBuytaert
Copy link
Contributor Author

With debugging enabled , no mor details regarding the line nr were visible , this is a snippet from the querier log..

Apr 25 15:00:04 cortex01 cortex[506581]: 2024/04/25 15:00:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:00:04 cortex01 cortex[506581]: 2024/04/25 15:00:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:00:14 cortex01 cortex[506581]: 2024/04/25 15:00:14 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:00:44 cortex01 cortex[506581]: ts=2024-04-25T13:00:44.485592333Z caller=logging.go:76 level=debug traceID=657180a8c5de32fe msg="GET /metrics (200) 2.186138ms"
Apr 25 15:01:24 cortex01 cortex[506581]: 2024/04/25 15:01:24 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:01:24 cortex01 cortex[506581]: 2024/04/25 15:01:24 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:01:34 cortex01 cortex[506581]: 2024/04/25 15:01:34 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:01:44 cortex01 cortex[506581]: ts=2024-04-25T13:01:44.485962801Z caller=logging.go:76 level=debug traceID=61c01609c9adfa91 msg="GET /metrics (200) 2.552095ms"
Apr 25 15:02:44 cortex01 cortex[506581]: ts=2024-04-25T13:02:44.486211514Z caller=logging.go:76 level=debug traceID=45814b9fb916bd0b msg="GET /metrics (200) 2.282626ms"
Apr 25 15:03:44 cortex01 cortex[506581]: ts=2024-04-25T13:03:44.485531798Z caller=logging.go:76 level=debug traceID=6971b2be199cfbaf msg="GET /metrics (200) 1.845831ms"
Apr 25 15:04:04 cortex01 cortex[506581]: 2024/04/25 15:04:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:04:04 cortex01 cortex[506581]: 2024/04/25 15:04:04 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:04:14 cortex01 cortex[506581]: 2024/04/25 15:04:14 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:04:44 cortex01 cortex[506581]: ts=2024-04-25T13:04:44.484666408Z caller=logging.go:76 level=debug traceID=789feca8d5bf8bf1 msg="GET /metrics (200) 1.755314ms"
Apr 25 15:05:44 cortex01 cortex[506581]: ts=2024-04-25T13:05:44.485404378Z caller=logging.go:76 level=debug traceID=1ae734cefa34984c msg="GET /metrics (200) 2.415125ms"
Apr 25 15:06:44 cortex01 cortex[506581]: ts=2024-04-25T13:06:44.48592392Z caller=logging.go:76 level=debug traceID=3b80a2f1cb6cf232 msg="GET /metrics (200) 2.120953ms"
Apr 25 15:07:44 cortex01 cortex[506581]: ts=2024-04-25T13:07:44.485398696Z caller=logging.go:76 level=debug traceID=5c12932efb0faa71 msg="GET /metrics (200) 1.827166ms"
Apr 25 15:08:44 cortex01 cortex[506581]: ts=2024-04-25T13:08:44.488854493Z caller=logging.go:76 level=debug traceID=28d6894c1929e005 msg="GET /metrics (200) 2.529201ms"
Apr 25 15:09:24 cortex01 cortex[506581]: 2024/04/25 15:09:24 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:09:24 cortex01 cortex[506581]: 2024/04/25 15:09:24 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:09:34 cortex01 cortex[506581]: 2024/04/25 15:09:34 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Apr 25 15:09:44 cortex01 cortex[506581]: ts=2024-04-25T13:09:44.485970517Z caller=logging.go:76 level=debug traceID=065e342d3be43d29 msg="GET /metrics (200) 2.911692ms"

I will try again with a 1.15 release ..

@KrisBuytaert
Copy link
Contributor Author

I downgraded the cluster to 1.15 .. and after an hour+ of testing there were no errors.
In a 1.16 cluster the errors start happening within the first 10 minutes already

@yeya24
Copy link
Contributor

yeya24 commented Apr 26, 2024

Hi @KrisBuytaert,

With debugging enabled , no mor details regarding the line nr were visible , this is a snippet from the querier log..

Is it also happening to other components as well? Do you use memberlist? I am curious if we see the same error log in query frontend log

@KrisBuytaert
Copy link
Contributor Author

@yeya24 these errors occured in both the querier and the distributor , and from what I noticed not in the query-frontend I`m not using memberlist

@KrisBuytaert
Copy link
Contributor Author

FWIW I've upgraded to 1.17 and no change .. the errors are still there ..

@yeya24
Copy link
Contributor

yeya24 commented May 2, 2024

Can you share your configuration?
What Ring implementation you are using

@KrisBuytaert
Copy link
Contributor Author

We're using etcd as a ring ..

auth_enabled:  true

server:
  http_listen_port:  9010
  http_listen_address:  10.4.0.87
  grpc_listen_port: 9910
  grpc_listen_address: 10.4.0.87
  grpc_server_max_concurrent_streams: 20000
  log_level: debug


limits:
  max_metadata_per_user: 0
  max_metadata_per_metric: 0
  accept_ha_samples: true
  max_label_names_per_series: 50
  max_series_per_metric: 100000
  max_series_per_user: 10000000
  compactor_blocks_retention_period: 712d
  ingestion_rate: 26000


# storage configuration
storage:
  engine: blocks

blocks_storage:
  backend: s3
  s3:
    bucket_name: o11ydev-cortex
    insecure: false
    endpoint: some_s3_service.io
    secret_access_key: "nope"
    access_key_id: "nope"
  bucket_store:
    ignore_deletion_mark_delay: 1h
    bucket_index:
      enabled: true
  tsdb:
    retention_period: 405m

compactor:
  data_dir: /var/lib/cortex/compactor
  compaction_concurrency: 1
  sharding_enabled: false
  deletion_delay: 12h0m0s

distributor:
  shard_by_all_labels: true
  ha_tracker:
    enable_ha_tracker: true
    kvstore:
       store: etcd
       etcd:
          endpoints: ["http://10.4.0.87:2379", "http://10.4.0.95:2379", "http://10.4.0.127:23:2379"]


frontend:
  instance_interface_names:
    - enp1s0

frontend_worker:
  frontend_address: "10.4.0.87:9914"

ingester:
  # We want our ingesters to flush chunks at the same time to optimise
  # deduplication opportunities.
  lifecycler:
    ring:
      replication_factor: 2
      kvstore:
        store: etcd
        etcd:
          endpoints: ["http://10.4.0.87:2379", "http://10.4.0.95:2379", "http://10.4.0.127:23:2379"]
    num_tokens: 512
    # The ingester needs some time before joining to observe the ring state
    join_after: 5s
    observe_period: 5s
    min_ready_duration: 0s
    final_sleep: 0s
    tokens_file_path: /var/lib/cortex/wal/tokens
    # The address to advertise for this ingester.  Will be autodiscovered by
    # looking up address on eth0 or en0; can be specified if this fails.
    address: 10.4.0.87

querier:
  # This needs to be parametrized and made dynamic
  # store_gateway_addresses: 127.0.0.1:9910
  query_store_for_labels_enabled: true
  #  query_store_after: 400m
  #  query_ingesters_within: 405m

ruler:
  enable_api: true
  external_labels:
    cortex: o11ydev
    component: ruler
  external_url: 
  alertmanager_url: 
  rule_path: /var/lib/cortex/ruler
  ring:
    kvstore:
       store: etcd
       etcd:
          endpoints: ["http://10.4.0.87:2379", "http://10.4.0.95:2379", "http://10.4.0.127:23:2379"]

ruler_storage:
  backend: s3
  s3:
    bucket_name: default
    insecure: false
    endpoint: s3.gra.io.cloud.ovh.net
    secret_access_key: "secret"
    access_key_id: "none"

store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
       store: etcd
       etcd:
          endpoints: ["http://10.4.0.87:2379", "http://10.4.0.95:2379", "http://10.4.0.127:23:2379"]
    instance_interface_names:
      - enp1s0



ingester_client:
  grpc_client_config:
    # Configure the client to allow messages up to 100MB.
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip


@yeya24
Copy link
Contributor

yeya24 commented May 3, 2024

Hi @KrisBuytaert, thanks for sharing your configuration. I think the issue is etcd related and I do find this from etcd repo etcd-io/etcd#15719.

Cortex enables PermitWithoutStream option on the client side here. However, etcd server doesn't have a way to turn on it at server side. I do think it is an issue of etcd. What etcd version you are using?
I am not sure why this problem only occures after 1.15.0 release. It might be a etcd client library version upgrade.

This error matches the behavior mentioned in the gRPC doc here.

But overall, this error happens when Cortex client sends keepalive pings to etcd server. Ideally it shouldn't impact the normal Ring functionalities like data loss, just Keep alive ping request is dropped. As mentioned in the code comment,

		// Configure the keepalive to make sure that the client reconnects
		// to the etcd service endpoint(s) in case the current connection is
		// dead (ie. the node where etcd is running is dead or a network
		// partition occurs).

There are several options here:

  1. Ask etcd to expose PermitWithoutStream flag and set it to true
  2. In Cortex we add a new config to etcd client to disable PermitWithoutStream
  3. Try a different Ring implementation other than etcd

@KrisBuytaert
Copy link
Contributor Author

@yeya24 thank you for this update ! We just moved away from consul to etcd because of their license change , not many options left .. but given that as you mention this is about the connection to etcd , that also means there is no data loss happening ..

We're on etcd-3.5.12-1

Do I understand correctly that even if cortex were to implement the client config that wouldn't help yet untill etcd also exposes the feature ?

So for now the best approach probably is to ignore the error log ?

@yeya24
Copy link
Contributor

yeya24 commented May 3, 2024

I guess we can disable this option on client side to mitigate the error log.

It shouldn't cause much negative impact since it is just keep alive. And keep alive ping from client are still sent when there is an active connection

@yeya24
Copy link
Contributor

yeya24 commented May 5, 2024

After a bit more investigation I think the cause is grpc/grpc-go#5935, which is added in grpc-go v1.54.0 release. Now we are at v1.63.2 so it is unlikely to downgrade that.

@yeya24
Copy link
Contributor

yeya24 commented May 5, 2024

@KrisBuytaert Would it be possible for you to change this value to false and rebuild a new image and test?
https://github.com/cortexproject/cortex/blob/master/pkg/ring/kv/etcd/etcd.go#L105

I am unable to reproduce this error log locally in a test case and I am unsure if this change could mitigate the issue 100%. If you could test it out it would be great.

@KrisBuytaert
Copy link
Contributor Author

@yeya24 so I build the binary and launched it about 24 hours ago ... my logs show no more CALM errors ...
So I guess that change most probably fixes the problem

@yeya24
Copy link
Contributor

yeya24 commented May 7, 2024

Great to see it helps. Feel free to open a PR for it. Or anyone from the community can pick it up.

AC:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants