kubernetes watch API is behaving oddly #1370

New issue

Closed

#1498

Closed

kubernetes watch API is behaving oddly#1370

#1498

sameer2800

I am running a kubernetes watch on configmaps list. I am running watch in a background thread and looking for watch events continously and updating my cache if there is any event added/modifed. What i am observing is that sometimes the watch events are not coming at all.

Watch<V1ConfigMap> watch = Watch.createWatch(
               apiClient,
               api.listNamespacedConfigMapCall(serviceDiscoveryConfig.getNamespace(), null, null, null,
                       null, null, null, null, null, true, null),
               new TypeToken<Watch.Response<V1ConfigMap>>() {
               }.getType()); 


while(true) {
               TimeUnit.SECONDS.sleep(2);
               for (Watch.Response<V1ConfigMap> configMap : watch) {

                   if (configMap.object != null) {
                       log.info("Received configmap watch event. updating the configmap: {} , metadata: {}", configMap.object.getMetadata().getName(), configMap.object.getMetadata().toString());

                           namespaceConfigsCache.insertOrUpdateValue(configMap.object.getMetadata().getName(), configMap.object.getData());

                   }
               }
           }

This entire piece of code runs in a background thread. The moment code misses a event, i dont see any more watch events after that point of time at all. is there any chance that watch is being stopped. if so, how do i check the status.

brendandburns

Contributor

You should not expect a watch to run forever, you need to list/watch in a loop.

The code is actually semi-thorny to get right, you are probably better off using the Informer class in this library that handles much of this logic for you.

sameer2800

Author

thanks @brendandburns. I will try out Informer and will let you know.

sameer2800

Author

@brendandburns I am seeing similar behavior even with Informer class. Let me know if i have configured something wrong here. It received events for the first few minutes and then suddenly it stopped receiving changelog events.

 // configmaps informer
        SharedIndexInformer<V1ConfigMap> configInformer =
                factory.sharedIndexInformerFor(
                        (CallGeneratorParams params) -> {
                            return api.listNamespacedConfigMapCall(
                                    serviceDiscoveryConfig.getNamespace(),
                                    null,
                                    null,
                                    null,
                                    null,
                                    null,
                                    null,
                                    params.resourceVersion,
                                    params.timeoutSeconds,
                                    params.watch,
                                    null);
                        },
                        V1ConfigMap.class,
                        V1ConfigMapList.class);


configInformer.addEventHandler(
                new ResourceEventHandler<V1ConfigMap>() {
                    @Override
                    public void onAdd(V1ConfigMap configMap) {

                        log.info("Received configmap watch add event. updating the configmap: {} , metadata: {}", configMap.getMetadata().getName(), configMap.getMetadata().toString());

                       
                    }

                    @Override
                    public void onUpdate(V1ConfigMap oldConfigMap, V1ConfigMap newConfigMap) {
                        log.info("Received configmap watch update event. updating the configmap: {} , metadata: {}", newConfigMap.getMetadata().getName(), newConfigMap.getMetadata().toString());
                    
                    }

                    @Override
                    public void onDelete(V1ConfigMap configMap, boolean deletedFinalStateUnknown) {
                      
                    }
                });

        factory.startAllRegisteredInformers();

brendandburns

Contributor

Is it possible that your thread is throwing an exception? If you throw an uncaught exception inside the thread, the thread will terminate.

I would try:

public void run() {
  try {
    // your code here
  } catch (Throwable e) {
     e.printStackTrace();
  }
}

And see if any exceptions occur. Your code for using the informer looks correct.

yue9944882

Member

Is it possible that your thread is throwing an exception? If you throw an uncaught exception inside the thread, the thread will terminate.

i think so

sameer2800

Author

@brendandburns @yue9944882 I am not running in this a seperate thread because factory.startAllRegisteredInformers(); starts the informers in the background thread. I am running this in main method itself. Which part do u want me to put in try catch block ,because initilazing infomer and adding a event handler is one time task and i dont see errors there. And startAllRegisteredInfromers runs in background.

sameer2800

Author

@brendandburns I have not changed any timeout variables. i see default for listNamespacedConfigMapCall is set to 5 mins.

listerWatcher.watch(
                  new CallGeneratorParams(
                      Boolean.TRUE,
                      lastSyncResourceVersion,
                      Long.valueOf(Duration.ofMinutes(5).toMillis()).intValue()));

Actually, in my case, kube API server goes to unavailable state once in a while. do u think increasing the timeout will work ?

yue9944882

Member

you code looks good, the informer will retry reconnecting the kube-apiserver every 1 second if the server goes unavailable. and watch connection will be re-established once the server is up.

java/examples/src/main/java/io/kubernetes/client/examples/InformerExample.java

Lines 38 to 39 in a43fa93

    
           OkHttpClient httpClient = 
        
               apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();

did you set the read-timeout to infinite as the example above shows?

sameer2800

Author

@yue9944882 yes.

apiClient = ClientBuilder.cluster().build();;
       // infinite timeout
       OkHttpClient httpClient =
               apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();
       apiClient.setHttpClient(httpClient);
       this.api = new CoreV1Api();

I tried changing the timeouts too. dint help. in my last run, I could see it working for hours. then it stopped receiving the events. I started with debug mode on. I neither see exceptions nor errors.

brendandburns

Contributor

I'm actually not sure if you want infinite timeout? In a flaky network, is it possible that the something is not sending a TCP reset on the severing of a network connection? I've seen situations where a TCP reset isn't sent and the system holds a TCP connection open, but there's no traffic flowing.

I would actually set a non-infinite timeout (5 minutes?) and see if that fixes things.

sameer2800

Author

@brendandburns thanks for the suggestion and i will try and let you know

tony-clarke-amdocs

Contributor

@sameer2800 where you able to resolve this?

We are starting to see the same symptoms on AKS (Azure Kubernetes Service, K8S 1.18.8) for CR instance. After about 5 minutes the informer stops seeing any updates (new/update/delete). We are running with the 9.0.1 release. We updated to 10.0.1 but no difference.

@brendanburns you suggested to run with a read timeout that is not zero, but the 10.0.0 release was updated to disallow any read timeout other than zero. See this commit. Any other suggestions to try?

brendandburns

Contributor

cc @yue9944882

See some related discussion here:
kubernetes/kubernetes#65012

@tony-clarke-amdocs for AKS specifically see the discussion here:
Azure/AKS#1755

I think we should:
a) re-enable non-zero timeouts
b) make sure we're sending TCP Keep-Alive

eventually:

c) switch from Web Sockets to HTTP/2 and add health checks.

tony-clarke-amdocs

Contributor

@brendandburns @yue9944882 I noticed that the watch call sets the timeout to 5 minutes. See here. Given that we no longer see watch events after 5 minutes...I tend to think this is not a coincidence?
Any idea how we make sure to send TCP keep-alive? Looking at the code, I don't think we are doing web sockets today.

tony-clarke-amdocs

Contributor

@brendandburns @yue9944882 I think I have figured this out. The standard client doesn't include http2 protocol.

ApiClient apiClient = ClientBuilder.standard().build();

We need to add the following to add http2 and a pinginterval.

apiClient.setHttpClient(apiClient
                .getHttpClient()
                .newBuilder()
                    .protocols(Arrays.asList(Protocol.HTTP_2,Protocol.HTTP_1_1))
                    .readTimeout(Duration.ZERO)
                    .pingInterval(1,TimeUnit.MINUTES)
                .build())

With the above change the watch doesn't hang and it all is good.

Does it make sense that the standard build includes something like this by default? I think it should at least include HTTP_2 protocol.

9 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

Update client builder to be more robust by defaultkubernetes-client/java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kubernetes watch API is behaving oddly #1370

9 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

	OkHttpClient httpClient =
	apiClient.getHttpClient().newBuilder().readTimeout(0, TimeUnit.SECONDS).build();

kubernetes watch API is behaving oddly #1370

Description

Activity

brendandburns commented on Nov 10, 2020

sameer2800 commented on Nov 10, 2020

sameer2800 commented on Nov 10, 2020

brendandburns commented on Nov 10, 2020

yue9944882 commented on Nov 10, 2020

sameer2800 commented on Nov 11, 2020

sameer2800 commented on Nov 12, 2020

yue9944882 commented on Nov 12, 2020

sameer2800 commented on Nov 12, 2020

brendandburns commented on Nov 12, 2020

sameer2800 commented on Nov 13, 2020

tony-clarke-amdocs commented on Nov 24, 2020

brendandburns commented on Nov 25, 2020

tony-clarke-amdocs commented on Nov 25, 2020

tony-clarke-amdocs commented on Dec 4, 2020

9 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions