Closed
Description
I am running a kubernetes watch on configmaps list. I am running watch in a background thread and looking for watch events continously and updating my cache if there is any event added/modifed. What i am observing is that sometimes the watch events are not coming at all.
Watch<V1ConfigMap> watch = Watch.createWatch(
apiClient,
api.listNamespacedConfigMapCall(serviceDiscoveryConfig.getNamespace(), null, null, null,
null, null, null, null, null, true, null),
new TypeToken<Watch.Response<V1ConfigMap>>() {
}.getType());
while(true) {
TimeUnit.SECONDS.sleep(2);
for (Watch.Response<V1ConfigMap> configMap : watch) {
if (configMap.object != null) {
log.info("Received configmap watch event. updating the configmap: {} , metadata: {}", configMap.object.getMetadata().getName(), configMap.object.getMetadata().toString());
namespaceConfigsCache.insertOrUpdateValue(configMap.object.getMetadata().getName(), configMap.object.getData());
}
}
}
This entire piece of code runs in a background thread. The moment code misses a event, i dont see any more watch events after that point of time at all. is there any chance that watch is being stopped. if so, how do i check the status.
Metadata
Metadata
Assignees
Labels
No labels
Activity
brendandburns commentedon Nov 10, 2020
You should not expect a watch to run forever, you need to list/watch in a loop.
The code is actually semi-thorny to get right, you are probably better off using the
Informer
class in this library that handles much of this logic for you.sameer2800 commentedon Nov 10, 2020
thanks @brendandburns. I will try out Informer and will let you know.
sameer2800 commentedon Nov 10, 2020
@brendandburns I am seeing similar behavior even with Informer class. Let me know if i have configured something wrong here. It received events for the first few minutes and then suddenly it stopped receiving changelog events.
brendandburns commentedon Nov 10, 2020
Is it possible that your thread is throwing an exception? If you throw an uncaught exception inside the thread, the thread will terminate.
I would try:
And see if any exceptions occur. Your code for using the informer looks correct.
yue9944882 commentedon Nov 10, 2020
i think so
sameer2800 commentedon Nov 11, 2020
@brendandburns @yue9944882 I am not running in this a seperate thread because
factory.startAllRegisteredInformers();
starts the informers in the background thread. I am running this in main method itself. Which part do u want me to put in try catch block ,because initilazing infomer and adding a event handler is one time task and i dont see errors there. And startAllRegisteredInfromers runs in background.sameer2800 commentedon Nov 12, 2020
@brendandburns I have not changed any timeout variables. i see default for listNamespacedConfigMapCall is set to 5 mins.
Actually, in my case, kube API server goes to unavailable state once in a while. do u think increasing the timeout will work ?
yue9944882 commentedon Nov 12, 2020
you code looks good, the informer will retry reconnecting the kube-apiserver every 1 second if the server goes unavailable. and watch connection will be re-established once the server is up.
java/examples/src/main/java/io/kubernetes/client/examples/InformerExample.java
Lines 38 to 39 in a43fa93
did you set the read-timeout to infinite as the example above shows?
sameer2800 commentedon Nov 12, 2020
@yue9944882 yes.
I tried changing the timeouts too. dint help. in my last run, I could see it working for hours. then it stopped receiving the events. I started with debug mode on. I neither see exceptions nor errors.
brendandburns commentedon Nov 12, 2020
I'm actually not sure if you want infinite timeout? In a flaky network, is it possible that the something is not sending a TCP reset on the severing of a network connection? I've seen situations where a TCP reset isn't sent and the system holds a TCP connection open, but there's no traffic flowing.
I would actually set a non-infinite timeout (5 minutes?) and see if that fixes things.
sameer2800 commentedon Nov 13, 2020
@brendandburns thanks for the suggestion and i will try and let you know
tony-clarke-amdocs commentedon Nov 24, 2020
@sameer2800 where you able to resolve this?
We are starting to see the same symptoms on AKS (Azure Kubernetes Service, K8S 1.18.8) for CR instance. After about 5 minutes the informer stops seeing any updates (new/update/delete). We are running with the 9.0.1 release. We updated to 10.0.1 but no difference.
@brendanburns you suggested to run with a read timeout that is not zero, but the 10.0.0 release was updated to disallow any read timeout other than zero. See this commit. Any other suggestions to try?
brendandburns commentedon Nov 25, 2020
cc @yue9944882
See some related discussion here:
kubernetes/kubernetes#65012
@tony-clarke-amdocs for AKS specifically see the discussion here:
Azure/AKS#1755
I think we should:
a) re-enable non-zero timeouts
b) make sure we're sending TCP Keep-Alive
eventually:
c) switch from Web Sockets to HTTP/2 and add health checks.
tony-clarke-amdocs commentedon Nov 25, 2020
@brendandburns @yue9944882 I noticed that the watch call sets the timeout to 5 minutes. See here. Given that we no longer see watch events after 5 minutes...I tend to think this is not a coincidence?
Any idea how we make sure to send TCP keep-alive? Looking at the code, I don't think we are doing web sockets today.
tony-clarke-amdocs commentedon Dec 4, 2020
@brendandburns @yue9944882 I think I have figured this out. The
standard
client doesn't include http2 protocol.We need to add the following to add http2 and a pinginterval.
With the above change the watch doesn't hang and it all is good.
Does it make sense that the
standard
build includes something like this by default? I think it should at least include HTTP_2 protocol.9 remaining items