After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #657

wtrocki · 2021-11-05T10:50:46Z

Bug Report

We have been using Operator SDK in real production like scenarios. With +200 000 secrets on single kube cluster etc.
In that scenario we have noticed that Operator connection can be unstable and disconnect very often.

@secondsun have done fix to restart operator when connection is dropped but it looks like in recent versions this part of the code is not triggered due to connection being kept by underlying watcher. Problem is that we see that watchers have been idle and not responding - meaning that java operator SDK operators been running but not responding to any requests properly.

This is quite challenging with Java Operator SDK - we have seen "data loss". Golang based operators work better on such clusters mainly because their architecture checks for the CRs in event loop (rather than relying on the watch)

According to @secondsun:

https://github.com/java-operator-sdk/java-operator-sdk/blob/0cc051237f1639b9a419f9b0beaf3d1c8cb0e31d/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/internal/CustomResourceEventSource.java#L68 is a candidate bug. It might be that namespaces aren't getting watched properly

Logs:
https://gist.github.com/secondsun/7abd69a12e5a393841c0edd8156dcc1d

You can see difference in versions in PR that downgrades them
https://github.com/redhat-developer/app-services-operator/pull/288/files

wtrocki · 2021-11-05T10:51:11Z

CC @metacosm

secondsun · 2021-11-05T14:01:45Z

It is possible that this got fixed in 1.9.11 / 2.0.0 (java/quarkus sdks).

We were running with quarkus 2.0.0.CR2 which was based on an older version of the java SDK which had a bug with watches not being recreated after a timeout. I think this issue might be from that bug, in which case we can test and close/verify.

metacosm · 2021-11-05T14:02:38Z

Which version is causing the issue? We've fixed an issue with watchers not being able to reconnect to the server in 1.9.11.

secondsun · 2021-11-05T14:04:23Z

We were using quarkus-sdk 2.0.0.CR2 which was before 1.9.11 was released. 2.0.0 looks like it is based on 1.9.11, correct?

metacosm · 2021-11-05T15:19:36Z

Yes, 2.0.0 is using 1.9.11.

secondsun · 2021-11-05T15:23:05Z

@wtrocki, @metacosm I think we can close this issue. I think that it is a duplicate of the bug that 1.9.11 fixes. If we see this with the 2.0.0 quarkus operator sdk we can reopen.

metacosm · 2021-11-08T13:51:37Z

Closing for now. Please re-open if you find that the issue is still present with the latest version.

wtrocki mentioned this issue Nov 5, 2021

fix: downgrade operator redhat-developer/app-services-operator#288

Closed

metacosm self-assigned this Nov 5, 2021

snowdrop-bot mentioned this issue Nov 5, 2021

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters snowdrop-zen/java-operator-sdk#87

Closed

metacosm closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #657

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #657

wtrocki commented Nov 5, 2021 •

edited

Loading

wtrocki commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021

Uh oh!

metacosm commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021

Uh oh!

metacosm commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021 •

edited

Loading

Uh oh!

metacosm commented Nov 8, 2021

Uh oh!

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #657

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #657

Comments

wtrocki commented Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Report

wtrocki commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021

Uh oh!

metacosm commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021

Uh oh!

metacosm commented Nov 5, 2021

Uh oh!

secondsun commented Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metacosm commented Nov 8, 2021

Uh oh!

wtrocki commented Nov 5, 2021 •

edited

Loading

secondsun commented Nov 5, 2021 •

edited

Loading