Skip to content

Commit b6425b4

Browse files
authored
docs: improvements on reschedule / retry behavior (#1809)
1 parent df38d38 commit b6425b4

File tree

1 file changed

+36
-19
lines changed

1 file changed

+36
-19
lines changed

docs/documentation/features.md

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -369,13 +369,29 @@ these features:
369369
1. A successful execution resets a retry and the rescheduled executions which were present before
370370
the reconciliation. However, a new rescheduling can be instructed from the reconciliation
371371
outcome (`UpdateControl` or `DeleteControl`).
372+
373+
For example, if a reconciliation had previously been re-scheduled after some amount of time, but an event triggered
374+
the reconciliation (or cleanup) in the mean time, the scheduled execution would be automatically cancelled, i.e.
375+
re-scheduling a reconciliation does not guarantee that one will occur exactly at that time, it simply guarantees that
376+
one reconciliation will occur at that time at the latest, triggering one if no event from the cluster triggered one.
377+
Of course, it's always possible to re-schedule a new reconciliation at the end of that "automatic" reconciliation.
378+
379+
Similarly, if a retry was scheduled, any event from the cluster triggering a successful execution in the mean time
380+
would cancel the scheduled retry (because there's now no point in retrying something that already succeeded)
381+
372382
2. In case an exception happened, a retry is initiated. However, if an event is received
373383
meanwhile, it will be reconciled instantly, and this execution won't count as a retry attempt.
374384
3. If the retry limit is reached (so no more automatic retry would happen), but a new event
375385
received, the reconciliation will still happen, but won't reset the retry, and will still be
376386
marked as the last attempt in the retry info. The point (1) still holds, but in case of an
377387
error, no retry will happen.
378388

389+
The thing to keep in mind when it comes to retrying or rescheduling is that JOSDK tries to avoid unnecessary work. When
390+
you reschedule an operation, you instruct JOSDK to perform that operation at the latest by the end of the rescheduling
391+
delay. If something occurred on the cluster that triggers that particular operation (reconciliation or cleanup), then
392+
JOSDK considers that there's no point in attempting that operation again at the end of the specified delay since there
393+
is now no point to do so anymore. The same idea also applies to retries.
394+
379395
## Rate Limiting
380396
381397
It is possible to rate limit reconciliation on a per-resource basis. The rate limit also takes
@@ -611,15 +627,15 @@ Logging is enhanced with additional contextual information using
611627
[MDC](http://www.slf4j.org/manual.html#mdc). The following attributes are available in most
612628
parts of reconciliation logic and during the execution of the controller:
613629

614-
| MDC Key | Value added from primary resource |
615-
| :--- |:----------------------------------|
616-
| `resource.apiVersion` | `.apiVersion` |
617-
| `resource.kind` | `.kind` |
618-
| `resource.name` | `.metadata.name` |
619-
| `resource.namespace` | `.metadata.namespace` |
620-
| `resource.resourceVersion` | `.metadata.resourceVersion` |
621-
| `resource.generation` | `.metadata.generation` |
622-
| `resource.uid` | `.metadata.uid` |
630+
| MDC Key | Value added from primary resource |
631+
|:---------------------------|:----------------------------------|
632+
| `resource.apiVersion` | `.apiVersion` |
633+
| `resource.kind` | `.kind` |
634+
| `resource.name` | `.metadata.name` |
635+
| `resource.namespace` | `.metadata.namespace` |
636+
| `resource.resourceVersion` | `.metadata.resourceVersion` |
637+
| `resource.generation` | `.metadata.generation` |
638+
| `resource.uid` | `.metadata.uid` |
623639

624640
For more information about MDC see this [link](https://www.baeldung.com/mdc-in-log4j-2-logback).
625641

@@ -688,27 +704,28 @@ for this feature.
688704

689705
## Leader Election
690706

691-
Operators are generally deployed with a single running or active instance. However, it is
692-
possible to deploy multiple instances in such a way that only one, called the "leader", processes the
693-
events. This is achieved via a mechanism called "leader election". While all the instances are
694-
running, and even start their event sources to populate the caches, only the leader will process
695-
the events. This means that should the leader change for any reason, for example because it
696-
crashed, the other instances are already warmed up and ready to pick up where the previous
707+
Operators are generally deployed with a single running or active instance. However, it is
708+
possible to deploy multiple instances in such a way that only one, called the "leader", processes the
709+
events. This is achieved via a mechanism called "leader election". While all the instances are
710+
running, and even start their event sources to populate the caches, only the leader will process
711+
the events. This means that should the leader change for any reason, for example because it
712+
crashed, the other instances are already warmed up and ready to pick up where the previous
697713
leader left off should one of them become elected leader.
698714

699-
See sample configuration in the [E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/8865302ac0346ee31f2d7b348997ec2913d5922b/sample-operators/leader-election/src/main/java/io/javaoperatorsdk/operator/sample/LeaderElectionTestOperator.java#L21-L23)
715+
See sample configuration in
716+
the [E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/8865302ac0346ee31f2d7b348997ec2913d5922b/sample-operators/leader-election/src/main/java/io/javaoperatorsdk/operator/sample/LeaderElectionTestOperator.java#L21-L23)
700717
.
701718

702719
## Runtime Info
703720

704-
[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16)
705-
is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom
721+
[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16)
722+
is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom
706723
liveness probes.
707724

708725
[stopOnInformerErrorDuringStartup](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L168-L168)
709726
setting, where this flag usually needs to be set to false, in order to control the exact liveness properties.
710727

711-
See also an example implementation in the
728+
See also an example implementation in the
712729
[WebPage sample](https://github.com/java-operator-sdk/java-operator-sdk/blob/3e2e7c4c834ef1c409d636156b988125744ca911/sample-operators/webpage/src/main/java/io/javaoperatorsdk/operator/sample/WebPageOperator.java#L38-L43)
713730

714731
## Automatic Generation of CRDs

0 commit comments

Comments
 (0)