diff --git a/README.md b/README.md index 48568d19..c8094de5 100644 --- a/README.md +++ b/README.md @@ -41,13 +41,12 @@ A Framework represents an application with a set of Tasks: 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName 5. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework 6. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole -7. Guarantees at most one instance of a specific Task is running at any point in time -8. Guarantees at most one instance of a specific Framework is running at any point in time +7. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability) ### Controller Feature 1. Highly generalized as it is built for all kinds of applications 2. Light-weight as it is only responsible for Pod orchestration -3. Well-defined Framework consistency, state machine and failure model +3. Well-defined Framework [Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability), [State Machine](doc/user-manual.md#FrameworkTaskStateMachine) and [Failure Model](doc/user-manual.md#CompletionStatus) 4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure 5. Support to specify how to [classify and summarize Pod failures](doc/user-manual.md#PodFailureClassification) 6. Support to expose [Framework and Pod history snapshots](doc/user-manual.md#FrameworkPodHistory) to external systems diff --git a/doc/user-manual.md b/doc/user-manual.md index 005aaefd..fa7f348e 100644 --- a/doc/user-manual.md +++ b/doc/user-manual.md @@ -9,6 +9,8 @@ - [RetryPolicy](#RetryPolicy) - [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy) - [Framework and Pod History](#FrameworkPodHistory) + - [Framework and Task State Machine](#FrameworkTaskStateMachine) + - [Framework Consistency vs Availability](#FrameworkConsistencyAvailability) - [Controller Extension](#ControllerExtension) - [FrameworkBarrier](#FrameworkBarrier) - [HivedScheduler](#HivedScheduler) @@ -116,7 +118,8 @@ Type: application/json or application/yaml Delete the specified Framework. Notes: -* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient). +* If you need to achieve all the [Framework ConsistencyGuarantees](#ConsistencyGuarantees) or achieve higher [Framework Availability](#FrameworkAvailability) by leveraging the [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go), you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body. +* However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient). **Response** @@ -370,6 +373,87 @@ Notes: ## Framework and Pod History By leveraging [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc. +## Framework and Task State Machine +### Framework State Machine +[FrameworkState](../pkg/apis/frameworkcontroller/v1/types.go) + +### Task State Machine +[TaskState](../pkg/apis/frameworkcontroller/v1/types.go) + +## Framework Consistency vs Availability +### Framework Consistency +#### ConsistencyGuarantees +For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}: + +- **ConsistencyGuarantee1**: + + At most one instance of the Task is running at any point in time. + +- **ConsistencyGuarantee2**: + + No instance of the Task is running if it is TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted. + +For a specific Framework identified by {FrameworkName}: + +- **ConsistencyGuarantee3**: + + At most one instance of the Framework is running at any point in time. + +- **ConsistencyGuarantee4**: + + No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted. + +#### How to achieve ConsistencyGuarantees + +The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyGuarantees), if you do not explicitly violate below guidelines: + +1. Achieve **ConsistencyGuarantee1**: + + Do not [force delete the managed Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods): + + 1. Do not set [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) to be not nil. + + For example, the default PodGracefulDeletionTimeoutSec is acceptable. + + 2. Do not delete the managed Pod with [0 GracePeriodSeconds](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods). + + For example, the default Pod deletion is acceptable. + + 3. Do not delete the Node which runs the managed Pod. + + For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable. + + *The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).* + + *To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).* + +2. Achieve **ConsistencyGuarantee2**, **ConsistencyGuarantee3** and **ConsistencyGuarantee4**: + 1. Achieve **ConsistencyGuarantee1**. + + 2. Must delete the managed ConfigMap with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion). + + For example, the default ConfigMap deletion is acceptable. + + 3. Must delete the Framework with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion). + + For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background. + + 4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods. + + *The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).* + +### Framework Availability +According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability). + +You can tune the trade-off, such as to achieve higher [Framework Availability](#FrameworkAvailability) by sacrificing the [Framework Consistency](#FrameworkConsistency): +1. Set a small [Pod TolerationSeconds for TaintBasedEvictions](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions) +2. Set a small [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) +3. Violate other guidelines mentioned in [How to achieve ConsistencyGuarantees](#ConsistencyGuaranteesHowTo), such as manually force delete a problematic Pod. + +See more in: +1. [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) +2. [Pod Safety and Consistency Guarantees](https://github.com/kubernetes/community/blob/ee8998b156031f6b363daade51ca2d12521f4ac0/contributors/design-proposals/storage/pod-safety.md) + ## Controller Extension ### FrameworkBarrier 1. [Usage](../pkg/barrier/barrier.go) diff --git a/pkg/apis/frameworkcontroller/v1/types.go b/pkg/apis/frameworkcontroller/v1/types.go index 8c8a3e40..bb4cdd4e 100644 --- a/pkg/apis/frameworkcontroller/v1/types.go +++ b/pkg/apis/frameworkcontroller/v1/types.go @@ -47,8 +47,7 @@ type FrameworkList struct { // 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName // 5. With fine grained RetryPolicy for each Task and the whole Framework // 6. With fine grained FrameworkAttemptCompletionPolicy for each TaskRole -// 7. Guarantees at most one instance of a specific Task is running at any point in time -// 8. Guarantees at most one instance of a specific Framework is running at any point in time +// 7. With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability // // Notes: // 1. Status field should only be modified by FrameworkController, and @@ -57,26 +56,6 @@ type FrameworkList struct { // Leverage CRD status subresource to isolate Status field modification with other fields. // This can help to avoid unintended modification, such as users may unintendedly modify // the status when updating the spec. -// 2. To ensure at most one instance of a specific Task is running at any point in time: -// 1. Do not delete the managed Pod with 0 gracePeriodSeconds. -// For example, the default Pod deletion is acceptable. -// 2. Do not delete the Node which runs the managed Pod. -// For example, drain before delete the Node is acceptable. -// The instance can be universally located by its TaskAttemptInstanceUID or PodUID. -// See RetryPolicySpec and TaskAttemptStatus. -// 3. To ensure at most one instance of a specific Framework is running at any point in time: -// 1. Ensure ensure at most one instance of a specific Task is running at any point in time. -// 2. Do not delete the managed ConfigMap with Background propagationPolicy. -// For example, the default ConfigMap deletion is acceptable. -// 3. Must delete the Framework with Foreground propagationPolicy. -// For example, the default Framework deletion may not be acceptable, since the default -// propagationPolicy for Framework object may be Background. -// The instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID. -// See RetryPolicySpec and FrameworkAttemptStatus. -// 4. To ensure there is no orphan object previously managed by FrameworkController: -// 1. Do not delete the Framework or the managed ConfigMap with Orphan propagationPolicy. -// For example, the default Framework and ConfigMap deletion is acceptable. -// 2. Do not change the OwnerReferences of the managed ConfigMap and Pods. ////////////////////////////////////////////////////////////////////////////////////////////////// type Framework struct { meta.TypeMeta `json:",inline"` @@ -107,8 +86,31 @@ type TaskRoleSpec struct { } type TaskSpec struct { - RetryPolicy RetryPolicySpec `json:"retryPolicy"` - Pod core.PodTemplateSpec `json:"pod"` + RetryPolicy RetryPolicySpec `json:"retryPolicy"` + + // If the Task's current associated Pod object is being deleted, i.e. graceful + // deletion, but the graceful deletion cannot finish within this timeout, then + // the Pod will be deleted forcefully by FrameworkController. + // Default to nil. + // + // If this timeout is not nil, the Pod may be deleted forcefully by FrameworkController. + // The force deletion does not wait for confirmation that the Pod has been terminated + // totally, and then the Task will be immediately transitioned to TaskAttemptCompleted. + // As a consequence, the Task will be immediately completed or retried with another + // new Pod, however the old Pod may be still running. + // So, in this setting, the Task behaves like ReplicaSet, and choose it if the Task + // favors availability over consistency, such as stateless Task. + // However, to still best effort execute graceful deletion with the toleration for + // transient deletion failures, this timeout should be at least longer than the Pod + // TerminationGracePeriodSeconds + minimal TolerationSeconds for TaintBasedEvictions. + // + // If this timeout is nil, the Pod will always be deleted gracefully, i.e. never + // be deleted forcefully by FrameworkController. This helps to guarantee at most + // one instance of a specific Task is running at any point in time. + // So, in this setting, the Task behaves like StatefulSet, and choose it if the Task + // favors consistency over availability, such as stateful Task. + PodGracefulDeletionTimeoutSec *int64 `json:"podGracefulDeletionTimeoutSec"` + Pod core.PodTemplateSpec `json:"pod"` } type ExecutionType string @@ -163,10 +165,9 @@ const ( // So, an attempt identified by its attempt id may be associated with multiple // attempt instances over time, i.e. multiple instances may be run for the // attempt over time, however, at most one instance is exposed into ApiServer -// over time and at most one instance is running at any point in time. -// So, the actual retried attempt instances maybe exceed the RetryPolicySpec -// in rare cases, however, the RetryPolicyStatus will never exceed the -// RetryPolicySpec. +// over time. +// So, the actual retried attempt instances may exceed the RetryPolicySpec in +// rare cases, however, the RetryPolicyStatus will never exceed the RetryPolicySpec. // 2. Resort to other spec to control other kind of RetryPolicy: // 1. Container RetryPolicy is the RestartPolicy in Pod Spec. // See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy diff --git a/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go b/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go index 7cdc6bbd..6a743885 100644 --- a/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go +++ b/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go @@ -868,6 +868,11 @@ func (in *TaskRoleStatus) DeepCopy() *TaskRoleStatus { func (in *TaskSpec) DeepCopyInto(out *TaskSpec) { *out = *in out.RetryPolicy = in.RetryPolicy + if in.PodGracefulDeletionTimeoutSec != nil { + in, out := &in.PodGracefulDeletionTimeoutSec, &out.PodGracefulDeletionTimeoutSec + *out = new(int64) + **out = **in + } in.Pod.DeepCopyInto(&out.Pod) return } diff --git a/pkg/barrier/barrier.go b/pkg/barrier/barrier.go index 438696e8..01aa5e02 100644 --- a/pkg/barrier/barrier.go +++ b/pkg/barrier/barrier.go @@ -271,6 +271,7 @@ func (b *FrameworkBarrier) Run() { if isPermanentErr { exit(ci.CompletionCodeContainerPermanentFailed) } else { + // May also timeout, but still treat as Unknown Error exit(ci.CompletionCode(1)) } } diff --git a/pkg/common/utils.go b/pkg/common/utils.go index cbe266ae..750b0ac4 100644 --- a/pkg/common/utils.go +++ b/pkg/common/utils.go @@ -87,6 +87,14 @@ func PtrUIDStr(s string) *types.UID { return PtrUID(types.UID(s)) } +func PtrDeletionPropagation(o meta.DeletionPropagation) *meta.DeletionPropagation { + return &o +} + +func PtrTime(o meta.Time) *meta.Time { + return &o +} + func PtrNow() *meta.Time { now := meta.Now() return &now diff --git a/pkg/controller/controller.go b/pkg/controller/controller.go index c808ae31..11e6351e 100644 --- a/pkg/controller/controller.go +++ b/pkg/controller/controller.go @@ -514,75 +514,66 @@ func (c *FrameworkController) syncFramework(key string) (returnedErr error) { "Failed: Framework cannot be got from local cache: %v", err) } } else { - if localF.DeletionTimestamp != nil { - // Skip syncFramework to avoid fighting with GarbageCollectionController, - // because GarbageCollectionController may be deleting the dependent object. - klog.Infof(logPfx+ - "Skipped: Framework is deleting: Will be deleted at %v", - localF.DeletionTimestamp) - return nil + f := localF.DeepCopy() + // From now on, f is a writable copy of the original local cached one, and + // it may be different from the original one. + + expected := c.getExpectedFrameworkStatusInfo(f.Key()) + if expected == nil { + if f.Status != nil { + // Recover f related things, since it is the first time we see it and + // its Status is not nil. + // No need to recover previous enqueued items, because the Informer has + // already delivered the Add events for all recovered Frameworks which + // caused all Frameworks will be enqueued to sync. + // No need to recover previous scheduled to enqueue items, because the + // schedule will be recovered during sync. + } + + // f.Status must be the same as the remote one, since it is the first + // time we see it. + c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, true) } else { - f := localF.DeepCopy() - // From now on, f is a writable copy of the original local cached one, and - // it may be different from the original one. - - expected := c.getExpectedFrameworkStatusInfo(f.Key()) - if expected == nil { - if f.Status != nil { - // Recover f related things, since it is the first time we see it and - // its Status is not nil. - // No need to recover previous enqueued items, because the Informer has - // already delivered the Add events for all recovered Frameworks which - // caused all Frameworks will be enqueued to sync. - // No need to recover previous scheduled to enqueue items, because the - // schedule will be recovered during sync. - } + // f.Status may be outdated, so override it with the expected one, to + // ensure the Framework.Status is Monotonically Exposed. + f.Status = expected.status - // f.Status must be the same as the remote one, since it is the first - // time we see it. - c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, true) - } else { - // f.Status may be outdated, so override it with the expected one, to - // ensure the Framework.Status is Monotonically Exposed. - f.Status = expected.status - - // Ensure the expected Framework.Status is the same as the remote one - // before sync. - if !expected.remoteSynced { - updateErr := c.updateRemoteFrameworkStatus(f) - if updateErr != nil { - return updateErr - } - c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, true) + // Ensure the expected Framework.Status is the same as the remote one + // before sync. + if !expected.remoteSynced { + updateErr := c.updateRemoteFrameworkStatus(f) + if updateErr != nil { + return updateErr } + c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, true) } + } - // At this point, f.Status is the same as the expected and remote - // Framework.Status, so it is ready to sync against f.Spec and other - // related objects. - errs := []error{} - remoteF := f.DeepCopy() + // At this point, f.Status is the same as the expected and remote + // Framework.Status, so it is ready to sync against f.Spec and other + // related objects. + errs := []error{} + remoteF := f.DeepCopy() - syncErr := c.syncFrameworkStatus(f) - errs = append(errs, syncErr) + syncErr := c.syncFrameworkStatus(f) + errs = append(errs, syncErr) - if !reflect.DeepEqual(remoteF.Status, f.Status) { - // Always update the expected and remote Framework.Status even if sync - // error, since f.Status should never be corrupted due to any Platform - // Transient Error, so no need to rollback to the one before sync, and - // no need to DeepCopy between f.Status and the expected one. - updateErr := c.updateRemoteFrameworkStatus(f) - errs = append(errs, updateErr) + if !reflect.DeepEqual(remoteF.Status, f.Status) { + // Always update the expected and remote Framework.Status even if sync + // error, since f.Status should never be corrupted due to any Platform + // Transient Error, so no need to rollback to the one before sync, and + // no need to DeepCopy between f.Status and the expected one. + updateErr := c.updateRemoteFrameworkStatus(f) + errs = append(errs, updateErr) - c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, updateErr == nil) - } else { - klog.Infof(logPfx + - "Skip to update the expected and remote Framework.Status since " + - "they are unchanged") - } - - return errorAgg.NewAggregate(errs) + c.updateExpectedFrameworkStatusInfo(f.Key(), f.Status, updateErr == nil) + } else { + klog.Infof(logPfx + + "Skip to update the expected and remote Framework.Status since " + + "they are unchanged") } + + return errorAgg.NewAggregate(errs) } } @@ -645,6 +636,19 @@ func (c *FrameworkController) enqueueTaskRetryDelayTimeoutCheck( failIfTimeout, "TaskRetryDelayTimeoutCheck") } +func (c *FrameworkController) enqueuePodGracefulDeletionTimeoutCheck( + f *ci.Framework, taskRoleName string, + failIfTimeout bool, pod *core.Pod) bool { + taskSpec := f.TaskRoleSpec(taskRoleName).Task + if pod.DeletionTimestamp == nil { + return false + } + + return c.enqueueFrameworkTimeoutCheck( + f, *internal.GetPodDeletionStartTime(pod), taskSpec.PodGracefulDeletionTimeoutSec, + failIfTimeout, "PodGracefulDeletionTimeoutCheck") +} + func (c *FrameworkController) enqueueFrameworkTimeoutCheck( f *ci.Framework, startTime meta.Time, timeoutSec *int64, failIfTimeout bool, logSfx string) bool { @@ -693,7 +697,8 @@ func (c *FrameworkController) syncFrameworkState(f *ci.Framework) (err error) { if f.Status.State == ci.FrameworkCompleted { if c.enqueueFrameworkCompletedRetainTimeoutCheck(f, true) { - klog.Infof(logPfx + "Skipped: Framework is already completed") + klog.Infof(logPfx + "Skipped: Framework is already completed, " + + "and waiting to be deleted after FrameworkCompletedRetainSec") return nil } @@ -737,7 +742,7 @@ func (c *FrameworkController) syncFrameworkState(f *ci.Framework) (err error) { diag = fmt.Sprintf( "ConfigMap does not appear in the local cache within timeout %v, "+ - "so consider it was deleted and force delete it", + "so consider it was deleted and explicitly delete it", common.SecToDuration(c.cConfig.ObjectLocalCacheCreationTimeoutSec)) code = ci.CompletionCodeConfigMapCreationTimeout klog.Warningf(logPfx + diag) @@ -887,6 +892,12 @@ func (c *FrameworkController) syncFrameworkState(f *ci.Framework) (err error) { // FrameworkAttemptDeleting} if f.Status.State == ci.FrameworkAttemptCreationPending { + if f.DeletionTimestamp != nil { + klog.Infof(logPfx + "Skip to createFrameworkAttempt: " + + "Framework is deleting") + return nil + } + if f.Spec.ExecutionType == ci.ExecutionStop { diag := fmt.Sprintf("User has requested to stop the Framework") klog.Infof(logPfx + diag) @@ -965,22 +976,22 @@ func (c *FrameworkController) syncFrameworkState(f *ci.Framework) (err error) { } func (c *FrameworkController) deleteFramework( - f *ci.Framework, force bool) error { + f *ci.Framework, confirm bool) error { errPfx := fmt.Sprintf( - "[%v]: Failed to delete Framework %v: force: %v: ", - f.Key(), f.UID, force) + "[%v]: Failed to delete Framework %v: confirm: %v: ", + f.Key(), f.UID, confirm) - // Do not set zero GracePeriodSeconds to do force deletion in any case, since - // it will also immediately delete Pod in PodUnknown state, while the Pod may - // be still running. deleteErr := c.fClient.FrameworkcontrollerV1().Frameworks(f.Namespace).Delete( - f.Name, &meta.DeleteOptions{Preconditions: &meta.Preconditions{UID: &f.UID}}) + f.Name, &meta.DeleteOptions{ + Preconditions: &meta.Preconditions{UID: &f.UID}, + PropagationPolicy: common.PtrDeletionPropagation(meta.DeletePropagationForeground), + }) if deleteErr != nil { if !apiErrors.IsNotFound(deleteErr) { return fmt.Errorf(errPfx+"%v", deleteErr) } } else { - if force { + if confirm { // Confirm it is deleted instead of still deleting. remoteF, getErr := c.fClient.FrameworkcontrollerV1().Frameworks(f.Namespace).Get( f.Name, meta.GetOptions{}) @@ -1000,8 +1011,8 @@ func (c *FrameworkController) deleteFramework( } klog.Infof( - "[%v]: Succeeded to delete Framework %v: force: %v", - f.Key(), f.UID, force) + "[%v]: Succeeded to delete Framework %v: confirm: %v", + f.Key(), f.UID, confirm) return nil } @@ -1012,10 +1023,11 @@ func (c *FrameworkController) deleteFramework( // Clean up instead of recovery is because the ConfigMapUID is always the ground // truth. func (c *FrameworkController) getOrCleanupConfigMap( - f *ci.Framework, force bool) (cm *core.ConfigMap, err error) { + f *ci.Framework, confirm bool) (cm *core.ConfigMap, err error) { + logPfx := fmt.Sprintf("[%v]: getOrCleanupConfigMap: ", f.Key()) cmName := f.ConfigMapName() - if force { + if confirm { cm, err = c.kClient.CoreV1().ConfigMaps(f.Namespace).Get(cmName, meta.GetOptions{}) } else { @@ -1026,9 +1038,9 @@ func (c *FrameworkController) getOrCleanupConfigMap( if apiErrors.IsNotFound(err) { return nil, nil } else { - return nil, fmt.Errorf( - "[%v]: Failed to get ConfigMap %v: force: %v: %v", - f.Key(), cmName, force, err) + return nil, fmt.Errorf(logPfx+ + "Failed to get ConfigMap %v: confirm: %v: %v", + cmName, confirm, err) } } @@ -1039,11 +1051,18 @@ func (c *FrameworkController) getOrCleanupConfigMap( // is failed to persist due to FrameworkController restart or create fails // but succeeds on remote, so clean up the ConfigMap to avoid unmanaged cm // leak. - return nil, c.deleteConfigMap(f, cm.UID, force) + klog.Warningf(logPfx+ + "Found unmanaged but controlled ConfigMap, so explicitly delete it: %v, %v", + cm.Name, cm.UID) + return nil, c.deleteConfigMap(f, cm.UID, confirm) } else { // Do not own and manage the life cycle of not controlled object, so still // consider the get and controlled object clean up is success, and postpone // the potential naming conflict when creating the controlled object. + klog.Warningf(logPfx+ + "Found unmanaged and uncontrolled ConfigMap, and it may be naming conflict "+ + "with the controlled ConfigMap to be created: %v, %v", + cm.Name, cm.UID) return nil, nil } } else { @@ -1055,15 +1074,12 @@ func (c *FrameworkController) getOrCleanupConfigMap( // Using UID to ensure we delete the right object. // The cmUID should be controlled by f. func (c *FrameworkController) deleteConfigMap( - f *ci.Framework, cmUID types.UID, force bool) error { + f *ci.Framework, cmUID types.UID, confirm bool) error { cmName := f.ConfigMapName() errPfx := fmt.Sprintf( - "[%v]: Failed to delete ConfigMap %v, %v: force: %v: ", - f.Key(), cmName, cmUID, force) + "[%v]: Failed to delete ConfigMap %v, %v: confirm: %v: ", + f.Key(), cmName, cmUID, confirm) - // Do not set zero GracePeriodSeconds to do force deletion in any case, since - // it will also immediately delete Pod in PodUnknown state, while the Pod may - // be still running. deleteErr := c.kClient.CoreV1().ConfigMaps(f.Namespace).Delete(cmName, &meta.DeleteOptions{Preconditions: &meta.Preconditions{UID: &cmUID}}) if deleteErr != nil { @@ -1071,7 +1087,7 @@ func (c *FrameworkController) deleteConfigMap( return fmt.Errorf(errPfx+"%v", deleteErr) } } else { - if force { + if confirm { // Confirm it is deleted instead of still deleting. cm, getErr := c.kClient.CoreV1().ConfigMaps(f.Namespace).Get(cmName, meta.GetOptions{}) @@ -1091,8 +1107,8 @@ func (c *FrameworkController) deleteConfigMap( } klog.Infof( - "[%v]: Succeeded to delete ConfigMap %v, %v: force: %v", - f.Key(), cmName, cmUID, force) + "[%v]: Succeeded to delete ConfigMap %v, %v: confirm: %v", + f.Key(), cmName, cmUID, confirm) return nil } @@ -1195,13 +1211,13 @@ func (c *FrameworkController) syncTaskState( diag := fmt.Sprintf( "Pod does not appear in the local cache within timeout %v, "+ - "so consider it was deleted and force delete it", + "so consider it was deleted and explicitly delete it", common.SecToDuration(c.cConfig.ObjectLocalCacheCreationTimeoutSec)) klog.Warningf(logPfx + diag) // Ensure pod is deleted in remote to avoid managed pod leak after // TaskAttemptCompleted. - err := c.deletePod(f, taskRoleName, taskIndex, *taskStatus.PodUID(), true) + err := c.deletePod(f, taskRoleName, taskIndex, *taskStatus.PodUID(), true, false) if err != nil { return err } @@ -1230,7 +1246,7 @@ func (c *FrameworkController) syncTaskState( if taskStatus.State == ci.TaskAttemptDeletionPending { // The CompletionStatus has been persisted, so it is safe to delete the // pod now. - err := c.deletePod(f, taskRoleName, taskIndex, *taskStatus.PodUID(), false) + err := c.deletePod(f, taskRoleName, taskIndex, *taskStatus.PodUID(), false, false) if err != nil { return err } @@ -1315,8 +1331,7 @@ func (c *FrameworkController) syncTaskState( } f.TransitionTaskState(taskRoleName, taskIndex, ci.TaskAttemptDeleting) - klog.Infof(logPfx + "Waiting Pod to be deleted") - return nil + return c.handlePodGracefulDeletion(f, taskRoleName, taskIndex, pod) } } } @@ -1517,6 +1532,33 @@ func (c *FrameworkController) syncTaskState( taskStatus.State)) } +// The pod should be controlled by f's cm. +func (c *FrameworkController) handlePodGracefulDeletion( + f *ci.Framework, taskRoleName string, taskIndex int32, pod *core.Pod) error { + logPfx := fmt.Sprintf("[%v][%v][%v]: handlePodGracefulDeletion: ", + f.Key(), taskRoleName, taskIndex) + taskSpec := f.TaskRoleSpec(taskRoleName).Task + + if pod.DeletionTimestamp == nil { + return nil + } + if taskSpec.PodGracefulDeletionTimeoutSec == nil { + klog.Infof(logPfx + "Waiting Pod to be deleted") + return nil + } + if c.enqueuePodGracefulDeletionTimeoutCheck(f, taskRoleName, true, pod) { + klog.Infof(logPfx + "Waiting Pod to be deleted or timeout") + return nil + } + + klog.Warningf(logPfx+ + "Pod cannot be deleted within timeout %v, so force delete it", + common.SecToDuration(taskSpec.PodGracefulDeletionTimeoutSec)) + // Always confirm the force deletion to expose the failure that even force + // deletion cannot delete the Pod, such as the Pod Finalizers is not empty. + return c.deletePod(f, taskRoleName, taskIndex, pod.UID, true, true) +} + // Get Task's current Pod object, if not found, then clean up existing // controlled Pod if any. // Returned pod is either managed or nil, if it is the managed pod, it is not @@ -1524,11 +1566,13 @@ func (c *FrameworkController) syncTaskState( // Clean up instead of recovery is because the PodUID is always the ground truth. func (c *FrameworkController) getOrCleanupPod( f *ci.Framework, cm *core.ConfigMap, - taskRoleName string, taskIndex int32, force bool) (pod *core.Pod, err error) { + taskRoleName string, taskIndex int32, confirm bool) (pod *core.Pod, err error) { + logPfx := fmt.Sprintf("[%v][%v][%v]: getOrCleanupPod: ", + f.Key(), taskRoleName, taskIndex) taskStatus := f.TaskStatus(taskRoleName, taskIndex) podName := taskStatus.PodName() - if force { + if confirm { pod, err = c.kClient.CoreV1().Pods(f.Namespace).Get(podName, meta.GetOptions{}) } else { @@ -1539,9 +1583,9 @@ func (c *FrameworkController) getOrCleanupPod( if apiErrors.IsNotFound(err) { return nil, nil } else { - return nil, fmt.Errorf( - "[%v][%v][%v]: Failed to get Pod %v: force: %v: %v", - f.Key(), taskRoleName, taskIndex, podName, force, err) + return nil, fmt.Errorf(logPfx+ + "Failed to get Pod %v: confirm: %v: %v", + podName, confirm, err) } } @@ -1551,11 +1595,24 @@ func (c *FrameworkController) getOrCleanupPod( // The managed Pod becomes unmanaged if and only if Framework.Status // is failed to persist due to FrameworkController restart or create fails // but succeeds on remote, so clean up the Pod to avoid unmanaged pod leak. - return nil, c.deletePod(f, taskRoleName, taskIndex, pod.UID, force) + klog.Warningf(logPfx+ + "Found unmanaged but controlled Pod, so explicitly delete it: %v, %v", + pod.Name, pod.UID) + if pod.DeletionTimestamp != nil { + err = c.handlePodGracefulDeletion(f, taskRoleName, taskIndex, pod) + if err != nil { + return nil, err + } + } + return nil, c.deletePod(f, taskRoleName, taskIndex, pod.UID, confirm, false) } else { // Do not own and manage the life cycle of not controlled object, so still // consider the get and controlled object clean up is success, and postpone // the potential naming conflict when creating the controlled object. + klog.Warningf(logPfx+ + "Found unmanaged and uncontrolled Pod, and it may be naming conflict "+ + "with the controlled Pod to be created: %v, %v", + pod.Name, pod.UID) return nil, nil } } else { @@ -1565,27 +1622,26 @@ func (c *FrameworkController) getOrCleanupPod( } // Using UID to ensure we delete the right object. -// The podUID should be controlled by cm. +// The podUID should be controlled by f's cm. func (c *FrameworkController) deletePod( f *ci.Framework, taskRoleName string, taskIndex int32, - podUID types.UID, force bool) error { - taskStatus := f.TaskStatus(taskRoleName, taskIndex) - podName := taskStatus.PodName() + podUID types.UID, confirm bool, force bool) error { + podName := f.TaskStatus(taskRoleName, taskIndex).PodName() errPfx := fmt.Sprintf( - "[%v][%v][%v]: Failed to delete Pod %v, %v: force: %v: ", - f.Key(), taskRoleName, taskIndex, podName, podUID, force) - - // Do not set zero GracePeriodSeconds to do force deletion in any case, since - // it will also immediately delete Pod in PodUnknown state, while the Pod may - // be still running. - deleteErr := c.kClient.CoreV1().Pods(f.Namespace).Delete(podName, - &meta.DeleteOptions{Preconditions: &meta.Preconditions{UID: &podUID}}) + "[%v][%v][%v]: Failed to delete Pod %v, %v: confirm: %v, force: %v: ", + f.Key(), taskRoleName, taskIndex, podName, podUID, confirm, force) + + deleteOptions := &meta.DeleteOptions{Preconditions: &meta.Preconditions{UID: &podUID}} + if force { + deleteOptions.GracePeriodSeconds = common.PtrInt64(0) + } + deleteErr := c.kClient.CoreV1().Pods(f.Namespace).Delete(podName, deleteOptions) if deleteErr != nil { if !apiErrors.IsNotFound(deleteErr) { return fmt.Errorf(errPfx+"%v", deleteErr) } } else { - if force { + if confirm { // Confirm it is deleted instead of still deleting. pod, getErr := c.kClient.CoreV1().Pods(f.Namespace).Get(podName, meta.GetOptions{}) @@ -1605,8 +1661,8 @@ func (c *FrameworkController) deletePod( } klog.Infof( - "[%v][%v][%v]: Succeeded to delete Pod %v, %v: force: %v", - f.Key(), taskRoleName, taskIndex, podName, podUID, force) + "[%v][%v][%v]: Succeeded to delete Pod %v, %v: confirm: %v, force: %v", + f.Key(), taskRoleName, taskIndex, podName, podUID, confirm, force) return nil } diff --git a/pkg/internal/utils.go b/pkg/internal/utils.go index 0b2942dd..c5ce04fe 100644 --- a/pkg/internal/utils.go +++ b/pkg/internal/utils.go @@ -38,6 +38,7 @@ import ( "k8s.io/client-go/tools/cache" "k8s.io/klog" "reflect" + "time" ) func CreateClients(kConfig *rest.Config) ( @@ -220,3 +221,17 @@ func ToPod(obj interface{}) *core.Pod { return pod } + +func GetPodDeletionStartTime(pod *core.Pod) *meta.Time { + if pod.DeletionTimestamp == nil { + return nil + } + + var gracePeriod time.Duration + if pod.DeletionGracePeriodSeconds == nil { + gracePeriod = time.Duration(0) + } else { + gracePeriod = common.SecToDuration(pod.DeletionGracePeriodSeconds) + } + return common.PtrTime(meta.NewTime(pod.DeletionTimestamp.Add(-gracePeriod))) +}