Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Support PodGracefulDeletionTimeoutSec to tune Framework Consistency vs Availability #43

Merged
merged 8 commits into from
Sep 19, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,12 @@ A Framework represents an application with a set of Tasks:
4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
5. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework
6. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole
7. Guarantees at most one instance of a specific Task is running at any point in time
8. Guarantees at most one instance of a specific Framework is running at any point in time
7. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability)

### Controller Feature
1. Highly generalized as it is built for all kinds of applications
2. Light-weight as it is only responsible for Pod orchestration
3. Well-defined Framework consistency, state machine and failure model
3. Well-defined Framework [Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability), [State Machine](doc/user-manual.md#FrameworkTaskStateMachine) and [Failure Model](doc/user-manual.md#CompletionStatus)
4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
5. Support to specify how to [classify and summarize Pod failures](doc/user-manual.md#PodFailureClassification)
6. Support to expose [Framework and Pod history snapshots](doc/user-manual.md#FrameworkPodHistory) to external systems
Expand Down
86 changes: 85 additions & 1 deletion doc/user-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
- [RetryPolicy](#RetryPolicy)
- [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy)
- [Framework and Pod History](#FrameworkPodHistory)
- [Framework and Task State Machine](#FrameworkTaskStateMachine)
- [Framework Consistency vs Availability](#FrameworkConsistencyAvailability)
- [Controller Extension](#ControllerExtension)
- [FrameworkBarrier](#FrameworkBarrier)
- [HivedScheduler](#HivedScheduler)
Expand Down Expand Up @@ -116,7 +118,8 @@ Type: application/json or application/yaml
Delete the specified Framework.

Notes:
* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
* If you need to achieve all the [Framework ConsistencyGuarantees](#ConsistencyGuarantees) or achieve higher [Framework Availability](#FrameworkAvailability) by leveraging the [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go), you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body.
* However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).

**Response**

Expand Down Expand Up @@ -370,6 +373,87 @@ Notes:
## <a name="FrameworkPodHistory">Framework and Pod History</a>
By leveraging [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
### <a name="FrameworkStateMachine">Framework State Machine</a>
[FrameworkState](../pkg/apis/frameworkcontroller/v1/types.go)

### <a name="TaskStateMachine">Task State Machine</a>
[TaskState](../pkg/apis/frameworkcontroller/v1/types.go)

## <a name="FrameworkConsistencyAvailability">Framework Consistency vs Availability</a>
### <a name="FrameworkConsistency">Framework Consistency</a>
#### <a name="ConsistencyGuarantees">ConsistencyGuarantees</a>
For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}:

- **ConsistencyGuarantee1**:

At most one instance of the Task is running at any point in time.

- **ConsistencyGuarantee2**:

No instance of the Task is running if it is TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted.

For a specific Framework identified by {FrameworkName}:

- **ConsistencyGuarantee3**:

At most one instance of the Framework is running at any point in time.

- **ConsistencyGuarantee4**:

No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted.

#### <a name="ConsistencyGuaranteesHowTo">How to achieve ConsistencyGuarantees</a>

The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyGuarantees), if you do not explicitly violate below guidelines:

1. Achieve **ConsistencyGuarantee1**:

Do not [force delete the managed Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods):

1. Do not set [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) to be not nil.

For example, the default PodGracefulDeletionTimeoutSec is acceptable.

2. Do not delete the managed Pod with [0 GracePeriodSeconds](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods).

For example, the default Pod deletion is acceptable.

3. Do not delete the Node which runs the managed Pod.

For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable.

*The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*

*To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).*

2. Achieve **ConsistencyGuarantee2**, **ConsistencyGuarantee3** and **ConsistencyGuarantee4**:
1. Achieve **ConsistencyGuarantee1**.

2. Must delete the managed ConfigMap with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).

For example, the default ConfigMap deletion is acceptable.

3. Must delete the Framework with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).

For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.

4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods.

*The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*

### <a name="FrameworkAvailability">Framework Availability</a>
According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability).

You can tune the trade-off, such as to achieve higher [Framework Availability](#FrameworkAvailability) by sacrificing the [Framework Consistency](#FrameworkConsistency):
1. Set a small [Pod TolerationSeconds for TaintBasedEvictions](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions)
2. Set a small [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
3. Violate other guidelines mentioned in [How to achieve ConsistencyGuarantees](#ConsistencyGuaranteesHowTo), such as manually force delete a problematic Pod.

See more in:
1. [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
2. [Pod Safety and Consistency Guarantees](https://github.com/kubernetes/community/blob/ee8998b156031f6b363daade51ca2d12521f4ac0/contributors/design-proposals/storage/pod-safety.md)

## <a name="ControllerExtension">Controller Extension</a>
### <a name="FrameworkBarrier">FrameworkBarrier</a>
1. [Usage](../pkg/barrier/barrier.go)
Expand Down
57 changes: 29 additions & 28 deletions pkg/apis/frameworkcontroller/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,7 @@ type FrameworkList struct {
// 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
// 5. With fine grained RetryPolicy for each Task and the whole Framework
// 6. With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
// 7. Guarantees at most one instance of a specific Task is running at any point in time
// 8. Guarantees at most one instance of a specific Framework is running at any point in time
// 7. With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
//
// Notes:
// 1. Status field should only be modified by FrameworkController, and
Expand All @@ -57,26 +56,6 @@ type FrameworkList struct {
// Leverage CRD status subresource to isolate Status field modification with other fields.
// This can help to avoid unintended modification, such as users may unintendedly modify
// the status when updating the spec.
// 2. To ensure at most one instance of a specific Task is running at any point in time:
// 1. Do not delete the managed Pod with 0 gracePeriodSeconds.
// For example, the default Pod deletion is acceptable.
// 2. Do not delete the Node which runs the managed Pod.
// For example, drain before delete the Node is acceptable.
// The instance can be universally located by its TaskAttemptInstanceUID or PodUID.
// See RetryPolicySpec and TaskAttemptStatus.
// 3. To ensure at most one instance of a specific Framework is running at any point in time:
// 1. Ensure ensure at most one instance of a specific Task is running at any point in time.
// 2. Do not delete the managed ConfigMap with Background propagationPolicy.
// For example, the default ConfigMap deletion is acceptable.
// 3. Must delete the Framework with Foreground propagationPolicy.
// For example, the default Framework deletion may not be acceptable, since the default
// propagationPolicy for Framework object may be Background.
// The instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.
// See RetryPolicySpec and FrameworkAttemptStatus.
// 4. To ensure there is no orphan object previously managed by FrameworkController:
// 1. Do not delete the Framework or the managed ConfigMap with Orphan propagationPolicy.
// For example, the default Framework and ConfigMap deletion is acceptable.
// 2. Do not change the OwnerReferences of the managed ConfigMap and Pods.
//////////////////////////////////////////////////////////////////////////////////////////////////
type Framework struct {
meta.TypeMeta `json:",inline"`
Expand Down Expand Up @@ -107,8 +86,31 @@ type TaskRoleSpec struct {
}

type TaskSpec struct {
RetryPolicy RetryPolicySpec `json:"retryPolicy"`
Pod core.PodTemplateSpec `json:"pod"`
RetryPolicy RetryPolicySpec `json:"retryPolicy"`

// If the Task's current associated Pod object is being deleted, i.e. graceful
// deletion, but the graceful deletion cannot finish within this timeout, then
// the Pod will be deleted forcefully by FrameworkController.
// Default to nil.
//
// If this timeout is not nil, the Pod may be deleted forcefully by FrameworkController.
// The force deletion does not wait for confirmation that the Pod has been terminated
// totally, and then the Task will be immediately transitioned to TaskAttemptCompleted.
// As a consequence, the Task will be immediately completed or retried with another
// new Pod, however the old Pod may be still running.
// So, in this setting, the Task behaves like ReplicaSet, and choose it if the Task
// favors availability over consistency, such as stateless Task.
// However, to still best effort execute graceful deletion with the toleration for
// transient deletion failures, this timeout should be at least longer than the Pod
// TerminationGracePeriodSeconds + minimal TolerationSeconds for TaintBasedEvictions.
//
// If this timeout is nil, the Pod will always be deleted gracefully, i.e. never
// be deleted forcefully by FrameworkController. This helps to guarantee at most
// one instance of a specific Task is running at any point in time.
// So, in this setting, the Task behaves like StatefulSet, and choose it if the Task
// favors consistency over availability, such as stateful Task.
PodGracefulDeletionTimeoutSec *int64 `json:"podGracefulDeletionTimeoutSec"`
Pod core.PodTemplateSpec `json:"pod"`
}

type ExecutionType string
Expand Down Expand Up @@ -163,10 +165,9 @@ const (
// So, an attempt identified by its attempt id may be associated with multiple
// attempt instances over time, i.e. multiple instances may be run for the
// attempt over time, however, at most one instance is exposed into ApiServer
// over time and at most one instance is running at any point in time.
// So, the actual retried attempt instances maybe exceed the RetryPolicySpec
// in rare cases, however, the RetryPolicyStatus will never exceed the
// RetryPolicySpec.
// over time.
// So, the actual retried attempt instances may exceed the RetryPolicySpec in
// rare cases, however, the RetryPolicyStatus will never exceed the RetryPolicySpec.
// 2. Resort to other spec to control other kind of RetryPolicy:
// 1. Container RetryPolicy is the RestartPolicy in Pod Spec.
// See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
Expand Down
5 changes: 5 additions & 0 deletions pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pkg/barrier/barrier.go
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ func (b *FrameworkBarrier) Run() {
if isPermanentErr {
exit(ci.CompletionCodeContainerPermanentFailed)
} else {
// May also timeout, but still treat as Unknown Error
exit(ci.CompletionCode(1))
}
}
Expand Down
8 changes: 8 additions & 0 deletions pkg/common/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,14 @@ func PtrUIDStr(s string) *types.UID {
return PtrUID(types.UID(s))
}

func PtrDeletionPropagation(o meta.DeletionPropagation) *meta.DeletionPropagation {
return &o
}

func PtrTime(o meta.Time) *meta.Time {
return &o
}

func PtrNow() *meta.Time {
now := meta.Now()
return &now
Expand Down
Loading