microsoft · yqwang-ms · Sep 19, 2019 · Sep 18, 2019 · Sep 19, 2019 · Sep 19, 2019
diff --git a/README.md b/README.md
@@ -41,13 +41,12 @@ A Framework represents an application with a set of Tasks:
 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
 5. With fine grained [RetryPolicy](doc/user-manual.md#RetryPolicy) for each Task and the whole Framework
 6. With fine grained [FrameworkAttemptCompletionPolicy](doc/user-manual.md#FrameworkAttemptCompletionPolicy) for each TaskRole
-7. Guarantees at most one instance of a specific Task is running at any point in time
-8. Guarantees at most one instance of a specific Framework is running at any point in time
+7. With PodGracefulDeletionTimeoutSec for each Task to [tune Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability)
 
 ### Controller Feature
 1. Highly generalized as it is built for all kinds of applications
 2. Light-weight as it is only responsible for Pod orchestration
-3. Well-defined Framework consistency, state machine and failure model
+3. Well-defined Framework [Consistency vs Availability](doc/user-manual.md#FrameworkConsistencyAvailability), [State Machine](doc/user-manual.md#FrameworkTaskStateMachine) and [Failure Model](doc/user-manual.md#CompletionStatus)
 4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
 5. Support to specify how to [classify and summarize Pod failures](doc/user-manual.md#PodFailureClassification)
 6. Support to expose [Framework and Pod history snapshots](doc/user-manual.md#FrameworkPodHistory) to external systems

diff --git a/doc/user-manual.md b/doc/user-manual.md
@@ -9,6 +9,8 @@
    - [RetryPolicy](#RetryPolicy)
    - [FrameworkAttemptCompletionPolicy](#FrameworkAttemptCompletionPolicy)
    - [Framework and Pod History](#FrameworkPodHistory)
+   - [Framework and Task State Machine](#FrameworkTaskStateMachine)
+   - [Framework Consistency vs Availability](#FrameworkConsistencyAvailability)
    - [Controller Extension](#ControllerExtension)
      - [FrameworkBarrier](#FrameworkBarrier)
      - [HivedScheduler](#HivedScheduler)
@@ -116,7 +118,8 @@ Type: application/json or application/yaml
 Delete the specified Framework.
 
 Notes:
-* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
+* If you need to achieve all the [Framework ConsistencyGuarantees](#ConsistencyGuarantees) or achieve higher [Framework Availability](#FrameworkAvailability) by leveraging the [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go), you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body.
+* However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.14.2](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient).
 
 **Response**
 
@@ -370,6 +373,87 @@ Notes:
 ## <a name="FrameworkPodHistory">Framework and Pod History</a>
 By leveraging [LogObjectSnapshot](../pkg/apis/frameworkcontroller/v1/config.go), external systems, such as [Fluentd](https://www.fluentd.org) and [ElasticSearch](https://www.elastic.co/products/elasticsearch), can collect and process Framework and Pod history snapshots even if it was retried or deleted, such as persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
 
+## <a name="FrameworkTaskStateMachine">Framework and Task State Machine</a>
+### <a name="FrameworkStateMachine">Framework State Machine</a>
+[FrameworkState](../pkg/apis/frameworkcontroller/v1/types.go)
+
+### <a name="TaskStateMachine">Task State Machine</a>
+[TaskState](../pkg/apis/frameworkcontroller/v1/types.go)
+
+## <a name="FrameworkConsistencyAvailability">Framework Consistency vs Availability</a>
+### <a name="FrameworkConsistency">Framework Consistency</a>
+#### <a name="ConsistencyGuarantees">ConsistencyGuarantees</a>
+For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}:
+
+- **ConsistencyGuarantee1**:
+
+  At most one instance of the Task is running at any point in time.
+
+- **ConsistencyGuarantee2**:
+
+  No instance of the Task is running if it is TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted.
+
+For a specific Framework identified by {FrameworkName}:
+
+- **ConsistencyGuarantee3**:
+
+  At most one instance of the Framework is running at any point in time.
+
+- **ConsistencyGuarantee4**:
+
+  No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted.
+
+#### <a name="ConsistencyGuaranteesHowTo">How to achieve ConsistencyGuarantees</a>
+
+The default behavior is to achieve all the [ConsistencyGuarantees](#ConsistencyGuarantees), if you do not explicitly violate below guidelines:
+
+1. Achieve **ConsistencyGuarantee1**:
+
+    Do not [force delete the managed Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods):
+
+   1. Do not set [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go) to be not nil.
+
+      For example, the default PodGracefulDeletionTimeoutSec is acceptable.
+
+   2. Do not delete the managed Pod with [0 GracePeriodSeconds](https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods).
+
+      For example, the default Pod deletion is acceptable.
+
+   3. Do not delete the Node which runs the managed Pod.
+
+      For example, [drain the Node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node) before delete it is acceptable.
+
+   *The Task instance can be universally located by its [TaskAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [PodUID](../pkg/apis/frameworkcontroller/v1/types.go).*
+
+   *To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as [Delete StatefulSet Pod only after the Pod termination has been confirmed](https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods) manually or by your [Cloud Controller Manager](https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager).*
+
+2. Achieve **ConsistencyGuarantee2**, **ConsistencyGuarantee3** and **ConsistencyGuarantee4**:
+   1. Achieve **ConsistencyGuarantee1**.
+
+   2. Must delete the managed ConfigMap with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).
+
+      For example, the default ConfigMap deletion is acceptable.
+
+   3. Must delete the Framework with [Foreground PropagationPolicy](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion).
+
+      For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.
+
+   4. Do not change the [OwnerReferences](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents) of the managed ConfigMap and Pods.
+
+   *The Framework instance can be universally located by its [FrameworkAttemptInstanceUID](../pkg/apis/frameworkcontroller/v1/types.go) or [ConfigMapUID](../pkg/apis/frameworkcontroller/v1/types.go).*
+
+### <a name="FrameworkAvailability">Framework Availability</a>
+According to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the [Framework Consistency](#FrameworkConsistency) and the [Framework Availability](#FrameworkAvailability).
+
+You can tune the trade-off, such as to achieve higher [Framework Availability](#FrameworkAvailability) by sacrificing the [Framework Consistency](#FrameworkConsistency):
+1. Set a small [Pod TolerationSeconds for TaintBasedEvictions](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions)
+2. Set a small [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
+3. Violate other guidelines mentioned in [How to achieve ConsistencyGuarantees](#ConsistencyGuaranteesHowTo), such as manually force delete a problematic Pod.
+
+See more in:
+1. [PodGracefulDeletionTimeoutSec](../pkg/apis/frameworkcontroller/v1/types.go)
+2. [Pod Safety and Consistency Guarantees](https://github.com/kubernetes/community/blob/ee8998b156031f6b363daade51ca2d12521f4ac0/contributors/design-proposals/storage/pod-safety.md)
+
 ## <a name="ControllerExtension">Controller Extension</a>
 ### <a name="FrameworkBarrier">FrameworkBarrier</a>
 1. [Usage](../pkg/barrier/barrier.go)

diff --git a/pkg/apis/frameworkcontroller/v1/types.go b/pkg/apis/frameworkcontroller/v1/types.go
@@ -47,8 +47,7 @@ type FrameworkList struct {
 // 4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
 // 5. With fine grained RetryPolicy for each Task and the whole Framework
 // 6. With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
-// 7. Guarantees at most one instance of a specific Task is running at any point in time
-// 8. Guarantees at most one instance of a specific Framework is running at any point in time
+// 7. With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
 //
 // Notes:
 // 1. Status field should only be modified by FrameworkController, and
@@ -57,26 +56,6 @@ type FrameworkList struct {
 //    Leverage CRD status subresource to isolate Status field modification with other fields.
 //    This can help to avoid unintended modification, such as users may unintendedly modify
 //    the status when updating the spec.
-// 2. To ensure at most one instance of a specific Task is running at any point in time:
-//    1. Do not delete the managed Pod with 0 gracePeriodSeconds.
-//       For example, the default Pod deletion is acceptable.
-//    2. Do not delete the Node which runs the managed Pod.
-//       For example, drain before delete the Node is acceptable.
-//    The instance can be universally located by its TaskAttemptInstanceUID or PodUID.
-//    See RetryPolicySpec and TaskAttemptStatus.
-// 3. To ensure at most one instance of a specific Framework is running at any point in time:
-//    1. Ensure ensure at most one instance of a specific Task is running at any point in time.
-//    2. Do not delete the managed ConfigMap with Background propagationPolicy.
-//       For example, the default ConfigMap deletion is acceptable.
-//    3. Must delete the Framework with Foreground propagationPolicy.
-//       For example, the default Framework deletion may not be acceptable, since the default
-//       propagationPolicy for Framework object may be Background.
-//    The instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.
-//    See RetryPolicySpec and FrameworkAttemptStatus.
-// 4. To ensure there is no orphan object previously managed by FrameworkController:
-//    1. Do not delete the Framework or the managed ConfigMap with Orphan propagationPolicy.
-//       For example, the default Framework and ConfigMap deletion is acceptable.
-//    2. Do not change the OwnerReferences of the managed ConfigMap and Pods.
 //////////////////////////////////////////////////////////////////////////////////////////////////
 type Framework struct {
 	meta.TypeMeta   `json:",inline"`
@@ -107,8 +86,31 @@ type TaskRoleSpec struct {
 }
 
 type TaskSpec struct {
-	RetryPolicy RetryPolicySpec      `json:"retryPolicy"`
-	Pod         core.PodTemplateSpec `json:"pod"`
+	RetryPolicy RetryPolicySpec `json:"retryPolicy"`
+
+	// If the Task's current associated Pod object is being deleted, i.e. graceful
+	// deletion, but the graceful deletion cannot finish within this timeout, then
+	// the Pod will be deleted forcefully by FrameworkController.
+	// Default to nil.
+	//
+	// If this timeout is not nil, the Pod may be deleted forcefully by FrameworkController.
+	// The force deletion does not wait for confirmation that the Pod has been terminated
+	// totally, and then the Task will be immediately transitioned to TaskAttemptCompleted.
+	// As a consequence, the Task will be immediately completed or retried with another
+	// new Pod, however the old Pod may be still running.
+	// So, in this setting, the Task behaves like ReplicaSet, and choose it if the Task
+	// favors availability over consistency, such as stateless Task.
+	// However, to still best effort execute graceful deletion with the toleration for
+	// transient deletion failures, this timeout should be at least longer than the Pod
+	// TerminationGracePeriodSeconds + minimal TolerationSeconds for TaintBasedEvictions.
+	//
+	// If this timeout is nil, the Pod will always be deleted gracefully, i.e. never
+	// be deleted forcefully by FrameworkController. This helps to guarantee at most
+	// one instance of a specific Task is running at any point in time.
+	// So, in this setting, the Task behaves like StatefulSet, and choose it if the Task
+	// favors consistency over availability, such as stateful Task.
+	PodGracefulDeletionTimeoutSec *int64               `json:"podGracefulDeletionTimeoutSec"`
+	Pod                           core.PodTemplateSpec `json:"pod"`
 }
 
 type ExecutionType string
@@ -163,10 +165,9 @@ const (
 //    So, an attempt identified by its attempt id may be associated with multiple
 //    attempt instances over time, i.e. multiple instances may be run for the
 //    attempt over time, however, at most one instance is exposed into ApiServer
-//    over time and at most one instance is running at any point in time.
-//    So, the actual retried attempt instances maybe exceed the RetryPolicySpec
-//    in rare cases, however, the RetryPolicyStatus will never exceed the
-//    RetryPolicySpec.
+//    over time.
+//    So, the actual retried attempt instances may exceed the RetryPolicySpec in
+//    rare cases, however, the RetryPolicyStatus will never exceed the RetryPolicySpec.
 // 2. Resort to other spec to control other kind of RetryPolicy:
 //    1. Container RetryPolicy is the RestartPolicy in Pod Spec.
 //       See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

diff --git a/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go b/pkg/apis/frameworkcontroller/v1/zz_generated.deepcopy.go
diff --git a/pkg/barrier/barrier.go b/pkg/barrier/barrier.go
@@ -271,6 +271,7 @@ func (b *FrameworkBarrier) Run() {
 			if isPermanentErr {
 				exit(ci.CompletionCodeContainerPermanentFailed)
 			} else {
+				// May also timeout, but still treat as Unknown Error
 				exit(ci.CompletionCode(1))
 			}
 		}

diff --git a/pkg/common/utils.go b/pkg/common/utils.go
@@ -87,6 +87,14 @@ func PtrUIDStr(s string) *types.UID {
 	return PtrUID(types.UID(s))
 }
 
+func PtrDeletionPropagation(o meta.DeletionPropagation) *meta.DeletionPropagation {
+	return &o
+}
+
+func PtrTime(o meta.Time) *meta.Time {
+	return &o
+}
+
 func PtrNow() *meta.Time {
 	now := meta.Now()
 	return &now