diff --git a/README.md b/README.md index c4b19974..1c4bfb59 100644 --- a/README.md +++ b/README.md @@ -57,9 +57,8 @@ A Framework represents an application with a set of Tasks: 1. A Kubernetes cluster, v1.10 or above, on-cloud or on-premise. ## Quick Start -1. [Build](build/frameworkcontroller) -2. [Run Example](example/run/frameworkcontroller.md) -3. [Framework Example](example/framework) +1. [Run Controller](example/run) +2. [Submit Framework](example/framework) ## Doc 1. [User Manual](doc/user-manual.md) diff --git a/doc/user-manual.md b/doc/user-manual.md index 84862e96..92996d0a 100644 --- a/doc/user-manual.md +++ b/doc/user-manual.md @@ -10,26 +10,37 @@ - [Best Practice](#BestPractice) ## Framework Interop -**Supported interoperations with a Framework** - +### Supported Client +As Framework is actually a [Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions), all [CRD Clients](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#accessing-a-custom-resource) can be used to interoperate with it, such as: +1. [kubectl](https://kubernetes.io/docs/reference/kubectl) + ```shell + kubectl create -f {Framework File Path} + # Note this is not Foreground Deletion, see [DELETE Framework] section + kubectl delete framework {FrameworkName} + kubectl get framework {FrameworkName} + kubectl describe framework {FrameworkName} + kubectl get frameworks + kubectl describe frameworks + ... + ``` +2. [Kubernetes Client Library](https://kubernetes.io/docs/reference/using-api/client-libraries) +3. Any HTTP Client + +### Supported Interoperation | API Kind | Operations | |:---- |:---- | | Framework | [CREATE](#CREATE_Framework) [DELETE](#DELETE_Framework) [GET](#GET_Framework) [LIST](#LIST_Frameworks) [WATCH](#WATCH_Framework) [WATCH_LIST](#WATCH_LIST_Frameworks) | | [ConfigMap](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#configmap-v1-core) | All operations except for [CREATE](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#create-193) [PUT](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#replace-195) [PATCH](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#patch-194) | | [Pod](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#pod-v1-core) | All operations except for [CREATE](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#create-55) [PUT](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#replace-57) [PATCH](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#patch-56) | -**Supported clients to execute the interoperations with a Framework** - -As Framework is actually a Kubernetes [CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions), all CRD clients can be used to execute the interoperations with a Framework, see them in [Accessing a custom resource](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#accessing-a-custom-resource). - -### CREATE Framework +#### CREATE Framework **Request** POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks Body: [Framework](../pkg/apis/frameworkcontroller/v1/types.go) -Type: application/json +Type: application/json or application/yaml **Description** @@ -44,26 +55,32 @@ Create the specified Framework. | Accepted(202) | [Framework](../pkg/apis/frameworkcontroller/v1/types.go) | Return current Framework. | | Conflict(409) | [Status](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#status-v1-meta) | The specified Framework already exists. | -### DELETE Framework +#### DELETE Framework **Request** DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName} Body: + +application/json ```json { "propagationPolicy": "Foreground" } ``` +application/yaml +```yaml +propagationPolicy: Foreground +``` -Type: application/json +Type: application/json or application/yaml **Description** Delete the specified Framework. Notes: -* Should always use and only use the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). +* If you need to ensure at most one instance of a specific Framework (identified by the FrameworkName) is running at any point in time, you should always use and only use the [Foreground Deletion](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion) in the provided body, see [Framework Notes](../pkg/apis/frameworkcontroller/v1/types.go). However, `kubectl delete` does not support to specify the Foreground Deletion at least for [Kubernetes v1.10](https://github.com/kubernetes/kubernetes/issues/66110#issuecomment-413761559), so you may have to use other [Supported Client](#SupportedClient). **Response** @@ -73,7 +90,7 @@ Notes: | OK(200) | [Status](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#status-v1-meta) | The specified Framework is deleted. | | NotFound(200) | [Status](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#status-v1-meta) | The specified Framework is not found. | -### GET Framework +#### GET Framework **Request** GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName} @@ -89,7 +106,7 @@ Get the specified Framework. | OK(200) | [Framework](../pkg/apis/frameworkcontroller/v1/types.go) | Return current Framework. | | NotFound(200) | [Status](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#status-v1-meta) | The specified Framework is not found. | -### LIST Frameworks +#### LIST Frameworks **Request** GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks @@ -107,7 +124,7 @@ Get all Frameworks (in the specified FrameworkNamespace). |:---- |:---- |:---- | | OK(200) | [FrameworkList](../pkg/apis/frameworkcontroller/v1/types.go) | Return all Frameworks (in the specified FrameworkNamespace). | -### WATCH Framework +#### WATCH Framework **Request** GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName} @@ -125,7 +142,7 @@ Watch the change events of the specified Framework. | OK(200) | [WatchEvent](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#watchevent-v1-meta) | Streaming the change events of the specified Framework. | | NotFound(200) | [Status](https://v1-10.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#status-v1-meta) | The specified Framework is not found. | -### WATCH_LIST Frameworks +#### WATCH_LIST Frameworks **Request** GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks @@ -305,8 +322,7 @@ Notes: ## Controller Extension ### FrameworkBarrier 1. [Usage](../pkg/barrier/barrier.go) -2. [Build](../build/frameworkbarrier) -3. Example: [FrameworkBarrier Example](../example/framework/extension/frameworkbarrier.yaml), [Tensorflow Example](../example/framework/scenario/tensorflow), [etc](../example/framework/scenario). +2. Example: [FrameworkBarrier Example](../example/framework/extension/frameworkbarrier.yaml), [TensorFlow Example](../example/framework/scenario/tensorflow), [etc](../example/framework/scenario). ## Best Practice [Best Practice](../pkg/apis/frameworkcontroller/v1/types.go) diff --git a/example/config/default/frameworkcontroller.yaml b/example/config/default/frameworkcontroller.yaml index b973cc8e..9ed47f06 100644 --- a/example/config/default/frameworkcontroller.yaml +++ b/example/config/default/frameworkcontroller.yaml @@ -3,16 +3,6 @@ # This is the default config for frameworkcontroller, so all settings are commented out. -# Setup k8s config: -# kubeApiServerAddress is default to ${KUBE_APISERVER_ADDRESS} and kubeConfigFilePath -# is default to ${KUBECONFIG} then falls back to ${HOME}/.kube/config. -# If both kubeApiServerAddress and kubeConfigFilePath after defaulting are still empty, -# falls back to k8s inClusterConfig. -# -# Address should be in format http[s]://host:port #kubeApiServerAddress: http://10.10.10.10:8080 -# -# See https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#config #kubeConfigFilePath: "" - #workerNumber: 20 diff --git a/example/framework/README.md b/example/framework/README.md new file mode 100644 index 00000000..5d62c660 --- /dev/null +++ b/example/framework/README.md @@ -0,0 +1,11 @@ +# Submit Framework +We provide various Framework examples that can be submitted by various clients: +1. [Framework Supported Client](../../doc/user-manual.md#SupportedClient) +2. Framework Example + 1. [Basic Example](basic) + 2. [FrameworkController Extension Example](extension) + 3. [Real Scenario Example](scenario) + +## Next +1. [Framework Interop](../../doc/user-manual.md#FrameworkInterop) +2. [Framework Usage](../../pkg/apis/frameworkcontroller/v1/types.go) diff --git a/example/framework/basic/batchfailedpermanent.yaml b/example/framework/basic/batchfailedpermanent.yaml index 5d24e25a..c89900a1 100644 --- a/example/framework/basic/batchfailedpermanent.yaml +++ b/example/framework/basic/batchfailedpermanent.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchfailedtransient.yaml b/example/framework/basic/batchfailedtransient.yaml index 62c43125..216db204 100644 --- a/example/framework/basic/batchfailedtransient.yaml +++ b/example/framework/basic/batchfailedtransient.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchfailedtransientconflict.yaml b/example/framework/basic/batchfailedtransientconflict.yaml index 22e81ce2..0e9a1763 100644 --- a/example/framework/basic/batchfailedtransientconflict.yaml +++ b/example/framework/basic/batchfailedtransientconflict.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchfailedunknown.yaml b/example/framework/basic/batchfailedunknown.yaml index f40e9a28..4afe6715 100644 --- a/example/framework/basic/batchfailedunknown.yaml +++ b/example/framework/basic/batchfailedunknown.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchstatefulfailed.yaml b/example/framework/basic/batchstatefulfailed.yaml index c7c1bab7..58a75720 100644 --- a/example/framework/basic/batchstatefulfailed.yaml +++ b/example/framework/basic/batchstatefulfailed.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchsucceeded.yaml b/example/framework/basic/batchsucceeded.yaml index c30db1a5..523d31c4 100644 --- a/example/framework/basic/batchsucceeded.yaml +++ b/example/framework/basic/batchsucceeded.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/batchwithservicesucceeded.yaml b/example/framework/basic/batchwithservicesucceeded.yaml index e4a07321..3fc0aedb 100644 --- a/example/framework/basic/batchwithservicesucceeded.yaml +++ b/example/framework/basic/batchwithservicesucceeded.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/service.yaml b/example/framework/basic/service.yaml index 8b6fd85d..ac66660d 100644 --- a/example/framework/basic/service.yaml +++ b/example/framework/basic/service.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/basic/servicestateful.yaml b/example/framework/basic/servicestateful.yaml index 5d6b1b24..b883d297 100644 --- a/example/framework/basic/servicestateful.yaml +++ b/example/framework/basic/servicestateful.yaml @@ -1,4 +1,3 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework diff --git a/example/framework/extension/frameworkbarrier.yaml b/example/framework/extension/frameworkbarrier.yaml index aad0f6cf..a4980e88 100644 --- a/example/framework/extension/frameworkbarrier.yaml +++ b/example/framework/extension/frameworkbarrier.yaml @@ -1,6 +1,9 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go # For the full frameworkbarrier usage, see ./pkg/barrier/barrier.go + +############################### Prerequisite ################################### +# See "[PREREQUISITE]" in this file. +################################################################################ apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework metadata: @@ -54,6 +57,19 @@ spec: volumeMounts: - name: frameworkbarrier-volume mountPath: /mnt/frameworkbarrier + # [PREREQUISITE] + # User needs to create a service account in the same namespace of this + # Framework with granted permission for frameworkbarrier, if the k8s + # cluster enforces authorization. + # For example, if the cluster enforces RBAC: + # kubectl create serviceaccount frameworkbarrier --namespace default + # kubectl create clusterrole frameworkbarrier \ + # --verb=get,list,watch \ + # --resource=frameworks + # kubectl create clusterrolebinding frameworkbarrier \ + # --clusterrole=frameworkbarrier \ + # --user=system:serviceaccount:default:frameworkbarrier + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier # Using official image to demonstrate this example. @@ -97,6 +113,9 @@ spec: volumeMounts: - name: frameworkbarrier-volume mountPath: /mnt/frameworkbarrier + # [PREREQUISITE] + # Same as server TaskRole. + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier image: frameworkcontroller/frameworkbarrier diff --git a/example/framework/scenario/tensorflow/README.md b/example/framework/scenario/tensorflow/README.md new file mode 100644 index 00000000..8f0a6858 --- /dev/null +++ b/example/framework/scenario/tensorflow/README.md @@ -0,0 +1,17 @@ +# TensorFlow On FrameworkController + +## Feature +1. Support both GPU and CPU Distributed Training +2. Automatically clean up PS when the whole FrameworkAttempt is completed +3. No need to adjust existing TensorFlow image +4. No need to setup [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service) and [Kubernetes Service](https://kubernetes.io/docs/concepts/services-networking/service) +5. [Common Feature](../../../../README.md#Feature) + +## Prerequisite +1. See `[PREREQUISITE]` in each specific Framework yaml file. +2. Need to setup [Kubernetes Cluster-Level Logging](https://kubernetes.io/docs/concepts/cluster-administration/logging), if you need to persist and expose the log for deleted Pod. + +## Quick Start +1. [Common Quick Start](../../../../README.md#Quick-Start) +2. [CPU Example](cpu) +3. [GPU Example](gpu) diff --git a/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml b/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml index 7485f13c..9ec33811 100644 --- a/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml +++ b/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml @@ -1,6 +1,9 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go # For the full frameworkbarrier usage, see ./pkg/barrier/barrier.go + +############################### Prerequisite ################################### +# See "[PREREQUISITE]" in this file. +################################################################################ apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework metadata: @@ -23,8 +26,15 @@ spec: pod: spec: restartPolicy: Never - # Using hostNetwork to avoid network overhead. - hostNetwork: true + # [PREREQUISITE] + # User needs to setup the k8s cluster networking model and aware the + # potential network overhead, if he want to disable the hostNetwork to + # avoid the coordination of the containerPort usage. + # And for this example, if the hostNetwork is disabled, it only needs + # at least 1 node, otherwise, it needs at least 3 nodes since all the + # 3 workers are specified with the same containerPort. + # See https://kubernetes.io/docs/concepts/cluster-administration/networking + hostNetwork: false containers: - name: tensorflow # Using official image to demonstrate this example. @@ -56,6 +66,11 @@ spec: mountPath: /mnt/frameworkbarrier - name: data-volume mountPath: /mnt/data + # [PREREQUISITE] + # User needs to create a service account for frameworkbarrier, if the + # k8s cluster enforces authorization. + # See more in ./example/framework/extension/frameworkbarrier.yaml + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier # Using official image to demonstrate this example. @@ -74,10 +89,12 @@ spec: - name: frameworkbarrier-volume emptyDir: {} - name: data-volume + # [PREREQUISITE] # User needs to specify his own data-volume for input data and - # output model and the data-volume must be a distributed shared - # file system, so that data can be "handed off" between Pods, - # such as nfs, cephfs or glusterfs, etc. + # output model. + # The data-volume must be a distributed shared file system, so that + # data can be "handed off" between Pods, such as nfs, cephfs or + # glusterfs, etc. # See https://kubernetes.io/docs/concepts/storage/volumes. # # And then he needs to download and extract the example input data @@ -103,7 +120,9 @@ spec: pod: spec: restartPolicy: Never - hostNetwork: true + # [PREREQUISITE] + # Same as ps TaskRole. + hostNetwork: false containers: - name: tensorflow image: frameworkcontroller/tensorflow-examples:cpu @@ -125,6 +144,9 @@ spec: mountPath: /mnt/frameworkbarrier - name: data-volume mountPath: /mnt/data + # [PREREQUISITE] + # Same as ps TaskRole. + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier image: frameworkcontroller/frameworkbarrier @@ -140,6 +162,8 @@ spec: - name: frameworkbarrier-volume emptyDir: {} - name: data-volume + # [PREREQUISITE] + # Same as ps TaskRole. #nfs: # server: {NFS Server Host} # path: {NFS Shared Directory} diff --git a/example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithgpu.yaml b/example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithgpu.yaml index 1a1e39a4..1d6bdde2 100644 --- a/example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithgpu.yaml +++ b/example/framework/scenario/tensorflow/gpu/tensorflowdistributedtrainingwithgpu.yaml @@ -1,6 +1,9 @@ -# Post to {kubeApiServerAddress}/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks # For the full spec setting and usage, see ./pkg/apis/frameworkcontroller/v1/types.go # For the full frameworkbarrier usage, see ./pkg/barrier/barrier.go + +############################### Prerequisite ################################### +# See "[PREREQUISITE]" in this file. +################################################################################ apiVersion: frameworkcontroller.microsoft.com/v1 kind: Framework metadata: @@ -23,8 +26,15 @@ spec: pod: spec: restartPolicy: Never - # Using hostNetwork to avoid network overhead. - hostNetwork: true + # [PREREQUISITE] + # User needs to setup the k8s cluster networking model and aware the + # potential network overhead, if he want to disable the hostNetwork to + # avoid the coordination of the containerPort usage. + # And for this example, if the hostNetwork is disabled, it only needs + # at least 1 node, otherwise, it needs at least 3 nodes since all the + # 3 workers are specified with the same containerPort. + # See https://kubernetes.io/docs/concepts/cluster-administration/networking + hostNetwork: false containers: - name: tensorflow # Using official image to demonstrate this example. @@ -53,6 +63,7 @@ spec: - containerPort: 4001 resources: limits: + # [PREREQUISITE] # User needs to setup GPU for the k8s cluster. # See https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus nvidia.com/gpu: 1 @@ -61,6 +72,11 @@ spec: mountPath: /mnt/frameworkbarrier - name: data-volume mountPath: /mnt/data + # [PREREQUISITE] + # User needs to create a service account for frameworkbarrier, if the + # k8s cluster enforces authorization. + # See more in ./example/framework/extension/frameworkbarrier.yaml + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier # Using official image to demonstrate this example. @@ -79,10 +95,12 @@ spec: - name: frameworkbarrier-volume emptyDir: {} - name: data-volume + # [PREREQUISITE] # User needs to specify his own data-volume for input data and - # output model and the data-volume must be a distributed shared - # file system, so that data can be "handed off" between Pods, - # such as nfs, cephfs or glusterfs, etc. + # output model. + # The data-volume must be a distributed shared file system, so that + # data can be "handed off" between Pods, such as nfs, cephfs or + # glusterfs, etc. # See https://kubernetes.io/docs/concepts/storage/volumes. # # And then he needs to download and extract the example input data @@ -108,7 +126,9 @@ spec: pod: spec: restartPolicy: Never - hostNetwork: true + # [PREREQUISITE] + # Same as ps TaskRole. + hostNetwork: false containers: - name: tensorflow image: frameworkcontroller/tensorflow-examples:gpu @@ -127,12 +147,17 @@ spec: - containerPort: 5001 resources: limits: + # [PREREQUISITE] + # Same as ps TaskRole. nvidia.com/gpu: 1 volumeMounts: - name: frameworkbarrier-volume mountPath: /mnt/frameworkbarrier - name: data-volume mountPath: /mnt/data + # [PREREQUISITE] + # Same as ps TaskRole. + serviceAccountName: frameworkbarrier initContainers: - name: frameworkbarrier image: frameworkcontroller/frameworkbarrier @@ -148,6 +173,8 @@ spec: - name: frameworkbarrier-volume emptyDir: {} - name: data-volume + # [PREREQUISITE] + # Same as ps TaskRole. #nfs: # server: {NFS Server Host} # path: {NFS Shared Directory} diff --git a/example/run/README.md b/example/run/README.md new file mode 100644 index 00000000..9b31dfe2 --- /dev/null +++ b/example/run/README.md @@ -0,0 +1,136 @@ +# Run FrameworkController +We provide various approaches to run FrameworkController: + - [Run By Kubernetes StatefulSet](#RunByKubernetesStatefulSet) + - [Run By Docker Container](#RunByDockerContainer) + - [Run By OS Process](#RunByOSProcess) + +Notes: + - For a single k8s cluster, one instance of FrameworkController orchestrates all Frameworks in all namespaces. + - For a single k8s cluster, ensure at most one instance of FrameworkController is running at any point in time. + - For the full FrameworkController configuration, see + [Config Usage](../../pkg/apis/frameworkcontroller/v1/config.go) and [Config Example](../../example/config/default/frameworkcontroller.yaml). + +## Run By Kubernetes StatefulSet +- This approach is better for production, since StatefulSet by itself provides [self-healing](https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof) and can ensure [at most one instance](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md) of FrameworkController is running at any point in time. +- Using official image to demonstrate this example. + +**Prerequisite** + +If the k8s cluster enforces [Authorization](https://kubernetes.io/docs/reference/access-authn-authz/authorization/#authorization-modules), you need to first create a [Service Account](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account) with granted permission for FrameworkController. For example, if the cluster enforces [RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#kubectl-create-clusterrolebinding): +```shell +kubectl create serviceaccount frameworkcontroller --namespace default +kubectl create clusterrolebinding frameworkcontroller \ + --clusterrole=cluster-admin \ + --user=system:serviceaccount:default:frameworkcontroller +``` + +**Run** + +Run FrameworkController with above Service Account and the [k8s inClusterConfig](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod): +```shell +kubectl create -f frameworkcontroller.yaml +``` + +frameworkcontroller.yaml: +```yaml +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: frameworkcontroller + namespace: default +spec: + serviceName: frameworkcontroller + selector: + matchLabels: + app: frameworkcontroller + replicas: 1 + template: + metadata: + labels: + app: frameworkcontroller + spec: + # Using the service account with granted permission + # if the k8s cluster enforces authorization. + serviceAccountName: frameworkcontroller + containers: + - name: frameworkcontroller + image: frameworkcontroller/frameworkcontroller + # Using k8s inClusterConfig, so usually, no need to specify + # KUBE_APISERVER_ADDRESS or KUBECONFIG + #env: + #- name: KUBE_APISERVER_ADDRESS + # value: {http[s]://host:port} + #- name: KUBECONFIG + # value: {Pod Local KubeConfig File Path} +``` + +## Run By Docker Container +- This approach may be better for development sometimes. +- Using official image to demonstrate this example. + +**Run** + +If you have an insecure ApiServer address (can be got from [Insecure ApiServer](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#api-server-ports-and-ips) or [kubectl proxy](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#using-kubectl-proxy)) which does not enforce authentication, you only need to provide the address: +```shell +docker run -e KUBE_APISERVER_ADDRESS={http[s]://host:port} \ + frameworkcontroller/frameworkcontroller +``` + +Otherwise, you need to provide your [KubeConfig File](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#explore-the-home-kube-directory) which inlines or refers the [ApiServer Credential Files](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#transport-security) with [granted permission](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#define-clusters-users-and-contexts): +```shell +docker run -e KUBECONFIG=/mnt/.kube/config \ + -v {Host Local KubeConfig File Path}:/mnt/.kube/config \ + -v {Host Local ApiServer Credential File Path}:{Container Local ApiServer Credential File Path} \ + frameworkcontroller/frameworkcontroller +``` +For example, if the k8s cluster is created by [Minikube](https://kubernetes.io/docs/setup/minikube): +```shell +docker run -e KUBECONFIG=/mnt/.kube/config \ + -v ${HOME}/.kube/config:/mnt/.kube/config \ + -v ${HOME}/.minikube:${HOME}/.minikube \ + frameworkcontroller/frameworkcontroller +``` + +## Run By OS Process +- This approach may be better for development sometimes. +- Using local built binary distribution to demonstrate this example. + +**Prerequisite** + +Ensure you have installed [Golang 10.10 or above](https://golang.org/doc/install#install) and the [${GOPATH}](https://golang.org/doc/code.html#GOPATH) is valid. + +Then build the FrameworkController binary distribution: +```shell +export PROJECT_DIR=${GOPATH}/src/github.com/microsoft/frameworkcontroller +rm -rf ${PROJECT_DIR} +mkdir -p ${PROJECT_DIR} +git clone https://github.com/Microsoft/frameworkcontroller.git ${PROJECT_DIR} +cd ${PROJECT_DIR} +./build/frameworkcontroller/go-build.sh +``` + +**Run** + +If you have an insecure ApiServer address (can be got from [Insecure ApiServer](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#api-server-ports-and-ips) or [kubectl proxy](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#using-kubectl-proxy)) which does not enforce authentication, you only need to provide the address: +```shell +KUBE_APISERVER_ADDRESS={http[s]://host:port} \ + ./dist/frameworkcontroller/start.sh +``` + +Otherwise, you need to provide your [KubeConfig File](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#explore-the-home-kube-directory) which inlines or refers the [ApiServer Credential Files](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#transport-security) with [granted permission](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#define-clusters-users-and-contexts): +```shell +KUBECONFIG={Process Local KubeConfig File Path} \ + ./dist/frameworkcontroller/start.sh +``` +For example: +```shell +KUBECONFIG=${HOME}/.kube/config \ + ./dist/frameworkcontroller/start.sh +``` +And in above example, `${HOME}/.kube/config` is the default value of `KUBECONFIG`, so you can skip it: +```shell +./dist/frameworkcontroller/start.sh +``` + +## Next +1. [Submit Framework](../framework) diff --git a/example/run/frameworkcontroller.md b/example/run/frameworkcontroller.md deleted file mode 100644 index fd52e414..00000000 --- a/example/run/frameworkcontroller.md +++ /dev/null @@ -1,63 +0,0 @@ -# Run FrameworkController - -1. Ensure at most one instance of FrameworkController is run for a single k8s cluster. -2. For the full FrameworkController configuration, see - [Config Usage](../../pkg/apis/frameworkcontroller/v1/config.go) and [Config Example](../../example/config). - -## Run by a OS Process - -```shell -KUBE_APISERVER_ADDRESS={http[s]://host:port} ./dist/frameworkcontroller/start.sh -``` -Or -```shell -KUBECONFIG={Process Local KubeConfig File Path} ./dist/frameworkcontroller/start.sh -``` - -## Run by a Docker Container - -```shell -docker run -e KUBE_APISERVER_ADDRESS={http[s]://host:port} frameworkcontroller -``` -Or -```shell -docker run -e KUBECONFIG={Container Local KubeConfig File Path} frameworkcontroller -``` - -## Run by a Kubernetes StatefulSet - -```shell -kubectl create -f frameworkcontroller.yaml -``` - -frameworkcontroller.yaml: -```yaml -apiVersion: apps/v1 -kind: StatefulSet -metadata: - name: frameworkcontroller -spec: - serviceName: frameworkcontroller - selector: - matchLabels: - app: frameworkcontroller - replicas: 1 - template: - metadata: - labels: - app: frameworkcontroller - spec: - containers: - - name: frameworkcontroller - # Using official image to demonstrate this example. - image: frameworkcontroller/frameworkcontroller - env: - # May not need to specify KUBE_APISERVER_ADDRESS or KUBECONFIG - # if the target cluster to control is the cluster running the - # StatefulSet. - # See k8s inClusterConfig. - - name: KUBE_APISERVER_ADDRESS - value: {http[s]://host:port} - - name: KUBECONFIG - value: {Pod Local KubeConfig File Path} -``` diff --git a/pkg/apis/frameworkcontroller/v1/config.go b/pkg/apis/frameworkcontroller/v1/config.go index e0c83008..6415ebcf 100644 --- a/pkg/apis/frameworkcontroller/v1/config.go +++ b/pkg/apis/frameworkcontroller/v1/config.go @@ -32,11 +32,27 @@ import ( ) type Config struct { - // If both kubeApiServerAddress and kubeConfigFilePath after defaulting are still - // empty, falls back to k8s inClusterConfig. + // KubeApiServerAddress is default to ${KUBE_APISERVER_ADDRESS}. + // KubeConfigFilePath is default to ${KUBECONFIG} then falls back to ${HOME}/.kube/config. + // + // If both KubeApiServerAddress and KubeConfigFilePath after defaulting are still empty, falls back to the + // [k8s inClusterConfig](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod). + // + // If both KubeApiServerAddress and KubeConfigFilePath after defaulting are not empty, + // KubeApiServerAddress overrides the server address specified in the file referred by KubeConfigFilePath. + // + // If only KubeApiServerAddress after defaulting is not empty, it should be an insecure ApiServer address (can be got from + // [Insecure ApiServer](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#api-server-ports-and-ips) or + // [kubectl proxy](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#using-kubectl-proxy)) + // which does not enforce authentication. + // + // If only KubeConfigFilePath after defaulting is not empty, it should be an valid + // [KubeConfig File](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#explore-the-home-kube-directory) + // which inlines or refers the valid + // [ApiServer Credential Files](https://kubernetes.io/docs/reference/access-authn-authz/controlling-access/#transport-security). + // // Address should be in format http[s]://host:port KubeApiServerAddress *string `yaml:"kubeApiServerAddress"` - // See https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#config KubeConfigFilePath *string `yaml:"kubeConfigFilePath"` // Number of concurrent workers to process each different Frameworks diff --git a/pkg/barrier/barrier.go b/pkg/barrier/barrier.go index b81e214d..b9591c55 100644 --- a/pkg/barrier/barrier.go +++ b/pkg/barrier/barrier.go @@ -101,11 +101,11 @@ const ( // Config /////////////////////////////////////////////////////////////////////////////////////// type Config struct { - // The Framework for which the barrier waits. - // Address should be in format http[s]://host:port + // See the same fields in pkg/apis/frameworkcontroller/v1/config.go KubeApiServerAddress string `yaml:"kubeApiServerAddress"` - // See https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#config KubeConfigFilePath string `yaml:"kubeConfigFilePath"` + + // The Framework for which the barrier waits. FrameworkNamespace string `yaml:"frameworkNamespace"` FrameworkName string `yaml:"frameworkName"`