Skip to content

v2.0.0

Latest
Compare
Choose a tag to compare
@andreyvelich andreyvelich released this 21 Jul 15:59
· 32 commits to master since this release

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc