Skip to content

SkyPilot v0.9.1

Latest
Compare
Choose a tag to compare
@romilbhardwaj romilbhardwaj released this 24 Apr 18:04
· 0 commits to releases/0.9.0 since this release
1ffe585

SkyPilot v0.9.1: API Server Architecture, Web Dashboard, Faster Storage, Improved Configuration and more!

We're excited to announce the release of SkyPilot v0.9.1! This update brings major improvements to SkyPilot, making it faster, more powerful and flexible for production-ready deployment.

Highlights

Client-Server Architecture

Client-Server Architecture

The new client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.

  • Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
  • Integrate with workflow orchestrators: SkyPilot state is centralized on the API server, does not need to be maintained in orchestrators like Airflow.
  • Multi-tenancy: Share clusters, jobs, and services securely among teammates.

More: Docs, Blog

Web dashboard

SkyPilot has a new dashboard! Easily view and manage your clusters, jobs and logs.

SkyPilot Web Dashboard

Access it with sky dashboard.

New configuration system

New configuration system

SkyPilot now supports specifying configuration at various levels: CLI, SkyPilot YAML, project-level config, client-level global config and server-side config.

You can now have a project configuration storing default values for all jobs in a project, a user configuration to apply globally to all projects and Task YAML overrides for specific jobs.

New mount_cached storage - 9.6x faster checkpointing

New storage mode mount_cached uses the local disk as a cache for cloud storage buckets. Boosts GPU utilizationby making cloud I/O asynchronous.

file_mounts:
  /checkpoints:
    source: gs://my-checkpoints-bucket
    mode: MOUNT_CACHED  # Will asynchronously upload all writes to the bucket

More: Docs, Blog

New cloud: Nebius

SkyPilot now supports Nebius cloud! Getting started is easy:

$ sky check nebius
$ sky launch --gpus H200:8 --cloud nebius

ARM instance support - run SkyPilot on GH200s, GB200s, and more!

New native images for ARM instances allows you to run SkyPilot on your GH200s, GB200s on Lambda cloud, GCP or your own Kubernetes clusters! (#4835)

What's new

CLI & Core interfaces

  • sky CLI now returns non-zero exit code on launch/exec/logs/jobs launch/jobs logs failures (#4846)
    • This improves scriptability with sky CLI in automated workflows.
  • sky check now separately checks storage and compute capabilities (#4996, #4977)
  • New --all option for sky jobs queue to show all jobs (#4923)
  • resources.gpus can now be used to alias resources.accelerators in the SkyPilot YAML (#5207)

Managed Jobs

  • Multiple users can now share the same jobs controller (#4733)
  • Autostop and autodown settings for the jobs controller can now be customized (#5182)
    # ~/.sky/config.yaml
    jobs:
    controller:
      # autostop: false  # to disable completely
      autostop:
        idle_minutes: 5
        down: true
    
  • See other users's jobs with sky jobs queue -u when using a shared controller (#4787)
  • Access to cloud object storage is no longer necessary for using file mounts or workdir in managed jobs. (#4708)
    • Running managed jobs on Kubernetes no longer requires cloud access.

Storage

  • New mode: mount_cached (#4369)
    • This mode is optimized for checkpointing large models
    • It asynchronously uploads the cached directory to the cloud storage bucket, increasing GPU utilization.
  • Fix issue with openrsync on Mac OS 15 causing uploads failures (#5196)
  • .gitignore handling is now more robust (#4988)
  • Fix exclusion for AWS bucket upload (#5128)

Kubernetes

  • Revamped /dev/fuse access mechanism on k8s (#5028)
    • We no longer need to request smarter-devices-fuse resource, making SkyPilot fuse mounting compatible on autoscaling clusters.
  • B200 GPUs are now supported on GKE (#5102)
  • Scale-to-zero autoscaling is now supported on GKE (#4935)
    • SkyPilot can now inspect the node pools available on scale-to-zero clusters before provisioning.
    • This allows SkyPilot to intelligently filter out clusters that cannot provision the requested GPU type.
  • sky check now detects and hints for unlabeled GPU nodes on GKE (#5065)
  • GPU names are now case-insensitive; numbers-only name formats are now supported (#4756, #4925)
  • Fixed fractional CPU support when using <1 CPU core (#4707)
  • Fix node filtering when provisioning multiple GPUs (#4930)
  • initContainers can now be overriden through pod_config (#5247)
  • Instructions on mounting NFS volumes (#4951)
  • GPU labelling script can now use custom context names (#5072)
  • Fixed a bug where clusters from stale contexts could not be cleaned up (#4980)

Backend

  • New Client-Server Architecture (#4660)
    • This allows SkyPilot to be deployed as a remote service shared by multiple users.
  • Fixed conda support when using python 3.12 (#4035)
  • sky exec now waits for the cluster to be started (#4867)
  • sky local up --ips now supports specifying sudo password (#5030)
  • Clouds with expired credentials are now automatically excluded from failover (#5015)

SkyServe

  • New Spot/On-demand Policy: dynamic_fallback (#4628)
    • New spot_placer field can be set to dynamic_fallback to let SkyPilot automatically switch from spot to on-demand instances if spot instances are not available.
    • More details in paper
  • Fixed: any_of field order issue causing version bump to not work (#4978)
  • Fixed: LiveError on controller (#4995)

Cloud Support

  • New cloud: Nebius (#4573, #4838)
  • GCP
    • TPU v6e is now supported on GKE clusters (#4986)
    • VPCs from different projects can be used (#5143)
    • Newer instance types (e.g., a3-highgpu-8g) can now be directly selected from the CLI with -t flag (#5120)
  • RunPod
    • Custom docker images with non-root user are now supported (#4683)
  • Lambda
    • New regions: us-east-3 and australia-east-1 (#4703, #4738)
    • Ports can now be opened on Lambda VMs (#5124)
  • Fluidstack: NVLINK GPUs are now supported (#3954)
  • IBM: new fetcher for IBM catalog (#5003)
  • Cloudflare R2: fixed upload issues when using new awscli versions (#5282)

New Examples and Tutorials

⚠️ Deprecations and removals

Removed

  • Env vars starting with SKY_ are no longer supported. Use SKYPILOT_ env vars instead.
  • Old services from 0.7.0 (before #4439) may require to be stopped and restarted.
  • kubernetes is no longer a valid region name. use the k8s context name to specify a kubernetes cluster if required.

Deprecated

  • experimental.config_overrides has been deprecated. Use the config field instead.

Migration guide

SkyPilot 0.9.1 introduces the asynchronous execution model, which may cause compatibility issues with user programs using SkyPilot SDKs <=0.8.1.

Refer to the migration guide to upgrade your code.

TL;DR: Wrap all SkyPilot SDK function calls (except tail_logs) with sky.stream_and_get() to make your program behave mostly the same as before:

# <= 0.8.1
job_id, handle = sky.launch(task)
# 0.9.1
job_id, handle = sky.stream_and_get(sky.launch(task))

Thanks to all contributors!

New contributors: @kyuds, @BorenTsai, @funkypenguin, @JiangJiaWei1103, @SalikovAlex, @flaviomartins, @ajay, @bradhilton, @SeungjinYang, @eltociear, @vvidovic, @KennBro, @DanielZhangQD

Many thanks to all contributors who contributed to this release!

Contributors: @aylei, @zpoint, @SeungjinYang, @cg505, @Michaelvll, @romilbhardwaj, @KeplerC, @concretevitamin, @SalikovAlex, @DanielZhangQD, @kyuds, @cblmemo, @andylizf, @clayrosenthal, @JiangJiaWei1103, @bradhilton, @funkypenguin, @vvidovic, @cbrownstein, @flaviomartins, @KennBro, @mjibril, @kristopolous, @ajay, @landscapepainter, @eltociear, @BorenTsai