Skip to content

Commit f727408

Browse files
SalikovAlexromilbhardwajconcretevitaminMichaelvllzpoint
authored
[Nebius] Nebius Object Storage support. (#4838)
* Add Nebius storage integration and associated utilities This commit introduces support for Nebius object storage, enabling users to integrate Nebius storage with various functionalities such as file mounting, syncing, and cloud transfers. It includes necessary adaptors, utility methods, and updates to the storage framework to handle Nebius-specific configurations and endpoints. * Add support for Nebius storage mounting and testing Implemented new Nebius storage mounting functionality, including COPY and MOUNT modes, with tests. Updated cloud_stores.py for AWS CLI compatibility and added a YAML template for Nebius storage configurations. Removed outdated Nebius test case in favor of the new approach. * format * Add Nebius object storage support across tests and utilities This commit introduces comprehensive Nebius support, making it accessible for S3-compatible operations including bucket creation, deletion, and mounting. It removes reliance on the AWS_SHARED_CREDENTIALS_FILE environment variable, streamlines Nebius-specific configurations, and adds associated unit test parameters to validate functionality across storage operations. * fix * typo * Refactor Nebius adaptor and improve clarity. Remove redundant code, streamline imports, and enhance error messaging. Adjust documentation for better accuracy and update function annotations. These changes improve maintainability and readability of the Nebius adaptor module. * Refactor Nebius storage setup and clean up debug print. Simplified Nebius AWS CLI installation by reusing the S3CloudStorage configuration for consistency. Removed unnecessary debug print in `run_upload_cli` to reduce console noise. Minor formatting adjustment in test YAML file. * Refactor Nebius storage handling and add timeout for deletions Clean up and improve code readability, including string formatting and conditionals. Introduce `_TIMEOUT_TO_PROPAGATES` to handle timeout while verifying Nebius bucket deletions. Update comments to reflect the corrected usage on Nebius servers. * Refactor subprocess call and improve timeout error messaging. Removed unused variable from subprocess call to clean up code. Updated timeout error to include the bucket name for more detailed and helpful error reporting. * Set default region for Nebius Object Storage if none provided Updated the helper method to assign a default region when no region is specified, ensuring compatibility with Nebius Object Storage. This change avoids potential errors caused by missing region values. * Support Nebius URLs in file sync commands Replace 'nebius://' with 's3://' in source paths to ensure compatibility with AWS CLI commands. This allows seamless integration of Nebius storage endpoints. * [Docs] Add quick start to k8s getting started docs (#4799) * k8s quick start * title * [Docs] New "Examples" section (#4858) * WIP: Examples dropdown. * update new * WIP * local render is fine; need to add files * test pip * fix * add missing * add missing * add missing * add missing * try .mddmissing * add missing * updates * updates * Instructions * updates * cleanup * fix * updates * lint * updates * add RAG * RAG new * add missing * refactor * fix redirection and warnings * generate before build * remove uneccessary source * minor * remove generated examples * fix header * priorize readme file * avoid remove * format * update README * update links * try fix stem/name * add paper * updates * update task -> skypilot yaml * source/generate_examples.py: revert to .stem --------- Co-authored-by: Zhanghao Wu <[email protected]> * [API Server] Fix admin policy enforcement on `validate` and `optimize` (#4820) * Add admin policy to validate * Add admin policy to validate * Add admin policy to optimize * docs * imports * Move dag validation to core * Fixes * lint * Add comments * lint * Fixed executor based validate implementation * Revert executor based validate implementation * lint * lint * Add validation during optimize * lint * Remove validate from core * Remove admin policy apply when validating dag for exec * comments * Bump API version * comments * [Core] Exit with non-zero code on launch/exec/logs/jobs launch/jobs logs (#4846) * Support return code based on job success/failure * Return exit code for tailing managed jobs * Fixes * lint * Create JobExitCode enum * Get JobExitCode from ManagedJobStatus * lint * cleanup * cleanup * Add tests * lint * Managed jobs back compat * Skylet backward compatibility * lint * Update logs --status returncodes * Update logs --status returncodes * lint * fix retcode * Fix tests * lint * Fix --no-follow * Fix cli docs rendering * minor * rename ret_code to returncode * rename SUCCESS to SUCCEEDED * Refactor JobExitCode to exceptions * lint * [Storage] Fix storage deletion for all (#4872) Fix storage deletion for all * [Docs] Avoid back links in FAQ (#4866) Avoid back links * Serve log before termination for smoke tests (#4691) * serve log before termination * restore change * replace command * fix * add sky serve status * [Dashboard] Fix Log Download (#4844) * download preview * refactor log content column * fix column issue * [jobs] catch NotSupportedError for `sky down --purge` (#4811) Fixes #4626. * [Test] fixed managed job return code with --no-follow for compatibility test (#4887) * [Test] fixed backward compatibility test Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * temp test Signed-off-by: Aylei <[email protected]> * revert temp change Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * show managed jobs user column in `sky status -u` (#4889) * [Examples] Rename airflow DAG (#4898) * Rename to sky_train_dag * rename * [API server] honor SKYPILOT_DEBUG env in server log (#4883) Signed-off-by: Aylei <[email protected]> * [jobs] resolve jobs queue user on API server side (#4897) * [jobs] resolve jobs queue user on API server side * lint * note user_name is optional * Updates the vast catalog to write directly to the vms.csv (#4891) Previously this file emitted to sys.stdout which prevented the catalog-fetcher from actually updating the catalog. This has now been updated matching many of the patterns employed by other vendors in this directory. * [Docs] Minor updates to installation.rst (#4888) * [Docs] K8s docs updates (#4902) Fixes to k8s docs * [jobs] fix dashboard for remote API server (#4895) * [jobs] fix dashboard for remote API server * fix for k8s * [docs] add jobs controller resource tuning reference in config page (#4909) * [Core] Handle mid-sequence chunking in log streaming (#4908) * Handle mid-sequence chunking * format * Handle actual UnicodeDecodeError * lint * Exclude `.pyc` and `__pycache__` files from config hash calculation to fix `test_launch_fast --kubernetes` failures (#4880) * filter out pyc and pycache * filter out pyc and pycache * handle edge case * None for the case where file might have been deleted after listing * add comment * [Docs] Add a few more examples for k8s. (#4911) * Add some new Example links. * Finetune landing/README. * Updates * No fork button * [Docs] Add team deployment in existing machine and `detach_run` in docs (#4913) * Indicate remote API server for jobs * Add api deployment and detach_run in docs * avoid console for better copy paste * avoid more console * fix * rename * update doc * format * revert * Update docs/source/reservations/existing-machines.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * update PR template to use CI tests (#4917) * update template * update * no bold * [UX] Auto-exclude unavailable kubernetes contexts (#4692) * [UX] Exclude stale kubernetes context - Improve Kubernetes context and node retrieval error handling - Add context-aware retry mechanism for Kubernetes API calls * catch broad error Signed-off-by: Aylei <[email protected]> * track unavailable contexts Signed-off-by: Aylei <[email protected]> * typo Signed-off-by: Aylei <[email protected]> * remove irrelevant change Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Update sky/clouds/kubernetes.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * cover unreachable context in smoke test Signed-off-by: Aylei <[email protected]> * fix post cleanup in multi-k8s Signed-off-by: Aylei <[email protected]> * more comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [API server] accelerate start by slowly start workers (#4885) * [API server] accelerate start by slowly start workers Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> * always close Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * more permissive match for k8s accelerators (#4925) * case insensitive match for k8s accelerators * fix typo in canonicalization func * format * [core] if not all nodes are in ray status, double check after 5s (#4916) * [core] if not all nodes are in ray status, double check after 5s * add a comment explaining the situation more * [Docs] Remove `networking: nodeport` from config docs (#4928) Remove `networking: nodeport` from config * [Core] Fix failover handler for clouds moved to new provisioner (#4919) * Fix failover handler * remove unused handler * [Test] Cost down smoke tests (#4813) * change cpu to 2+ and memory to 4+ * remove some resource heavy * update yaml * intermediate bucket yaml * cloud aws for test_managed_jobs_pipeline_recovery_aws * pipeline yaml update * fix * fix * larger the size of kube * resource heavy * test skyserve_update * test skyserve_update * fix kubernetes test failure * skyserve_streaming * more kubernetes high resource test * restore azure of test_skyserve_rolling_update * restore azure change * restore change * restore test_skyserve_rolling_update * bug fix: * fix yaml * v100 does not require low resource * no special resource for kubernetes tests * add more for master test * test_multi_tenant_managed_jobs low resource * managed_job_storage * longer timeout for kube * resolve PR comment * rename function * Add linting for sentence case in Markdown and reST headings (#4805) * linting * subtitle * draft linting * update linting script * title lowercase * fix * pass build * simplified logic * resolve review comment * resolve review comment * restore change * resolve comment * [Core] sky exec now waits cluster to be started (#4867) * [Core] sky exec now waits cluster to be started Signed-off-by: Aylei <[email protected]> * add smoke test case Signed-off-by: Aylei <[email protected]> * refine smoke Signed-off-by: Aylei <[email protected]> * fix smoke test Signed-off-by: Aylei <[email protected]> * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [Docs] Minor: pull up a page. (#4929) * `Fix Nebius integration issues and update storage error message` Updated the `create_endpoint` function to ensure the `region` parameter is strictly typed as `str`. Modified `create_nebius_client` to accept `None` as the default region. Additionally, corrected the error message in `storage.py` to specify 'nebius' instead of 's3'. * typo * Refactor storage handling and update R2 credentials usage Updated R2 command to explicitly set AWS_SHARED_CREDENTIALS_FILE for better credential management. Simplified region assignment logic in storage initialization to improve code readability and maintainability. * Refactor Nebius-related code for clarity and correctness Ensure Nebius paths are properly validated and transformed, replacing `if` checks with assertions. Fixed default region handling in `create_endpoint` and corrected variable naming in `split_nebius_path` for consistency. These changes enhance code reliability and maintainability. * Refactor SDK initialization to use a cached global instance. Introduce a global `_sdk` variable to cache the SDK instance, preventing redundant initialization. This improves efficiency by avoiding repeated calls to `nebius.sdk.SDK()` in the `sdk()` function. The logic ensures `_sdk` is only initialized once, either with IAM credentials or a credentials file. * Update bucket URI format in mount and storage test Replaced the `bucket_uri` returned in the test with a prefixed `nebius://` format. This ensures consistency with updated storage access conventions. * format * [Jobs][UX] add -all option to jobs queue printing (#4923) * add all option * formatting * fix comments * Refactor jobs queue display logic and improve job listing * [deps] pin ibm-platform-services to >=0.48.0 to work around issue (#4939) * [api server] avoid deleting requests.db but not -wal/-shm (#4941) [api server] avoid deleting requests.db without -wal/-shm * [Test] Fix kubernetes failure tests (#4874) * resource_heavy for test_multi_tenant_managed_jobs * longer initial delay and resource_heavy * test launch fast * test again * more test * more log * more log * more log * more log * more log * restore log * remove resource heavy * restore change * longer initial delay * wait for NOT_READY for test_skyserve_rolling_update test * remove unuse import * increase the sleep to 120 * f format * fix test_managed_jobs_storage and test_kubernetes_storage_mounts * restore deleted test * restore more * remove resource_heavy * test * test again * fix azure check * test one more time * test one more time * Revert "test one more time" This reverts commit 029a3a7. * Revert "test one more time" This reverts commit fa70b8f. * Revert "test again" This reverts commit 3480116. * Revert "test" This reverts commit c695b56. * fix * add comment * no spot for kubernetes test * no spot * bigger initial delay * longer initial delay * check if its eks cluster * fix bool arg * [k8s] filter out nodes with less accelerators than requested (#4930) * filter out nodes in gke with less accelerators than requested * address comments * gpu check executes on non-tpu nodes * [Jobs] Error out for intermediate bucket on cloud not enabled (#4942) * Error out for intermediate bucket on cloud not enabled * better logging for reauth error * Add reauth exception * format * [Docs] Add docs on implementing priorities in k8s (#4803) * Add priorities page * Address comments, add to k8s setup docs * fixes * Fixes * [Docs] Minor wording changes. (#4940) * wip * updates * reword * add * [Examples] LLM/Gemma3 Example (#4937) * gemma3 * Update gemma3.yaml to specify exact versions for transformers and vllm installations; add readiness probe configuration in service section. Update README.md to correct command option from 'deepseek' to 'gemma-3'. * Remove outdated command option from README.md for clarity. * update readme for serving * Update README.md to correct the number of nodes in the service launch command and update the link to the appropriate SkyPilot YAML file. * [Doc] Gemma3 doc update (#4948) Update README and documentation to include Gemma 3 model and example. Added Gemma 3 to the news section in README.md and updated the models index in documentation. * [Docs] Update k8s volume mounting docs + refactor optional steps (#4934) * Update volume mounting docs * Update volume mounting docs * Update volume mounting docs * Add nested tabset, restructure optional steps * Move volume mounting docs * Update volume mounting docs * Reorder * casing * Comments * fix * reduce links * [Test] Simplified buildkite agent queue (#4932) * remove serve and backcompact * ignore buildkite yaml file * [Docs] Update benefits for client-server (#4945) * Update benefits for client-server * update * Update docs/source/reference/api-server/api-server.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * Fix flaky for test_cancel_launch_and_exec_async (#4966) * fix flaky for test_cancel_launch_and_exec_async * comma * use generic_cloud * new line format * [Docs] fix typo in gemma3 example (#4971) Signed-off-by: Aylei <[email protected]> * [k8s] better support for GKE scale-to-zero autoscaling node pools (#4935) * working codepath * remove prints and an assert * make into classes * minor changes * update codepath comment * lint * slight reformat * review feedback * autoscale_detecror -> autoscaler * unnest regions_with_offering logic * short circuit on unsupported autoscaler * formalize context name validation, add exception handling for cluster info request * account for TPUs * code hardening * remove AUTOSCALER_TO_LABEL_FORMATTER in favor of expanding AUTOSCALER_TYPE_TO_AUTOSCALER * more debug logs, review feedbacks * final review comments addressed * fix incorrect vcpu/mem checks for GKE autoscaler (#4972) * fix annotation "kubernetes.io/ingress.class" is deprecated (#4974) * fix annotation "kubernetes.io/ingress.class" is deprecated Signed-off-by: Ajay-Satish-01 <[email protected]> * fix: ingress spec based on version --------- Signed-off-by: Ajay-Satish-01 <[email protected]> * [UX] Fix dense cli for resources not enough (#4962) fix dense cli * [API server] attach setup of controllers (#4931) * [API server] attach setup of controllers Signed-off-by: Aylei <[email protected]> * lint Signed-off-by: Aylei <[email protected]> * Address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * [Test] Add support for missing bashrc file in zsh shells (#4963) * add support for zsh * fix for bashrc after testing * [Docs] Fix NFS mounting docs for k8s (#4951) Add kubernetes key * [k8s] GKE support for TPU V6 (#4986) * [k8s] GKE support for TPU V6 * gke t6 support * remove wrong check * Fix test_managed_jobs_storage failure on azure in master branch (#4965) * fix * longer timeout * [Test]Separate different param into different steps on buildkite and fix flacky of test_job_queue_with_docker (#4955) * different param to different steps * longer time to sleep * [API server] cleanup executor processes on shutdown (#4912) * [API server] cleanup executor on shutdown Signed-off-by: Aylei <[email protected]> * refine Signed-off-by: Aylei <[email protected]> * just raise impossible exceptions Signed-off-by: Aylei <[email protected]> * Update sky/utils/subprocess_utils.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Signed-off-by: Aylei <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * [k8s] LRU cache for GKE can_create_new_instance_of_type (#4973) * LRU cache for can_create_new_instance_of_type * request scope * [Test]Refactor backward compatibility test (#4906) * backward compat * fix * backcompat update * generate pipeline * bug fix * remove deactivate * robust backcompact test * fix * more log * bug fix * subprocess run with bash * bug fix * update template * fix flaky * limit concurrency * pip install uv * fix * low resource * fix * bump python version to 3.10 * recreate env * import order * [Core] Independent storage check (#4977) * independent storage check * formatting * granular perms * _is_storage_cloud_enabled uses storage check * UX improvement * remove debug logs * fix local test * sky check no regression * no sky check regression, managed jobs work * api backwards compatibility * define globally minimal perms for gcp * review feedback * continue from except * [GCP] Don't require TPU support for serve:gcp if TPU support is not required (#4991) don't require tpu support for serve:gcp if tpu support is not required * followup to #4935 (#4989) * review comments * use .get() where it makes sense * [Serve] BugFix: `any_of` field order issue cause version bump to not work (#4978) * [Serve] BugFix: `any_of` field order issue cause version bump to not work * upd * [Example] Batch Inference (#4994) * initial code for batched inference * Refactor batch inference scripts and configuration files for improved consistency and clarity. Removed unused bucket name generation and monitoring service launch from `batch_compute_vectors.py`. Updated `compute_text_vectors.yaml` and `monitor_progress.yaml` to use a unified bucket name. Revised README to reflect changes in embedding generation focus and performance highlights. * formattting * Update batch inference configuration files for consistency. Renamed `compute_text_vectors` and `monitor_progress` to include `batch-inference` prefix. Revised README to enhance clarity on monitoring progress and accessing the service. * Update README for batch inference: replaced local image links with external URLs, corrected endpoint variable in monitoring instructions, and added a new image for enhanced visual representation. * Update README for batch inference: corrected image URLs to include file extensions and added a new section for further learning resources. * Enhance README for batch inference: updated the section on computing embeddings to include details about the Amazon reviews dataset and clarified the use of the `Alibaba-NLP/gte-Qwen2-7B-instruct` model for generating embeddings. * update banner --------- Signed-off-by: Aylei <[email protected]> Signed-off-by: Ajay-Satish-01 <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Kaiyuan Eric Chen <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Aylei <[email protected]> Co-authored-by: chris mckenzie <[email protected]> Co-authored-by: Seung Jin <[email protected]> Co-authored-by: Ajay Satish <[email protected]> Co-authored-by: Daniel Shin <[email protected]> Co-authored-by: Tian Xia <[email protected]>
1 parent a480f34 commit f727408

File tree

12 files changed

+1026
-23
lines changed

12 files changed

+1026
-23
lines changed

docs/source/getting-started/installation.rst

+18
Original file line numberDiff line numberDiff line change
@@ -584,6 +584,24 @@ To use *Service Account* authentication, follow these steps:
584584
* The `NEBIUS_IAM_TOKEN` file, if present, will take priority for authentication.
585585
* Service Accounts are restricted to a single region. Ensure you configure the Service Account for the appropriate region during creation.
586586

587+
Nebius offers `Object Storage <https://nebius.com/services/storage>`_, an S3-compatible object storage without any egress charges.
588+
SkyPilot can download/upload data to Nebius buckets and mount them as local filesystem on clusters launched by SkyPilot. To set up Nebius support, run:
589+
590+
.. code-block:: shell
591+
592+
# Install boto
593+
pip install boto3
594+
# Configure your Nebius Object Storage credentials
595+
aws configure --profile nebius
596+
597+
In the prompt, enter your Nebius Access Key ID and Secret Access Key (see `instructions to generate Nebius credentials <https://docs.nebius.com/object-storage/quickstart#env-configure>`_). Select :code:`auto` for the default region and :code:`json` for the default output format.
598+
599+
.. code-block:: bash
600+
601+
aws configure set aws_access_key_id $NB_ACCESS_KEY_AWS_ID --profile nebius
602+
aws configure set aws_secret_access_key $NB_SECRET_ACCESS_KEY --profile nebius
603+
aws configure set region eu-west1 --profile nebius
604+
aws configure set endpoint_url https://storage.eu-west1.nebius.cloud:443 --profile nebius
587605
588606
Request quotas for first time users
589607
--------------------------------------

sky/adaptors/nebius.py

+128-6
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
"""Nebius cloud adaptor."""
22
import os
3+
import threading
4+
from typing import Optional
35

46
from sky.adaptors import common
7+
from sky.utils import annotations
8+
from sky.utils import ux_utils
59

610
NEBIUS_TENANT_ID_FILENAME = 'NEBIUS_TENANT_ID.txt'
711
NEBIUS_IAM_TOKEN_FILENAME = 'NEBIUS_IAM_TOKEN.txt'
@@ -12,6 +16,10 @@
1216
NEBIUS_PROJECT_ID_PATH = '~/.nebius/' + NEBIUS_PROJECT_ID_FILENAME
1317
NEBIUS_CREDENTIALS_PATH = '~/.nebius/' + NEBIUS_CREDENTIALS_FILENAME
1418

19+
DEFAULT_REGION = 'eu-north1'
20+
21+
NEBIUS_PROFILE_NAME = 'nebius'
22+
1523
MAX_RETRIES_TO_DISK_CREATE = 120
1624
MAX_RETRIES_TO_INSTANCE_STOP = 120
1725
MAX_RETRIES_TO_INSTANCE_START = 120
@@ -23,15 +31,27 @@
2331
POLL_INTERVAL = 5
2432

2533
_iam_token = None
34+
_sdk = None
2635
_tenant_id = None
2736
_project_id = None
2837

38+
_IMPORT_ERROR_MESSAGE = ('Failed to import dependencies for Nebius AI Cloud.'
39+
'Try pip install "skypilot[nebius]"')
40+
2941
nebius = common.LazyImport(
3042
'nebius',
31-
import_error_message='Failed to import dependencies for Nebius AI Cloud. '
32-
'Try running: pip install "skypilot[nebius]"',
43+
import_error_message=_IMPORT_ERROR_MESSAGE,
3344
# https://github.com/grpc/grpc/issues/37642 to avoid spam in console
3445
set_loggers=lambda: os.environ.update({'GRPC_VERBOSITY': 'NONE'}))
46+
boto3 = common.LazyImport('boto3', import_error_message=_IMPORT_ERROR_MESSAGE)
47+
botocore = common.LazyImport('botocore',
48+
import_error_message=_IMPORT_ERROR_MESSAGE)
49+
50+
_LAZY_MODULES = (boto3, botocore, nebius)
51+
_session_creation_lock = threading.RLock()
52+
_INDENT_PREFIX = ' '
53+
NAME = 'Nebius'
54+
SKY_CHECK_NAME = 'Nebius (for Nebius Object Storae)'
3555

3656

3757
def request_error():
@@ -104,7 +124,109 @@ def get_tenant_id():
104124

105125

106126
def sdk():
107-
if get_iam_token() is not None:
108-
return nebius.sdk.SDK(credentials=get_iam_token())
109-
return nebius.sdk.SDK(
110-
credentials_file_name=os.path.expanduser(NEBIUS_CREDENTIALS_PATH))
127+
global _sdk
128+
if _sdk is None:
129+
if get_iam_token() is not None:
130+
_sdk = nebius.sdk.SDK(credentials=get_iam_token())
131+
return _sdk
132+
_sdk = nebius.sdk.SDK(
133+
credentials_file_name=os.path.expanduser(NEBIUS_CREDENTIALS_PATH))
134+
return _sdk
135+
136+
137+
def get_nebius_credentials(boto3_session):
138+
"""Gets the Nebius credentials from the boto3 session object.
139+
140+
Args:
141+
boto3_session: The boto3 session object.
142+
Returns:
143+
botocore.credentials.ReadOnlyCredentials object with the R2 credentials.
144+
"""
145+
nebius_credentials = boto3_session.get_credentials()
146+
if nebius_credentials is None:
147+
with ux_utils.print_exception_no_traceback():
148+
raise ValueError('Nebius credentials not found. Run '
149+
'`sky check` to verify credentials are '
150+
'correctly set up.')
151+
return nebius_credentials.get_frozen_credentials()
152+
153+
154+
# lru_cache() is thread-safe and it will return the same session object
155+
# for different threads.
156+
# Reference: https://docs.python.org/3/library/functools.html#functools.lru_cache # pylint: disable=line-too-long
157+
@annotations.lru_cache(scope='global')
158+
def session():
159+
"""Create an AWS session."""
160+
# Creating the session object is not thread-safe for boto3,
161+
# so we add a reentrant lock to synchronize the session creation.
162+
# Reference: https://github.com/boto/boto3/issues/1592
163+
# However, the session object itself is thread-safe, so we are
164+
# able to use lru_cache() to cache the session object.
165+
with _session_creation_lock:
166+
session_ = boto3.session.Session(profile_name=NEBIUS_PROFILE_NAME)
167+
return session_
168+
169+
170+
@annotations.lru_cache(scope='global')
171+
def resource(resource_name: str, region: str = DEFAULT_REGION, **kwargs):
172+
"""Create a Nebius resource.
173+
174+
Args:
175+
resource_name: Nebius resource name (e.g., 's3').
176+
kwargs: Other options.
177+
"""
178+
# Need to use the resource retrieved from the per-thread session
179+
# to avoid thread-safety issues (Directly creating the client
180+
# with boto3.resource() is not thread-safe).
181+
# Reference: https://stackoverflow.com/a/59635814
182+
183+
session_ = session()
184+
nebius_credentials = get_nebius_credentials(session_)
185+
endpoint = create_endpoint(region)
186+
187+
return session_.resource(
188+
resource_name,
189+
endpoint_url=endpoint,
190+
aws_access_key_id=nebius_credentials.access_key,
191+
aws_secret_access_key=nebius_credentials.secret_key,
192+
region_name=region,
193+
**kwargs)
194+
195+
196+
@annotations.lru_cache(scope='global')
197+
def client(service_name: str, region):
198+
"""Create an Nebius client of a certain service.
199+
200+
Args:
201+
service_name: Nebius service name (e.g., 's3').
202+
kwargs: Other options.
203+
"""
204+
# Need to use the client retrieved from the per-thread session
205+
# to avoid thread-safety issues (Directly creating the client
206+
# with boto3.client() is not thread-safe).
207+
# Reference: https://stackoverflow.com/a/59635814
208+
209+
session_ = session()
210+
nebius_credentials = get_nebius_credentials(session_)
211+
endpoint = create_endpoint(region)
212+
213+
return session_.client(service_name,
214+
endpoint_url=endpoint,
215+
aws_access_key_id=nebius_credentials.access_key,
216+
aws_secret_access_key=nebius_credentials.secret_key,
217+
region_name=region)
218+
219+
220+
@common.load_lazy_modules(_LAZY_MODULES)
221+
def botocore_exceptions():
222+
"""AWS botocore exception."""
223+
# pylint: disable=import-outside-toplevel
224+
from botocore import exceptions
225+
return exceptions
226+
227+
228+
def create_endpoint(region: Optional[str] = DEFAULT_REGION) -> str:
229+
"""Reads accountid necessary to interact with Nebius Object Storage"""
230+
if region is None:
231+
region = DEFAULT_REGION
232+
return f'https://storage.{region}.nebius.cloud:443'

sky/cloud_stores.py

+66
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from sky.adaptors import azure
2020
from sky.adaptors import cloudflare
2121
from sky.adaptors import ibm
22+
from sky.adaptors import nebius
2223
from sky.adaptors import oci
2324
from sky.clouds import gcp
2425
from sky.data import data_utils
@@ -543,6 +544,70 @@ def make_sync_file_command(self, source: str, destination: str) -> str:
543544
return download_via_ocicli
544545

545546

547+
class NebiusCloudStorage(CloudStorage):
548+
"""Nebius Cloud Storage."""
549+
550+
# List of commands to install AWS CLI
551+
_GET_AWSCLI = [
552+
'aws --version >/dev/null 2>&1 || '
553+
f'{constants.SKY_UV_PIP_CMD} install awscli',
554+
]
555+
556+
def is_directory(self, url: str) -> bool:
557+
"""Returns whether nebius 'url' is a directory.
558+
559+
In cloud object stores, a "directory" refers to a regular object whose
560+
name is a prefix of other objects.
561+
"""
562+
nebius_s3 = nebius.resource('s3')
563+
bucket_name, path = data_utils.split_nebius_path(url)
564+
bucket = nebius_s3.Bucket(bucket_name)
565+
566+
num_objects = 0
567+
for obj in bucket.objects.filter(Prefix=path):
568+
num_objects += 1
569+
if obj.key == path:
570+
return False
571+
# If there are more than 1 object in filter, then it is a directory
572+
if num_objects == 3:
573+
return True
574+
575+
# A directory with few or no items
576+
return True
577+
578+
def make_sync_dir_command(self, source: str, destination: str) -> str:
579+
"""Downloads using AWS CLI."""
580+
# AWS Sync by default uses 10 threads to upload files to the bucket.
581+
# To increase parallelism, modify max_concurrent_requests in your
582+
# aws config file (Default path: ~/.aws/config).
583+
endpoint_url = nebius.create_endpoint()
584+
assert 'nebius://' in source, 'nebius:// is not in source'
585+
source = source.replace('nebius://', 's3://')
586+
download_via_awscli = (f'{constants.SKY_REMOTE_PYTHON_ENV}/bin/aws s3 '
587+
'sync --no-follow-symlinks '
588+
f'{source} {destination} '
589+
f'--endpoint {endpoint_url} '
590+
f'--profile={nebius.NEBIUS_PROFILE_NAME}')
591+
592+
all_commands = list(self._GET_AWSCLI)
593+
all_commands.append(download_via_awscli)
594+
return ' && '.join(all_commands)
595+
596+
def make_sync_file_command(self, source: str, destination: str) -> str:
597+
"""Downloads a file using AWS CLI."""
598+
endpoint_url = nebius.create_endpoint()
599+
assert 'nebius://' in source, 'nebius:// is not in source'
600+
source = source.replace('nebius://', 's3://')
601+
download_via_awscli = (f'{constants.SKY_REMOTE_PYTHON_ENV}/bin/aws s3 '
602+
f'cp {source} {destination} '
603+
f'--endpoint {endpoint_url} '
604+
f'--profile={nebius.NEBIUS_PROFILE_NAME}')
605+
606+
all_commands = list(self._GET_AWSCLI)
607+
all_commands.append(download_via_awscli)
608+
return ' && '.join(all_commands)
609+
610+
546611
def get_storage_from_path(url: str) -> CloudStorage:
547612
"""Returns a CloudStorage by identifying the scheme:// in a URL."""
548613
result = urllib.parse.urlsplit(url)
@@ -559,6 +624,7 @@ def get_storage_from_path(url: str) -> CloudStorage:
559624
'r2': R2CloudStorage(),
560625
'cos': IBMCosCloudStorage(),
561626
'oci': OciCloudStorage(),
627+
'nebius': NebiusCloudStorage(),
562628
# TODO: This is a hack, as Azure URL starts with https://, we should
563629
# refactor the registry to be able to take regex, so that Azure blob can
564630
# be identified with `https://(.*?)\.blob\.core\.windows\.net`

sky/clouds/nebius.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -279,10 +279,13 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
279279
return True, None
280280

281281
def get_credential_file_mounts(self) -> Dict[str, str]:
282-
return {
282+
credential_file_mounts = {
283283
f'~/.nebius/{filename}': f'~/.nebius/{filename}'
284284
for filename in _CREDENTIAL_FILES
285285
}
286+
credential_file_mounts['~/.aws/credentials'] = '~/.aws/credentials'
287+
288+
return credential_file_mounts
286289

287290
@classmethod
288291
def get_current_user_identity(cls) -> Optional[List[str]]:

0 commit comments

Comments
 (0)