-
Notifications
You must be signed in to change notification settings - Fork 54
New 32 #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New 32 #64
Conversation
Job class represents a running job and JobDefinition is the necessary configuration for a job which can then be submitted to a specific cluster Signed-off-by: Kevin <[email protected]>
# max_retries=0, # default | ||
# mounts=None, # default | ||
), | ||
scheduler="ray", # can be determined by type of cluster if more are introduced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this as configurable parameter, "scheduler" in the Config, with 'ray' as the default. And then rename this class TorchXJob()
? I know the status() and logs() functions assume ray at the moment, but the class name and hard coding feel a bit awkward here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the scheduler is really dependent on the cluster type, so I don't think it needs to be exposed to the user. It shouldn't be hard coded though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we update this to something like scheduler = self.config.cluster_type
then?
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevin <[email protected]>
Create job definition, but do not submit. | ||
|
||
The primary purpose of this function is to facilitate unit testing. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
""" | |
pass |
A job definition to be submitted to a generic backend cluster. | ||
""" | ||
|
||
def _dry_run(self, cluster) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _dry_run(self, cluster) -> str: | |
def _dry_run(self, cluster: "Cluster") -> str: |
def logs(self): | ||
""" | ||
Method for retrieving the job's logs. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
""" | |
pass |
# max_retries=0, # default | ||
# mounts=None, # default | ||
), | ||
scheduler="ray", # can be determined by type of cluster if more are introduced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we update this to something like scheduler = self.config.cluster_type
then?
""" | ||
Submit the job definition to a specific cluster, resulting in a Job object. | ||
""" | ||
return TorchXRayJob(self, cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be something like:
if cluster.cluster_type == "ray":
return TorchXRayJob(self, cluster)
else:
print("Unsupported Scheduler")
Since we will be looking at adding the MCAD scheduler here too soon.
all_jobs.append(self) | ||
|
||
@property | ||
def job_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def job_id(self): | |
def job_id(self): | |
""" | |
Returns the job id if present. If not returns job id from ray dashboard. | |
""" |
if hasattr(self, "_job_id"): | ||
return self._job_id | ||
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265" | ||
_, _, job_id = parse_app_handle(self._app_handle) | ||
self._job_id = job_id.lstrip(f"{dashboard_address}-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if hasattr(self, "_job_id"): | |
return self._job_id | |
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265" | |
_, _, job_id = parse_app_handle(self._app_handle) | |
self._job_id = job_id.lstrip(f"{dashboard_address}-") | |
if hasattr(self, "_job_id"): | |
return self._job_id | |
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265" | |
_, _, job_id = parse_app_handle(self._app_handle) | |
self._job_id = job_id.lstrip(f"{dashboard_address}-") | |
return self._job_id |
from typing import List | ||
from pathlib import Path | ||
|
||
from ray.job_submission import JobSubmissionClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed.
cpu=cluster.config.max_cpus, | ||
gpu=cluster.config.gpu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have these be the defaults, but expose them as options to the user. This setup assume you want to use 100% of cluster resources for every job submitted. Not sure if this would always be the case.
I'm having trouble balancing making the interface simple and making the code sufficiently generic. As I make it more generic I'm just falling back to the torchx design, and I feel like the abstraction isn't offering much in terms of simplification. I think I might need a rethink. |
Superseded by #70 |
Please feel free to comment and offer suggestions.
Closes: #32