Skip to content

New 32 #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

New 32 #64

wants to merge 5 commits into from

Conversation

KPostOffice
Copy link
Collaborator

@KPostOffice KPostOffice commented Feb 14, 2023

Please feel free to comment and offer suggestions.

Closes: #32

MichaelClifford and others added 2 commits February 8, 2023 16:06
Job class represents a running job and JobDefinition is the necessary
configuration for a job which can then be submitted to a specific
cluster

Signed-off-by: Kevin <[email protected]>
# max_retries=0, # default
# mounts=None, # default
),
scheduler="ray", # can be determined by type of cluster if more are introduced
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this as configurable parameter, "scheduler" in the Config, with 'ray' as the default. And then rename this class TorchXJob()? I know the status() and logs() functions assume ray at the moment, but the class name and hard coding feel a bit awkward here. WDYT?

Copy link
Collaborator Author

@KPostOffice KPostOffice Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the scheduler is really dependent on the cluster type, so I don't think it needs to be exposed to the user. It shouldn't be hard coded though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we update this to something like scheduler = self.config.cluster_type then?

@KPostOffice KPostOffice marked this pull request as ready for review February 21, 2023 20:26
Create job definition, but do not submit.

The primary purpose of this function is to facilitate unit testing.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
"""
pass

A job definition to be submitted to a generic backend cluster.
"""

def _dry_run(self, cluster) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _dry_run(self, cluster) -> str:
def _dry_run(self, cluster: "Cluster") -> str:

def logs(self):
"""
Method for retrieving the job's logs.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
"""
pass

# max_retries=0, # default
# mounts=None, # default
),
scheduler="ray", # can be determined by type of cluster if more are introduced
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we update this to something like scheduler = self.config.cluster_type then?

"""
Submit the job definition to a specific cluster, resulting in a Job object.
"""
return TorchXRayJob(self, cluster)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something like:

if cluster.cluster_type == "ray":
        return TorchXRayJob(self, cluster) 
else:
        print("Unsupported Scheduler") 

Since we will be looking at adding the MCAD scheduler here too soon.

all_jobs.append(self)

@property
def job_id(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def job_id(self):
def job_id(self):
"""
Returns the job id if present. If not returns job id from ray dashboard.
"""

Comment on lines +146 to +150
if hasattr(self, "_job_id"):
return self._job_id
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265"
_, _, job_id = parse_app_handle(self._app_handle)
self._job_id = job_id.lstrip(f"{dashboard_address}-")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(self, "_job_id"):
return self._job_id
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265"
_, _, job_id = parse_app_handle(self._app_handle)
self._job_id = job_id.lstrip(f"{dashboard_address}-")
if hasattr(self, "_job_id"):
return self._job_id
dashboard_address = f"{self.cluster.cluster_dashboard_uri(self.cluster.config.namespace).lstrip('http://')}:8265"
_, _, job_id = parse_app_handle(self._app_handle)
self._job_id = job_id.lstrip(f"{dashboard_address}-")
return self._job_id

from typing import List
from pathlib import Path

from ray.job_submission import JobSubmissionClient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.

Comment on lines +101 to +102
cpu=cluster.config.max_cpus,
gpu=cluster.config.gpu,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have these be the defaults, but expose them as options to the user. This setup assume you want to use 100% of cluster resources for every job submitted. Not sure if this would always be the case.

@KPostOffice
Copy link
Collaborator Author

I'm having trouble balancing making the interface simple and making the code sufficiently generic. As I make it more generic I'm just falling back to the torchx design, and I feel like the abstraction isn't offering much in terms of simplification. I think I might need a rethink.

@KPostOffice
Copy link
Collaborator Author

Superseded by #70

@KPostOffice KPostOffice closed this Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update SDK to manage jobs
2 participants