Skip to content

Scale up/down GPU instances based on n_jobs #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks
lefnire opened this issue Oct 24, 2020 · 0 comments
Closed
4 tasks

Scale up/down GPU instances based on n_jobs #90

lefnire opened this issue Oct 24, 2020 · 0 comments
Labels
🤖AI All the ML issues (NLP, XGB, etc) help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup

Comments

@lefnire
Copy link
Collaborator

lefnire commented Oct 24, 2020

Currently only spins up 1 AWS Batch instance when users are active (spins down after inactivity). Need to scale this.

  • Submit additional Batch jobs based on n_new_jobs vs n_machines in ("pending", "on"). Each instance can handle 2 jobs at a time (will improve after Reconsider current BERT models #28), so something like if n_new_jobs/n_machines/2 > 1: cloud_up().
  • Inverse check to spin down instances. Move from instance handling its own death, to submitting a kill job from server_jobs (since only one instance will take that job, and server_jobs will re-consider submitting another kill-job after one goes down).
  • (small) if 0 in queue, it says "eta 0 seconds" - should be total+30
  • Add dict (map) of number of concurrency specific jobs can handle. Eg, summarization/sentiment can handle 2-3 jobs at once; question-answering only one (can't run concurrent with anything else). Also add config.yml option for altering this per machine, in case some machines have higher/lower GPU compute.
@lefnire lefnire added help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup 🤖AI All the ML issues (NLP, XGB, etc) labels Oct 24, 2020
@lefnire lefnire moved this to Beta in Gnothi Nov 6, 2022
@lefnire lefnire added this to Gnothi Nov 6, 2022
@lefnire lefnire closed this as completed May 29, 2023
@github-project-automation github-project-automation bot moved this from V1.5 to Done in Gnothi May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖AI All the ML issues (NLP, XGB, etc) help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup
Projects
Archived in project
Development

No branches or pull requests

1 participant