Skip to content

GPU concurrency / scaling / stability #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 of 13 tasks
lefnire opened this issue Oct 1, 2020 · 3 comments
Closed
9 of 13 tasks

GPU concurrency / scaling / stability #10

lefnire opened this issue Oct 1, 2020 · 3 comments
Labels
🤖AI All the ML issues (NLP, XGB, etc) bug Something isn't working help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup

Comments

@lefnire
Copy link
Collaborator

lefnire commented Oct 1, 2020

The GPU server is riddled with issues. It's meant to be runnable on multiple machines, and to spin up a Paperspace instance (or multiple) if there are no listeners. Each machine should grab a job, immediately set it to "working" so no other machine can grab it. Each machine should process 2 jobs max at a time (figure out something smarter?).

  • QA is Longformer 4096 tokens, and extremely expensive. Either ensure only one at a time (rather than 2 usual cap), or find a smaller model. (track smaller-model @ Reconsider current BERT models #28)
  • Jobs are doing race-condition; one job will get nabbed by multiple gpu-servers. I thought my update .. set state='working' returning .. would be sufficient, but maybe I need to move to a real job-queue like RabbitMQ. Switch from Postgres manual job-queue to real JQ, like Celery? #52. Also, multiple instances requested immediately, another race-condition. Actually, definitely move to job-queue.
  • (easy) if 0 in queue, it says "eta 0 seconds" - should be total+30
  • Queueing up question-answering right after other jobs crashes GPU.
  • QA often gets stuck, CPU @ 100% and actually file-system at 100%. It's reading something really hard, but I don't think it's the model (4gb though it is) since it lasts 10m and then crashes.
  • GPU instances kill very often & easily. Literally just dump "killed" without error & restarts. Investigate.

Paperspace stuff Moved to AWS Batch

  • cloud_up_maybe on prod spins up 2 instances. Seems machines table not populated fast enough against gpu_status() check; race-condition. Are there two FastAPI threads on ECS? How to prevent this race-condition?
  • Maybe switch off server calling cron, to dedicated cron service so CPU jobs are ensured singleton
  • Decide, based on num_listeners and num_jobs_working, to scale up Paperspace Batch to multiple jobs. I'll need to upgrade Paperspace ($8/m/2-jobs; $24/m/5-jobs).
  • Batch jobs succeed immediately, then close. Probably because of last_job() returning long-ago - so it goes up, decides nobody's around, goes down.
  • Request AWS Batch increase in number of jobs (currently only allowed 1 I think) [update: evidently not, maybe just a fluke of spot-instance resources available when I tested]
  • Email me if Paperspace Batch is online
  • Ensure exit(0) deletes Paperspace job. Otherwise I need to use delete code.
@lefnire lefnire added bug Something isn't working help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup 🤖AI All the ML issues (NLP, XGB, etc) labels Oct 1, 2020
@lefnire
Copy link
Collaborator Author

lefnire commented Oct 9, 2020

Fixed GPU crash, due to RAM overload.

@lefnire
Copy link
Collaborator Author

lefnire commented Oct 23, 2020

Ah, looks like gunicorn spawns (8?) workers, and apscheduler runs cron in each worker. That means habitica-sync is done per-user 8x all at once, cloud_up_maybe is called 8x concurrently so there's a race-condition on notify_online (so 2-3 jobs get enqueued). So it's time indeed to switch to a proper job queuing system. I tried using apscheduler with jobstores=vars.DB_FULL to ensure all workers are referencing the same jobs, and job_default=(coalesce=True, max_instances=1) to prevent overlap; but it didn't do the trick (I think still with the 8 workers, apschduler's struggling with the race condition). Docs here. I also tried fastapi-utils#repeat_every hoping fastapi's default threadpool would prevent multiple runs, but it didn't do it either. I'm not sure why, I'd assumed that's what that utility's for? Finally, I don't want to limit the number of workers, and I'm a bit sketched to use --preload (this example) because I want to keep FastAPI on Gunicorn running as-intended, and high-concurrency. Another option is a background script singleton via /app/prestart.sh; but still, that script'd be run multiply if the server scales up.

So, time for a proper job queue. Looking into Celery

[Update] Temporary solution of dedicated singleton server-jobs service (same Dockerfile, ENTRYPOINT=jobs.py). Will move everything to Celery together (server jobs, GPU jobs, etc).

@lefnire
Copy link
Collaborator Author

lefnire commented Oct 24, 2020

Will track #90 for GPU instances, #52 for job-queue, #28 for model performance.

@lefnire lefnire closed this as completed Oct 24, 2020
@lefnire lefnire mentioned this issue Nov 2, 2020
4 tasks
@lefnire lefnire moved this to Done in Gnothi Nov 6, 2022
@lefnire lefnire added this to Gnothi Nov 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖AI All the ML issues (NLP, XGB, etc) bug Something isn't working help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup
Projects
Archived in project
Development

No branches or pull requests

1 participant