GPU concurrency / scaling / stability #10

lefnire · 2020-10-01T21:24:01Z

The GPU server is riddled with issues. It's meant to be runnable on multiple machines, and to spin up a Paperspace instance (or multiple) if there are no listeners. Each machine should grab a job, immediately set it to "working" so no other machine can grab it. Each machine should process 2 jobs max at a time (figure out something smarter?).

QA is Longformer 4096 tokens, and extremely expensive. Either ensure only one at a time (rather than 2 usual cap), ~~or find a smaller model.~~ (track smaller-model @ Reconsider current BERT models #28)
Jobs are doing race-condition; one job will get nabbed by multiple gpu-servers. I thought my update .. set state='working' returning .. would be sufficient, but maybe I need to move to a real job-queue like RabbitMQ. Switch from Postgres manual job-queue to real JQ, like Celery? #52. Also, multiple instances requested immediately, another race-condition. Actually, definitely move to job-queue.
(easy) if 0 in queue, it says "eta 0 seconds" - should be total+30
Queueing up question-answering right after other jobs crashes GPU.
QA often gets stuck, CPU @ 100% and actually file-system at 100%. It's reading something really hard, but I don't think it's the model (4gb though it is) since it lasts 10m and then crashes.
GPU instances kill very often & easily. Literally just dump "killed" without error & restarts. Investigate.

~~Paperspace stuff~~ Moved to AWS Batch

cloud_up_maybe on prod spins up 2 instances. Seems machines table not populated fast enough against gpu_status() check; race-condition. Are there two FastAPI threads on ECS? How to prevent this race-condition?
Maybe switch off server calling cron, to dedicated cron service so CPU jobs are ensured singleton
Decide, based on num_listeners and num_jobs_working, to scale up ~~Paperspace~~ Batch to multiple jobs. ~~I'll need to upgrade Paperspace ($8/m/2-jobs; $24/m/5-jobs).~~
Batch jobs succeed immediately, then close. Probably because of last_job() returning long-ago - so it goes up, decides nobody's around, goes down.
Request AWS Batch increase in number of jobs (currently only allowed 1 I think) [update: evidently not, maybe just a fluke of spot-instance resources available when I tested]
Email me if ~~Paperspace~~ Batch is online
Ensure exit(0) deletes Paperspace job. Otherwise I need to use delete code.

The text was updated successfully, but these errors were encountered:

lefnire · 2020-10-09T22:09:49Z

Fixed GPU crash, due to RAM overload.

lefnire · 2020-10-23T22:41:16Z

Ah, looks like gunicorn spawns (8?) workers, and apscheduler runs cron in each worker. That means habitica-sync is done per-user 8x all at once, cloud_up_maybe is called 8x concurrently so there's a race-condition on notify_online (so 2-3 jobs get enqueued). So it's time indeed to switch to a proper job queuing system. I tried using apscheduler with jobstores=vars.DB_FULL to ensure all workers are referencing the same jobs, and job_default=(coalesce=True, max_instances=1) to prevent overlap; but it didn't do the trick (I think still with the 8 workers, apschduler's struggling with the race condition). Docs here. I also tried fastapi-utils#repeat_every hoping fastapi's default threadpool would prevent multiple runs, but it didn't do it either. I'm not sure why, I'd assumed that's what that utility's for? Finally, I don't want to limit the number of workers, and I'm a bit sketched to use --preload (this example) because I want to keep FastAPI on Gunicorn running as-intended, and high-concurrency. Another option is a background script singleton via /app/prestart.sh; but still, that script'd be run multiply if the server scales up.

So, time for a proper job queue. Looking into Celery

[Update] Temporary solution of dedicated singleton server-jobs service (same Dockerfile, ENTRYPOINT=jobs.py). Will move everything to Celery together (server jobs, GPU jobs, etc).

lefnire · 2020-10-24T21:49:55Z

Will track #90 for GPU instances, #52 for job-queue, #28 for model performance.

lefnire added bug Something isn't working help wanted Extra attention is needed 🛠Stability Anything stability-related, usually around server/GPU setup 🤖AI All the ML issues (NLP, XGB, etc) labels Oct 1, 2020

lefnire closed this as completed Oct 24, 2020

lefnire mentioned this issue Nov 2, 2020

Duplicate field entries #20

Closed

4 tasks

lefnire mentioned this issue Dec 8, 2020

Use cortext.dev for GPU scaling #128

Closed

lefnire moved this to Done in Gnothi Nov 6, 2022

lefnire added this to Gnothi Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU concurrency / scaling / stability #10

GPU concurrency / scaling / stability #10

lefnire commented Oct 1, 2020 •

edited

Loading

lefnire commented Oct 9, 2020

lefnire commented Oct 23, 2020 •

edited

Loading

lefnire commented Oct 24, 2020

GPU concurrency / scaling / stability #10

GPU concurrency / scaling / stability #10

Comments

lefnire commented Oct 1, 2020 • edited Loading

lefnire commented Oct 9, 2020

lefnire commented Oct 23, 2020 • edited Loading

lefnire commented Oct 24, 2020

lefnire commented Oct 1, 2020 •

edited

Loading

lefnire commented Oct 23, 2020 •

edited

Loading