GPU concurrency / scaling / stability #10
Labels
🤖AI
All the ML issues (NLP, XGB, etc)
bug
Something isn't working
help wanted
Extra attention is needed
🛠Stability
Anything stability-related, usually around server/GPU setup
The GPU server is riddled with issues. It's meant to be runnable on multiple machines, and to spin up a Paperspace instance (or multiple) if there are no listeners. Each machine should grab a job, immediately set it to "working" so no other machine can grab it. Each machine should process 2 jobs max at a time (figure out something smarter?).
or find a smaller model.(track smaller-model @ Reconsider current BERT models #28)update .. set state='working' returning ..
would be sufficient, but maybe I need to move to a real job-queue like RabbitMQ. Switch from Postgres manual job-queue to real JQ, like Celery? #52. Also, multiple instances requested immediately, another race-condition. Actually, definitely move to job-queue.Paperspace stuffMoved to AWS Batchmachines
table not populated fast enough againstgpu_status()
check; race-condition. Are there two FastAPI threads on ECS? How to prevent this race-condition?PaperspaceBatch to multiple jobs.I'll need to upgrade Paperspace ($8/m/2-jobs; $24/m/5-jobs).last_job()
returning long-ago - so it goes up, decides nobody's around, goes down.PaperspaceBatch is onlineexit(0)
deletes Paperspace job. Otherwise I need to use delete code.The text was updated successfully, but these errors were encountered: