-
Notifications
You must be signed in to change notification settings - Fork 1.2k
make DVC recover from network failures #2884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So should we poll indefinitely or just retry |
@isidentical Definitely we shouldn't poll indefinitely. Some reasonable number of retries would be great. At what level do you think to put the retries, btw? |
It requires a bit of research, though my initial guess would be |
Maybe something like 10 retries, for 3 seconds by default (at worst, it will poll for 30 seconds and then exit)? (I think it also would make sense to put these as config options for users to set depending on their expectations, and their applications) |
What would happen if it's downloading a 100GB file and connection breaks in the middle? I know that good FTP downloaders can restore it from the middle. Can we do the same for majority of the remotes? That's what at least I had in my mind with this ticket. It probably means that just regular retry of the whole download/upload is not enough. Though can be a good first step. The only thing I worry that we'll have to move it down/rewrite it later. |
I know that s3 supports it partially, and probably others (for download). Though since we are changing the cache format, I wonder whether we should wait for it, or not. We might automatically get the feature of resuming the upload from the chunk that we had the error, so the problem basically goes away |
@isidentical this a a very good point! My 2cs - I would wait for that indeed first and for now would to a basic retry like you initially suggested. We can add a checkbox in this ticket to revisit it later when chunking is implemented. |
I guess the real issue for retrying is determining whether an operation failed because of a 'connection error' or something else. We currently don't have unified exceptions for this. Also, apparently, some clients (or the underlying HTTP libraries) already do this (e.g: s3/with boto) for some time (like polls for a minute or so until it gives up). Just to keep in mind, since if we add a retry on top of that, the total time might become something unreasonable. |
just to note: s3fs and gcsfs already have some sort of recovering system for connection timeout errors (it is configurable with setting |
Closing since this is now a duty of particular filesystem implementation and all of the important ones handle this gracefuly. |
DVC has core commands that run a large number of operations with remotes:
or a long running network operations: pushing/pulling a large file.
It would be great for DVC to recover from network failures (even substantial ones). Instead of giving up from the very first read/connect error we should write a message and keep retrying with some exponential backoff. This should greatly improve user experience for those who deal with long running operations every day.
The text was updated successfully, but these errors were encountered: