Skip to content

make DVC recover from network failures #2884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shcheklein opened this issue Dec 3, 2019 · 10 comments
Closed

make DVC recover from network failures #2884

shcheklein opened this issue Dec 3, 2019 · 10 comments
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks ui user interface / interaction

Comments

@shcheklein
Copy link
Member

DVC has core commands that run a large number of operations with remotes:

dvc status
dvc push/pull
collecting checksum for a remote directory

or a long running network operations: pushing/pulling a large file.

It would be great for DVC to recover from network failures (even substantial ones). Instead of giving up from the very first read/connect error we should write a message and keep retrying with some exponential backoff. This should greatly improve user experience for those who deal with long running operations every day.

@ghost ghost added triage Needs to be triaged feature request Requesting a new feature p2-medium Medium priority, should be done, but less important and removed triage Needs to be triaged labels Dec 3, 2019
@efiop efiop added the ui user interface / interaction label Dec 4, 2019
@dmpetrov dmpetrov added the performance improvement over resource / time consuming tasks label Feb 3, 2020
@isidentical
Copy link
Contributor

It would be great for DVC to recover from network failures (even substantial ones). Instead of giving up from the very first read/connect error we should write a message and keep retrying with some exponential backoff.

So should we poll indefinitely or just retry N times and then give up?

@efiop
Copy link
Contributor

efiop commented Feb 2, 2021

@isidentical Definitely we shouldn't poll indefinitely. Some reasonable number of retries would be great. At what level do you think to put the retries, btw?

@isidentical
Copy link
Contributor

At what level do you think to put the retries

It requires a bit of research, though my initial guess would be BaseTree.upload/download

@isidentical
Copy link
Contributor

isidentical commented Feb 2, 2021

Some reasonable number of retries would be great.

Maybe something like 10 retries, for 3 seconds by default (at worst, it will poll for 30 seconds and then exit)? (I think it also would make sense to put these as config options for users to set depending on their expectations, and their applications)

@shcheklein
Copy link
Member Author

What would happen if it's downloading a 100GB file and connection breaks in the middle? I know that good FTP downloaders can restore it from the middle. Can we do the same for majority of the remotes? That's what at least I had in my mind with this ticket. It probably means that just regular retry of the whole download/upload is not enough. Though can be a good first step. The only thing I worry that we'll have to move it down/rewrite it later.

@isidentical
Copy link
Contributor

Can we do the same for majority of the remotes?

I know that s3 supports it partially, and probably others (for download). Though since we are changing the cache format, I wonder whether we should wait for it, or not. We might automatically get the feature of resuming the upload from the chunk that we had the error, so the problem basically goes away

@shcheklein
Copy link
Member Author

We might automatically get the feature of resuming the upload from the chunk that we had the error, so the problem basically goes away

@isidentical this a a very good point! My 2cs - I would wait for that indeed first and for now would to a basic retry like you initially suggested. We can add a checkbox in this ticket to revisit it later when chunking is implemented.

@isidentical
Copy link
Contributor

I guess the real issue for retrying is determining whether an operation failed because of a 'connection error' or something else. We currently don't have unified exceptions for this. Also, apparently, some clients (or the underlying HTTP libraries) already do this (e.g: s3/with boto) for some time (like polls for a minute or so until it gives up). Just to keep in mind, since if we add a retry on top of that, the total time might become something unreasonable.

@isidentical
Copy link
Contributor

just to note: s3fs and gcsfs already have some sort of recovering system for connection timeout errors (it is configurable with setting retries=..., and defaults to 5)
https://github.com/dask/gcsfs/blob/60be1b9446c70277714c2ee24b9fe2be76e9c979/gcsfs/core.py#L547-L549
https://github.com/dask/s3fs/blob/35b703ed82b40b1ac449cbc4c354aefe527fa8bd/s3fs/core.py#L235-L238

@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

Closing since this is now a duty of particular filesystem implementation and all of the important ones handle this gracefuly.

@efiop efiop closed this as completed Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks ui user interface / interaction
Projects
None yet
Development

No branches or pull requests

4 participants