make DVC recover from network failures #2884

shcheklein · 2019-12-03T04:41:37Z

DVC has core commands that run a large number of operations with remotes:

dvc status
dvc push/pull
collecting checksum for a remote directory

or a long running network operations: pushing/pulling a large file.

It would be great for DVC to recover from network failures (even substantial ones). Instead of giving up from the very first read/connect error we should write a message and keep retrying with some exponential backoff. This should greatly improve user experience for those who deal with long running operations every day.

The text was updated successfully, but these errors were encountered:

isidentical · 2021-02-02T11:34:55Z

It would be great for DVC to recover from network failures (even substantial ones). Instead of giving up from the very first read/connect error we should write a message and keep retrying with some exponential backoff.

So should we poll indefinitely or just retry N times and then give up?

efiop · 2021-02-02T13:36:16Z

@isidentical Definitely we shouldn't poll indefinitely. Some reasonable number of retries would be great. At what level do you think to put the retries, btw?

isidentical · 2021-02-02T13:38:52Z

At what level do you think to put the retries

It requires a bit of research, though my initial guess would be BaseTree.upload/download

isidentical · 2021-02-02T13:40:04Z

Some reasonable number of retries would be great.

Maybe something like 10 retries, for 3 seconds by default (at worst, it will poll for 30 seconds and then exit)? (I think it also would make sense to put these as config options for users to set depending on their expectations, and their applications)

shcheklein · 2021-02-03T00:35:22Z

What would happen if it's downloading a 100GB file and connection breaks in the middle? I know that good FTP downloaders can restore it from the middle. Can we do the same for majority of the remotes? That's what at least I had in my mind with this ticket. It probably means that just regular retry of the whole download/upload is not enough. Though can be a good first step. The only thing I worry that we'll have to move it down/rewrite it later.

isidentical · 2021-02-03T07:19:34Z

Can we do the same for majority of the remotes?

I know that s3 supports it partially, and probably others (for download). Though since we are changing the cache format, I wonder whether we should wait for it, or not. We might automatically get the feature of resuming the upload from the chunk that we had the error, so the problem basically goes away

shcheklein · 2021-02-03T20:10:35Z

We might automatically get the feature of resuming the upload from the chunk that we had the error, so the problem basically goes away

@isidentical this a a very good point! My 2cs - I would wait for that indeed first and for now would to a basic retry like you initially suggested. We can add a checkbox in this ticket to revisit it later when chunking is implemented.

isidentical · 2021-02-04T08:18:02Z

I guess the real issue for retrying is determining whether an operation failed because of a 'connection error' or something else. We currently don't have unified exceptions for this. Also, apparently, some clients (or the underlying HTTP libraries) already do this (e.g: s3/with boto) for some time (like polls for a minute or so until it gives up). Just to keep in mind, since if we add a retry on top of that, the total time might become something unreasonable.

isidentical · 2021-04-30T09:33:11Z

just to note: s3fs and gcsfs already have some sort of recovering system for connection timeout errors (it is configurable with setting retries=..., and defaults to 5)
https://github.com/dask/gcsfs/blob/60be1b9446c70277714c2ee24b9fe2be76e9c979/gcsfs/core.py#L547-L549
https://github.com/dask/s3fs/blob/35b703ed82b40b1ac449cbc4c354aefe527fa8bd/s3fs/core.py#L235-L238

efiop · 2021-10-08T22:14:45Z

Closing since this is now a duty of particular filesystem implementation and all of the important ones handle this gracefuly.

ghost added triage Needs to be triaged feature request Requesting a new feature p2-medium Medium priority, should be done, but less important and removed triage Needs to be triaged labels Dec 3, 2019

efiop added the ui user interface / interaction label Dec 4, 2019

weekly-digest bot mentioned this issue Dec 8, 2019

Weekly Digest (1 December, 2019 - 8 December, 2019) #2919

Closed

dmpetrov added the performance improvement over resource / time consuming tasks label Feb 3, 2020

efiop closed this as completed Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make DVC recover from network failures #2884

make DVC recover from network failures #2884

shcheklein commented Dec 3, 2019

isidentical commented Feb 2, 2021

Uh oh!

efiop commented Feb 2, 2021

Uh oh!

isidentical commented Feb 2, 2021

Uh oh!

isidentical commented Feb 2, 2021 •

edited

Loading

Uh oh!

shcheklein commented Feb 3, 2021

Uh oh!

isidentical commented Feb 3, 2021

Uh oh!

shcheklein commented Feb 3, 2021

Uh oh!

isidentical commented Feb 4, 2021

Uh oh!

isidentical commented Apr 30, 2021

Uh oh!

efiop commented Oct 8, 2021

Uh oh!

make DVC recover from network failures #2884

make DVC recover from network failures #2884

Comments

shcheklein commented Dec 3, 2019

isidentical commented Feb 2, 2021

Uh oh!

efiop commented Feb 2, 2021

Uh oh!

isidentical commented Feb 2, 2021

Uh oh!

isidentical commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Feb 3, 2021

Uh oh!

isidentical commented Feb 3, 2021

Uh oh!

shcheklein commented Feb 3, 2021

Uh oh!

isidentical commented Feb 4, 2021

Uh oh!

isidentical commented Apr 30, 2021

Uh oh!

efiop commented Oct 8, 2021

Uh oh!

isidentical commented Feb 2, 2021 •

edited

Loading