Skip to content

dvc push after dvc push is slow #2867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JohanMollevik opened this issue Nov 29, 2019 · 5 comments
Closed

dvc push after dvc push is slow #2867

JohanMollevik opened this issue Nov 29, 2019 · 5 comments
Labels
performance improvement over resource / time consuming tasks ui user interface / interaction

Comments

@JohanMollevik
Copy link

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

  • debian stretch
  • dvc installed from pip
    $ dvc --version
    0.71.0

I have a large dataset (#2512 ) and have been trying to debug performance to evaluate if dvc will work for this type of data.

I did one dvc push against azure taking 4 hours for 132 GB data in 2.5M files 1 .dvc file. That is ok assuming there was changes. I then immediately again ran dvc push and it is taking 4 hours again.

Why does dvc not compleat much faster on the second run. There should be no changes between the remote and local cache so I was expecting it to finish after a few minutes.

@ghost ghost added the triage Needs to be triaged label Nov 29, 2019
@efiop
Copy link
Contributor

efiop commented Nov 29, 2019

Hi @JohanMollevik !

I suppose it was "Collecting information from remote cache" stage that was talking a lot of time, right? If it was it, then it is caused by the fact that dvc needs to check each local file with the remote to make sure that all files from that data set are present, this is needed to provide per-file granularity for directories. It is a known optimization issue and we are solving it in #2147 . If your dataset is fairly static, I would consider adding it to dvc as an archive and unpacking it with something like dvc run -d data.tar.gz --outs-no-cache data tar -xvf data.tar.gz.

@efiop efiop added the ui user interface / interaction label Nov 29, 2019
@ghost ghost removed the triage Needs to be triaged label Nov 29, 2019
@efiop efiop added the performance improvement over resource / time consuming tasks label Nov 29, 2019
@JohanMollevik
Copy link
Author

I do not think it said, the progress bar was only 80 chars wide so some data might have been hidden.
But, there was first a progress bar taking 4 minutes or so followed by one taking 4 hours.

@efiop
Copy link
Contributor

efiop commented Nov 29, 2019

@JohanMollevik Hm, we are in the middle of revisiting ui for pull/push/etc, so i might be missing something, but from the description it does seem to be the case. First pbar was collecting local status and the second one was collecting remote status. Please see my updated comment up above.

@JohanMollevik
Copy link
Author

@efiop Yes that looks likely. The work on #2147 sounds promising, I'm looking forward to seeing that resolved.

@efiop
Copy link
Contributor

efiop commented Nov 29, 2019

@JohanMollevik I'll close this ticket in favor of #2147 . Please ping us if you have any follow up questions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks ui user interface / interaction
Projects
None yet
Development

No branches or pull requests

2 participants