Skip to content

remote: reduce traverse weight multiplier #3704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pmrowla opened this issue Apr 30, 2020 · 0 comments · Fixed by #3705
Closed

remote: reduce traverse weight multiplier #3704

pmrowla opened this issue Apr 30, 2020 · 0 comments · Fixed by #3705
Assignees
Labels
performance improvement over resource / time consuming tasks

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Apr 30, 2020

In remote.cache_exists(), the traverse/full remote listing method is weighted to account for some performance hits that make it inherently slower than querying individual objects

dvc/dvc/remote/base.py

Lines 923 to 928 in 18e8f07

# For sufficiently large remotes, traverse must be weighted to account
# for performance overhead from large lists/sets.
# From testing with S3, for remotes with 1M+ files, object_exists is
# faster until len(checksums) is at least 10k~100k
if remote_size > self.TRAVERSE_THRESHOLD_SIZE:
traverse_weight = traverse_pages * self.TRAVERSE_WEIGHT_MULTIPLIER

The initial weight multiplier value (20) can be lowered now, due to the subsequent remote improvements that have come since the initial traverse/no traverse optimization PR

@pmrowla pmrowla added c1-quick-fix performance improvement over resource / time consuming tasks labels Apr 30, 2020
@pmrowla pmrowla self-assigned this Apr 30, 2020
pmrowla added a commit to pmrowla/dvc that referenced this issue Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant