-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Old bug (dvc pull from S3 and md5 starts with 00
) reappears with latest release
#6089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would you mind sending the full traceback? |
I just tried this as a test case and it seemed to work ( @pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote):
tmp_dir.dvc_gen({"foo": "363"})
dvc.push()
clean(["foo"], dvc)
stats = dvc.pull("foo")
assert stats["fetched"] == 1
assert stats["added"] == ["foo"]
[cache_dir] = dvc.cloud.get_remote('upstream').fs.ls(remote)
assert cache_dir.endswith('00') |
@isidentical I think we'd need to test with pulling more than one file at a time, if we're only trying to push/pull a single file, we do the direct It looks like we might not be handling the dvc specific see #4141 (comment) for an explanation of the original bug, and #4149 (comment) for discussion on why we added the essentially , there's times that we do |
Probably |
Yeah, this should work in the meantime, but ideally it'd be best if we can use arbitrary prefix lengths longer than 2 for clouds that support it |
Also looks like we've added a test in #4149 which is still present right now, but it was only testing the exists() path 🙁 @isidentical Yes, we actually use just |
It would be great if we could add a more generic test to the @pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote):
random_files = {f'random_{n}': str(n) for n in range(1024)}
tmp_dir.dvc_gen({"foo": "363", "random": random_files})
dvc.push()
clean(["foo", "random"], dvc)
stats = dvc.pull()
assert 'foo' in stats['added'] |
@isidentical The behavior depends on TRAVERSE_PREFIX_LEN (@pmrowla could correct me if I'm wrong), so we'll need to exceed that or, probably better, just patch it to be lower for the tests. EDIT: i meant TRAVERSE_THRESHOLD_SIZE 🤦 |
If you set
in your test, it should force the prefix based query for all push/pull/status queries w/more than one on file all remote types |
Still can't reproduce :/ (Also tried setting @pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote, monkeypatch):
from dvc.fs.s3 import BaseFileSystem
BaseFileSystem.TRAVERSE_THRESHOLD_SIZE = 0
BaseFileSystem.TRAVERSE_WEIGHT_MULTIPLIER = 1
random_files = {f'random_{n}': str(n) for n in range(32)}
tmp_dir.dvc_gen({"foo": "363", "random": random_files})
dvc.push()
clean(["foo", "random"], dvc)
stats = dvc.pull()
assert 'foo' in stats['added'] |
This comment has been minimized.
This comment has been minimized.
Ok it is actually reproducible if I also hard code
Setting TRAVERSE_PREFIX_LEN to 2 fixes it. For testing this I think a dataset with >4096 files and these values should hit the issue as well:
We should also probably just introduce an internal flag for forcing the different traverse behaviors for simpler testing purposes |
@isidentical Thanks for a quick fix! 🙏 @jnd77 We are releasing 2.3.0 right now with a temporary fix. Thanks for the feedback! |
Thanks a lot for the quick fix. Works fine now. :) |
Keeping this issue open for now since the current hotfix is just a workaround until we have a proper solution in the fsspec backends |
All cloud providers now support |
@karajan1001 FYI ^^^ Please check that ossfs supports it as well. |
Uh oh!
There was an error while loading. Please reload this page.
Bug Report
dvc pull does not seem to work for files whose md5 starts with 00, when remote is S3
Description
This bug #4141 has reappeared with release 2.2.0 (it's working fine with 2.1.0).
Might be due to this commit:
#5683
Let me know if you need more information ... :)
The text was updated successfully, but these errors were encountered: