-
Notifications
You must be signed in to change notification settings - Fork 1.2k
dvc import: imports the same file twice from 2 different repos #9904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
@asiron can you run md5sum on those two cache files and verify whether or not the hashes match as expected
|
Yes, (.venv) mzurad@workstation:~/code/pe-test$ md5sum /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3
913affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90
913affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3 |
I tried importing with a slightly different output path, but without success: dvc import -v -o params/my-model-1.yaml --rev mvp1 [email protected]:organization/path/my-model-1.git params.yaml
dvc import -v -o params/my-model-2.yaml --rev mvp1 [email protected]:organization/path/my-model-2.git params.yaml |
I'm unable to reproduce this, I tried importing from local repos and from both Just to clarify, did you remove the shared cache files before you retried importing with the different paths? This cache file is invalid:
But as long as Since your imported .dvc files look correct, can you try explicitly removing the shared cache files and the existing links, and then just run
|
I did that and also disabled shared cache and wiped $ md5sum .dvc/cache/files/md5/91/3affb56fb36f9bc9416cd569790a90
ab32d18a00bd85ffbcb2d78d927dbbb3 .dvc/cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 |
I made another repo that imports the files from your repos and the problem seems to arise when we add $ git clone [email protected]:asiron/testing-broken-fetch.git test-broken-fetch-cloned
$ cd test-broken-fetch-cloned
$ dvc fetch params/a/params.yaml.dvc
1 file fetched
$ dvc fetch params/b/params.yaml.dvc
1 file fetched
$ dvc checkout
WARNING: No file hash info found for '/home/mzurad/code/test-broken-fetch-cloned/out.txt'. It won't be created.
A params/b/params.yaml
A params/a/params.yaml
ERROR: Checkout failed for following targets:
out.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>
$ md5sum params/b/params.yaml
6b371a942c39fd86583569072acb274d params/b/params.yaml
$ md5sum params/a/params.yaml
6b371a942c39fd86583569072acb274d params/a/params.yaml |
Thanks for that, I am able to reproduce the issue now (when I add a
So on the initial cc @efiop |
repro script: #!/bin/bash
set -ex
A=https://github.com/pmrowla/test-a.git
B=https://github.com/pmrowla/test-b.git
REPO=repo
rm -rf $REPO
mkdir $REPO
pushd $REPO
git init
dvc init
cat >dvc.yaml <<EOL
stages:
test:
cmd: cat params/a/params.yaml params/b/params.yaml > foo.txt
deps:
- params/a/params.yaml
- params/b/params.yaml
outs:
- foo.txt
EOL
mkdir -p params/a params/b
dvc import -o params/a/params.yaml --rev main $A params.yaml
dvc import -o params/b/params.yaml --rev main $B params.yaml
rm -rf .dvc/cache params/a/params.yaml params/b/params.yaml
cat params/a/params.yaml.dvc
cat params/b/params.yaml.dvc
dvc fetch -v params/a/params.yaml.dvc
dvc fetch -v params/b/params.yaml.dvc
dvc checkout params/a/params.yaml params/b/params.yaml
md5sum params/a/params.yaml params/b/params.yaml
popd |
Marking this as p0 since it looks like it's causing cache corruption |
@pmrowla Thanks for the research! 🙏 I think this is related to how we identify filesystems. https://github.com/iterative/dvc-data/blob/8952521a7bfda6c6ece79293e3189aad7797b056/src/dvc_data/index/collect.py#L95 We use protocol + path and they happen to match here for two dvcfs instances, hence the overlap 🤦♂️ Taking a look, likely need to introduce fsid. |
Implements fsspec's fsid fsspec/filesystem_spec#1122 Required for iterative#9904
Implements fsspec's fsid fsspec/filesystem_spec#1122 Required for iterative#9904
Also a similar problem could happen with other filesystems, not only dvcfs, but it is just less likely due to most other filesystems having some unique ids in their paths (e.g. bucket name for s3/azure/gs). Big thanks to @asiron for the quality feedback 🙏 |
Implements fsspec's fsid fsspec/filesystem_spec#1122 Required for #9904
Just tested this on |
@asiron 3.19.0 is on its way out (pypi will be ready in a few minutes and the rest of the packages will be ready in the evening/tomorrow). Thank you! 🙏 |
Bug Report
dvc import: pulls the same file from 2 different repos
Description
dvc import
is importing the same file twice, even though the repos are differentBasically I have 2 repos:
and each is a DVC repo that contains
params.yaml
My goal is two combine the 2 models into an application, therefore I need their respective
params.yaml
files. However,dvc import
somehow gets confused (probably because paths are the same in each repo?) and pulls the same file. Moreover, what's really weird is that hashes on these files are different and link to different locations in cache. Essentially, everything looks fine, but when reading or diff'ing the contents are the same.(.venv) mzurad@workstation:~/code/pe-test/params$ dvc import -v -o my-model-1/params.yaml --rev mvp1 [email protected]:organization/path/my-model-1.git params.yaml
(.venv) mzurad@workstation:~/code/pe-test/params$ dvc import -v -o my-model-2/params.yaml --rev mvp1 [email protected]:organization/path/my-model-2.git params.yaml
(.venv) mzurad@workstation:~/code/pe-test/params$ tree
(.venv) mzurad@workstation:~/code/pe-test/params$ cat my-model-1/params.yaml.dvc
(.venv) mzurad@workstation:~/code/pe-test/params$ cat my-model-2/params.yaml.dvc
Reproduce
I can't seem to reproduce this.
Expected
Files should be correctly imported.
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: