Skip to content

dvc import: imports the same file twice from 2 different repos #9904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asiron opened this issue Sep 1, 2023 · 15 comments · Fixed by iterative/dvc-data#436 or #9923
Closed

dvc import: imports the same file twice from 2 different repos #9904

asiron opened this issue Sep 1, 2023 · 15 comments · Fixed by iterative/dvc-data#436 or #9923
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something?

Comments

@asiron
Copy link

asiron commented Sep 1, 2023

Bug Report

dvc import: pulls the same file from 2 different repos

Description

dvc import is importing the same file twice, even though the repos are different
Basically I have 2 repos:

  • [email protected]:organization/path/my-model-1.git
  • [email protected]:organization/path/my-model-2.git
    and each is a DVC repo that contains params.yaml
    My goal is two combine the 2 models into an application, therefore I need their respective params.yaml files. However, dvc import somehow gets confused (probably because paths are the same in each repo?) and pulls the same file. Moreover, what's really weird is that hashes on these files are different and link to different locations in cache. Essentially, everything looks fine, but when reading or diff'ing the contents are the same.

(.venv) mzurad@workstation:~/code/pe-test/params$ dvc import -v -o my-model-1/params.yaml --rev mvp1 [email protected]:organization/path/my-model-1.git params.yaml

2023-09-01 17:13:23,015 DEBUG: v3.16.0 (pip), CPython 3.8.10 on Linux-5.15.0-76-generic-x86_64-with-glibc2.29
2023-09-01 17:13:23,015 DEBUG: command: /home/mzurad/code/pe-test/.venv/bin/dvc import -v -o my-model-1/params.yaml --rev mvp1 [email protected]:organization/path/my-model-1.git params.yaml
2023-09-01 17:13:23,192 DEBUG: Lockfile for '../dvc.yaml' not found   
2023-09-01 17:13:23,235 DEBUG: Removing output 'my-model-1/params.yaml' of stage: 'my-model-1/params.yaml.dvc'.
2023-09-01 17:13:23,235 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-1/params.yaml'
Importing 'params.yaml ([email protected]:organization/path/my-model-1.git)' -> 'my-model-1/params.yaml'
2023-09-01 17:13:23,236 DEBUG: Computed stage: 'my-model-1/params.yaml.dvc' md5: '15fa312651a71503cb5d70f966ed6b93'
2023-09-01 17:13:23,236 DEBUG: 'md5' of stage: 'my-model-1/params.yaml.dvc' changed.
2023-09-01 17:13:23,237 DEBUG: Creating external repo [email protected]:organization/path/my-model-1.git@mvp1
2023-09-01 17:13:23,237 DEBUG: erepo: git clone '[email protected]:organization/path/my-model-1.git' to a temporary dir
2023-09-01 17:13:54,198 DEBUG: Added '/home/mzurad/code/pe-test/params/my-model-1/params.yaml' to gitignore file.                                                                                                                                                            
2023-09-01 17:13:54,203 DEBUG: Computed stage: 'my-model-1/params.yaml.dvc' md5: '9a69e96782c14aaa4e52487c8b37ce3f'                                                                                                                                                          
2023-09-01 17:13:54,205 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/b6b933eae94b5eb503e0ed377c47ba354b401e2e12939b51eee68cbec9214da7' to '/mnt/ssd2/.shared-dvc-cache/files/md5'                                                                              
2023-09-01 17:13:54,205 DEBUG: Preparing to collect status from '/mnt/ssd2/.shared-dvc-cache/files/md5'
2023-09-01 17:13:54,205 DEBUG: Collecting status from '/mnt/ssd2/.shared-dvc-cache/files/md5'
2023-09-01 17:13:54,206 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-1/.3UKsg3Bp6ZyCeLdTARmCcj.tmp'                                                                                                                                                            
2023-09-01 17:13:54,206 DEBUG: Removing '/mnt/ssd2/.shared-dvc-cache/files/md5/.j9wLPKuidEjA5cFkHqbcEf.tmp'
2023-09-01 17:13:54,207 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-1/params.yaml'                                                                                                                                                                            
2023-09-01 17:13:54,216 DEBUG: Saving information to 'my-model-1/params.yaml.dvc'.                                                                                                                                                                                           

To track the changes with git, run:

        git add my-model-1/params.yaml.dvc my-model-1/.gitignore

To enable auto staging, run:

        dvc config core.autostage true
2023-09-01 17:13:54,242 DEBUG: Analytics is disabled.

(.venv) mzurad@workstation:~/code/pe-test/params$ dvc import -v -o my-model-2/params.yaml --rev mvp1 [email protected]:organization/path/my-model-2.git params.yaml

2023-09-01 17:14:38,752 DEBUG: v3.16.0 (pip), CPython 3.8.10 on Linux-5.15.0-76-generic-x86_64-with-glibc2.29
2023-09-01 17:14:38,752 DEBUG: command: /home/mzurad/code/pe-test/.venv/bin/dvc import -v -o my-model-2/params.yaml --rev mvp1 [email protected]:organization/path/my-model-2.git params.yaml
2023-09-01 17:14:38,932 DEBUG: Lockfile for '../dvc.yaml' not found   
2023-09-01 17:14:38,976 DEBUG: Removing output 'my-model-2/params.yaml' of stage: 'my-model-2/params.yaml.dvc'.
2023-09-01 17:14:38,976 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-2/params.yaml'
Importing 'params.yaml ([email protected]:organization/path/my-model-2.git)' -> 'my-model-2/params.yaml'
2023-09-01 17:14:38,977 DEBUG: Computed stage: 'my-model-2/params.yaml.dvc' md5: '9d0288dd444947e3619c29c13df172b3'
2023-09-01 17:14:38,977 DEBUG: 'md5' of stage: 'my-model-2/params.yaml.dvc' changed.
2023-09-01 17:14:38,978 DEBUG: Creating external repo [email protected]:organization/path/my-model-2.git@mvp1
2023-09-01 17:14:38,978 DEBUG: erepo: git clone '[email protected]:organization/path/my-model-2.git' to a temporary dir
2023-09-01 17:14:40,915 DEBUG: Computed stage: 'my-model-2/params.yaml.dvc' md5: 'd7fec615a6cbc356c22decc92e3196a7'                                                                                                                                                       
2023-09-01 17:14:40,916 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/b6b933eae94b5eb503e0ed377c47ba354b401e2e12939b51eee68cbec9214da7' to '/mnt/ssd2/.shared-dvc-cache/files/md5'                                                                              
2023-09-01 17:14:40,916 DEBUG: Preparing to collect status from '/mnt/ssd2/.shared-dvc-cache/files/md5'
2023-09-01 17:14:40,916 DEBUG: Collecting status from '/mnt/ssd2/.shared-dvc-cache/files/md5'
2023-09-01 17:14:40,917 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-2/.L5oNdGLoWLgLqSjjuZn2uv.tmp'                                                                                                                                                         
2023-09-01 17:14:40,917 DEBUG: Removing '/mnt/ssd2/.shared-dvc-cache/files/md5/.2Xb5Zy6mCsftwqz96Qr838.tmp'
2023-09-01 17:14:40,918 DEBUG: Removing '/home/mzurad/code/pe-test/params/my-model-2/params.yaml'                                                                                                                                                                         
2023-09-01 17:14:40,927 DEBUG: Saving information to 'my-model-2/params.yaml.dvc'.                                                                                                                                                                                        

To track the changes with git, run:

        git add my-model-2/params.yaml.dvc

To enable auto staging, run:

        dvc config core.autostage true
2023-09-01 17:14:40,934 DEBUG: Analytics is disabled.

(.venv) mzurad@workstation:~/code/pe-test/params$ tree

.
├── my-model-2
│   ├── params.yaml -> /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90
│   └── params.yaml.dvc
└── my-model-1
    ├── params.yaml -> /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3
    └── params.yaml.dvc

2 directories, 4 files

(.venv) mzurad@workstation:~/code/pe-test/params$ cat my-model-1/params.yaml.dvc

d5: 9a69e96782c14aaa4e52487c8b37ce3f
frozen: true
deps:
- path: params.yaml
  repo:
    url: [email protected]:organization/path/my-model-1.git
    rev: mvp1
    rev_lock: 21e10a81912609f8419522688fc51a92db1ed394
outs:
- md5: ab32d18a00bd85ffbcb2d78d927dbbb3
  size: 1560
  hash: md5
  path: params.yaml

(.venv) mzurad@workstation:~/code/pe-test/params$ cat my-model-2/params.yaml.dvc

md5: d7fec615a6cbc356c22decc92e3196a7
frozen: true
deps:
- path: params.yaml
  repo:
    url: [email protected]:organization/path/my-model-2.git
    rev: mvp1
    rev_lock: 9840ae9b84c4f300d36da06b052e297a3a2e579e
outs:
- md5: 913affb56fb36f9bc9416cd569790a90
  size: 1335
  hash: md5
  path: params.yaml
(.venv) mzurad@workstation:~/code/pe-test/params$ diff my-model-2/params.yaml my-model-1/params.yaml
(.venv) mzurad@workstation:~/code/pe-test/params$ 

Reproduce

I can't seem to reproduce this.

Expected

Files should be correctly imported.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.16.0 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-5.15.0-76-generic-x86_64-with-glibc2.29
Subprojects:
        dvc_data = 2.15.4
        dvc_objects = 1.0.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.3.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.7.0)
Config:
        Global: /home/mzurad/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/nvme1n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/a15a3498ef6bebff6bb2e7fb9e33e35e

Additional Information (if any):

@asiron
Copy link
Author

asiron commented Sep 1, 2023

rm -rf /var/tmp/dvc/repo/* did not help

@pmrowla
Copy link
Contributor

pmrowla commented Sep 4, 2023

@asiron can you run md5sum on those two cache files and verify whether or not the hashes match as expected

md5sum /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Sep 4, 2023
@asiron
Copy link
Author

asiron commented Sep 4, 2023

Yes, md5sum shows the same hash for both:

(.venv) mzurad@workstation:~/code/pe-test$ md5sum /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3
913affb56fb36f9bc9416cd569790a90  /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90
913affb56fb36f9bc9416cd569790a90  /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3

@asiron
Copy link
Author

asiron commented Sep 4, 2023

I tried importing with a slightly different output path, but without success:

dvc import -v -o params/my-model-1.yaml --rev mvp1 [email protected]:organization/path/my-model-1.git params.yaml
dvc import -v -o params/my-model-2.yaml --rev mvp1 [email protected]:organization/path/my-model-2.git params.yaml

@pmrowla
Copy link
Contributor

pmrowla commented Sep 5, 2023

I'm unable to reproduce this, I tried importing from local repos and from both git@/https URLs for params.yaml in https://github.com/pmrowla/test-a https://github.com/pmrowla/test-b and I get the expected result in all cases.


Just to clarify, did you remove the shared cache files before you retried importing with the different paths?

This cache file is invalid:

913affb56fb36f9bc9416cd569790a90  /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3

But as long as /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3 exists DVC won't redownload anything with md5: ab32d18a00bd85ffbcb2d78d927dbbb3 (so any subsequent re-import won't overwrite the bad cache file)

Since your imported .dvc files look correct, can you try explicitly removing the shared cache files and the existing links, and then just run dvc pull?

rm params/my-model-1.yaml params/my-model-2.yaml /mnt/ssd2/.shared-dvc-cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 /mnt/ssd2/.shared-dvc-cache/files/md5/ab/32d18a00bd85ffbcb2d78d927dbbb3

@asiron
Copy link
Author

asiron commented Sep 5, 2023

I did that and also disabled shared cache and wiped /var/tmp/dvc, .dvc/cache and .dvc/tmp, so now cache is set to the default project/.dvc/cache.
dvc fetch is only fetching one of them: 913affb56fb36f9bc9416cd569790a90 but it's not valid, because:

$ md5sum .dvc/cache/files/md5/91/3affb56fb36f9bc9416cd569790a90 
ab32d18a00bd85ffbcb2d78d927dbbb3  .dvc/cache/files/md5/91/3affb56fb36f9bc9416cd569790a90

@asiron
Copy link
Author

asiron commented Sep 5, 2023

I made another repo that imports the files from your repos and the problem seems to arise when we add dvc.yaml file with a stage. Can you check on your side if you can replicate:

$ git clone [email protected]:asiron/testing-broken-fetch.git test-broken-fetch-cloned
$ cd test-broken-fetch-cloned 
$ dvc fetch params/a/params.yaml.dvc 
1 file fetched                                                                                                                                                                                              
$ dvc fetch params/b/params.yaml.dvc 
1 file fetched                                                                                                                                                                                              
$ dvc checkout
WARNING: No file hash info found for '/home/mzurad/code/test-broken-fetch-cloned/out.txt'. It won't be created.                                                                                             
A       params/b/params.yaml                                                                                                                                                                                
A       params/a/params.yaml                                                                                                                                                                                
ERROR: Checkout failed for following targets:
out.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>
$ md5sum params/b/params.yaml
6b371a942c39fd86583569072acb274d  params/b/params.yaml
$ md5sum params/a/params.yaml
6b371a942c39fd86583569072acb274d  params/a/params.yaml

@pmrowla
Copy link
Contributor

pmrowla commented Sep 6, 2023

Thanks for that, I am able to reproduce the issue now (when I add a dvc.yaml pipeline stage). One thing I noticed is that dvc fetch clones the wrong repo when fetching the second params file (it's using the git URL for A instead of the git URL for B)

$ cat params/a/params.yaml.dvc
md5: 3688cbeef52af2e44b131443eed265ab
frozen: true
deps:
- path: params.yaml
  repo:
    url: https://github.com/pmrowla/test-a.git
    rev: main
    rev_lock: 4df0c2c3d4b55091af0a73182fe839814ce5e3b2
outs:
- md5: 6b371a942c39fd86583569072acb274d
  size: 7
  hash: md5
  path: params.yaml

$ cat params/b/params.yaml.dvc
md5: 12634c4b8be3471eb7bec9d7ecd558e7
frozen: true
deps:
- path: params.yaml
  repo:
    url: https://github.com/pmrowla/test-b.git
    rev: main
    rev_lock: 9d6560a113279014d4d51f981d4dc7c990c71fe4
outs:
- md5: 37b91651347defb0103472be09f14b0b
  size: 7
  hash: md5
  path: params.yaml

$ dvc fetch params/b/params.yaml.dvc -v
2023-09-06 15:50:52,004 DEBUG: v3.18.1.dev2+gd27bf0a68, CPython 3.11.4 on macOS-13.5.1-arm64-arm-64bit
2023-09-06 15:50:52,004 DEBUG: command: /Users/pmrowla/.virtualenvs/dvc/bin/dvc fetch params/b/params.yaml.dvc -v
2023-09-06 15:50:52,130 DEBUG: Lockfile for 'dvc.yaml' not found
2023-09-06 15:50:52,139 DEBUG: Creating external repo https://github.com/pmrowla/test-a.git@4df0c2c3d4b55091af0a73182fe839814ce5e3b2
2023-09-06 15:50:52,139 DEBUG: erepo: git clone 'https://github.com/pmrowla/test-a.git' to a temporary dir
1 file fetched
2023-09-06 15:50:53,361 DEBUG: Analytics is disabled.

So on the initial dvc import it works correctly and gets the separate files from each source repo (and generates the resulting .dvc files as expected). The bug occurs on any subsequent fetch for those imports

cc @efiop

@pmrowla pmrowla added bug Did we break something? A: data-sync Related to dvc get/fetch/import/pull/push and removed awaiting response we are waiting for your reply, please respond! :) labels Sep 6, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Sep 6, 2023

repro script:

#!/bin/bash
set -ex

A=https://github.com/pmrowla/test-a.git
B=https://github.com/pmrowla/test-b.git
REPO=repo

rm -rf $REPO

mkdir $REPO
pushd $REPO
git init
dvc init
cat >dvc.yaml <<EOL
stages:
  test:
    cmd: cat params/a/params.yaml params/b/params.yaml > foo.txt
    deps:
    - params/a/params.yaml
    - params/b/params.yaml
    outs:
    - foo.txt
EOL
mkdir -p params/a params/b
dvc import -o params/a/params.yaml --rev main $A params.yaml
dvc import -o params/b/params.yaml --rev main $B params.yaml

rm -rf .dvc/cache params/a/params.yaml params/b/params.yaml

cat params/a/params.yaml.dvc
cat params/b/params.yaml.dvc
dvc fetch -v params/a/params.yaml.dvc
dvc fetch -v params/b/params.yaml.dvc
dvc checkout params/a/params.yaml params/b/params.yaml
md5sum params/a/params.yaml params/b/params.yaml
popd

@dberenbaum
Copy link
Contributor

Marking this as p0 since it looks like it's causing cache corruption

@efiop
Copy link
Contributor

efiop commented Sep 6, 2023

@pmrowla Thanks for the research! 🙏

I think this is related to how we identify filesystems. https://github.com/iterative/dvc-data/blob/8952521a7bfda6c6ece79293e3189aad7797b056/src/dvc_data/index/collect.py#L95 We use protocol + path and they happen to match here for two dvcfs instances, hence the overlap 🤦‍♂️ Taking a look, likely need to introduce fsid.

efiop added a commit to efiop/dvc that referenced this issue Sep 7, 2023
efiop added a commit to efiop/dvc that referenced this issue Sep 7, 2023
@efiop
Copy link
Contributor

efiop commented Sep 7, 2023

Also a similar problem could happen with other filesystems, not only dvcfs, but it is just less likely due to most other filesystems having some unique ids in their paths (e.g. bucket name for s3/azure/gs).

Big thanks to @asiron for the quality feedback 🙏

@efiop efiop reopened this Sep 7, 2023
efiop added a commit that referenced this issue Sep 7, 2023
Implements fsspec's fsid fsspec/filesystem_spec#1122

Required for #9904
@efiop
Copy link
Contributor

efiop commented Sep 7, 2023

@asiron Could you give upstream dvc a try, please?

For the record: works for me locally and @pmrowla 's script is also working correctly now.

@asiron
Copy link
Author

asiron commented Sep 7, 2023

Just tested this on 3.18.1.dev6+g7fb8a9a4 and both the repro script and my private code works. Thanks a lot for such a fast fix !

@efiop
Copy link
Contributor

efiop commented Sep 7, 2023

@asiron 3.19.0 is on its way out (pypi will be ready in a few minutes and the rest of the packages will be ready in the evening/tomorrow). Thank you! 🙏

@efiop efiop added this to DVC Sep 7, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Sep 7, 2023
@efiop efiop moved this from Backlog to Done in DVC Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something?
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants