Skip to content

For a large dataset dvc add creates symlinks to wrong files in the cache  #4303

Closed
@maxme1

Description

@maxme1

Bug Report

I have a large dataset (~60GB) of mostly large files. Each file has additional json-files associated with it.
After I use dvc add FOLDER all files are replaced with symlinks, and some of them are referencing wrong files in the cache (but from the same dataset). Moreover most of the links reference the same file, so the links are not just shuffled.

I tried to reproduce this issue with json-files only, but didn't succeed.
However the issue persists if I include the large files, even though it's somewhat random: each time (running add) the links point to different files.

Information about my setup

Output of dvc version:

$ dvc version

DVC version: 1.2.2
Python version: 3.6.8
Platform: Linux-4.15.0-107-generic-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: http, https, s3, ssh
Filesystem type: ('nfs4', EDITED_OUT_THE_PATH)

Additional Information:

The cache is stored outside of the repository on an NFS4-mounted disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    awaiting responsewe are waiting for your reply, please respond! :)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions