Skip to content

WSL corrupt cache files on dvc add #6979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bobertlo opened this issue Nov 13, 2021 · 3 comments
Closed

WSL corrupt cache files on dvc add #6979

bobertlo opened this issue Nov 13, 2021 · 3 comments

Comments

@bobertlo
Copy link
Contributor

Bug Report

@efiop I was able to reproduce this very reliably. Sorry to keep finding this stuff ;)

Description

dvc cache files are (rarely but reproducibly) being read/written incorrectly and the contents of the cache are not matching their hash after being inserted into .dvc/cache

Reproduce

I have been able to reliably reproduce this. I am finding this with larger datasets. I was testing a remote setup when I discovered this.

In this example I used the go 1.17.2 installation from my home directory. The contents are 11880 files and 17 are corrupted. The same files are reproducibly corrupted. I was using a similar but different dataset initially.

I caught this while testing an http remote that verifies the hashes while accepting uploads.

python3.8 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install dvc
git init .
dvc init
cp -rv ~/opt/go .
dvc add go
$ cat .dvc/cache/39/b14cfe23e7b57f28c7f8a421631cbe | md5sum
6f03bd42972eb0a732c763432cd6361d

Expected

Files in .dvc/cache should be inserted under the correct hash.

Environment information

Output of dvc doctor:

I was first running 2.8.2, then reproduced with a pip upgrade in place after deleting the cache and tmp directories.

$ dvc doctor
DVC version: 2.8.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Supports:                                                                                                                                                                                                      webhdfs (fsspec = 2021.11.0),                                                                                                                                                                          http (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),                                                                                                                                                         https (aiohttp = 3.8.0, aiohttp-retry = 2.4.6),                                                                                                                                                        s3 (s3fs = 2021.10.1, boto3 = 1.19.8),                                                                                                                                                                 ssh (sshfs = 2021.11.0)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: http
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

@efiop
Copy link
Contributor

efiop commented Nov 13, 2021

@bobertlo md5 that we use is not real md5 strictly speaking, it is something like md5(dos2unix(data)), so I'm wondering if you are confusing it with actual corruption. Does dvc status after dvc add say that stuff is corrupted? What if you remove .dvc/tmp, what does dvc status say?

@bobertlo
Copy link
Contributor Author

Oh wow, I had even heard that before but it did not register.

I spent a long time debugging this because I was taking for granted the input would “match” its content address!

oops! Thanks

@efiop
Copy link
Contributor

efiop commented Nov 13, 2021

@bobertlo Good thing there is nothing actually corrupted 😅 FYI: #4658 We will be switching away from this hash in the future along with chunking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants