Closed as not planned
Description
In #8849, we stopped serializing the directory info in the resulting .dvc/dvc.lock files for cloud versioned remotes. Can we do the same everywhere?
This would help with a bunch of existing issues:
- diff gets confused when the object is not in cache #7661 and
dvc diff
: unexpected output in Github Actions #8875:dvc diff
fails to compare refs unless the data associated with those refs has been pulled locally.dvc data status
also reports anunknown
status when data hasn't been pulled. By having all the files listed in the .dvc/dvc.lock file, it would always be possible to get the granular file info of any commit. - cloud versioning:
dvc pull
fails after version aware import-url for non version aware s3 remote #8872:dvc pull
on animport-url
target is now supposed to be able to pull the data directly from the source without having to push a copy to the remote, but it doesn't work for directories because only the high-level directory info is saved to the .dvc file. - Mechanism to update a dataset w/o downloading it first #4657: To modify an existing directory, the whole directory needs to be pulled. Having the granular file info in the .dvc file means that users could delete a file by searching for it in the .dvc file and deleting that entry. This still isn't great UX, but it should be easy to make
dvc add/remove
work at a granular level by only modifying part of the .dvc file. - merge-driver: failed merges don't return conflict markers #8638: Users have to install a special merge driver because of .dir entries. Even then, a merge conflict becomes hard to troubleshoot because the conflict will not show both .dir entries, and even if it did, there's no easy way to combine them. With granular file info instead of the .dir entries, no merge driver is needed, and merge conflicts could be resolved by editing the file info in the .dvc files.
Automatically pushing and pulling the .dir files from the remote could also solve a lot of these problems, but it seems like a worse UX. It's less transparent, harder for users to manage, and fails when users don't have access to the remote or forgot to push something.
How much do we really need the reference to the .dir file? If necessary, could we serialize that reference somewhere that's not git-tracked, like in a shadow .dvc/tmp/mydataset.dvc
file?