Description
Bug Report
Description
We have a use-case where we need to capture "metadata" of remote s3 storage and are not able to do dvc push
because this would essentially end up in duplicating our storage. Old files do not change and the dataset only changes as new files come in.
Thus, I was hoping to make use of DVC to capture those metadata i.e. to manage a file in git which holds a list of files in our remote s3 bucket. Later on it should be possible to do a clean git clone and do dvc pull
to fetch the data from the original (not dvc pushed) s3 bucket. (See Discord thread for more info)
This almost works with dvc import-url --version-aware s3://uri...
. The generated .dvc file of the root folder contains all the s3 paths of the files inside the root folder.
However, upon cleanly checking out the git repo which is holding the dvc metadata and doing dvc pull
there is an assertion error. (Note that the original s3 bucket is not version aware! I am only using the --version-aware option so that DVC saves all the metadata of all files inside the .dvc file instead of the .dvc/cache folder (which is not committed to git!))
Reproduce
- git init
- dvc init
- dvc import-url --version-aware s3://s3-bucket-which-is-not-version-aware
- git add .
- git commit -m 'initial commit'
- git clean -dfX
- dvc pull -v
Expected
To download all the files from the original s3.
Workaround
Comment out line https://github.com/iterative/dvc-data/blob/main/src/dvc_data/index/save.py#L24
Environment information/Command output
�[32m2023-01-23 16:17:41,940�[39m �[34mDEBUG�[39m: failed to pull cache for 'XYZ'
�[32m2023-01-23 16:17:44,009�[39m �[34mDEBUG�[39m: built tree 'object b7b631ef79755d98f49ffc96bb740dea.dir'
dvc : �[32m2023-01-23 16:17:45,804�[39m �[31mERROR�[39m: unexpected error
At line:1 char:1
+ dvc pull -v 2>&1 > out.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (�[32m2023-01-23...nexpected error:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
�[31m------------------------------------------------------------�[39m
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\cli\__init__.py", line 184, in main
ret = cmd.do_run()
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\cli\command.py", line 26, in do_run
return self.run()
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\commands\data_sync.py", line 31, in run
stats = self.repo.pull(
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\__init__.py", line 66, in wrapper
return f(repo, *args, **kwargs)
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\pull.py", line 34, in pull
processed_files_count = self.fetch(
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\__init__.py", line 66, in wrapper
return f(repo, *args, **kwargs)
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\fetch.py", line 111, in fetch
imported = save_imports(
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\imports.py", line 120, in save_imports
md5(data_view)
File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc_data\index\save.py", line 24, in md5
assert entry.fs
AssertionError
�[31m------------------------------------------------------------�[39m
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: link type reflink is not available ([Errno 129] no more link types left to try out)
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,908�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,908�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\DVC3\.dvc\cache\.VsKTYWSQigtDM6pyQ4F5cw.tmp'
�[32m2023-01-23 16:17:45,911�[39m �[34mDEBUG�[39m: Version info for developers:
DVC version: 2.43.1
---------------------------------
Platform: Python 3.9.16 on Windows-10-10.0.22621-SP0
Subprojects:
dvc_data = 0.35.1
dvc_objects = 0.19.0
dvc_render = 0.0.17
dvc_task = 0.1.11
dvclive = 1.3.2
scmrepo = 0.1.6
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.1.0, boto3 = 1.24.59)
Cache types: hardlink, symlink
Cache directory: NTFS on C:\
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git
�[33mHaving any troubles?�[39m Hit us up at �[34mhttps://dvc.org/support�[39m, we are always happy to help!
�[32m2023-01-23 16:17:45,913�[39m �[34mDEBUG�[39m: Analytics is enabled.
�[32m2023-01-23 16:17:46,070�[39m �[34mDEBUG�[39m: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\johnl\\AppData\\Local\\Temp\\tmpu8q07j2a']'
�[32m2023-01-23 16:17:46,073�[39m �[34mDEBUG�[39m: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\johnl\\AppData\\Local\\Temp\\tmpu8q07j2a']'