Skip to content

cloud versioning: dvc pull fails after version aware import-url for non version aware s3 remote #8872

Closed
@jolyons123

Description

@jolyons123

Bug Report

Description

We have a use-case where we need to capture "metadata" of remote s3 storage and are not able to do dvc push because this would essentially end up in duplicating our storage. Old files do not change and the dataset only changes as new files come in.
Thus, I was hoping to make use of DVC to capture those metadata i.e. to manage a file in git which holds a list of files in our remote s3 bucket. Later on it should be possible to do a clean git clone and do dvc pull to fetch the data from the original (not dvc pushed) s3 bucket. (See Discord thread for more info)
This almost works with dvc import-url --version-aware s3://uri.... The generated .dvc file of the root folder contains all the s3 paths of the files inside the root folder.
However, upon cleanly checking out the git repo which is holding the dvc metadata and doing dvc pull there is an assertion error. (Note that the original s3 bucket is not version aware! I am only using the --version-aware option so that DVC saves all the metadata of all files inside the .dvc file instead of the .dvc/cache folder (which is not committed to git!))

Reproduce

  1. git init
  2. dvc init
  3. dvc import-url --version-aware s3://s3-bucket-which-is-not-version-aware
  4. git add .
  5. git commit -m 'initial commit'
  6. git clean -dfX
  7. dvc pull -v

Expected

To download all the files from the original s3.

Workaround

Comment out line https://github.com/iterative/dvc-data/blob/main/src/dvc_data/index/save.py#L24

Environment information/Command output

�[32m2023-01-23 16:17:41,940�[39m �[34mDEBUG�[39m: failed to pull cache for 'XYZ'
�[32m2023-01-23 16:17:44,009�[39m �[34mDEBUG�[39m: built tree 'object b7b631ef79755d98f49ffc96bb740dea.dir'
dvc : �[32m2023-01-23 16:17:45,804�[39m �[31mERROR�[39m: unexpected error
At line:1 char:1
+ dvc pull -v 2>&1 > out.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (�[32m2023-01-23...nexpected error:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
 
�[31m------------------------------------------------------------�[39m
Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\cli\__init__.py", line 184, in main
    ret = cmd.do_run()
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\cli\command.py", line 26, in do_run
    return self.run()
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\commands\data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\__init__.py", line 66, in wrapper
    return f(repo, *args, **kwargs)
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\__init__.py", line 66, in wrapper
    return f(repo, *args, **kwargs)
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\fetch.py", line 111, in fetch
    imported = save_imports(
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc\repo\imports.py", line 120, in save_imports
    md5(data_view)
  File "C:\ProgramData\Miniconda3\envs\dvc\lib\site-packages\dvc_data\index\save.py", line 24, in md5
    assert entry.fs
AssertionError
�[31m------------------------------------------------------------�[39m
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: link type reflink is not available ([Errno 129] no more link types left to try out)
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,906�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,908�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\.FAGs8ZeCNpMeFrGRKfP7S5.tmp'
�[32m2023-01-23 16:17:45,908�[39m �[34mDEBUG�[39m: Removing 'C:\Users\johnl\Arbeit2\DVC3\.dvc\cache\.VsKTYWSQigtDM6pyQ4F5cw.tmp'
�[32m2023-01-23 16:17:45,911�[39m �[34mDEBUG�[39m: Version info for developers:
DVC version: 2.43.1 
---------------------------------
Platform: Python 3.9.16 on Windows-10-10.0.22621-SP0
Subprojects:
	dvc_data = 0.35.1
	dvc_objects = 0.19.0
	dvc_render = 0.0.17
	dvc_task = 0.1.11
	dvclive = 1.3.2
	scmrepo = 0.1.6
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.1.0, boto3 = 1.24.59)
Cache types: hardlink, symlink
Cache directory: NTFS on C:\
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git

�[33mHaving any troubles?�[39m Hit us up at �[34mhttps://dvc.org/support�[39m, we are always happy to help!
�[32m2023-01-23 16:17:45,913�[39m �[34mDEBUG�[39m: Analytics is enabled.
�[32m2023-01-23 16:17:46,070�[39m �[34mDEBUG�[39m: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\johnl\\AppData\\Local\\Temp\\tmpu8q07j2a']'
�[32m2023-01-23 16:17:46,073�[39m �[34mDEBUG�[39m: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\johnl\\AppData\\Local\\Temp\\tmpu8q07j2a']'

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-syncRelated to dvc get/fetch/import/pull/pushawaiting responsewe are waiting for your reply, please respond! :)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions