Skip to content

get-url and import-url doesn't seem to work with S3 buckets anymore. #4144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anotherbugmaster opened this issue Jul 1, 2020 · 3 comments · Fixed by #4528
Closed

get-url and import-url doesn't seem to work with S3 buckets anymore. #4144

anotherbugmaster opened this issue Jul 1, 2020 · 3 comments · Fixed by #4528
Assignees
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important research

Comments

@anotherbugmaster
Copy link
Contributor

anotherbugmaster commented Jul 1, 2020

Bug Report

  1. Create an empty directory
  2. dvc init --no-scm
  3. dvc import-url s3://some_bucket/some_target -v
2020-07-01 17:14:02,947 DEBUG: fetched: [(3,)]                                  
2020-07-01 17:14:03,123 DEBUG: Removing output 'some_target' of stage: 'some_target.dvc'.
Importing 's3://some_bucket/some_target' -> 'some_target'
2020-07-01 17:14:03,123 DEBUG: Computed stage: 'some_target.dvc' md5: '2f8b87d3b22efd1638f414c3b3f65614'
2020-07-01 17:14:03,123 DEBUG: 'md5' of stage: 'some_target.dvc' changed.
2020-07-01 17:14:04,088 DEBUG: fetched: [(0,)]
2020-07-01 17:14:04,146 ERROR: failed to import s3://some_bucket/some_target. You could also try downloading it manually, and adding it with `dvc add`. - Current operation was unsuccessful because 's3://some_bucket/some_target' requires existing cache on 's3' remote. See <https://man.dvc.org/config#cache> for information on how to set up remote cache.
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/dvc/command/imp_url.py", line 14, in run
    self.repo.imp_url(
  File "/usr/lib/python3.8/site-packages/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/usr/lib/python3.8/site-packages/dvc/repo/imp_url.py", line 54, in imp_url
    stage.run()
  File "/home/anotherbugmaster/.local/lib/python3.8/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/lib/python3.8/site-packages/dvc/stage/decorators.py", line 35, in rwlocked
    return call()
  File "/home/anotherbugmaster/.local/lib/python3.8/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/stage/__init__.py", line 424, in run
    sync_import(self, dry, force)
  File "/usr/lib/python3.8/site-packages/dvc/stage/imports.py", line 29, in sync_import
    stage.save_deps()
  File "/usr/lib/python3.8/site-packages/dvc/stage/__init__.py", line 387, in save_deps
    dep.save()
  File "/usr/lib/python3.8/site-packages/dvc/output/base.py", line 268, in save
    self.info = self.save_info()
  File "/usr/lib/python3.8/site-packages/dvc/output/base.py", line 192, in save_info
    return self.remote.save_info(self.path_info)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 762, in save_info
    return self.tree.save_info(path_info, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 329, in save_info
    self.PARAM_CHECKSUM: self.get_hash(path_info, tree=tree, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 297, in get_hash
    hash_ = self.get_dir_hash(path_info, tree, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 311, in get_dir_hash
    raise RemoteCacheRequiredError(path_info)
dvc.exceptions.RemoteCacheRequiredError: Current operation was unsuccessful because 's3://some_bucket/some_target' requires existing cache on 's3' remote. See <https://man.dvc.org/config#cache> for information on how to set up remote cache.
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
  1. dvc get-url s3://some_bucket/some_target -v
2020-07-01 17:15:39,910 ERROR: unexpected error - 'NoneType' object has no attribute 'cache'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/dvc/main.py", line 53, in main
    ret = cmd.run()
  File "/usr/lib/python3.8/site-packages/dvc/command/get_url.py", line 17, in run
    Repo.get_url(self.args.url, out=self.args.out)
  File "/usr/lib/python3.8/site-packages/dvc/repo/get_url.py", line 19, in get_url
    dep.save()
  File "/usr/lib/python3.8/site-packages/dvc/output/base.py", line 268, in save
    self.info = self.save_info()
  File "/usr/lib/python3.8/site-packages/dvc/output/base.py", line 192, in save_info
    return self.remote.save_info(self.path_info)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 762, in save_info
    return self.tree.save_info(path_info, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 329, in save_info
    self.PARAM_CHECKSUM: self.get_hash(path_info, tree=tree, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 297, in get_hash
    hash_ = self.get_dir_hash(path_info, tree, **kwargs)
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 310, in get_dir_hash
    if not self.cache:
  File "/usr/lib/python3.8/site-packages/dvc/remote/base.py", line 184, in cache
    return getattr(self.repo.cache, self.scheme)
AttributeError: 'NoneType' object has no attribute 'cache'
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

The same commands work with https://some_domain/some_target urls and I don't think that external cache were ever necessary to download files from S3.

Please provide information about your setup

Output of dvc version:

$ dvc version
1.1.2

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

@ghost ghost added the triage Needs to be triaged label Jul 1, 2020
@efiop efiop added research p2-medium Medium priority, should be done, but less important labels Jul 1, 2020
@ghost ghost removed the triage Needs to be triaged label Jul 1, 2020
@efiop efiop added the bug Did we break something? label Jul 1, 2020
@efiop
Copy link
Contributor

efiop commented Jul 1, 2020

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jul 1, 2020

I found out a couple of things:

  • get-url works in 0.93.0
  • In order to make import-url work one need to set up s3 cache in any bucket, not necessarily in the same bucket that contains the file that needs to be imported

That kind of solves the issue, but I don't get the logic behind this. Why would I need a cache in a separate bucket just to download the file from a completely different bucket? Seems weird because I need to download the file to my local machine anyway in order to compute hashes

@efiop
Copy link
Contributor

efiop commented Jul 1, 2020

@anotherbugmaster This is a well known bug that became more intrusive once we've adjusted the way we process inputs in get-url and import-url. It will be improved in the near future.

@efiop efiop assigned efiop and unassigned efiop Jul 31, 2020
@efiop efiop mentioned this issue Aug 11, 2020
3 tasks
efiop added a commit to efiop/dvc that referenced this issue Aug 15, 2020
Currently we kinda assume that whatever is returned by `get_file_hash`
is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for
http it might return `etag` or `md5`, but we don't distinguish between
those and call both `etag`. This is becoming more relevant for dir
hashes that are computed a few different ways (e.g. in-memory md5 or
upload to remote and get etag for the dir file).

Prerequisite for iterative#4144 and iterative#3069
efiop added a commit that referenced this issue Aug 15, 2020
Currently we kinda assume that whatever is returned by `get_file_hash`
is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for
http it might return `etag` or `md5`, but we don't distinguish between
those and call both `etag`. This is becoming more relevant for dir
hashes that are computed a few different ways (e.g. in-memory md5 or
upload to remote and get etag for the dir file).

Prerequisite for #4144 and #3069
efiop added a commit to efiop/dvc that referenced this issue Aug 25, 2020
efiop added a commit that referenced this issue Aug 25, 2020
efiop added a commit to efiop/dvc that referenced this issue Aug 30, 2020
@efiop efiop mentioned this issue Aug 30, 2020
2 tasks
efiop added a commit that referenced this issue Aug 30, 2020
* dvc: use HashInfo

Related to #4144 , #3069 , #1676

* Update dvc/tree/s3.py

Co-authored-by: Saugat Pachhai <[email protected]>

Co-authored-by: Saugat Pachhai <[email protected]>
efiop added a commit to efiop/dvc that referenced this issue Aug 31, 2020
By itself `self.info` is quite confusing, as it is not clear what it is
about. Using `hash_info` makes much more sense and is required to
support alternative hash types.

Related to iterative#4144, iterative#3069, iterative#1676
efiop added a commit that referenced this issue Aug 31, 2020
By itself `self.info` is quite confusing, as it is not clear what it is
about. Using `hash_info` makes much more sense and is required to
support alternative hash types.

Related to #4144, #3069, #1676
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants