Skip to content

Endpoint URL is not taken into account when adding an external file from Minio #4151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lucasmaheo opened this issue Jul 2, 2020 · 10 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@lucasmaheo
Copy link

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
WARNING: Unable to detect supported link types, as cache directory '.dvc/cache' doesn't exist. It is usually auto-created by commands such as `dvc add/fetch/pull/run/import`, but you could create it manually to enable this check.
DVC version: 1.1.2
Python version: 3.7.7
Platform: Linux-4.20.17-042017-generic-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: http, https, s3
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda2')

Additional Information (if any):

I was trying out DVC and I cannot make it work with a local deployment of Minio. Minio is hosted at 127.0.0.1:9000 and works as expected, I tested it.

Contents of .dvc/config:

[cache]
    s3 = s3cache
['remote "s3cache"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000

Logs:

$ dvc add s3://mybucket/textfile --external --verbose
2020-07-02 09:05:55,227 DEBUG: fetched: [(3,)]
2020-07-02 09:05:55,583 DEBUG: fetched: [(0,)]
2020-07-02 09:05:55,587 ERROR: unexpected error - An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/main.py", line 53, in main
    ret = cmd.run()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 22, in run
    external=self.args.external,
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 91, in add
    stage.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 380, in save
    self.save_outs()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 391, in save_outs
    out.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 253, in save
    if not self.exists:
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 189, in exists
    return self.remote.tree.exists(self.path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 133, in exists
    return self.isfile(path_info) or self.isdir(path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 166, in isfile
    self.s3.head_object(Bucket=path_info.bucket, Key=path_info.path)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 637, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

After some investigation, dvc does seem to take into account the configuration and the endpointurl. However on this specific boto3 request it does not. I did not go much more into the code to find out why the two s3 clients are generated from different configurations.

Configuration for the failing request:

{'url_path': '/mybucket/textfile', 'query_string': {}, 'method': 'HEAD', 'headers': {'User-Agent': 'Boto3/1.14.14 Python/3.7.7 Linux/4.20.17-042017-generic Botocore/1.17.14'}, 'body': b'', 'url': 'https://s3.amazonaws.com/mybucket/textfile', 'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x7f4fd6aaf310>, 'has_streaming_input': False, 'auth_type': None, 'signing': {'bucket': 'mybucket'}, 'timestamp': '20200702T130555Z'}}

Configuration loaded by DVC at some point during the call:

{'url': 's3://mybucket', 'endpointurl': 'http://127.0.0.1:9000', 'use_ssl': True, 'listobjects': False}

Any idea as to why this behaviour is happening?

Thanks,
Lucas

@ghost ghost added the triage Needs to be triaged label Jul 2, 2020
@efiop efiop added bug Did we break something? research labels Jul 2, 2020
@ghost ghost removed the triage Needs to be triaged label Jul 2, 2020
@efiop efiop removed bug Did we break something? research labels Jul 2, 2020
@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

@lucasmaheo That happens because you use direct s3 url in your dvc add command. What you should use instead is a remote:// addressing. E.g. right now you have s3cache remote already, but you could define a similar separate one and use it.
E.g. with s3cache you could:

dvc add remote://s3cache/textfile

but I would suggest something like:

[cache]
    s3 = s3cache
['remote "mys3"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000
['remote "s3ache"']
    url = remote://mys3/cache

and then just

dvc add remote://mys3/path/to/file

🙂

The external workspaces scenario is admitedly not very polished right now and has some flaws, so we've created #3920 to discuss how it should be changed to become better.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jul 2, 2020
@skshetry
Copy link
Collaborator

skshetry commented Jul 2, 2020

Duplicate of #1280

@skshetry skshetry marked this as a duplicate of #1280 Jul 2, 2020
@skshetry
Copy link
Collaborator

skshetry commented Jul 2, 2020

Duplicate of #3441

Also, we have an outstanding docs issue: iterative/dvc.org#108

@skshetry skshetry marked this as a duplicate of #3441 Jul 2, 2020
@lucasmaheo
Copy link
Author

Thank you for the explanation @efiop. Indeed this is a misunderstanding on my part. How do these instructions not end up on the dvc add page? At any rate, this fixed it.

If anyone else ends up on this issue, the following command worked:

dvc add remote://s3cache/somefile --external

@skshetry skshetry closed this as completed Jul 2, 2020
@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

@lucasmaheo The reason is that functionality is considered very advanced and not polished enough, so it is even hard to describe it nicely in the docs :( But we do have a few tickets with remote:// notation explanation for the docs. Btw, could you elaborate on your scenario? Maybe you don't actually need this functionality.

@lucasmaheo
Copy link
Author

My scenario is hypothetical at this point. The typical use case would be to be able to version data files as well as ML models in a scalable storage. Usually, our projects use cloud storage (or on-premise, cloud-like storage) to have a single reference for data. We are looking for a solution to version efficiently those voluminous datasets, and DVC seems to fit the bill.

@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

@lucasmaheo Sounds like that is indeed not the best approach. The first problem here is the isolation - if any user of your dvc repo runs dvc checkout data on s3 will change for everyone, which is a really bad practice unless you really know what you are doing. I would suggest storing your data in a regular dvc repo and access it through dvc. We have things like dvc get/import and even python API https://dvc.org/doc/api-reference that allow you to access your data using a human readable name for the artifact that you need. See https://dvc.org/doc/use-cases/data-registries .

@lucasmaheo
Copy link
Author

Oh thanks for the clarification. That is indeed not the behaviour I was looking for.

So DVC registries are to be used with local copies, as I understand. Exactly in the same way as git registries, with the exception of the added possibility to select only required files. To circumvent that, we need to use the API. It all makes sense now.

At least I now know how to create the data registry. I am expecting that using dvc.api.open() with rev left blank should read from the version of the data that was committed with the current revision of the Git local repository.

Now is there a way to stream outputs to the remote registry? Supposing I was reading data from S3 and producing a transformed version of that data iteratively. If the outputs do not fit on disk, I would prefer to output to another location in S3 and after the whole process, push that version of data to the registry. Is that a feature you are looking into (pushing from a remote location)?

@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

Now is there a way to stream outputs to the remote registry?

@lucasmaheo You mean kinda like publish them there? Currently there is no way to push it back like that 🙁 But we've heard requests for that. In my mind should be some special operation that does "straight to remote" action. So... something like

dvc add data --from s3://bucket/path --push --no-download (sorry for lame options, just talking out of my head right now)

that would create data.dvc as if you would downloaded it by-hand and then dvc add dataed, but it wouldn't actually download to your disk, but rather would stream the data from s3://bucket/path, compute the needed hash on-the-fly and upload it to our remote on-the-fly. Clearly, in this approach, we would still use the network traffic to stream the file, but at least we won't use your local storage. That could also be avoided if the cloud could provide us with a real md5, but that is another topic for discussion.

@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

I feel like that ^ covers the most misuses that we've seen. Maybe it is even worth doing that by default when someone tries to feed a url to dvc add. E.g.

dvc add s3://bucket/path/data

would just create data.dvc and stream s3://bucket/path/data to compute the hash and push to the default remote. Not sure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

3 participants