Endpoint URL is not taken into account when adding an external file from Minio #4151

lucasmaheo · 2020-07-02T13:30:28Z

Bug Report

Please provide information about your setup

Output of dvc version:

$ dvc version
WARNING: Unable to detect supported link types, as cache directory '.dvc/cache' doesn't exist. It is usually auto-created by commands such as `dvc add/fetch/pull/run/import`, but you could create it manually to enable this check.
DVC version: 1.1.2
Python version: 3.7.7
Platform: Linux-4.20.17-042017-generic-x86_64-with-debian-stretch-sid
Binary: False
Package: pip
Supported remotes: http, https, s3
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda2')

Additional Information (if any):

I was trying out DVC and I cannot make it work with a local deployment of Minio. Minio is hosted at 127.0.0.1:9000 and works as expected, I tested it.

Contents of .dvc/config:

[cache]
    s3 = s3cache
['remote "s3cache"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000

Logs:

$ dvc add s3://mybucket/textfile --external --verbose
2020-07-02 09:05:55,227 DEBUG: fetched: [(3,)]
2020-07-02 09:05:55,583 DEBUG: fetched: [(0,)]
2020-07-02 09:05:55,587 ERROR: unexpected error - An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/main.py", line 53, in main
    ret = cmd.run()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/command/add.py", line 22, in run
    external=self.args.external,
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/repo/add.py", line 91, in add
    stage.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 380, in save
    self.save_outs()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/stage/__init__.py", line 391, in save_outs
    out.save()
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 253, in save
    if not self.exists:
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/output/base.py", line 189, in exists
    return self.remote.tree.exists(self.path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 133, in exists
    return self.isfile(path_info) or self.isdir(path_info)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/dvc/remote/s3.py", line 166, in isfile
    self.s3.head_object(Bucket=path_info.bucket, Key=path_info.path)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/lmaheo/miniconda3/envs/dvc/lib/python3.7/site-packages/botocore/client.py", line 637, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

After some investigation, dvc does seem to take into account the configuration and the endpointurl. However on this specific boto3 request it does not. I did not go much more into the code to find out why the two s3 clients are generated from different configurations.

Configuration for the failing request:

{'url_path': '/mybucket/textfile', 'query_string': {}, 'method': 'HEAD', 'headers': {'User-Agent': 'Boto3/1.14.14 Python/3.7.7 Linux/4.20.17-042017-generic Botocore/1.17.14'}, 'body': b'', 'url': 'https://s3.amazonaws.com/mybucket/textfile', 'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x7f4fd6aaf310>, 'has_streaming_input': False, 'auth_type': None, 'signing': {'bucket': 'mybucket'}, 'timestamp': '20200702T130555Z'}}

Configuration loaded by DVC at some point during the call:

{'url': 's3://mybucket', 'endpointurl': 'http://127.0.0.1:9000', 'use_ssl': True, 'listobjects': False}

Any idea as to why this behaviour is happening?

Thanks,
Lucas

The text was updated successfully, but these errors were encountered:

efiop · 2020-07-02T13:39:55Z

@lucasmaheo That happens because you use direct s3 url in your dvc add command. What you should use instead is a remote:// addressing. E.g. right now you have s3cache remote already, but you could define a similar separate one and use it.
E.g. with s3cache you could:

dvc add remote://s3cache/textfile

but I would suggest something like:

[cache]
    s3 = s3cache
['remote "mys3"']
    url = s3://mybucket
    endpointurl = http://127.0.0.1:9000
['remote "s3ache"']
    url = remote://mys3/cache

and then just

dvc add remote://mys3/path/to/file

🙂

The external workspaces scenario is admitedly not very polished right now and has some flaws, so we've created #3920 to discuss how it should be changed to become better.

skshetry · 2020-07-02T13:41:39Z

Duplicate of #1280

skshetry · 2020-07-02T13:45:37Z

Duplicate of #3441

Also, we have an outstanding docs issue: iterative/dvc.org#108

lucasmaheo · 2020-07-02T14:15:52Z

Thank you for the explanation @efiop. Indeed this is a misunderstanding on my part. How do these instructions not end up on the dvc add page? At any rate, this fixed it.

If anyone else ends up on this issue, the following command worked:

dvc add remote://s3cache/somefile --external

efiop · 2020-07-02T14:23:29Z

@lucasmaheo The reason is that functionality is considered very advanced and not polished enough, so it is even hard to describe it nicely in the docs :( But we do have a few tickets with remote:// notation explanation for the docs. Btw, could you elaborate on your scenario? Maybe you don't actually need this functionality.

lucasmaheo · 2020-07-02T14:43:00Z

My scenario is hypothetical at this point. The typical use case would be to be able to version data files as well as ML models in a scalable storage. Usually, our projects use cloud storage (or on-premise, cloud-like storage) to have a single reference for data. We are looking for a solution to version efficiently those voluminous datasets, and DVC seems to fit the bill.

efiop · 2020-07-02T14:55:43Z

@lucasmaheo Sounds like that is indeed not the best approach. The first problem here is the isolation - if any user of your dvc repo runs dvc checkout data on s3 will change for everyone, which is a really bad practice unless you really know what you are doing. I would suggest storing your data in a regular dvc repo and access it through dvc. We have things like dvc get/import and even python API https://dvc.org/doc/api-reference that allow you to access your data using a human readable name for the artifact that you need. See https://dvc.org/doc/use-cases/data-registries .

lucasmaheo · 2020-07-02T18:30:49Z

Oh thanks for the clarification. That is indeed not the behaviour I was looking for.

So DVC registries are to be used with local copies, as I understand. Exactly in the same way as git registries, with the exception of the added possibility to select only required files. To circumvent that, we need to use the API. It all makes sense now.

At least I now know how to create the data registry. I am expecting that using dvc.api.open() with rev left blank should read from the version of the data that was committed with the current revision of the Git local repository.

Now is there a way to stream outputs to the remote registry? Supposing I was reading data from S3 and producing a transformed version of that data iteratively. If the outputs do not fit on disk, I would prefer to output to another location in S3 and after the whole process, push that version of data to the registry. Is that a feature you are looking into (pushing from a remote location)?

efiop · 2020-07-02T19:42:32Z

Now is there a way to stream outputs to the remote registry?

@lucasmaheo You mean kinda like publish them there? Currently there is no way to push it back like that 🙁 But we've heard requests for that. In my mind should be some special operation that does "straight to remote" action. So... something like

dvc add data --from s3://bucket/path --push --no-download (sorry for lame options, just talking out of my head right now)

that would create data.dvc as if you would downloaded it by-hand and then dvc add dataed, but it wouldn't actually download to your disk, but rather would stream the data from s3://bucket/path, compute the needed hash on-the-fly and upload it to our remote on-the-fly. Clearly, in this approach, we would still use the network traffic to stream the file, but at least we won't use your local storage. That could also be avoided if the cloud could provide us with a real md5, but that is another topic for discussion.

efiop · 2020-07-02T19:46:02Z

I feel like that ^ covers the most misuses that we've seen. Maybe it is even worth doing that by default when someone tries to feed a url to dvc add. E.g.

dvc add s3://bucket/path/data

would just create data.dvc and stream s3://bucket/path/data to compute the hash and push to the default remote. Not sure...

ghost added the triage Needs to be triaged label Jul 2, 2020

efiop added bug Did we break something? research labels Jul 2, 2020

ghost removed the triage Needs to be triaged label Jul 2, 2020

efiop removed bug Did we break something? research labels Jul 2, 2020

efiop added the awaiting response we are waiting for your reply, please respond! :) label Jul 2, 2020

skshetry marked this as a duplicate of #1280 Jul 2, 2020

skshetry marked this as a duplicate of #3441 Jul 2, 2020

skshetry closed this as completed Jul 2, 2020

efiop mentioned this issue Jul 2, 2020

dvc add --external fails on files created by gsutil compose #4154

Closed

weekly-digest bot mentioned this issue Jul 5, 2020

Weekly Digest (28 June, 2020 - 5 July, 2020) #4168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endpoint URL is not taken into account when adding an external file from Minio #4151

Endpoint URL is not taken into account when adding an external file from Minio #4151

lucasmaheo commented Jul 2, 2020

efiop commented Jul 2, 2020

Uh oh!

skshetry commented Jul 2, 2020

Uh oh!

skshetry commented Jul 2, 2020 •

edited

Loading

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020 •

edited

Loading

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020 •

edited

Loading

Uh oh!

efiop commented Jul 2, 2020

Uh oh!

Endpoint URL is not taken into account when adding an external file from Minio #4151

Endpoint URL is not taken into account when adding an external file from Minio #4151

Comments

lucasmaheo commented Jul 2, 2020

Bug Report

Please provide information about your setup

efiop commented Jul 2, 2020

Uh oh!

skshetry commented Jul 2, 2020

Uh oh!

skshetry commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasmaheo commented Jul 2, 2020

Uh oh!

efiop commented Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efiop commented Jul 2, 2020

Uh oh!

skshetry commented Jul 2, 2020 •

edited

Loading

efiop commented Jul 2, 2020 •

edited

Loading

efiop commented Jul 2, 2020 •

edited

Loading