Skip to content

Cannot add file having name with substring of a folder as prefix in s3 #2871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skshetry opened this issue Nov 30, 2019 · 4 comments · Fixed by #2873
Closed

Cannot add file having name with substring of a folder as prefix in s3 #2871

skshetry opened this issue Nov 30, 2019 · 4 comments · Fixed by #2873
Labels
bug Did we break something?

Comments

@skshetry
Copy link
Collaborator

skshetry commented Nov 30, 2019

Steps to reproduce

  1. Upload two files in s3: folder/data/data.csv and folder/datasets.md.
  2. Setup remotes and caches.
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
  1. dvc run -d remote://s3/data 'echo hello world'

Outcome

Running command:
        echo hello world
hello world
ERROR: unexpected error - '/folder/datasets.md' does not start with '/folder/data'

Version

$ dvc version
Python version: 3.6.6
Platform: Linux-5.3.12-arch1-1-x86_64-with-arch
Binary: False
Package: None
Filesystem type (cache directory): ('ext4', '/dev/sda9')
Filesystem type (workspace): ('ext4', '/dev/sda9')

Script to reproduce

#! /usr/bin/env bash

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")

s3.put_object(Bucket="dvc-temp", Key="folder/data/data.csv")
s3.put_object(Bucket="dvc-temp", Key="folder/datasets.md", Body="### Datasets")
'

temp=$(mktemp -d)
cd $temp

dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache

dvc run -d remote://s3/data 'echo hello world'

Analysis:

  1. This is due to walk_files implementation in RemoteS3 looking via prefix instead of /<prefix> to walk files. Either, walk_files should get directory path or should just append it itself.

for fname in self._list_paths(path_info, max_items):

Or, I'd prefer it to be handled when collecting the directory.

for fname in self.walk_files(path_info):

  1. Again, the logic of exists looks flawed. Say, you have data/subdir-file.txt and data/subdir/1 files. When adding data/subdir, the first result could be subdir-file.txt which matches startswith, therefore, the exists() will return True, but in reality, subdir does not exist.
    So, the function should check if it's a directory, and should loop through all results of _list_paths() till it finds the exact match (not sure, how expensive this will be).

dvc/dvc/remote/s3.py

Lines 208 to 211 in caa67c7

def exists(self, path_info):
dir_path = path_info / ""
fname = next(self._list_paths(path_info, max_items=1), "")
return path_info.path == fname or fname.startswith(dir_path.path)

@ghost ghost added the triage Needs to be triaged label Nov 30, 2019
@skshetry
Copy link
Collaborator Author

skshetry commented Nov 30, 2019

The WIP implementation for Google Cloud Storage (#2853, more specifically #2853 (comment)) also suffers from the same issue.
For now, I have reverted any changes related to this in the PR, as it has already ballooned in size. I'd opt for fixing this in a next PR.

@skshetry
Copy link
Collaborator Author

skshetry commented Nov 30, 2019

The following patch is working for me (might need additional improvement):

diff --git a/dvc/remote/s3.py b/dvc/remote/s3.py
index 8013f9bb..7494c96c 100644
--- a/dvc/remote/s3.py
+++ b/dvc/remote/s3.py
@@ -209,8 +209,13 @@ class RemoteS3(RemoteBASE):
 
     def exists(self, path_info):
         dir_path = path_info / ""
-        fname = next(self._list_paths(path_info, max_items=1), "")
-        return path_info.path == fname or fname.startswith(dir_path.path)
+
+        if self.isdir(dir_path):
+            return True
+
+        for fname in self._list_paths(path_info):
+            if path_info.path == fname:
+                return True
+
+        return False
+
     def makedirs(self, path_info):
         # We need to support creating empty directories, which means
@@ -279,7 +284,7 @@ class RemoteS3(RemoteBASE):
         )
 
     def walk_files(self, path_info, max_items=None):
-        for fname in self._list_paths(path_info, max_items):
+        for fname in self._list_paths(path_info / "", max_items):
             if fname.endswith("/"):
                 continue

@efiop
Copy link
Contributor

efiop commented Nov 30, 2019

@skshetry Great catch! Mind creating a PR with that? 🙂 We could then do the same in gs. CC @MrOutis

@efiop efiop added the bug Did we break something? label Nov 30, 2019
@ghost ghost removed the triage Needs to be triaged label Nov 30, 2019
@skshetry
Copy link
Collaborator Author

Script to reproduce for 2nd issue:

#! /usr/bin/env bash

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")

s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir-file.txt", Body="### Subdir")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir/1", Body="")
'

temp=$(mktemp -d)
cd $temp

dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache

dvc add remote://s3/data/subdir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants