-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Cannot add file having name with substring of a folder as prefix in s3 #2871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
bug
Did we break something?
Comments
The WIP implementation for Google Cloud Storage (#2853, more specifically #2853 (comment)) also suffers from the same issue. |
The following patch is working for me (might need additional improvement): diff --git a/dvc/remote/s3.py b/dvc/remote/s3.py
index 8013f9bb..7494c96c 100644
--- a/dvc/remote/s3.py
+++ b/dvc/remote/s3.py
@@ -209,8 +209,13 @@ class RemoteS3(RemoteBASE):
def exists(self, path_info):
dir_path = path_info / ""
- fname = next(self._list_paths(path_info, max_items=1), "")
- return path_info.path == fname or fname.startswith(dir_path.path)
+
+ if self.isdir(dir_path):
+ return True
+
+ for fname in self._list_paths(path_info):
+ if path_info.path == fname:
+ return True
+
+ return False
+
def makedirs(self, path_info):
# We need to support creating empty directories, which means
@@ -279,7 +284,7 @@ class RemoteS3(RemoteBASE):
)
def walk_files(self, path_info, max_items=None):
- for fname in self._list_paths(path_info, max_items):
+ for fname in self._list_paths(path_info / "", max_items):
if fname.endswith("/"):
continue |
Script to reproduce for 2nd issue: #! /usr/bin/env bash
export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'
moto_server s3 &> /dev/null &
python -c '
import boto3
session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir-file.txt", Body="### Subdir")
s3.put_object(Bucket="dvc-temp", Key="folder/data/subdir/1", Body="")
'
temp=$(mktemp -d)
cd $temp
dvc init --no-scm
dvc remote add -f s3 s3://dvc-temp/folder
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add -f cache remote://s3/cache
dvc config cache.s3 cache
dvc add remote://s3/data/subdir |
3 tasks
11 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Steps to reproduce
folder/data/data.csv
andfolder/datasets.md
.dvc run -d remote://s3/data 'echo hello world'
Outcome
Version
Script to reproduce
Analysis:
walk_files
implementation inRemoteS3
looking via prefix instead of/<prefix
> to walk files. Either,walk_files
should get directory path or should just append it itself.dvc/dvc/remote/s3.py
Line 282 in 0404a23
Or, I'd prefer it to be handled when collecting the directory.
dvc/dvc/remote/base.py
Line 196 in caa67c7
exists
looks flawed. Say, you havedata/subdir-file.txt
anddata/subdir/1
files. When addingdata/subdir
, the first result could besubdir-file.txt
which matchesstartswith
, therefore, theexists()
will return True, but in reality,subdir
does not exist.So, the function should check if it's a directory, and should loop through all results of
_list_paths()
till it finds the exact match (not sure, how expensive this will be).dvc/dvc/remote/s3.py
Lines 208 to 211 in caa67c7
The text was updated successfully, but these errors were encountered: