S3 with writing #174

martindurant · 2016-03-09T18:29:56Z

with retried on downloads

More of the tests easily converted

mrocklin · 2016-03-09T18:33:59Z

boto3 splits individual get_object calls into 8MB chunks. Is this preferable for some reason?

Also cc @hussainsultan who seems has historically been interested in S3 work.

mrocklin · 2016-03-09T18:34:24Z

distributed/s3fs.py

+            buff = io.BytesIO()
+            buffer_size = 1024 * 16
+            for chunk in iter(lambda: resp['Body'].read(buffer_size),
+                              b''):


I had no idea that this was valid Python. I wonder when I'll stop learning things about this language.

Getting this error:

/home/ubuntu/anaconda3/lib/python3.5/site-packages/distributed/s3fs.py in <lambda>() 529 buff = io.BytesIO() 530 buffer_size = 1024 * 16 --> 531 for chunk in iter(lambda: resp['Body'].read(buffer_size), 532 b''): 533 buff.write(chunk) NameError: free variable 'resp' referenced before assignment in enclosing scope

This would happen if we raised a ClientError but didn't fall into the listed condition.

Aha, then this might be exactly the interesting case we wanted to find

martindurant · 2016-03-09T18:48:26Z

On 8MB chunks, I know that multi-part upload only excepts chunks > 5MB, but I haven't seen a reference to preferable download chunks (this is not typical). The stream downloads in 16kB pieces.

mrocklin · 2016-03-09T18:57:09Z

The 8MB chunks comes from the threshold that boto3 uses to determine when to start doing multipart downloads. I may have misinterpreted that as being a chunksize.

martindurant · 2016-03-09T19:05:10Z

Happy to change to default chunksize, so long as it's something long compared to typical records and doesn't add too much overhead per chunk.

Note that distributed/cli/tests/test_dscheduler.py::test_defaults is failing on some builds again.

mrocklin · 2016-03-09T19:10:10Z

Whoa, those errors are very odd. cc @jreback

distributed/protocol.py:33: in <module>
    import pandas.msgpack as msgpack
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/__init__.py:42: in <module>
    import pandas.core.config_init
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/config_init.py:17: in <module>
    from pandas.core.format import detect_console_encoding
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/format.py:10: in <module>
    from pandas.core.index import Index, MultiIndex, _ensure_index
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/index.py:31: in <module>
    from pandas.io.common import PerformanceWarning
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/io/common.py:68: in <module>
    from boto.s3 import key
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/boto/__init__.py:1216: in <module>
    boto.plugin.load_plugins(config)
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/boto/plugin.py:92: in load_plugins
    for file in glob.glob(os.path.join(directory, '*.py')):
../../../miniconda/envs/test-environment/lib/python3.5/posixpath.py:89: in join
    genericpath._check_arg_types('join', a, *p)
../../../miniconda/envs/test-environment/lib/python3.5/genericpath.py:143: in _check_arg_types
    (funcname, s.__class__.__name__)) from None
E   TypeError: join() argument must be str or bytes, not 'NoneType'
__________ ERROR collecting distributed/cli/tests/test_dscheduler.py ___________
distributed/cli/tests/test_dscheduler.py:6: in <module>
    from distributed import Scheduler, Executor
distributed/__init__.py:3: in <module>
    from .center import Center
distributed/center.py:13: in <module>
    from .core import (Server, read, write, rpc, pingpong, send_recv,
distributed/core.py:25: in <module>
    from . import protocol
distributed/protocol.py:33: in <module>
    import pandas.msgpack as msgpack
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/__init__.py:42: in <module>
    import pandas.core.config_init
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/config_init.py:13: in <module>
    import pandas.core.config as cf
E   AttributeError: module 'pandas' has no attribute 'core'

mrocklin · 2016-03-09T19:16:19Z

@martindurant are these relevant?

In [12]: boto3.s3.transfer.S3_RETRYABLE_ERRORS
Out[12]: 
(socket.timeout,
 ConnectionError,
 botocore.vendored.requests.packages.urllib3.exceptions.ReadTimeoutError,
 botocore.exceptions.IncompleteReadError)

martindurant · 2016-03-09T19:29:53Z

I currently retry for all exceptions - could just be set to this list.

jreback · 2016-03-09T19:42:37Z

prob related to this: pandas-dev/pandas#11915

we are using boto, which doesn't play nicely with 3.5. soln is just to use boto3 on py3. which will prob be done in 0.18.1.

jreback · 2016-03-09T19:43:38Z

note that v0.18.0.rc2, installed via conda install pandas=v0.18.0rc2 -c pandas should allow pandas to import correctly on 3.5. If you have a setup where it does not pls lmk asap. (this was put in 0.18.0rc2 and is not in the 0.18.0rc1)

mrocklin · 2016-03-16T18:34:50Z

Closing. This has been partially merged and is otherwise continuing at https://github.com/dask/s3fs

Martin Durant added 3 commits March 8, 2016 22:17

Split filesystem from s3 into s3fs

ddab36d

Add in some write functions and copy a few tests from hdfs3

54b0a2d

More of the tests easily converted

Implement retriable S3 fetch

bfa30e9

mrocklin reviewed Mar 9, 2016
View reviewed changes

mrocklin mentioned this pull request Mar 9, 2016

S3fs #176

Merged

Merge branch 'master' into s3fs

c16b0e7

martindurant changed the title ~~WIP: S3fs~~ S3 with writing Mar 10, 2016

mrocklin closed this Mar 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

S3 with writing #174

S3 with writing #174

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

mrocklin Mar 9, 2016

Uh oh!

mrocklin Mar 9, 2016

Uh oh!

mrocklin Mar 9, 2016

Uh oh!

martindurant Mar 9, 2016

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

jreback commented Mar 9, 2016

Uh oh!

jreback commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 16, 2016

Uh oh!

Uh oh!

Uh oh!

S3 with writing #174

S3 with writing #174

Uh oh!

Conversation

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

mrocklin Mar 9, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 9, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 9, 2016

Choose a reason for hiding this comment

Uh oh!

martindurant Mar 9, 2016

Choose a reason for hiding this comment

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 9, 2016

Uh oh!

martindurant commented Mar 9, 2016

Uh oh!

jreback commented Mar 9, 2016

Uh oh!

jreback commented Mar 9, 2016

Uh oh!

mrocklin commented Mar 16, 2016

Uh oh!

Uh oh!