Skip to content

S3 with writing #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

S3 with writing #174

wants to merge 4 commits into from

Conversation

martindurant
Copy link
Member

with retried on downloads

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

boto3 splits individual get_object calls into 8MB chunks. Is this preferable for some reason?

Also cc @hussainsultan who seems has historically been interested in S3 work.

buff = io.BytesIO()
buffer_size = 1024 * 16
for chunk in iter(lambda: resp['Body'].read(buffer_size),
b''):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea that this was valid Python. I wonder when I'll stop learning things about this language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting this error:

/home/ubuntu/anaconda3/lib/python3.5/site-packages/distributed/s3fs.py in <lambda>()
    529             buff = io.BytesIO()
    530             buffer_size = 1024 * 16
--> 531             for chunk in iter(lambda: resp['Body'].read(buffer_size),
    532                               b''):
    533                 buff.write(chunk)

NameError: free variable 'resp' referenced before assignment in enclosing scope

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would happen if we raised a ClientError but didn't fall into the listed condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, then this might be exactly the interesting case we wanted to find

@martindurant
Copy link
Member Author

On 8MB chunks, I know that multi-part upload only excepts chunks > 5MB, but I haven't seen a reference to preferable download chunks (this is not typical). The stream downloads in 16kB pieces.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

The 8MB chunks comes from the threshold that boto3 uses to determine when to start doing multipart downloads. I may have misinterpreted that as being a chunksize.

@martindurant
Copy link
Member Author

Happy to change to default chunksize, so long as it's something long compared to typical records and doesn't add too much overhead per chunk.

Note that distributed/cli/tests/test_dscheduler.py::test_defaults is failing on some builds again.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

Whoa, those errors are very odd. cc @jreback

distributed/protocol.py:33: in <module>
    import pandas.msgpack as msgpack
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/__init__.py:42: in <module>
    import pandas.core.config_init
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/config_init.py:17: in <module>
    from pandas.core.format import detect_console_encoding
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/format.py:10: in <module>
    from pandas.core.index import Index, MultiIndex, _ensure_index
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/index.py:31: in <module>
    from pandas.io.common import PerformanceWarning
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/io/common.py:68: in <module>
    from boto.s3 import key
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/boto/__init__.py:1216: in <module>
    boto.plugin.load_plugins(config)
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/boto/plugin.py:92: in load_plugins
    for file in glob.glob(os.path.join(directory, '*.py')):
../../../miniconda/envs/test-environment/lib/python3.5/posixpath.py:89: in join
    genericpath._check_arg_types('join', a, *p)
../../../miniconda/envs/test-environment/lib/python3.5/genericpath.py:143: in _check_arg_types
    (funcname, s.__class__.__name__)) from None
E   TypeError: join() argument must be str or bytes, not 'NoneType'
__________ ERROR collecting distributed/cli/tests/test_dscheduler.py ___________
distributed/cli/tests/test_dscheduler.py:6: in <module>
    from distributed import Scheduler, Executor
distributed/__init__.py:3: in <module>
    from .center import Center
distributed/center.py:13: in <module>
    from .core import (Server, read, write, rpc, pingpong, send_recv,
distributed/core.py:25: in <module>
    from . import protocol
distributed/protocol.py:33: in <module>
    import pandas.msgpack as msgpack
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/__init__.py:42: in <module>
    import pandas.core.config_init
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/pandas/core/config_init.py:13: in <module>
    import pandas.core.config as cf
E   AttributeError: module 'pandas' has no attribute 'core'

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2016

@martindurant are these relevant?

In [12]: boto3.s3.transfer.S3_RETRYABLE_ERRORS
Out[12]: 
(socket.timeout,
 ConnectionError,
 botocore.vendored.requests.packages.urllib3.exceptions.ReadTimeoutError,
 botocore.exceptions.IncompleteReadError)

@martindurant
Copy link
Member Author

I currently retry for all exceptions - could just be set to this list.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

prob related to this: pandas-dev/pandas#11915

we are using boto, which doesn't play nicely with 3.5. soln is just to use boto3 on py3. which will prob be done in 0.18.1.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2016

note that v0.18.0.rc2, installed via conda install pandas=v0.18.0rc2 -c pandas should allow pandas to import correctly on 3.5. If you have a setup where it does not pls lmk asap. (this was put in 0.18.0rc2 and is not in the 0.18.0rc1)

@mrocklin mrocklin mentioned this pull request Mar 9, 2016
@martindurant martindurant changed the title WIP: S3fs S3 with writing Mar 10, 2016
@mrocklin
Copy link
Member

Closing. This has been partially merged and is otherwise continuing at https://github.com/dask/s3fs

@mrocklin mrocklin closed this Mar 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants