Skip to content

Executor factory on transfer manager vs max connection on s3 client? #1696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abbccdda opened this issue Jul 26, 2018 · 5 comments
Closed

Executor factory on transfer manager vs max connection on s3 client? #1696

abbccdda opened this issue Jul 26, 2018 · 5 comments
Labels
guidance Question that needs advice or information.

Comments

@abbccdda
Copy link

Hey there,

I'm trying to understand two similar configs when doing the s3 download performance tuning:
My current code looks like:

ClientConfiguration conf = new ClientConfiguration()
.withMaxConnections(THREAD_POOL_SIZE).withSocketTimeout(3600_000);

AmazonS3 clientS3 = AmazonS3ClientBuilder
    .standard()
    .withClientConfiguration(conf)
    .build();

TransferManager transferManager = TransferManagerBuilder
    .standard()
    .withS3Client(clientS3)
    .withMultipartCopyPartSize(500_000_000L)
    .withExecutorFactory(() -> Executors.newFixedThreadPool(THREAD_POOL_SIZE))
    .withMultipartCopyThreshold(1_000_000_000L)
    .build();

I have a question on what's the relation between transfer manager executor factory and s3client max connections? If I set them to different values, will each thread open more TCP connections overall?

Thanks for your time and let me know if my question makes sense.

@millems
Copy link
Contributor

millems commented Jul 27, 2018

The max connections is how many TCP/HTTP connections should be allowed by the client. The executor thread pool size is the number of threads used to perform file uploads/downloads/etc.

Because transfer manager currently uses one thread per upload/download, it usually makes sense for the number of threads and connections to be the same, assuming you're not using the S3 client for anything else.

If you're not using the S3 client instance directly or in another transfer manager and the number of threads is higher than the number of connections, a few may end up waiting for a connection to free up. If the number of threads is lower than the number of connections, you'll never utilize every connection available to the client.

The threads may sometimes be used for some extra work that doesn't use a connection (eg. combining the parts of a downloaded file), so if you looked really closely you may notice that a thread can be running without using a connection.

TL;DR: If you're not using the client for anything else, you can set the max connections = the thread pool size to be sure you can have optimum throughput/latency. If you're using the client for something else, you'll need to do some more thinking about the different use-cases and how contention over the shared pool of connections could affect your throughput/latency.

@abbccdda
Copy link
Author

Thanks Matthew for the explanation! As you have mentioned, combining the parts of a downloaded file could be sth a transfer manager will take care of. How much work load this could incur? Do I need to set my threads slightly higher than the number of connections?

@millems
Copy link
Contributor

millems commented Jul 30, 2018

I wouldn't worry it. The risk of blocking on waiting for a connection is worse for throughput than the risk of not always utilizing every available connection.

@dagnir
Copy link
Contributor

dagnir commented Jul 31, 2018

Hi @abbccdda looks like @millems has answered the original and follow up questions, so I'll go ahead and close this. Please feel free to reopen if you any follow up questions.

@dagnir dagnir closed this as completed Jul 31, 2018
@srchase srchase added guidance Question that needs advice or information. and removed Question labels Jan 4, 2019
@joshrosen-stripe
Copy link

Hi @millems and @dagnir,

I'm a bit confused about the advice to use a bounded thread pool here because that seems to contradict certain warnings in the docs. Upthread, @millems wrote

TL;DR: If you're not using the client for anything else, you can set the max connections = the thread pool size to be sure you can have optimum throughput/latency.

However, in the docs for a deprecated TransferManager constructor:

executorService - The ExecutorService to use for the TransferManager. It is not recommended to use a single threaded executor or a thread pool with a bounded work queue as control tasks may submit subtasks that can't complete until all sub tasks complete. Using an incorrectly configured thread pool may cause a deadlock (I.E. the work queue is filled with control tasks that can't finish until subtasks complete but subtasks can't execute because the queue is filled).

However, this warning is missing from the newer TransferManagerBuilder.withExecutorFactory Javadoc.

#939 has additional discussion of the deadlock issue and implies that we "Have to choose between deadlocks, unbounded executors, or internal APIs".

There's additional discussion of this issue on HADOOP-13826.

Given this, I wanted to ask here to clarify @millems's remarks above and figure out whether there's a more precise set of circumstances where it is guaranteed to be safe to use a bounded thread pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

5 participants