Skip to content

Parallel downloads with TransferManager easily starve the executor's thread pool (1.11.44) #896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ldcasillas-progreso opened this issue Oct 19, 2016 · 5 comments
Assignees
Labels
bug This issue is a bug.

Comments

@ldcasillas-progreso
Copy link

With the parallel downloads feature in 1.11.x, it's very easy to deadlock the TransferManager's executor's thread pool by simply submitting a lot of download requests. All of the executor's threads end up in this sort of state:

"s3-transfer-manager-worker-10" #27 prio=5 os_prio=31 tid=0x00007fa0b3bdd000 nid=0x7903 waiting on condition [0x0000000125530000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006c07a08f0> (a java.util.concurrent.FutureTask)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
    at java.util.concurrent.FutureTask.get(FutureTask.java:191)
    at com.amazonaws.services.s3.transfer.DownloadCallable.combineFiles(DownloadCallable.java:203)
    at com.amazonaws.services.s3.transfer.DownloadCallable.downloadInParallel(DownloadCallable.java:193)
    at com.amazonaws.services.s3.transfer.DownloadCallable.call(DownloadCallable.java:131)
    at com.amazonaws.services.s3.transfer.DownloadCallable.call(DownloadCallable.java:51)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:744)

   Locked ownable synchronizers:
    - <0x00000006c06ae580> (a java.util.concurrent.ThreadPoolExecutor$Worker)

...and none of the executor's threads is actually downloading anything. Presumably the combineFiles task should be a non-blocking future of some sort.

@varunnvs92
Copy link
Contributor

varunnvs92 commented Oct 19, 2016

Are you using default thread pool provided by SDK or custom? If custom thread pool, what is the number of threads in it?

Are you seeing this behavior frequently? Can you explain you download scenario like what is size of the file to download, downloading single/multiple files? I am trying to figure out the case where dead lock can happen.

@ldcasillas-progreso
Copy link
Author

ldcasillas-progreso commented Oct 19, 2016

I tried it both with a custom thread pool (8 threads) and the default one that the SDK provides.

The code I have is trying to download 69 files, totaling to about 16 GiB, with the same TransferManager. My code works just fine under version 1.10.16, but when I try upgrading to 1.11.44 (specifically for the parallel downloads feature!) the program just hangs after downloading a few files.

My code, as it happens, has some metrics around the transfers. First I get these warnings, which I don't know if they're relevant:

18:14:52.551 [Coordinator 2] INFO  c.p.d.automation.s3.BulkS3Download - Bulk S3 download from s3://my-bucket/common/prefix/for/my/objects to /home/myusername/path/to/download/directory (regexp = ^.*\.csv(/part-\d+)$, metrics-uuid=966c0cf1-dfe7-48ac-adac-4f82ee82e5e8)
Oct 18, 2016 6:15:00 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-000025315321510609134081.part.1 could not be deleted.
Oct 18, 2016 6:15:01 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-000003552744014039759453.part.1 could not be deleted.
Oct 18, 2016 6:15:01 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-00000285658510910860522.part.2 could not be deleted.
18:15:03.216 [Coordinator 2] INFO  c.p.d.automation.s3.BulkS3Download - Downloading 69 files from s3://my-bucket/common/prefix/for/my/objects
Oct 18, 2016 6:15:03 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-000013847210060239851018.part.1 could not be deleted.
Oct 18, 2016 6:15:04 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-000012084396428992029528.part.2 could not be deleted.
Oct 18, 2016 6:15:05 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file01.csv/part-000027710304967505190617.part.2 could not be deleted.
Oct 18, 2016 6:15:06 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file02.csv/part-000004575248648125704409.part.1 could not be deleted.
Oct 18, 2016 6:15:06 PM com.amazonaws.services.s3.transfer.DownloadCallable combineFiles
WARNING: The file /home/myusername/path/to/download/directory/subdir01/file02.csv/part-000007141841352462430105.part.2 could not be deleted.

Then the download proper starts. It downloads for a few seconds (as seen from the OS X system monitor's network graph), but it hangs very soon thereafer. So my metrics reporter just repeatedly logs this out, showing the TransferManager stuck after 8 transfers have started (same as the number of threads in the custom executor) and only 4 have completed. (I compute these metrics by attaching ProgressListeners to the SDK's Transfer objects, and observing the ProgressEvents and ProgressEventTypes.)

18:15:37.569 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=GAUGE, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.fraction-bytes-transferred, value=0.05701729361113557
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=GAUGE, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.progress, value=5.70%
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=COUNTER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.total-bytes-to-transfer, count=16580360328
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=COUNTER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.transfers-completed, count=4
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=COUNTER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.transfers-failed, count=0
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=COUNTER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.transfers-started, count=8
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=COUNTER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.transfers-total, count=69
18:15:37.571 [metrics-logger-reporter-2-thread-1] INFO  c.p.d.a.metrics.S3TransferMetrics - type=METER, name=S3TransferMetrics.4b4df776-f11c-443e-82f3-fca12d913a47.bytes-transferred, count=945367273, mean_rate=2.1003752701155223E7, m1=3.752772926092549E7, m5=5.581370025628049E7, m15=5.9619597766520604E7, rate_unit=events/second

To me it looks like the following is happening, if I understand things right:

  1. Each time I initiate a download with the TransferManager, it submits a DownloadCallable task to its executor.
  2. This task will itself submit DownloadPartCallable tasks to the same executor.
  3. The DownloadCallable task then blocks waiting for its submitted DownloadPartCallables to complete.
  4. When the parts download tasks complete, the DownloadCallable combines the files.

So with 69 downloads and an executor with only 8 (or the default 10) threads, it's just very easy for the program to end up in a state where all of the executor's 8 threads are running DownloadCallables who've all submitted their DownloadPartCallables and are in a blocking wait for them to complete, but there are no more threads available to actually run the latter.

Solutions I might suggest (which may well be all wrong):

  • Submit the DownloadCallables and DownloadPartCallables to two different executors, so that the former never starve the threads available for the latter. Downside: if the separate executor for the DownloadCallable tasks has a fixed number of threads, only that many downloads can be active at the same time. If it's an unlimited number of threads, then my code would end up launching 69 threads.
  • Rework the DownloadCallable so that it doesn't block waiting for the parts to download before it combines them. Rather, the task to combine the files is triggered asynchronously upon completion of all of the part download Futures. This requires some mechanism like Guava's ListenableFuture, Java 8's CompletionStage or the AWS Java SDK's async callbacks feature. This is perhaps more complex than the two-executors solution, but it's "righter" in some sense (that may or may not matter).

@varunnvs92 varunnvs92 added the bug This issue is a bug. label Oct 19, 2016
@varunnvs92
Copy link
Contributor

@ldcasillas-progreso
Thanks for the explanation and finding this bug. I am not yet decided on the potential fix. I will discuss with the team and will get back to you.

In the meantime, you can disable parallel downloads by setting this option to true.

@varunnvs92 varunnvs92 self-assigned this Oct 19, 2016
@varunnvs92
Copy link
Contributor

This bug is fixed in v1.11.47. See release notes here.

Thanks for reporting this issue.

@ldcasillas-progreso
Copy link
Author

Thanks. I'll test it out tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

2 participants