-
Notifications
You must be signed in to change notification settings - Fork 945
Description
Describe the bug
When uploading a single very large file (for example, 50GB) the java process uses up all its memory and then crashes out.
We are running a java client within a docker container in EKS, uploading a file from nfs-attached EFS storage to an s3 bucket. After about 5% of the transfer, we see memory usage jump to about 1GB, then after 10%, it jumps to 2GB, and so on, until it reaches the limit configured for the pod in Kubernetes, and then it locks up and dies (I'm not sure if the pod itself is dying, or being killed by kubernetes due to failed health checks - nothing is logged).
I've gone as high as setting 8GB memory limit, but this wasn't enough to allow the upload to complete.
I've tried various values for targetThroughput (eg 0.1 or 20) and maxConcurrency (eg null, or 2, or 50), without seeing any difference in the memory usage behaviour.
Expected Behavior
Expect the transfer to be able to complete using a sensible amount of memory.
Current Behavior
Java process stops logging, then either dies or is killed due to running out of memory.
In dmesg, we see something like:
Memory cgroup out of memory: Killed process 28619 (java) total-vm:7513612kB, anon-rss:2080236kB, file-rss:22760kB, shmem-rss:0kB, UID:0 pgtables:4764kB oom_score_adj:992
Reproduction Steps
Sorry, our code is too tightly integrated with other stuff to paste. But in essence - we are create an S3TransferManager like this:
S3CrtAsyncClientBuilder s3AsyncClientBuilder = S3AsyncClient.crtBuilder().region(Region.EU_WEST_1).targetThroughputInGbps(10.0)
.minimumPartSizeInBytes(5242880).maxConcurrency(10);
S3TransferManager.builder().s3Client(s3AsyncClient.build()).build();
And then doing an uploadObject
It may be key that we are running this in a docker container in an EKS cluster.
Possible Solution
This is pure speculation, but I wonder if something somewhere (in CRT?) is trying to get how much memory is available, and that this hasn't been done in a docker-friendly way - it may be getting the available memory of the host, rather than the memory available to the docker container it's running in.
Additional Information/Context
We are using this base docker image to run our java service:
eclipse-temurin:17-jre
AWS Java SDK version used
2.20.97 (and CRT version 0.22.2)
JDK version used
openjdk version "17.0.7" 2023-04-18 OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7) OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)
Operating System and version
eclipse-temurin is based on ubuntu