Skip to content

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jan 23, 2022

See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

  • Adds a new option fs.s3a.audit.enabled to controls whether or not auditing
    is enabled. This is false by default.

  • When false, the S3A auditing manager is NoopAuditManagerS3A,
    which was formerly only used for unit tests and
    during filsystem initialization.

  • When true, ActiveAuditManagerS3A is used for managing auditing,
    allowing auditing events to be reported.

  • updates documentation and tests.

This patch does not fix the underlying leak. When auditing is enabled,
long-lived threads will retain references to the audit managers
of S3A filesystem instances which have already been closed.

Contributed by Steve Loughran.

Description of PR

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

This patch does not fix the problem, but it allows for auditing
to be disabled such that no ThreadLocal fields are created.
By disabling auditing by default, memory leaks will not surface.

Auditing to S3 Logs can be enabled if required; short-lived
applications and/or applications which use the same limited set of
S3A instances should be safe to use with auditing.

Change-Id: I9090d65c896a12feb6826e581935078ffb737bab
@steveloughran
Copy link
Contributor Author

Turning off logging by default is the fastest way to address this issue, while I do something better involving a thread id map with weak references to spans. and with a test to verify memory problems are avoided.

running the full test suite with the default settings (i.e. auditing is disabled)

@steveloughran
Copy link
Contributor Author

two ARN failures from SDK update (noted on https://issues.apache.org/jira/browse/HADOOP-18085 ); other test failures are due to audit events not being collected. will address those by enabling auditing for those tests

[ERROR] Failures: 
[ERROR]   ITestS3AContractUnbuffer>AbstractContractUnbufferTest.testUnbufferMultipleReads:116->AbstractContractUnbufferTest.validateFileContents:139->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 failed to read expected number of bytes from stream. This may be transient expected:<512> but was:<391>
[ERROR]   ITestS3AMiscOperationCost.testGetContentMissingPath:129->AbstractS3ACostTest.verifyMetricsIntercepting:317->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning java.io.FileNotFoundException: No such file or directory: s3a://stevel-london/fork-0005/test/testGetContentMissingPath[delete-markers]: audit_span_creation expected:<1> but was:<0>
[ERROR]   ITestS3AMiscOperationCost.testGetContentMissingPath:129->AbstractS3ACostTest.verifyMetricsIntercepting:317->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning java.io.FileNotFoundException: No such file or directory: s3a://stevel-london/fork-0005/test/testGetContentMissingPath[keep-markers]: audit_span_creation expected:<1> but was:<0>
[ERROR]   ITestS3AMiscOperationCost.testGetContentSummaryDir:110->AbstractS3ACostTest.verifyMetrics:294->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning         none             inf            none             inf            2            1                  0 : audit_span_creation expected:<1> but was:<0>
[ERROR]   ITestS3AMiscOperationCost.testGetContentSummaryDir:110->AbstractS3ACostTest.verifyMetrics:294->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning         none             inf            none             inf            2            1                  0 : audit_span_creation expected:<1> but was:<0>
[ERROR]   ITestS3AMiscOperationCost.testMkdirOverDir:83->AbstractS3ACostTest.verifyMetrics:294->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning true: audit_span_creation expected:<1> but was:<0>
[ERROR]   ITestS3AMiscOperationCost.testMkdirOverDir:83->AbstractS3ACostTest.verifyMetrics:294->Assert.assertEquals:647->Assert.failNotEquals:835->Assert.fail:89 operation returning true: audit_span_creation expected:<1> but was:<0>
[ERROR] Errors: 
[ERROR]   ITestS3ABucketExistence.testAccessPointProbingV2:171->expectUnknownStore:103->lambda$testAccessPointProbingV2$12:172 » IllegalArgument
[ERROR]   ITestS3ABucketExistence.testAccessPointRequired:188->expectUnknownStore:103->lambda$testAccessPointRequired$14:189 » IllegalArgument
[INFO] 

Change-Id: I93e2211d4070f4222004a58b46f841360f74e554
@steveloughran
Copy link
Contributor Author

btw, that was against s3 london with -Dparallel-tests -DtestsThreadCount=6 -Dmarkers=keep -Dscale and only took 17 minutes to run!

@steveloughran
Copy link
Contributor Author

next run, same settings. only the arn tests failed

[ERROR] Errors: 
[ERROR]   ITestS3ABucketExistence.testAccessPointProbingV2:171->expectUnknownStore:103->lambda$testAccessPointProbingV2$12:172 » IllegalArgument
[ERROR]   ITestS3ABucketExistence.testAccessPointRequired:188->expectUnknownStore:103->lambda$testAccessPointRequired$14:189 » IllegalArgument
[INFO] 
[ERROR] Tests run: 1063, Failures: 0, Errors: 2, Skipped: 54

@aajisaka
Copy link
Member

aajisaka commented Jan 24, 2022

Would you remove the Change-Id from the commit message? (the removal can be done when squash-and-merge the commits)

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, seems a good interim fix/ability to disable the auditing. But, shouldn't the title for this PR be to add a config to disable auditing rather than Memory leakage since that would be fixed in the next PR?
Building the tests, seeing some config issue I'll resolve in some time(Seems like a new aws-java-sdk was added, for some reason it's not resolving for me).

It provides two forms of logging

1. Logging of operations in the client via Log4J.
1. Logging of operations in the client via the active SLF4J imolementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo: "implementation"



### Disabling Auditing with the No-op Auditor
### Disabling Auditing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a mention that servicename=NoopAuditor also disables auditing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't though. it stil leaks memory as the service manager is still instantiated.

which is why i've cut it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, you're right. Just realized, ActiveAuditManager with NoopAuditor would still leak, but NoopAuditManager is the one without the ThreadLocal variable now. Thanks for clearing it 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding to the architecture as an FYI point

…ture doc

Add an explanation there of why the memory is leaked.

Change-Id: I65f853fb0b588a0b27e11b6b1291954cf52fb362
@mukund-thakur mukund-thakur self-requested a review January 24, 2022 11:10
Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good changes wise. Just one clarification.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 38s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 4 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 32m 11s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 29s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 27s trunk passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 0m 35s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 6s trunk passed
+1 💚 shadedclient 20m 10s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 35s the patch passed
+1 💚 compile 0m 36s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javac 0m 37s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 mvnsite 0m 34s the patch passed
+1 💚 javadoc 0m 17s the patch passed with JDK Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 0m 25s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 7s the patch passed
+1 💚 shadedclient 19m 51s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 8s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
85m 33s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3916/3/artifact/out/Dockerfile
GITHUB PR #3916
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell markdownlint
uname Linux 660e82c8ce06 4.15.0-161-generic #169-Ubuntu SMP Fri Oct 15 13:41:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 99a7102
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.13+8-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3916/3/testReport/
Max. process+thread count 544 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3916/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@steveloughran
Copy link
Contributor Author

thx.

I'm going to creata a new JIRA for this "add option to disable s3a auditing" and commit this under that, wiith the original one open.

@apache apache deleted a comment from hadoop-yetus Jan 24, 2022
@apache apache deleted a comment from hadoop-yetus Jan 24, 2022
@steveloughran steveloughran merged commit b795f6f into apache:trunk Jan 24, 2022
@mukund-thakur mukund-thakur changed the title HADOOP-18091. S3A Auditor Memory Leakage HADOOP-18094. Disable S3A auditing by default. Jan 24, 2022
@steveloughran steveloughran deleted the s3/HADOOP-18091-audit-leak branch January 24, 2022 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants