MAPREDUCE-7474. Improve Manifest committer resilience #6716

steveloughran · 2024-04-09T18:28:09Z

MAPREDUCE-7474. Improve resilience of task commit save and rename operation with retries.

Retries of save()
5 attempts, with 500 millis sleep between them. No configuration.
Issue: should we make this configurable?
Split delete(path, recursive) into deleteFile and rmdir for separate
statistics.
Add new option mapreduce.manifest.committer.cleanup.parallel.delete.base.first
This attempts to delete the base dir, and only on failure (timeout, permissions)
does it attempt the parallel delete and (re) attempt at deleting base dir.
This is to cut back on azure load while still handling timeouts on deep tree
deletion

Test simulation expands to:

Support recovery through a countdown of calls to fail.
Simulate timeout before and after rename calls.
Simulating timeouts of delete operations

This is based on #6596 but skips the rate limiting logic spanning common and azure,
instead it only contains changes in manifest committer -easier to backport.

How was this patch tested?

IDE test of all new tests against azure
full test suite left to yetus

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

steveloughran · 2024-04-09T19:00:36Z

testing: azure cardiff
-Dparallel-tests=abfs -DtestsThreadCount=8
lots of tests failed for me, but I've now got an account with very low IO threshold. We need to look at those failures in general

steveloughran · 2024-04-11T18:59:14Z

Reviews invited from @mukund-thakur @anmolanmol1234 @anujmodi2021 @HarshitGupta11

* retries of save() * split delete into deleteFile and rmdir * needs tests Change-Id: Idb6cf0e85c62c973881fdc96a3ded97b1cfc43ff

* Retries of save() 5 attempts, with 500 millis sleep between them. No configuration. Issue: should we make this configurable? * Split delete(path, recursive) into deleteFile and rmdir for separate statistics. Test simulation expands to: * Support recovery through a countdown of calls to fail. * Simulate timeout before *and after* rename calls. Change-Id: I3f86c5a238515955e9b82ed37727d40d2d8d3f96

Change-Id: I039ec6e4dc12f68690ffa977ebb81056ab0d1711

New option mapreduce.manifest.committer.cleanup.parallel.delete.base.first this attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout. Best for abfs; for gcs it works but is suboptimal. Enabled by default. Also: changed default abfs io rate to 1_000 from 10_000. +docs and tests updated Change-Id: Idd10aecc3cb6747a6367573ef9547675641afe8c

Change-Id: I24778ab4d817a77afbbf1d5b132be270698382a4

Change-Id: Ic67f41449d1e46d9fb81c47012bb41d5fade84a9

hadoop-yetus · 2024-04-12T20:33:04Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 31s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 7 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	15m 2s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 11s		trunk passed
+1 💚	compile	17m 40s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 21s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 21s		trunk passed
+1 💚	mvnsite	1m 56s		trunk passed
+1 💚	javadoc	1m 38s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 35s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	2m 55s		trunk passed
+1 💚	shadedclient	34m 34s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 33s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 3s		the patch passed
+1 💚	compile	16m 50s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 50s		the patch passed
+1 💚	compile	16m 28s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	16m 28s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	4m 14s		the patch passed
+1 💚	mvnsite	1m 48s		the patch passed
+1 💚	javadoc	1m 35s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 33s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	3m 18s		the patch passed
+1 💚	shadedclient	35m 30s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 52s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 33s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 4s		The patch does not generate ASF License warnings.
		230m 36s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 412988d12e6f 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `a3117cf`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/testReport/
Max. process+thread count	1592 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

The number of attempts to commit a manifest is now configurable with the option: mapreduce.manifest.committer.manifest.save.attempts * The default is still 5. * The minimum is 1; asking for less is ignored. * A retry policy adds 500ms of sleep per attempt. (using retryUpToMaximumCountWithProportionalSleep policy) Documented. Making it configurable avoids having to guess what the ideal value should be, instead the default is something which could cope with briefly transient failures. Change-Id: I276aaf39bff73544a633126425cc7ec1e9848ec1

.../org/apache/hadoop/mapreduce/lib/output/committer/manifest/impl/ManifestStoreOperations.java

mukund-thakur · 2024-04-15T20:22:51Z

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java

-    return trackDuration(getIOStatistics(), statistic, () -> {
-      return operations.delete(path, recursive);
-    });
+    if (recursive) {


unable to understand this. How is a recursive flag determining that it is a dir or a file?

ok, deleteDir will also delete a file. let me highlight that.

I'd done this delete dir/file split to support different capacity requests, without that it is a bit over-complex. it does let us collect different statistics though, which may be useful

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java

mukund-thakur · 2024-04-15T20:38:40Z

...va/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java

+  /**
+   * Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}:  {@value}.
+   */
+  public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = true;


As it is bad for GCS, shouldn't the default be false?

really don't know here. In the docs I try to cover this

...hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md

hadoop-yetus · 2024-04-16T00:25:00Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 54s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 7 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 58s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 39s		trunk passed
+1 💚	compile	17m 26s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 8s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 23s		trunk passed
+1 💚	mvnsite	1m 57s		trunk passed
+1 💚	javadoc	1m 39s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 32s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	2m 54s		trunk passed
+1 💚	shadedclient	34m 10s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 34s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 1s		the patch passed
+1 💚	compile	16m 43s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 43s		the patch passed
+1 💚	compile	16m 16s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	16m 16s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 17s	/results-checkstyle-root.txt	root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 22)
+1 💚	mvnsite	1m 52s		the patch passed
+1 💚	javadoc	1m 35s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 29s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	3m 19s		the patch passed
+1 💚	shadedclient	34m 20s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 55s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 40s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 4s		The patch does not generate ASF License warnings.
		229m 15s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint
uname	Linux 34da63c6214c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `9193085`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/testReport/
Max. process+thread count	1585 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

anmolanmol1234 · 2024-04-16T10:37:54Z

...va/org/apache/hadoop/mapreduce/lib/output/committer/manifest/ManifestCommitterConstants.java

@@ -143,6 +145,20 @@ public final class ManifestCommitterConstants {
   */
  public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true;

+  /**
+   * Should parallel cleanup try to delete teh base first?


anmolanmol1234 · 2024-04-16T11:06:15Z

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java

+            getName(), tempPath, finalPath, retryCount);
+
+        trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
+            operations.save(manifestData, tempPath, true));


Also if rename failed in the first attempt but succeeded in the backend, will the save operation on tmpPath fail with an error and if yes how to recover from that
?

so renameFile() has always deleted the destination because we need to do that to cope with failures of a previous/concurrent task attempt. Whoever commits last wins.

To make this clearer I'm pulling up more of the code into this method and adding comments.

...n/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java

snvijaya · 2024-04-16T13:51:47Z

...n/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java

+        try (DurationInfo info = new DurationInfo(LOG, true,
+            "Initial delete of %s", baseDir)) {
+          exception = deleteOneDir(baseDir);
+          if (exception == null) {


As added by you in this logic, when the directory tree is very large and is over OAuth authentication, Azure cloud could fail the baseDir delete due to exhaustive ACL permissions checks. But this delete will entry the retry loop as it was request timeout and for this scenario all the retries too might fail and can take a while to report failure with backoff and retry attempts as per AZURE_MAX_IO_RETRIES (default value 30).

Default max retry count is 30 today just to ensure any 5-10 min network/service transient failures do not lead failures of long running workloads.

If this logic to attempt basedir delete before falling back to parallel deletes, is optimal only for Azure cloud, we could look for ways to fail fast for Delete with recursive.

Would this work - Add a new config MAX_RETRIES_RECURSIVE_DELETE which by default will be the same as AZURE_MAX_IO_RETRIES in ABFS driver. AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens.

If this sounds ok, we can look into changes needed in AbfsClient, AbfsRestOperation and ExponentialRetry to make MAX_RETRIES_RECURSIVE_DELETE config effective.

ooh, so it's going to be quite a long time to fall back.
I'm going to make the option default to false for now.

AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens.

it'll come from the cache, we don't want to set it for everything else, but a low MAX_RETRIES_RECURSIVE_DELETE might make sense everywhere. something to consider later.

Simulating more failure conditions. Still more to explore there, in particular "what if delete of rename target fails" Change-Id: Idb84f9c17a195702e6a2345b095f41e72865dd5b

hadoop-yetus · 2024-04-17T22:09:34Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 32s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 7 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 43s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 23s		trunk passed
+1 💚	compile	17m 35s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 7s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 19s		trunk passed
+1 💚	mvnsite	1m 55s		trunk passed
+1 💚	javadoc	1m 34s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 25s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	2m 53s		trunk passed
+1 💚	shadedclient	34m 12s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 33s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 0s		the patch passed
+1 💚	compile	16m 47s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 47s		the patch passed
+1 💚	compile	16m 25s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	16m 25s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 22s	/results-checkstyle-root.txt	root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22)
+1 💚	mvnsite	1m 54s		the patch passed
+1 💚	javadoc	1m 34s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	1m 34s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	3m 15s		the patch passed
+1 💚	shadedclient	34m 26s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 50s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	2m 30s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 3s		The patch does not generate ASF License warnings.
		228m 23s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint
uname	Linux 2eeac626505b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `3e5e1e6`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/testReport/
Max. process+thread count	1368 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2024-04-18T16:52:28Z

@snvijaya we actually know the total number of subdirs for the deletion!

it is propagated via the manifests: each TA manifest includes the #of dirs as an IOStatistic, the aggregate summary adds these all up.

the number of paths under the job dir is that number (counter committer_task_directory_count ) + any of failed task attempts.

which means we could actually have a threshold of how many subdirectories will trigger an automatic switch to parallel delete.

I'm just going to pass this down and log immediately before the cleanup kicks off, so if there are problems we will get the diagnostics adjacent to the error.

Note that your details on retry timings imply that on a mapreduce job (rather than spark one) the progress() callback will not take place -so there's a risk that the job will actually timeout. I don't think that's an issue in MR job actions, the way it is is in task-side actions where a heartbeat back to the MapRed AM is required.

Statistics Collection and Printing * New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file) * After a failure to save a task attempt, the iostats of the manifest are rebuilt so the stats on failures are updated. This will get into the final job _SUCCESS statistics so we can see if anything happened * Make the manifest print command something which can be invoked from the commandline: mapred successfile This is covered in the docs. The failure stats regeneration is nice; works by passing down a lambda-expression of the logic to (re)generate the manifest, and invoking this on every attempt. As this is where the stats are aggregated, it includes details on the previous failing attempts. Directory size for deletion * Optionally pass down directory count under job dir to cleanup stage * This is determined in job commit from aggregate statistics; unknown elsewhere (app abort etc.). * It is currently only logged; it may be possible to support an option of when to skip the initial serial delete, though it will depend on abfs login mechanism. Testing * More fault injection scenarios. * Ability to assert that iostats do not contain specific non-zero stats. This is used in ITestAbfsTerasort to assert no task save or rename failures. The stats before this change imply this did happen in a job commit; no other details, hence the new probe. * Log manifest committer at debug in mapred-core Note: if there's a retry process which means the operation can take minutes, the initial operation will block progress() callbacks so mapreduce jobs will fail. Spark is unaffected Change-Id: Id423267de89c7f31e4b1283f9c433b729ff0d87b

hadoop-yetus · 2024-04-19T04:32:17Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 34s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	shelldocs	0m 0s		Shelldocs was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 57s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 39s		trunk passed
+1 💚	compile	17m 27s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 20s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 19s		trunk passed
+1 💚	mvnsite	3m 33s		trunk passed
+1 💚	javadoc	3m 15s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	3m 1s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	6m 39s		trunk passed
+1 💚	shadedclient	34m 11s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	1m 58s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 16s		the patch passed
+1 💚	compile	16m 49s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 49s		the patch passed
+1 💚	compile	16m 15s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	16m 15s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-eol.txt	The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
-0 ⚠️	checkstyle	4m 17s	/results-checkstyle-root.txt	root: The patch generated 10 new + 22 unchanged - 0 fixed = 32 total (was 22)
+1 💚	mvnsite	3m 24s		the patch passed
+1 💚	shellcheck	0m 23s		No new issues.
+1 💚	javadoc	3m 9s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	3m 2s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	7m 19s		the patch passed
+1 💚	shadedclient	34m 28s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 59s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	160m 38s		hadoop-mapreduce-project in the patch passed.
+1 💚	unit	2m 43s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 14s		The patch does not generate ASF License warnings.
		410m 46s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	Linux 2d173658385e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `abed2fe`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/testReport/
Max. process+thread count	1671 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Change-Id: I3048e959efdc1fc7707061e137eb9524d795ff90

hadoop-yetus · 2024-04-19T22:45:15Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 34s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	shelldocs	0m 0s		Shelldocs was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	15m 7s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 9s		trunk passed
+1 💚	compile	17m 23s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 11s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 22s		trunk passed
+1 💚	mvnsite	3m 31s		trunk passed
+1 💚	javadoc	3m 15s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	2m 58s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	6m 39s		trunk passed
+1 💚	shadedclient	34m 6s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 33s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 7s		the patch passed
+1 💚	compile	16m 48s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 48s		the patch passed
+1 💚	compile	16m 14s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	16m 14s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 17s	/results-checkstyle-root.txt	root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22)
+1 💚	mvnsite	3m 23s		the patch passed
+1 💚	shellcheck	0m 23s		No new issues.
+1 💚	javadoc	3m 7s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	3m 1s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	7m 18s		the patch passed
+1 💚	shadedclient	34m 19s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 54s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	161m 19s		hadoop-mapreduce-project in the patch passed.
+1 💚	unit	2m 44s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 17s		The patch does not generate ASF License warnings.
		408m 51s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	Linux 1cd1cb0cbe32 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `cd40e7f`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/testReport/
Max. process+thread count	1621 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

saxenapranav

Great pr! Some comments.

Thanks!

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java

saxenapranav · 2024-04-19T14:43:02Z

...org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java

+          delete(finalPath, true, OP_DELETE);
+
+          // rename temp to final
+          renameFile(tempPath, finalPath);


There can be a parallel process which might create a directory in between line 680 and 683, should we check post line 683, if finalPath is a file or not?

Also, should we check if the filesystem rename returned true or false.

Reason for these above checks would help us know if there was no object on destination and the rename is completely succesful.

renameFile javadocs throws PathIOException – if the rename() call returned false.. so no need to check the result here

directory deletion, maybe: but what is going to create a directory here? nothing in the committer will, and if some other process is doing stuff in the job attempt dir you are doomed.

Got your point for the directory case.

For the first point, I now understand that executeRenamingOperation would call escalateRenameFailure on fs.rename() failure which would raise PathIOException. I was thinking if instead of calling renameFile if we can do operation.renameFile() directly and raise exception from there. Reason being, escalateRenameFailure does a getFileStatus on both src and dst for logging. We can save 2 filesystem calls if we know the renameFile for the saveManifest has failed. Would like to know your view. But, I am good with this comment.

you mean use commitFile() after creating a file entry, so pushing more of the recovery down? we could do that. we won't have the etag of the create file though.

This shall be better, as if the recovery also fail, we would not do additional HEAD calls for escalateRenameFailure. This looks good!

hadoop-tools/hadoop-azure/pom.xml

steveloughran · 2024-04-23T14:27:48Z

One thing I'm considering here, make that "initial attempt at base dir delete" a numeric threshold.

good: agile
bad: harder to test, less consistent; harder to replicate

for now, leaving a simple switch

* and use delay from retry class for sleeping Change-Id: I4f5ea48f6c22412d55ecb1bfd00c82b6cc7e4be5

hadoop-yetus · 2024-04-24T00:46:32Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	12m 17s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 1s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	shelldocs	0m 1s		Shelldocs was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	15m 14s		Maven dependency ordering for branch
+1 💚	mvninstall	32m 12s		trunk passed
+1 💚	compile	17m 30s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	16m 28s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	4m 23s		trunk passed
+1 💚	mvnsite	3m 35s		trunk passed
+1 💚	javadoc	3m 15s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	2m 56s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	6m 38s		trunk passed
+1 💚	shadedclient	34m 4s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 34s		Maven dependency ordering for patch
+1 💚	mvninstall	2m 5s		the patch passed
+1 💚	compile	16m 51s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	16m 51s		the patch passed
+1 💚	compile	15m 57s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	15m 57s		the patch passed
+1 💚	blanks	0m 1s		The patch has no blanks issues.
+1 💚	checkstyle	4m 23s		the patch passed
+1 💚	mvnsite	3m 25s		the patch passed
+1 💚	shellcheck	0m 24s		No new issues.
+1 💚	javadoc	3m 6s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	3m 1s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	7m 16s		the patch passed
+1 💚	shadedclient	34m 33s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	8m 27s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	162m 7s		hadoop-mapreduce-project in the patch passed.
+1 💚	unit	2m 56s		hadoop-azure in the patch passed.
+1 💚	asflicense	1m 18s		The patch does not generate ASF License warnings.
		422m 41s

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/artifact/out/Dockerfile
GITHUB PR	#6716
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	Linux 8700e5c7281c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `2b38434`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/testReport/
Max. process+thread count	1623 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

...n/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java

hadoop-yetus · 2024-04-24T07:23:41Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
			_ Prechecks _
+1 💚	dupname	0m 05s		No case conflicting files found.
+0 🆗	codespell	0m 05s		codespell was not available.
+0 🆗	detsecrets	0m 05s		detect-secrets was not available.
+0 🆗	shellcheck	0m 05s		Shellcheck was not available.
+0 🆗	shelldocs	0m 05s		Shelldocs was not available.
+0 🆗	spotbugs	0m 01s		spotbugs executables are not available.
+0 🆗	markdownlint	0m 01s		markdownlint was not available.
+0 🆗	xmllint	0m 01s		xmllint was not available.
+1 💚	@author	0m 00s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 00s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	2m 13s		Maven dependency ordering for branch
+1 💚	mvninstall	89m 13s		trunk passed
+1 💚	compile	39m 10s		trunk passed
+1 💚	checkstyle	6m 15s		trunk passed
+1 💚	mvnsite	16m 30s		trunk passed
+1 💚	javadoc	14m 00s		trunk passed
+1 💚	shadedclient	170m 25s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 13s		Maven dependency ordering for patch
+1 💚	mvninstall	11m 14s		the patch passed
+1 💚	compile	37m 21s		the patch passed
+1 💚	javac	37m 21s		the patch passed
+1 💚	blanks	0m 01s		The patch has no blanks issues.
+1 💚	checkstyle	5m 47s		the patch passed
+1 💚	mvnsite	16m 05s		the patch passed
+1 💚	javadoc	13m 42s		the patch passed
+1 💚	shadedclient	178m 33s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	5m 29s		The patch does not generate ASF License warnings.
		543m 45s

Subsystem	Report/Notes
GITHUB PR	#6716
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	MINGW64_NT-10.0-17763 b033f3aa81e5 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool	maven
Personality	/c/hadoop/dev-support/bin/hadoop.sh
git revision	trunk / `2b38434`
Default Java	Azul Systems, Inc.-1.8.0_332-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/testReport/
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/console
versions	git=2.44.0.windows.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN. test changes to match the codepath changes, including improvements in fault injection. Change-Id: I757a77c8d2b7a7f1cf2ce32d109ce1baa6a90ec2

steveloughran · 2024-04-30T18:12:36Z

I've now moved to commitFile() to rename the task manifest, after doing a getFileStatus() call first...which means its iO cost is the same as a rename with recovery enabled. it does let us see what happened, which we log at WARN.

saxenapranav

Thanks for taking the comment!

steveloughran · 2024-05-07T17:22:48Z

@saxenapranav what do you think of the patch now?

saxenapranav

From my perspective, change look good to me. Thanks for taking all the thoughts!

However, since I am new to the component, would be great if we can get +1 from other reviewers as well.

Thanks!

mukund-thakur

LGTM +1

hadoop-yetus · 2024-05-09T09:03:55Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
			_ Prechecks _
+1 💚	dupname	0m 05s		No case conflicting files found.
+0 🆗	codespell	0m 05s		codespell was not available.
+0 🆗	detsecrets	0m 05s		detect-secrets was not available.
+0 🆗	shellcheck	0m 05s		Shellcheck was not available.
+0 🆗	shelldocs	0m 05s		Shelldocs was not available.
+0 🆗	spotbugs	0m 01s		spotbugs executables are not available.
+0 🆗	markdownlint	0m 01s		markdownlint was not available.
+0 🆗	xmllint	0m 01s		xmllint was not available.
+1 💚	@author	0m 00s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 00s		The patch appears to include 11 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	2m 35s		Maven dependency ordering for branch
+1 💚	mvninstall	87m 02s		trunk passed
+1 💚	compile	38m 10s		trunk passed
+1 💚	checkstyle	6m 01s		trunk passed
+1 💚	mvnsite	16m 14s		trunk passed
+1 💚	javadoc	14m 05s		trunk passed
+1 💚	shadedclient	165m 47s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 13s		Maven dependency ordering for patch
+1 💚	mvninstall	11m 25s		the patch passed
+1 💚	compile	36m 11s		the patch passed
+1 💚	javac	36m 11s		the patch passed
+1 💚	blanks	0m 01s		The patch has no blanks issues.
+1 💚	checkstyle	5m 43s		the patch passed
+1 💚	mvnsite	16m 13s		the patch passed
+1 💚	javadoc	13m 47s		the patch passed
+1 💚	shadedclient	174m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	6m 07s		The patch does not generate ASF License warnings.
		532m 21s

Subsystem	Report/Notes
GITHUB PR	#6716
Optional Tests	dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname	MINGW64_NT-10.0-17763 54377fa890b8 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool	maven
Personality	/c/hadoop/dev-support/bin/hadoop.sh
git revision	trunk / `68dff78`
Default Java	Azul Systems, Inc.-1.8.0_332-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/testReport/
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/console
versions	git=2.44.0.windows.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Improve task commit resilience everywhere and add an option to reduce delete IO requests on job cleanup (relevant for ABFS and HDFS). Task Commit Resilience ---------------------- Task manifest saving is re-attempted on failure; the number of attempts made is configurable with the option: mapreduce.manifest.committer.manifest.save.attempts * The default is 5. * The minimum is 1; asking for less is ignored. * A retry policy adds 500ms of sleep per attempt. * Move from classic rename() to commitFile() to rename the file, after calling getFileStatus() to get its length and possibly etag. This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach the ResilientCommitByRename callbacks in abfs, which report on the outcome to the caller...which is then logged at WARN. * New statistic task_stage_save_summary_file to distinguish from other saving operations (job success/report file). This is only saved to the manifest on task commit retries, and provides statistics on all previous unsuccessful attempts to save the manifests + test changes to match the codepath changes, including improvements in fault injection. Directory size for deletion --------------------------- New option mapreduce.manifest.committer.cleanup.parallel.delete.base.first This attempts an initial attempt at deleting the base dir, only falling back to parallel deletes if there's a timeout. This option is disabled by default; Consider enabling it for abfs to reduce IO load. Consult the documentation for more details. Success file printing --------------------- The command to print a JSON _SUCCESS file from this committer and any S3A committer is now something which can be invoked from the mapred command: mapred successfile <path to file> Contributed by Steve Loughran

github-actions bot added MapReduce trunk labels Apr 9, 2024

steveloughran changed the title ~~MAPREDUCE-7474. Manifest committer resilience~~ MAPREDUCE-7474. Improve Manifest committer resilience Apr 9, 2024

steveloughran force-pushed the abfs/MAPREDUCE-7474-manifest-committer-resilience branch from 4dc5a60 to c4faf5d Compare April 11, 2024 13:26

github-actions bot added TOOLS ABFS labels Apr 11, 2024

steveloughran added 6 commits April 12, 2024 17:53

MAPREDUCE-7474. Manifest committer resilience

ed7a884

* retries of save() * split delete into deleteFile and rmdir * needs tests Change-Id: Idb6cf0e85c62c973881fdc96a3ded97b1cfc43ff

MAPREDUCE-7474. Manifest committer resilience: test failure

0fbbabc

Change-Id: I039ec6e4dc12f68690ffa977ebb81056ab0d1711

MAPREDUCE-7474. yetus

330bd61

Change-Id: I24778ab4d817a77afbbf1d5b132be270698382a4

MAPREDUCE-7474. checkstyle: trailing spaces

16e1be4

Change-Id: Ic67f41449d1e46d9fb81c47012bb41d5fade84a9

steveloughran force-pushed the abfs/MAPREDUCE-7474-manifest-committer-resilience branch from a3117cf to 16e1be4 Compare April 12, 2024 16:53

apache deleted a comment from hadoop-yetus Apr 15, 2024

github-actions bot added the build label Apr 15, 2024

mukund-thakur reviewed Apr 15, 2024

View reviewed changes

anmolanmol1234 suggested changes Apr 16, 2024

View reviewed changes

snvijaya suggested changes Apr 17, 2024

View reviewed changes

MAPREDUCE-7474. review feedback and test improvements

3e5e1e6

Simulating more failure conditions. Still more to explore there, in particular "what if delete of rename target fails" Change-Id: Idb84f9c17a195702e6a2345b095f41e72865dd5b

MAPREDUCE-7474. checkstyles etc

cd40e7f

Change-Id: I3048e959efdc1fc7707061e137eb9524d795ff90

saxenapranav suggested changes Apr 22, 2024

View reviewed changes

MAPREDUCE-7474. review feedback

2b38434

* and use delay from retry class for sleeping Change-Id: I4f5ea48f6c22412d55ecb1bfd00c82b6cc7e4be5

saxenapranav suggested changes Apr 24, 2024

View reviewed changes

...n/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/CleanupJobStage.java Show resolved Hide resolved

saxenapranav reviewed May 6, 2024

View reviewed changes

saxenapranav approved these changes May 8, 2024

View reviewed changes

mukund-thakur approved these changes May 8, 2024

View reviewed changes

steveloughran merged commit c927060 into apache:trunk May 13, 2024

steveloughran mentioned this pull request May 13, 2024

MAPREDUCE-7474. Improve Manifest committer resilience (#6716) #6824

Merged

4 tasks

steveloughran mentioned this pull request Jun 26, 2024

HADOOP-18679. Add API for bulk/paged delete of files #6726

Merged

4 tasks

MAPREDUCE-7474. Improve Manifest committer resilience #6716

MAPREDUCE-7474. Improve Manifest committer resilience #6716

Uh oh!

Conversation

steveloughran commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

For code changes:

Uh oh!

steveloughran commented Apr 9, 2024

Uh oh!

steveloughran commented Apr 11, 2024

Uh oh!

hadoop-yetus commented Apr 12, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hadoop-yetus commented Apr 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 17, 2024

Uh oh!

steveloughran commented Apr 18, 2024

Uh oh!

hadoop-yetus commented Apr 19, 2024

Uh oh!

hadoop-yetus commented Apr 19, 2024

Uh oh!

saxenapranav left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steveloughran commented Apr 23, 2024

Uh oh!

hadoop-yetus commented Apr 24, 2024

Uh oh!

Uh oh!

hadoop-yetus commented Apr 24, 2024

Uh oh!

steveloughran commented Apr 30, 2024

Uh oh!

saxenapranav left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

steveloughran commented Apr 9, 2024 •

edited

Loading

saxenapranav left a comment •

edited

Loading

saxenapranav left a comment •

edited

Loading