Skip to content

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Apr 9, 2024

MAPREDUCE-7474. Improve resilience of task commit save and rename operation with retries.

  • Retries of save()
    5 attempts, with 500 millis sleep between them. No configuration.
    Issue: should we make this configurable?
  • Split delete(path, recursive) into deleteFile and rmdir for separate
    statistics.
  • Add new option mapreduce.manifest.committer.cleanup.parallel.delete.base.first
    This attempts to delete the base dir, and only on failure (timeout, permissions)
    does it attempt the parallel delete and (re) attempt at deleting base dir.
    This is to cut back on azure load while still handling timeouts on deep tree
    deletion

Test simulation expands to:

  • Support recovery through a countdown of calls to fail.
  • Simulate timeout before and after rename calls.
  • Simulating timeouts of delete operations

This is based on #6596 but skips the rate limiting logic spanning common and azure,
instead it only contains changes in manifest committer -easier to backport.

How was this patch tested?

  • IDE test of all new tests against azure
  • full test suite left to yetus

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@steveloughran steveloughran changed the title MAPREDUCE-7474. Manifest committer resilience MAPREDUCE-7474. Improve Manifest committer resilience Apr 9, 2024
@steveloughran
Copy link
Contributor Author

testing: azure cardiff
-Dparallel-tests=abfs -DtestsThreadCount=8
lots of tests failed for me, but I've now got an account with very low IO threshold. We need to look at those failures in general

@steveloughran steveloughran force-pushed the abfs/MAPREDUCE-7474-manifest-committer-resilience branch from 4dc5a60 to c4faf5d Compare April 11, 2024 13:26
@steveloughran
Copy link
Contributor Author

Reviews invited from @mukund-thakur @anmolanmol1234 @anujmodi2021 @HarshitGupta11

* retries of save()
* split delete into deleteFile and rmdir
* needs tests

Change-Id: Idb6cf0e85c62c973881fdc96a3ded97b1cfc43ff
* Retries of save()
  5 attempts, with 500 millis sleep between them. No configuration.
  Issue: should we make this configurable?
* Split delete(path, recursive) into deleteFile and rmdir for separate
  statistics.

Test simulation expands to:
* Support recovery through a countdown of calls to fail.
* Simulate timeout before *and after* rename calls.

Change-Id: I3f86c5a238515955e9b82ed37727d40d2d8d3f96
Change-Id: I039ec6e4dc12f68690ffa977ebb81056ab0d1711
New option mapreduce.manifest.committer.cleanup.parallel.delete.base.first
this attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

Best for abfs; for gcs it works but is suboptimal.
Enabled by default.

Also: changed default abfs io rate to 1_000 from 10_000.

+docs and tests updated

Change-Id: Idd10aecc3cb6747a6367573ef9547675641afe8c
Change-Id: I24778ab4d817a77afbbf1d5b132be270698382a4
Change-Id: Ic67f41449d1e46d9fb81c47012bb41d5fade84a9
@steveloughran steveloughran force-pushed the abfs/MAPREDUCE-7474-manifest-committer-resilience branch from a3117cf to 16e1be4 Compare April 12, 2024 16:53
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 7 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 2s Maven dependency ordering for branch
+1 💚 mvninstall 32m 11s trunk passed
+1 💚 compile 17m 40s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 21s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 21s trunk passed
+1 💚 mvnsite 1m 56s trunk passed
+1 💚 javadoc 1m 38s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 35s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 2m 55s trunk passed
+1 💚 shadedclient 34m 34s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 33s Maven dependency ordering for patch
+1 💚 mvninstall 1m 3s the patch passed
+1 💚 compile 16m 50s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 50s the patch passed
+1 💚 compile 16m 28s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 16m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 14s the patch passed
+1 💚 mvnsite 1m 48s the patch passed
+1 💚 javadoc 1m 35s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 33s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 3m 18s the patch passed
+1 💚 shadedclient 35m 30s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 52s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 33s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
230m 36s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 412988d12e6f 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a3117cf
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/testReport/
Max. process+thread count 1592 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Apr 15, 2024
@apache apache deleted a comment from hadoop-yetus Apr 15, 2024
@apache apache deleted a comment from hadoop-yetus Apr 15, 2024
@apache apache deleted a comment from hadoop-yetus Apr 15, 2024
The number of attempts to commit a manifest is now
configurable with the option:
  mapreduce.manifest.committer.manifest.save.attempts

* The default is still 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
  (using retryUpToMaximumCountWithProportionalSleep policy)

Documented.

Making it configurable avoids having to guess what the ideal value should
be, instead the default is something which could cope with briefly transient
failures.

Change-Id: I276aaf39bff73544a633126425cc7ec1e9848ec1
@github-actions github-actions bot added the build label Apr 15, 2024
return trackDuration(getIOStatistics(), statistic, () -> {
return operations.delete(path, recursive);
});
if (recursive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unable to understand this. How is a recursive flag determining that it is a dir or a file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, deleteDir will also delete a file. let me highlight that.

I'd done this delete dir/file split to support different capacity requests, without that it is a bit over-complex. it does let us collect different statistics though, which may be useful

/**
* Default value of option {@link #OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST}: {@value}.
*/
public static final boolean OPT_CLEANUP_PARALLEL_DELETE_BASE_FIRST_DEFAULT = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is bad for GCS, shouldn't the default be false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really don't know here. In the docs I try to cover this

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 7 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 58s Maven dependency ordering for branch
+1 💚 mvninstall 32m 39s trunk passed
+1 💚 compile 17m 26s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 8s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 23s trunk passed
+1 💚 mvnsite 1m 57s trunk passed
+1 💚 javadoc 1m 39s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 32s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 2m 54s trunk passed
+1 💚 shadedclient 34m 10s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 34s Maven dependency ordering for patch
+1 💚 mvninstall 1m 1s the patch passed
+1 💚 compile 16m 43s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 43s the patch passed
+1 💚 compile 16m 16s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 16m 16s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 17s /results-checkstyle-root.txt root: The patch generated 1 new + 22 unchanged - 0 fixed = 23 total (was 22)
+1 💚 mvnsite 1m 52s the patch passed
+1 💚 javadoc 1m 35s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 29s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 3m 19s the patch passed
+1 💚 shadedclient 34m 20s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 55s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 40s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 4s The patch does not generate ASF License warnings.
229m 15s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint
uname Linux 34da63c6214c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 9193085
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/testReport/
Max. process+thread count 1585 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/7/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@@ -143,6 +145,20 @@ public final class ManifestCommitterConstants {
*/
public static final boolean OPT_CLEANUP_PARALLEL_DELETE_DIRS_DEFAULT = true;

/**
* Should parallel cleanup try to delete teh base first?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: the

getName(), tempPath, finalPath, retryCount);

trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () ->
operations.save(manifestData, tempPath, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if rename failed in the first attempt but succeeded in the backend, will the save operation on tmpPath fail with an error and if yes how to recover from that
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so renameFile() has always deleted the destination because we need to do that to cope with failures of a previous/concurrent task attempt. Whoever commits last wins.

To make this clearer I'm pulling up more of the code into this method and adding comments.

try (DurationInfo info = new DurationInfo(LOG, true,
"Initial delete of %s", baseDir)) {
exception = deleteOneDir(baseDir);
if (exception == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As added by you in this logic, when the directory tree is very large and is over OAuth authentication, Azure cloud could fail the baseDir delete due to exhaustive ACL permissions checks. But this delete will entry the retry loop as it was request timeout and for this scenario all the retries too might fail and can take a while to report failure with backoff and retry attempts as per AZURE_MAX_IO_RETRIES (default value 30).

Default max retry count is 30 today just to ensure any 5-10 min network/service transient failures do not lead failures of long running workloads.

If this logic to attempt basedir delete before falling back to parallel deletes, is optimal only for Azure cloud, we could look for ways to fail fast for Delete with recursive.

Would this work - Add a new config MAX_RETRIES_RECURSIVE_DELETE which by default will be the same as AZURE_MAX_IO_RETRIES in ABFS driver. AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens.

If this sounds ok, we can look into changes needed in AbfsClient, AbfsRestOperation and ExponentialRetry to make MAX_RETRIES_RECURSIVE_DELETE config effective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, so it's going to be quite a long time to fall back.
I'm going to make the option default to false for now.

AzureManifestCommitterFactory could probably set this config to 0 before FileSystem.get() call happens.

it'll come from the cache, we don't want to set it for everything else, but a low MAX_RETRIES_RECURSIVE_DELETE might make sense everywhere. something to consider later.

Simulating more failure conditions.
Still more to explore there, in particular "what if delete of rename target fails"

Change-Id: Idb84f9c17a195702e6a2345b095f41e72865dd5b
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 32s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 7 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 43s Maven dependency ordering for branch
+1 💚 mvninstall 32m 23s trunk passed
+1 💚 compile 17m 35s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 7s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 19s trunk passed
+1 💚 mvnsite 1m 55s trunk passed
+1 💚 javadoc 1m 34s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 25s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 2m 53s trunk passed
+1 💚 shadedclient 34m 12s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 33s Maven dependency ordering for patch
+1 💚 mvninstall 1m 0s the patch passed
+1 💚 compile 16m 47s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 47s the patch passed
+1 💚 compile 16m 25s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 16m 25s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 22s /results-checkstyle-root.txt root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22)
+1 💚 mvnsite 1m 54s the patch passed
+1 💚 javadoc 1m 34s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 1m 34s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 3m 15s the patch passed
+1 💚 shadedclient 34m 26s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 50s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 2m 30s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 3s The patch does not generate ASF License warnings.
228m 23s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint xmllint
uname Linux 2eeac626505b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 3e5e1e6
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/testReport/
Max. process+thread count 1368 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/8/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

@snvijaya we actually know the total number of subdirs for the deletion!

it is propagated via the manifests: each TA manifest includes the #of dirs as an IOStatistic, the aggregate summary adds these all up.

the number of paths under the job dir is that number (counter committer_task_directory_count ) + any of failed task attempts.

which means we could actually have a threshold of how many subdirectories will trigger an automatic switch to parallel delete.

I'm just going to pass this down and log immediately before the cleanup kicks off, so if there are problems we will get the diagnostics adjacent to the error.

Note that your details on retry timings imply that on a mapreduce job (rather than spark one) the progress() callback will not take place -so there's a risk that the job will actually timeout. I don't think that's an issue in MR job actions, the way it is is in task-side actions where a heartbeat back to the MapRed AM is required.

Statistics Collection and Printing
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file)
* After a failure to save a task attempt, the iostats of the manifest
  are rebuilt so the stats on failures are updated.
  This will get into the final job _SUCCESS statistics so we can see
  if anything happened
* Make the manifest print command something which can be invoked from
  the commandline: mapred successfile
  This is covered in the docs.

The failure stats regeneration is nice; works by passing down a
lambda-expression of the logic to (re)generate the manifest, and invoking
this on every attempt. As this is where the stats are aggregated,
it includes details on the previous failing attempts.

Directory size for deletion
* Optionally pass down directory count under job dir to cleanup stage
* This is determined in job commit from aggregate statistics;
  unknown elsewhere (app abort etc.).
* It is currently only logged; it may be possible to support an option
  of when to skip the initial serial delete, though it will depend on
  abfs login mechanism.

Testing
* More fault injection scenarios.
* Ability to assert that iostats do not contain specific non-zero stats.
  This is used in ITestAbfsTerasort to assert no task save or rename failures.
  The stats before this change imply this did happen in a job commit; no
  other details, hence the new probe.
* Log manifest committer at debug in mapred-core

Note: if there's a retry process which means the operation can take minutes,
the initial operation will block progress() callbacks so mapreduce jobs
will fail. Spark is unaffected

Change-Id: Id423267de89c7f31e4b1283f9c433b729ff0d87b
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 shelldocs 0m 0s Shelldocs was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 57s Maven dependency ordering for branch
+1 💚 mvninstall 32m 39s trunk passed
+1 💚 compile 17m 27s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 20s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 19s trunk passed
+1 💚 mvnsite 3m 33s trunk passed
+1 💚 javadoc 3m 15s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 3m 1s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 6m 39s trunk passed
+1 💚 shadedclient 34m 11s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 1m 58s Maven dependency ordering for patch
+1 💚 mvninstall 2m 16s the patch passed
+1 💚 compile 16m 49s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 49s the patch passed
+1 💚 compile 16m 15s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 16m 15s the patch passed
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
-0 ⚠️ checkstyle 4m 17s /results-checkstyle-root.txt root: The patch generated 10 new + 22 unchanged - 0 fixed = 32 total (was 22)
+1 💚 mvnsite 3m 24s the patch passed
+1 💚 shellcheck 0m 23s No new issues.
+1 💚 javadoc 3m 9s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 3m 2s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 7m 19s the patch passed
+1 💚 shadedclient 34m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 59s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 160m 38s hadoop-mapreduce-project in the patch passed.
+1 💚 unit 2m 43s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 14s The patch does not generate ASF License warnings.
410m 46s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux 2d173658385e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / abed2fe
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/testReport/
Max. process+thread count 1671 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/9/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Change-Id: I3048e959efdc1fc7707061e137eb9524d795ff90
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 shelldocs 0m 0s Shelldocs was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 7s Maven dependency ordering for branch
+1 💚 mvninstall 32m 9s trunk passed
+1 💚 compile 17m 23s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 11s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 22s trunk passed
+1 💚 mvnsite 3m 31s trunk passed
+1 💚 javadoc 3m 15s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 2m 58s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 6m 39s trunk passed
+1 💚 shadedclient 34m 6s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 33s Maven dependency ordering for patch
+1 💚 mvninstall 2m 7s the patch passed
+1 💚 compile 16m 48s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 48s the patch passed
+1 💚 compile 16m 14s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 16m 14s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 17s /results-checkstyle-root.txt root: The patch generated 2 new + 22 unchanged - 0 fixed = 24 total (was 22)
+1 💚 mvnsite 3m 23s the patch passed
+1 💚 shellcheck 0m 23s No new issues.
+1 💚 javadoc 3m 7s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 3m 1s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 7m 18s the patch passed
+1 💚 shadedclient 34m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 54s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 161m 19s hadoop-mapreduce-project in the patch passed.
+1 💚 unit 2m 44s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 17s The patch does not generate ASF License warnings.
408m 51s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux 1cd1cb0cbe32 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cd40e7f
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/testReport/
Max. process+thread count 1621 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/10/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@saxenapranav saxenapranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great pr! Some comments.

Thanks!

Comment on lines 680 to 683
delete(finalPath, true, OP_DELETE);

// rename temp to final
renameFile(tempPath, finalPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be a parallel process which might create a directory in between line 680 and 683, should we check post line 683, if finalPath is a file or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we check if the filesystem rename returned true or false.

Reason for these above checks would help us know if there was no object on destination and the rename is completely succesful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. renameFile javadocs throws PathIOException – if the rename() call returned false.. so no need to check the result here
  2. directory deletion, maybe: but what is going to create a directory here? nothing in the committer will, and if some other process is doing stuff in the job attempt dir you are doomed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point for the directory case.

For the first point, I now understand that executeRenamingOperation would call escalateRenameFailure on fs.rename() failure which would raise PathIOException. I was thinking if instead of calling renameFile if we can do operation.renameFile() directly and raise exception from there. Reason being, escalateRenameFailure does a getFileStatus on both src and dst for logging. We can save 2 filesystem calls if we know the renameFile for the saveManifest has failed. Would like to know your view. But, I am good with this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean use commitFile() after creating a file entry, so pushing more of the recovery down? we could do that. we won't have the etag of the create file though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shall be better, as if the recovery also fail, we would not do additional HEAD calls for escalateRenameFailure. This looks good!

@steveloughran
Copy link
Contributor Author

One thing I'm considering here, make that "initial attempt at base dir delete" a numeric threshold.

good: agile
bad: harder to test, less consistent; harder to replicate

for now, leaving a simple switch

* and use delay from retry class for sleeping

Change-Id: I4f5ea48f6c22412d55ecb1bfd00c82b6cc7e4be5
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 12m 17s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 shelldocs 0m 1s Shelldocs was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+0 🆗 xmllint 0m 1s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 14s Maven dependency ordering for branch
+1 💚 mvninstall 32m 12s trunk passed
+1 💚 compile 17m 30s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 compile 16m 28s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 checkstyle 4m 23s trunk passed
+1 💚 mvnsite 3m 35s trunk passed
+1 💚 javadoc 3m 15s trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 2m 56s trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 6m 38s trunk passed
+1 💚 shadedclient 34m 4s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 34s Maven dependency ordering for patch
+1 💚 mvninstall 2m 5s the patch passed
+1 💚 compile 16m 51s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javac 16m 51s the patch passed
+1 💚 compile 15m 57s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 javac 15m 57s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 checkstyle 4m 23s the patch passed
+1 💚 mvnsite 3m 25s the patch passed
+1 💚 shellcheck 0m 24s No new issues.
+1 💚 javadoc 3m 6s the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚 javadoc 3m 1s the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚 spotbugs 7m 16s the patch passed
+1 💚 shadedclient 34m 33s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 27s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 162m 7s hadoop-mapreduce-project in the patch passed.
+1 💚 unit 2m 56s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 18s The patch does not generate ASF License warnings.
422m 41s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/artifact/out/Dockerfile
GITHUB PR #6716
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname Linux 8700e5c7281c 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2b38434
Default Java Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/testReport/
Max. process+thread count 1623 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6716/11/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2 shellcheck=0.7.0
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
_ Prechecks _
+1 💚 dupname 0m 05s No case conflicting files found.
+0 🆗 codespell 0m 05s codespell was not available.
+0 🆗 detsecrets 0m 05s detect-secrets was not available.
+0 🆗 shellcheck 0m 05s Shellcheck was not available.
+0 🆗 shelldocs 0m 05s Shelldocs was not available.
+0 🆗 spotbugs 0m 01s spotbugs executables are not available.
+0 🆗 markdownlint 0m 01s markdownlint was not available.
+0 🆗 xmllint 0m 01s xmllint was not available.
+1 💚 @author 0m 00s The patch does not contain any @author tags.
+1 💚 test4tests 0m 00s The patch appears to include 10 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 2m 13s Maven dependency ordering for branch
+1 💚 mvninstall 89m 13s trunk passed
+1 💚 compile 39m 10s trunk passed
+1 💚 checkstyle 6m 15s trunk passed
+1 💚 mvnsite 16m 30s trunk passed
+1 💚 javadoc 14m 00s trunk passed
+1 💚 shadedclient 170m 25s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 2m 13s Maven dependency ordering for patch
+1 💚 mvninstall 11m 14s the patch passed
+1 💚 compile 37m 21s the patch passed
+1 💚 javac 37m 21s the patch passed
+1 💚 blanks 0m 01s The patch has no blanks issues.
+1 💚 checkstyle 5m 47s the patch passed
+1 💚 mvnsite 16m 05s the patch passed
+1 💚 javadoc 13m 42s the patch passed
+1 💚 shadedclient 178m 33s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 5m 29s The patch does not generate ASF License warnings.
543m 45s
Subsystem Report/Notes
GITHUB PR #6716
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname MINGW64_NT-10.0-17763 b033f3aa81e5 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool maven
Personality /c/hadoop/dev-support/bin/hadoop.sh
git revision trunk / 2b38434
Default Java Azul Systems, Inc.-1.8.0_332-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/testReport/
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/1/console
versions git=2.44.0.windows.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Move from classic rename() to commitFile() to rename the file,
after calling getFileStatus() to get its length and possibly etag.

This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
the ResilientCommitByRename callbacks in abfs, which report on
the outcome to the caller...which is then logged at WARN.

test changes to match the codepath changes, including improvements
in fault injection.

Change-Id: I757a77c8d2b7a7f1cf2ce32d109ce1baa6a90ec2
@steveloughran
Copy link
Contributor Author

I've now moved to commitFile() to rename the task manifest, after doing a getFileStatus() call first...which means its iO cost is the same as a rename with recovery enabled. it does let us see what happened, which we log at WARN.

Copy link
Contributor

@saxenapranav saxenapranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the comment!

@steveloughran
Copy link
Contributor Author

@saxenapranav what do you think of the patch now?

Copy link
Contributor

@saxenapranav saxenapranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, change look good to me. Thanks for taking all the thoughts!

However, since I am new to the component, would be great if we can get +1 from other reviewers as well.

Thanks!

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
_ Prechecks _
+1 💚 dupname 0m 05s No case conflicting files found.
+0 🆗 codespell 0m 05s codespell was not available.
+0 🆗 detsecrets 0m 05s detect-secrets was not available.
+0 🆗 shellcheck 0m 05s Shellcheck was not available.
+0 🆗 shelldocs 0m 05s Shelldocs was not available.
+0 🆗 spotbugs 0m 01s spotbugs executables are not available.
+0 🆗 markdownlint 0m 01s markdownlint was not available.
+0 🆗 xmllint 0m 01s xmllint was not available.
+1 💚 @author 0m 00s The patch does not contain any @author tags.
+1 💚 test4tests 0m 00s The patch appears to include 11 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 2m 35s Maven dependency ordering for branch
+1 💚 mvninstall 87m 02s trunk passed
+1 💚 compile 38m 10s trunk passed
+1 💚 checkstyle 6m 01s trunk passed
+1 💚 mvnsite 16m 14s trunk passed
+1 💚 javadoc 14m 05s trunk passed
+1 💚 shadedclient 165m 47s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 2m 13s Maven dependency ordering for patch
+1 💚 mvninstall 11m 25s the patch passed
+1 💚 compile 36m 11s the patch passed
+1 💚 javac 36m 11s the patch passed
+1 💚 blanks 0m 01s The patch has no blanks issues.
+1 💚 checkstyle 5m 43s the patch passed
+1 💚 mvnsite 16m 13s the patch passed
+1 💚 javadoc 13m 47s the patch passed
+1 💚 shadedclient 174m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 6m 07s The patch does not generate ASF License warnings.
532m 21s
Subsystem Report/Notes
GITHUB PR #6716
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle markdownlint xmllint
uname MINGW64_NT-10.0-17763 54377fa890b8 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool maven
Personality /c/hadoop/dev-support/bin/hadoop.sh
git revision trunk / 68dff78
Default Java Azul Systems, Inc.-1.8.0_332-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/testReport/
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6716/4/console
versions git=2.44.0.windows.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran merged commit c927060 into apache:trunk May 13, 2024
steveloughran added a commit to steveloughran/hadoop that referenced this pull request May 13, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
steveloughran added a commit to steveloughran/hadoop that referenced this pull request May 13, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
steveloughran added a commit that referenced this pull request May 15, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
steveloughran added a commit that referenced this pull request May 15, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
K0K0V0K pushed a commit to K0K0V0K/hadoop that referenced this pull request May 17, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
K0K0V0K pushed a commit to K0K0V0K/hadoop that referenced this pull request May 17, 2024
Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants