-
Notifications
You must be signed in to change notification settings - Fork 9.4k
[Backport] Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport] Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22610
Conversation
… in Parallel SCD Execution Relates to magento#22607 magento#21852 magento#22563
Hi @davidalger. Thank you for your contribution
For more details, please, review the Magento Contributor Assistant documentation |
Moving this PR on hold till original #22607 will be merged |
|
||
unset($this->inProgress[$package->getPath()]); | ||
return pcntl_wexitstatus($status) === 0; | ||
} else if ($result === -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use elseif
instead of else if
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. That was a mistake. I'll get it fixed.
Why didn't you added changes in |
Honestly? Because the changes were originally developed as a patch applied to the 2.3.1 and 2.2.6 releases inside a CI/CD pipeline. The I'll update the method on this back port though. Actually there is really no reason not to just pull back the entire updated class from 2.3 to include the strict types and phpcs:ignores, etc as well. Thanks for the notes! |
@ihor-sviziev PR updated. Let me know if there is anything I can do to help move this (and the PR to 2.3-develop) forward. Thanks again! |
Hi @sidolov, thank you for the review. |
Hi @ihor-sviziev, thank you for the review. |
Hi @davidalger, thank you for your contribution! |
… on Null PIDs in Parallel SCD Execution #22610
Original Pull Request
#22607
Description of Issue and Resolution
I've been building a Concourse CI/CD pipeline and using a project which has 14 themes to deploy given the multi-site setup. Around 50% of the jobs running in Concourse would fail with the following error (prior to the changes made in this PR) which is precisely the issue recorded on issue #21852:
Every time this error occured, the SCD would reach 100% completion and then hang until the 400 second timeout (as defined on
\Magento\Deploy\Process\Queue::DEFAULT_MAX_EXEC_TIME
) elapses and the queue exits and attempts to cleanup. The error came from__destruct
but the output indicates all packages had reached 100% and there are actually no child processes still present on the system by the time this occurs:With the better error logging and messaging in
__destruct
(without fixes in other areas of this class), theRuntimeException
thrown reveals what I expected: the__destruct
method is attempting to wait for a child process that has already exited and been reaped by a prior call topcntl_waitpid
in\Magento\Deploy\Process\Queue::isDeployed
resulting in aPCNTL_ECHILD
(errno 10) error:In the above execution, PID 540 was the parent process execution of
bin/magento
. PID 543 (the one that tripped up__destruct
call) is one of the first child processes that was spawned. Watching the process list inside the container this was running in showed this PID had cleaned up long before the procesess got close to the__destruct
call in which theQueue
implementationn is attempting to wait for it to exit. This indicated for some reason the PID was remaining ininProgress
(by all appearances, possibly a race condition given the sporadicness of the error on most environment, except this one where I can reproduce it 50% of the time)Similar behaviour of
pcntl_waitpid
can also be observed by running the following single-line command:Errno 10 is equivelant to the constant
PCNTL_ECHILD
which per the linux man page (https://linux.die.net/man/2/waitpid) indicates the following:From here I added the same error logging I'd setup in
__destruct
to the usage ofpcntl_waitpid
in\Magento\Deploy\Process\Queue::isDeployed
as there was no error checking there, just the assumption that it was succeeding with a valid PID or returning because no child had exited (per theWNOHANG
option).After adding error handling to
isDeployed
it was revealaed thatisDeployed
was callingpcntl_waitpid
withnull
incorrectly and this resulted in a silent error on 2.2x which was not caught due to lack of error checking on thepcntl_waitpid
call (note the missing PID value):On
2.3-develop
afterdeclare(strict_types=1);
was added to theQueue
class file, this resulted in aPHP Fatal error: Uncaught TypeError
sincepcntl_waitpid
is expecting anint
and not anull
argument. Complicating things worse on2.3-develop
was commit 7421dfb which inadvertantly changed the behaviour ofgetPid
to return a boolean vs actually returning the child PID.This resolves the issue with
getPid
introduced in commit 7421dfb as well as the issue with passing anull
value topcntl_waitpid
that ultimately resulted in the SCD process inconsistently failing when parallel execution is used under specific circumstances (such as having a large number of themes).This PR resolves both #22563 and #21852 reliably in my test case. At the beginning of this description I stated I was reproducing the issue on about 50% of builds. After applying this patch to the build I have now run over 75+ builds succesfully without SCD hanging until the timeout elapses and without resulting in any
RuntimeException
errors.Contribution checklist (*)