-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22607
Conversation
Reverts inadvertent change in 7421dfb causing the improper type to be returned as noted on magento#22563
….3-develop and issue where SCD hangs ending in RuntimeException on 2.2-develop Resolves magento#22563 and magento#21852
Hi @davidalger. Thank you for your contribution
For more details, please, review the Magento Contributor Assistant documentation |
@davidalger: awesome! But on my local installation after applying this fix, the amount of files which get generated when using 2 jobs is very small. In your testing, does it generated all the files?
|
@hostep That is indeed strange. On my local checkout on macOS I'm getting the following:
What version of PHP are you using? I'm running PHP 7.1 on my Mac at the moment. I'll test on 7.2 as well. Right now I'm getting integration tests running because 7.2 tests are failing with a weird error on travis:
On 7.1 however it's failing on an assertion:
|
I was using PHP 7.1.27, just tested again with PHP 7.2.16 but with the same result :( Can you test with the directories |
@hostep Here is what I get. PHP on my local env is installed via
What happens if you tack on -vvv for verbosity? |
… in Parallel SCD Execution Relates to magento#22607 magento#21852 magento#22563
Here's with the
And php:
Not sure if this is helpful though... Would be good if somebody else could do some testing as well, it still feels like an environmental issue over here, but not sure why. |
@hostep Yeah I'm not sure it is. Sorry. That's definitely a different issue, and I'm not sure what would cause it since I can't reproduce it either before or after these changes. General update on my notes above about the integration tests:
So I think we're good here…once Travis decides to stop hanging and run the test on this PR specifically vs just my branch on my forked repo. |
Changes looks good to me, but I was trying to test this change and got really strange result. When I tried to compile few times my themes with the same params (with @hostep did you have such issue? Then I reverted these changes and ... got the same issue. Very strange... Looks like it's not caused by this PR, but definitely it's not expected. I think issue that @hostep got - also not related to this PR. |
Just informational for people stumbling upon this thread. I just found the cause of my issue, after debugging the code, PHP execution just stopped (no error or anything) when calling And following the advise from the homebrew thread, I can make the multithreaded static content deploy work over here when I add @ihor-sviziev: after the above was fixed, I tested using 1, 2 or 3 jobs, and I'm always getting 6035 files in |
@ihor-sviziev @hostep That is interesting indeed. Yeah…I can't say I've ever run into varying number of files. Do you guys thinks this is something we should take care of by disabling |
Hi @davidalger, @hostep could you make such PR? |
Hmm, I'm not sure, I think it's not the responsibility of an application to change environmental problems, but not sure about it. But if we do need to implement it, it should probably be done in a similar fashion as over here, where the original value is restored, right? |
@hostep I would say yes, should definitely restore the original value before the method returns, that way it has no additional side-effects (even if it's not likely to be called from anywhere other than the two CLI command that run SCD step, just good practice). I'm thinking probably makes the most sense to do it in |
Ok, I've changed my mind in the mean time, I see the benefit of adding this now. (Update - |
Ok, I've got some updates regarding the When I also started experimenting with different versions of the Anyways, this is more just to document everything in case other people run against the same problem. Anyways, sorry for intruding in here, let's try to get this PR merged! 🙂 |
Note for QA: |
QA Passed ✔️ |
…Ds in Parallel SCD Execution magento#22607
Hi @davidalger, thank you for your contribution! |
Description of Issue and Resolution
I've been building a Concourse CI/CD pipeline and using a project which has 14 themes to deploy given the multi-site setup. Around 50% of the jobs running in Concourse would fail with the following error (prior to the changes made in this PR) which is precisely the issue recorded on issue #21852:
Every time this error occured, the SCD would reach 100% completion and then hang until the 400 second timeout (as defined on
\Magento\Deploy\Process\Queue::DEFAULT_MAX_EXEC_TIME
) elapses and the queue exits and attempts to cleanup. The error came from__destruct
but the output indicates all packages had reached 100% and there are actually no child processes still present on the system by the time this occurs:With the better error logging and messaging in
__destruct
(without fixes in other areas of this class), theRuntimeException
thrown reveals what I expected: the__destruct
method is attempting to wait for a child process that has already exited and been reaped by a prior call topcntl_waitpid
in\Magento\Deploy\Process\Queue::isDeployed
resulting in aPCNTL_ECHILD
(errno 10) error:In the above execution, PID 540 was the parent process execution of
bin/magento
. PID 543 (the one that tripped up__destruct
call) is one of the first child processes that was spawned. Watching the process list inside the container this was running in showed this PID had cleaned up long before the procesess got close to the__destruct
call in which theQueue
implementationn is attempting to wait for it to exit. This indicated for some reason the PID was remaining ininProgress
(by all appearances, possibly a race condition given the sporadicness of the error on most environment, except this one where I can reproduce it 50% of the time)Similar behaviour of
pcntl_waitpid
can also be observed by running the following single-line command:Errno 10 is equivelant to the constant
PCNTL_ECHILD
which per the linux man page (https://linux.die.net/man/2/waitpid) indicates the following:From here I added the same error logging I'd setup in
__destruct
to the usage ofpcntl_waitpid
in\Magento\Deploy\Process\Queue::isDeployed
as there was no error checking there, just the assumption that it was succeeding with a valid PID or returning because no child had exited (per theWNOHANG
option).After adding error handling to
isDeployed
it was revealaed thatisDeployed
was callingpcntl_waitpid
withnull
incorrectly and this resulted in a silent error on 2.2x which was not caught due to lack of error checking on thepcntl_waitpid
call (note the missing PID value):On
2.3-develop
afterdeclare(strict_types=1);
was added to theQueue
class file, this resulted in aPHP Fatal error: Uncaught TypeError
sincepcntl_waitpid
is expecting anint
and not anull
argument. Complicating things worse on2.3-develop
was commit 7421dfb which inadvertantly changed the behaviour ofgetPid
to return a boolean vs actually returning the child PID.This resolves the issue with
getPid
introduced in commit 7421dfb as well as the issue with passing anull
value topcntl_waitpid
that ultimately resulted in the SCD process inconsistently failing when parallel execution is used under specific circumstances (such as having a large number of themes).This PR resolves both #22563 and #21852 reliably in my test case. At the beginning of this description I stated I was reproducing the issue on about 50% of builds. After applying this patch to the build I have now run over 75+ builds succesfully without SCD hanging until the timeout elapses and without resulting in any
RuntimeException
errors.Fixed Issues (if relevant)
Contribution checklist (*)