Skip to content

Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 23, 2019

Conversation

davidalger
Copy link
Member

@davidalger davidalger commented May 2, 2019

Description of Issue and Resolution

I've been building a Concourse CI/CD pipeline and using a project which has 14 themes to deploy given the multi-site setup. Around 50% of the jobs running in Concourse would fail with the following error (prior to the changes made in this PR) which is precisely the issue recorded on issue #21852:

[RuntimeException]
Error while waiting for package deployed: 24; Status: 0

Every time this error occured, the SCD would reach 100% completion and then hang until the 400 second timeout (as defined on \Magento\Deploy\Process\Queue::DEFAULT_MAX_EXEC_TIME) elapses and the queue exits and attempts to cleanup. The error came from __destruct but the output indicates all packages had reached 100% and there are actually no child processes still present on the system by the time this occurs:

/tmp/build/e3206652 # ps | grep -E [p]hp
PID   USER     TIME  COMMAND
  540 root      0:14 php -d memory_limit=768M bin/magento setup:static-content:deploy --jobs 8

With the better error logging and messaging in __destruct (without fixes in other areas of this class), the RuntimeException thrown reveals what I expected: the __destruct method is attempting to wait for a child process that has already exited and been reaped by a prior call to pcntl_waitpid in \Magento\Deploy\Process\Queue::isDeployed resulting in a PCNTL_ECHILD (errno 10) error:

/tmp/build/e3206652/workspace # rm -rf pub/static/* && php -d memory_limit=768M bin/magento setup:static-content:deploy --jobs 8

Deploy using quick strategy
frontend/Magento/blank/en_US            3087/3087           ============================ 100% %  19 secs             
adminhtml/Magento/backend/en_US         2887/2887           ============================ 100% %  13 secs             
frontend/Magento/luma/en_US             3103/3103           ============================ 100% %  22 secs             
adminhtml/<redacted>/<redacted/en_US    2887/2887           ============================ 100% %  13 secs             
frontend/<redacted>/<redacted>/en_US    3197/3197           ============================ 100% %  25 secs             
frontend/<redacted>/<redacted>/en_US    3223/3223           ============================ 100% %  25 secs             
frontend/<redacted>/<redacted>/en_US    3198/3198           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3223/3223           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3232/3232           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3237/3237           ============================ 100% %  20 secs             
frontend/<redacted>/<redacted>/en_US    3239/3239           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3264/3264           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3247/3247           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3255/3255           ============================ 100% %  18 secs

                                                                                    
  [RuntimeException]                                                                    
  Error encountered waiting for child process (PID: 543): No child process (errno: 10)  

In the above execution, PID 540 was the parent process execution of bin/magento. PID 543 (the one that tripped up __destruct call) is one of the first child processes that was spawned. Watching the process list inside the container this was running in showed this PID had cleaned up long before the procesess got close to the __destruct call in which the Queue implementationn is attempting to wait for it to exit. This indicated for some reason the PID was remaining in inProgress (by all appearances, possibly a race condition given the sporadicness of the error on most environment, except this one where I can reproduce it 50% of the time)

Similar behaviour of pcntl_waitpid can also be observed by running the following single-line command:

$ php -r '$r = pcntl_waitpid(1234, $s); var_dump($r, $s); echo pcntl_strerror(pcntl_errno()) . " (errno: " . pcntl_errno(); echo ")\n";'
int(-1)
int(0)
No child processes (errno: 10)

Errno 10 is equivelant to the constant PCNTL_ECHILD which per the linux man page (https://linux.die.net/man/2/waitpid) indicates the following:

(for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process.

From here I added the same error logging I'd setup in __destruct to the usage of pcntl_waitpid in \Magento\Deploy\Process\Queue::isDeployed as there was no error checking there, just the assumption that it was succeeding with a valid PID or returning because no child had exited (per the WNOHANG option).

After adding error handling to isDeployed it was revealaed that isDeployed was calling pcntl_waitpid with null incorrectly and this resulted in a silent error on 2.2x which was not caught due to lack of error checking on the pcntl_waitpid call (note the missing PID value):

[RuntimeException]
Error encountered waiting for child process (PID: ): No child process (errno: 10)

Exception trace:
 () at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:336
 Magento\Deploy\Process\Queue->isDeployed() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:214
 Magento\Deploy\Process\Queue->assertAndExecute() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:163
 Magento\Deploy\Process\Queue->process() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Strategy/QuickDeploy.php:76
 Magento\Deploy\Strategy\QuickDeploy->deploy() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Service/DeployStaticContent.php:109
 Magento\Deploy\Service\DeployStaticContent->deploy() at /tmp/build/e3206652/workspace/setup/src/Magento/Setup/Console/Command/DeployStaticContentCommand.php:140
 Magento\Setup\Console\Command\DeployStaticContentCommand->execute() at /tmp/build/e3206652/workspace/vendor/symfony/console/Command/Command.php:245
 Symfony\Component\Console\Command\Command->run() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:835
 Symfony\Component\Console\Application->doRunCommand() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:185
 Symfony\Component\Console\Application->doRun() at /tmp/build/e3206652/workspace/vendor/magento/framework/Console/Cli.php:104
 Magento\Framework\Console\Cli->doRun() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:117
 Symfony\Component\Console\Application->run() at /tmp/build/e3206652/workspace/bin/magento:23

On 2.3-develop after declare(strict_types=1); was added to the Queue class file, this resulted in a PHP Fatal error: Uncaught TypeError since pcntl_waitpid is expecting an int and not a null argument. Complicating things worse on 2.3-develop was commit 7421dfb which inadvertantly changed the behaviour of getPid to return a boolean vs actually returning the child PID.

This resolves the issue with getPid introduced in commit 7421dfb as well as the issue with passing a null value to pcntl_waitpid that ultimately resulted in the SCD process inconsistently failing when parallel execution is used under specific circumstances (such as having a large number of themes).

// When $pid comes back as null the child process for this package has not yet started; prevents both
// hanging until timeout expires (which was behaviour in 2.2.x) and the type error from strict_types
if ($pid === null) {
    return false;
}

This PR resolves both #22563 and #21852 reliably in my test case. At the beginning of this description I stated I was reproducing the issue on about 50% of builds. After applying this patch to the build I have now run over 75+ builds succesfully without SCD hanging until the timeout elapses and without resulting in any RuntimeException errors.

Fixed Issues (if relevant)

  1. Parallelised execution of static content deploy is broken on 2.3-develop #22563: Parallelised execution of static content deploy is broken on 2.3-develop
  2. Random Error while waiting for package deployed #21852: Random Error while waiting for package deployed

Contribution checklist (*)

  • Pull request has a meaningful description of its purpose
  • All commits are accompanied by meaningful commit messages
  • All new or changed code is covered with unit/integration tests (if applicable)
  • All automated tests passed successfully (all builds on Travis CI are green)

@davidalger davidalger requested a review from hostep May 2, 2019 19:04
@m2-assistant
Copy link

m2-assistant bot commented May 2, 2019

Hi @davidalger. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

  • @magento-engcom-team give me test instance - deploy test instance based on PR changes
  • @magento-engcom-team give me 2.3-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Assistant documentation

@hostep
Copy link
Contributor

hostep commented May 2, 2019

@davidalger: awesome!

But on my local installation after applying this fix, the amount of files which get generated when using 2 jobs is very small. In your testing, does it generated all the files?
This might as well be a problem on my local environment though... (maybe those pcntl functions work slightly different between macOS & Linux)

$ rm -R var/cache/* var/view_preprocessed/* pub/static/*

$ bin/magento setup:static-content:deploy -f --jobs=2

Deploy using quick strategy
frontend/Magento/blank/en_US            0/2166              >--------------------------- 0% %    8 secs
adminhtml/Magento/backend/en_US         0/2184              >--------------------------- 0% %    8 secs
frontend/Magento/luma/en_US             0/2182              >--------------------------- 0% %    5 secs

Execution time: 10.173676013947

$ find pub/static -type f | wc -l
      14

$ rm -R var/cache/* var/view_preprocessed/* pub/static/*

$ bin/magento setup:static-content:deploy -f --jobs=1

Deploy using quick strategy
frontend/Magento/blank/en_US            2166/2166           ============================ 100% %  45 secs
adminhtml/Magento/backend/en_US         2184/2184           ============================ 100% %  46 secs
frontend/Magento/luma/en_US             2182/2182           ============================ 100% %  47 secs

Execution time: 147.12678790092

$ find pub/static -type f | wc -l
    6034

@davidalger
Copy link
Member Author

@hostep That is indeed strange. On my local checkout on macOS I'm getting the following:

davidalger:11:46 AM:/sites/pcntl-scd-fix (pcntl-scd-fix-23) $ bin/magento setup:static-content:deploy -f --jobs=2

Deploy using quick strategy
frontend/Magento/blank/en_US            2166/2166           ============================ 100% %  < 1 sec             
adminhtml/Magento/backend/en_US         2184/2184           ============================ 100% %  0                   
frontend/Magento/luma/en_US             2182/2182           ============================ 100% %  0

Execution time: 9.0232081413269

What version of PHP are you using? I'm running PHP 7.1 on my Mac at the moment. I'll test on 7.2 as well. Right now I'm getting integration tests running because 7.2 tests are failing with a weird error on travis:

PHP Fatal error:  Uncaught PHPUnit\Framework\Exception: Warning: exec(): Unable to fork [tasklist.exe /fi 'PID eq 8572' /fo CSV /nh 2>&1] in /home/travis/build/davidalger/magento2/lib/internal/Magento/Framework/Shell.php:58.

On 7.1 however it's failing on an assertion:

There was 1 failure:
1) Magento\MysqlMq\Model\QueueManagementTest::testAllFlows
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'32'
+'33'
/home/travis/build/davidalger/magento2/dev/tests/integration/testsuite/Magento/MysqlMq/Model/QueueManagementTest.php:100

@hostep
Copy link
Contributor

hostep commented May 2, 2019

I was using PHP 7.1.27, just tested again with PHP 7.2.16 but with the same result :(
PHP is installed using Macports. And I'm on macOS 10.14.4 if that should matter.

Can you test with the directories var/view_preprocessed/ & pub/static/ emptied as well? Because it would surprice me if it can run in 9 seconds :)

@davidalger
Copy link
Member Author

@hostep Here is what I get. PHP on my local env is installed via [email protected]

davidalger:03:08 PM:/sites/pcntl-scd-fix (pcntl-scd-fix-23) $ redis-cli flushall
OK
davidalger:03:08 PM:/sites/pcntl-scd-fix (pcntl-scd-fix-23) $ rm -R var/cache/* var/view_preprocessed/* pub/static/* generated/*
rm: var/cache/*: No such file or directory
davidalger:03:08 PM:/sites/pcntl-scd-fix (pcntl-scd-fix-23) $ bin/magento setup:static-content:deploy -f --jobs=2

Deploy using quick strategy
frontend/Magento/blank/en_US            2167/2167           ============================ 100% %  6 secs              
adminhtml/Magento/backend/en_US         2185/2185           ============================ 100% %  6 secs              
frontend/Magento/luma/en_US             2183/2183           ============================ 100% %  3 secs

Execution time: 25.305968999863
davidalger:03:09 PM:/sites/pcntl-scd-fix (pcntl-scd-fix-23) $ php -v
PHP 7.1.25 (cli) (built: Jan 13 2019 07:33:43) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2018 Zend Technologies
    with Zend OPcache v7.1.25, Copyright (c) 1999-2018, by Zend Technologies

What happens if you tack on -vvv for verbosity?

davidalger added a commit to davidalger/magento2 that referenced this pull request May 2, 2019
@hostep
Copy link
Contributor

hostep commented May 2, 2019

Here's with the -vvv flag:

$ rm -R var/cache/* var/view_preprocessed/* pub/static/* generated/*

$ bin/magento setup:static-content:deploy -f --jobs=2 -vvv

Deploy using quick strategy
Execute: frontend/Magento/blank/en_US
Execute: adminhtml/Magento/backend/en_US
frontend/Magento/blank/en_US            0/2166              >--------------------------- 0% %    < 1 sec
adminhtml/Magento/backend/en_US         0/2184              >--------------------------- 0% %    < 1 sec
.
Exited: frontend/Magento/blank/en_US(status: 0)
frontend/Magento/blank/en_US            0/2166              >--------------------------- 0% %    3 secs
adminhtml/Magento/backend/en_US         0/2184              >--------------------------- 0% %    3 secs
frontend/Magento/blank/en_US            0/2166              >--------------------------- 0% %    8 secs
adminhtml/Magento/backend/en_US         0/2184              >--------------------------- 0% %    8 secs
frontend/Magento/luma/en_US             0/2182              >--------------------------- 0% %    5 secs

Execution time: 8.4032649993896

And php:

$ php -v
PHP 7.1.27 (cli) (built: Mar 22 2019 08:12:27) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2018 Zend Technologies
    with Zend OPcache v7.1.27, Copyright (c) 1999-2018, by Zend Technologies

Not sure if this is helpful though...

Would be good if somebody else could do some testing as well, it still feels like an environmental issue over here, but not sure why.

@davidalger
Copy link
Member Author

@hostep Yeah I'm not sure it is. Sorry. That's definitely a different issue, and I'm not sure what would cause it since I can't reproduce it either before or after these changes.

General update on my notes above about the integration tests:

  • The 7.2 test failure emitting the Warning: exec(): Unable to fork is occurring on other PRs as well as does not appear to be related to this work.
  • The 7.1 failure is transient. Ran the same failed job on Travis again and it turned green. Also ran this test locally, with my debugger active, and the test in /Magento/MysqlMq/Model/QueueManagementTest.php does not ever touch any file modified by this PR.

So I think we're good here…once Travis decides to stop hanging and run the test on this PR specifically vs just my branch on my forked repo.

@hostep hostep removed their request for review May 3, 2019 13:19
@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented May 6, 2019

Changes looks good to me, but I was trying to test this change and got really strange result.

When I tried to compile few times my themes with the same params (with --jobs $(nproc), 8 in my case) and calculated files list - each time I got different results!

@hostep did you have such issue?

Then I reverted these changes and ... got the same issue.

Very strange... Looks like it's not caused by this PR, but definitely it's not expected.
Will additionally double check if it works good, but since I applied this patch - I hadn't any issues like "Error while waiting for package deployed..."

I think issue that @hostep got - also not related to this PR.

@hostep
Copy link
Contributor

hostep commented May 6, 2019

Just informational for people stumbling upon this thread.

I just found the cause of my issue, after debugging the code, PHP execution just stopped (no error or anything) when calling preg_match somewhere. After some searching around, it looks like it has to do with the following bug in PHP: https://bugs.php.net/bug.php?id=77260 (found through Homebrew/homebrew-core#35491)

And following the advise from the homebrew thread, I can make the multithreaded static content deploy work over here when I add pcre.jit=0 to my php.ini file 🎉

@ihor-sviziev: after the above was fixed, I tested using 1, 2 or 3 jobs, and I'm always getting 6035 files in pub/static (using 6a9860f + https://patch-diff.githubusercontent.com/raw/magento/magento2/pull/22607.diff)

@davidalger
Copy link
Member Author

@ihor-sviziev @hostep That is interesting indeed. Yeah…I can't say I've ever run into varying number of files.

Do you guys thinks this is something we should take care of by disabling pcre.jit in the SCD command via an ini_set? That would ensure that underlying bug won't break SCD where applicable. I could incorporate this into this PR or open a new one dedicated fixing that specific issue.

@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented May 8, 2019

Hi @davidalger,
I think it should be different PR. I’ll try to apply the same option in php.ini and will see, maybe issue with different amount of files will gone away

@hostep could you make such PR?

@hostep
Copy link
Contributor

hostep commented May 8, 2019

Hmm, I'm not sure, I think it's not the responsibility of an application to change environmental problems, but not sure about it.

But if we do need to implement it, it should probably be done in a similar fashion as over here, where the original value is restored, right?

@davidalger
Copy link
Member Author

@hostep I would say yes, should definitely restore the original value before the method returns, that way it has no additional side-effects (even if it's not likely to be called from anywhere other than the two CLI command that run SCD step, just good practice). I'm thinking probably makes the most sense to do it in \Magento\Deploy\Service\DeployStaticContent::deploy

@hostep
Copy link
Contributor

hostep commented May 8, 2019

Ok, I've changed my mind in the mean time, I see the benefit of adding this now.
I'll see if I can prepare a PR later this week!

(Update - I don't know how @davidalger is doing this, but he's posting from the future: "commented an hour from now" 😄 Nevermind, was probably some bug in Github, is fixed right now)

@hostep
Copy link
Contributor

hostep commented May 9, 2019

Ok, I've got some updates regarding the pcre.jit bug I'm experiencing.

When pcre.jit is set to 1 in my php.ini file and I add ini_set('pcre.jit', '0'); to the static content deploy code (in Magento\Setup\Console\Command\DeployStaticContentCommand::execute, first line), it's not fixed.
But I do notice that the counter next to the themes in the output changed from 0/2166 to 1/2166. So it does have some influence but changing it at runtime doesn't appear to fix the problem.
Changing it in my php.ini file to 0 does fix it entirely.
So I'm afraid I can't create a PR with a fix, since it wouldn't work anyway.

I also started experimenting with different versions of the pcre library, just to see if that got me somewhere. And I found that using version 8.41 and 8.42 is stable and fixes my issue, but using version 8.43 causes the problem. Here's the changelog for people interested (possibly point 9 might have caused it since it targets macOS 10.14 specifically, but that's only guessing here).
If you want to see which pcre version PHP is using, you can execute: php -i | grep 'PCRE Library Version'

Anyways, this is more just to document everything in case other people run against the same problem.
Conclusion: don't use the pcre library version 8.43 (on macOS 10.14 at least) or set pcre.jit=0 in your php.ini file.

Anyways, sorry for intruding in here, let's try to get this PR merged! 🙂

@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented May 13, 2019

Note for QA:
I started getting this issue almost all the time on CI when I switched to option -j $(nproc) (detects how many CPUs available on server), it equals to -j 8 on CI, but equals to -j 4 on my pc. I think that's main issue why I hadn't it locally.
In addition to that I'm compiling multi language website with 6 locales and 4 themes including default ones

@Nazar65
Copy link
Member

Nazar65 commented May 16, 2019

QA Passed ✔️

@m2-assistant
Copy link

m2-assistant bot commented May 23, 2019

Hi @davidalger, thank you for your contribution!
Please, complete Contribution Survey, it will take less than a minute.
Your feedback will help us to improve contribution process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants