Skip to content

[Backport] Implement Better Error Handling and Fix Waits on Null PIDs in Parallel SCD Execution #22610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 24, 2019

Conversation

davidalger
Copy link
Member

@davidalger davidalger commented May 2, 2019

Original Pull Request

#22607

Description of Issue and Resolution

I've been building a Concourse CI/CD pipeline and using a project which has 14 themes to deploy given the multi-site setup. Around 50% of the jobs running in Concourse would fail with the following error (prior to the changes made in this PR) which is precisely the issue recorded on issue #21852:

[RuntimeException]
Error while waiting for package deployed: 24; Status: 0

Every time this error occured, the SCD would reach 100% completion and then hang until the 400 second timeout (as defined on \Magento\Deploy\Process\Queue::DEFAULT_MAX_EXEC_TIME) elapses and the queue exits and attempts to cleanup. The error came from __destruct but the output indicates all packages had reached 100% and there are actually no child processes still present on the system by the time this occurs:

/tmp/build/e3206652 # ps | grep -E [p]hp
PID   USER     TIME  COMMAND
  540 root      0:14 php -d memory_limit=768M bin/magento setup:static-content:deploy --jobs 8

With the better error logging and messaging in __destruct (without fixes in other areas of this class), the RuntimeException thrown reveals what I expected: the __destruct method is attempting to wait for a child process that has already exited and been reaped by a prior call to pcntl_waitpid in \Magento\Deploy\Process\Queue::isDeployed resulting in a PCNTL_ECHILD (errno 10) error:

/tmp/build/e3206652/workspace # rm -rf pub/static/* && php -d memory_limit=768M bin/magento setup:static-content:deploy --jobs 8

Deploy using quick strategy
frontend/Magento/blank/en_US            3087/3087           ============================ 100% %  19 secs             
adminhtml/Magento/backend/en_US         2887/2887           ============================ 100% %  13 secs             
frontend/Magento/luma/en_US             3103/3103           ============================ 100% %  22 secs             
adminhtml/<redacted>/<redacted/en_US    2887/2887           ============================ 100% %  13 secs             
frontend/<redacted>/<redacted>/en_US    3197/3197           ============================ 100% %  25 secs             
frontend/<redacted>/<redacted>/en_US    3223/3223           ============================ 100% %  25 secs             
frontend/<redacted>/<redacted>/en_US    3198/3198           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3223/3223           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3232/3232           ============================ 100% %  24 secs             
frontend/<redacted>/<redacted>/en_US    3237/3237           ============================ 100% %  20 secs             
frontend/<redacted>/<redacted>/en_US    3239/3239           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3264/3264           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3247/3247           ============================ 100% %  22 secs             
frontend/<redacted>/<redacted>/en_US    3255/3255           ============================ 100% %  18 secs

                                                                                    
  [RuntimeException]                                                                    
  Error encountered waiting for child process (PID: 543): No child process (errno: 10)  

In the above execution, PID 540 was the parent process execution of bin/magento. PID 543 (the one that tripped up __destruct call) is one of the first child processes that was spawned. Watching the process list inside the container this was running in showed this PID had cleaned up long before the procesess got close to the __destruct call in which the Queue implementationn is attempting to wait for it to exit. This indicated for some reason the PID was remaining in inProgress (by all appearances, possibly a race condition given the sporadicness of the error on most environment, except this one where I can reproduce it 50% of the time)

Similar behaviour of pcntl_waitpid can also be observed by running the following single-line command:

$ php -r '$r = pcntl_waitpid(1234, $s); var_dump($r, $s); echo pcntl_strerror(pcntl_errno()) . " (errno: " . pcntl_errno(); echo ")\n";'
int(-1)
int(0)
No child processes (errno: 10)

Errno 10 is equivelant to the constant PCNTL_ECHILD which per the linux man page (https://linux.die.net/man/2/waitpid) indicates the following:

(for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process.

From here I added the same error logging I'd setup in __destruct to the usage of pcntl_waitpid in \Magento\Deploy\Process\Queue::isDeployed as there was no error checking there, just the assumption that it was succeeding with a valid PID or returning because no child had exited (per the WNOHANG option).

After adding error handling to isDeployed it was revealaed that isDeployed was calling pcntl_waitpid with null incorrectly and this resulted in a silent error on 2.2x which was not caught due to lack of error checking on the pcntl_waitpid call (note the missing PID value):

[RuntimeException]
Error encountered waiting for child process (PID: ): No child process (errno: 10)

Exception trace:
 () at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:336
 Magento\Deploy\Process\Queue->isDeployed() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:214
 Magento\Deploy\Process\Queue->assertAndExecute() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Process/Queue.php:163
 Magento\Deploy\Process\Queue->process() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Strategy/QuickDeploy.php:76
 Magento\Deploy\Strategy\QuickDeploy->deploy() at /tmp/build/e3206652/workspace/vendor/magento/module-deploy/Service/DeployStaticContent.php:109
 Magento\Deploy\Service\DeployStaticContent->deploy() at /tmp/build/e3206652/workspace/setup/src/Magento/Setup/Console/Command/DeployStaticContentCommand.php:140
 Magento\Setup\Console\Command\DeployStaticContentCommand->execute() at /tmp/build/e3206652/workspace/vendor/symfony/console/Command/Command.php:245
 Symfony\Component\Console\Command\Command->run() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:835
 Symfony\Component\Console\Application->doRunCommand() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:185
 Symfony\Component\Console\Application->doRun() at /tmp/build/e3206652/workspace/vendor/magento/framework/Console/Cli.php:104
 Magento\Framework\Console\Cli->doRun() at /tmp/build/e3206652/workspace/vendor/symfony/console/Application.php:117
 Symfony\Component\Console\Application->run() at /tmp/build/e3206652/workspace/bin/magento:23

On 2.3-develop after declare(strict_types=1); was added to the Queue class file, this resulted in a PHP Fatal error: Uncaught TypeError since pcntl_waitpid is expecting an int and not a null argument. Complicating things worse on 2.3-develop was commit 7421dfb which inadvertantly changed the behaviour of getPid to return a boolean vs actually returning the child PID.

This resolves the issue with getPid introduced in commit 7421dfb as well as the issue with passing a null value to pcntl_waitpid that ultimately resulted in the SCD process inconsistently failing when parallel execution is used under specific circumstances (such as having a large number of themes).

// When $pid comes back as null the child process for this package has not yet started; prevents both
// hanging until timeout expires (which was behaviour in 2.2.x) and the type error from strict_types
if ($pid === null) {
    return false;
}

This PR resolves both #22563 and #21852 reliably in my test case. At the beginning of this description I stated I was reproducing the issue on about 50% of builds. After applying this patch to the build I have now run over 75+ builds succesfully without SCD hanging until the timeout elapses and without resulting in any RuntimeException errors.

Contribution checklist (*)

  • Pull request has a meaningful description of its purpose
  • All commits are accompanied by meaningful commit messages
  • All new or changed code is covered with unit/integration tests (if applicable)
  • All automated tests passed successfully (all builds on Travis CI are green)

@m2-assistant
Copy link

m2-assistant bot commented May 2, 2019

Hi @davidalger. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

  • @magento-engcom-team give me test instance - deploy test instance based on PR changes
  • @magento-engcom-team give me 2.2-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Assistant documentation

@ihor-sviziev
Copy link
Contributor

Moving this PR on hold till original #22607 will be merged


unset($this->inProgress[$package->getPath()]);
return pcntl_wexitstatus($status) === 0;
} else if ($result === -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use elseif instead of else if there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. That was a mistake. I'll get it fixed.

@ihor-sviziev
Copy link
Contributor

Why didn't you added changes in getPid method, to be the same as in 2.3-develop?

@ihor-sviziev ihor-sviziev self-assigned this May 3, 2019
@davidalger
Copy link
Member Author

davidalger commented May 3, 2019

@ihor-sviziev

Why didn't you added changes in getPid method, to be the same as in 2.3-develop?

Honestly? Because the changes were originally developed as a patch applied to the 2.3.1 and 2.2.6 releases inside a CI/CD pipeline. The getPid method didn't use the null coalescing operator until after 2.3.1 was released, and the change to it in the 2.3 PR is merely fixing the return of a boolean which a manual change in a sweeping code-standards update introduced breaking the method entirely.

I'll update the method on this back port though. Actually there is really no reason not to just pull back the entire updated class from 2.3 to include the strict types and phpcs:ignores, etc as well. Thanks for the notes!

@davidalger
Copy link
Member Author

@ihor-sviziev PR updated. Let me know if there is anything I can do to help move this (and the PR to 2.3-develop) forward. Thanks again!

@magento-engcom-team
Copy link
Contributor

Hi @sidolov, thank you for the review.
ENGCOM-5150 has been created to process this Pull Request

@magento-engcom-team
Copy link
Contributor

Hi @ihor-sviziev, thank you for the review.
ENGCOM-5150 has been created to process this Pull Request

@Nazar65 Nazar65 force-pushed the pcntl-scd-fix-22 branch from 108ec67 to ba771cf Compare May 24, 2019 11:10
@m2-assistant
Copy link

m2-assistant bot commented May 24, 2019

Hi @davidalger, thank you for your contribution!
Please, complete Contribution Survey, it will take less than a minute.
Your feedback will help us to improve contribution process.

magento-engcom-team pushed a commit that referenced this pull request May 24, 2019
@magento-engcom-team magento-engcom-team added this to the Release: 2.2.10 milestone May 24, 2019
@davidalger davidalger deleted the pcntl-scd-fix-22 branch March 30, 2020 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants