-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: random panics when running tests on RHEL 6.6 #13968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rerun tests with GOTRACEBACK=2 |
11 is ECHILD, which clone(2) doesn't say it returns. I don't think the changes you highlighted are directly responsible, they just changed the pattern of access that pushed this machine over some limit. Is AppArmor or SELinux in play ? Are there any odd entries in /etc/security (not 100% of the name)? What is the output from I don't think qemu is related, unless the machine inside the qemu host is starved for memory. |
Can you remove NFS from the equation ? |
Yes I don't think that running tests in parallel causes problems in newosproc and pthread_create. I just described how it changed the behavior on a virtual system. I rerun tests on virtual machine in /tmp so that there is no NFS access. On physical system there is no NFS already. |
Ulimit -a is the same on both systems: core file size (blocks, -c) unlimited SELinux seems to be present because there are files in /etc/security and command selinuxenabled retruns 0. But I am quite sure that all SElinux settings are in their default values since distribution installation. |
I disabled selinux in /etc/selinux/config (SELINUX=disabled and selinuxenabled returns 1 now), but it didn't change anything. |
I found a system with RHEL 7.1 (kernel version 3.10.0-229.el7.x86_64) and tried running tests on it too. No problems encountered. So am starting to think that there may be some bugs specific to RHEL 6.6 kernel. |
It turns out that error 11 is EAGAIN, not ECHILD. I have long suspected that there is a potential bug in the GNU/Linux support, but I have never been able to write a test case for it. The Linux kernel source code shows that if one thread calls clone while a different thread is calling exec, the call to clone can return EAGAIN (look for uses of in_exec in the kernel source code). That suggests that newosproc in runtime/os1_linux.go should check for that case, and loop calling clone again. But since I've never been able to write a test case showing the problem, I've never made the change. This kind of problem, if it is indeed the problem, could certainly be kernel specific. You could try applying this patch to runtime/os_linux.go to see if it fixes the problem.
|
Thank you for a patch, but it didn't help so far. I tried to modify it a bit because comparison should be done with -_EAGAIN, but panics still remain. My patch looks like this now
and it produces errors like this:
There are also some CGo tests which fail on pthread_create: |
Thanks for trying it. In that case you should check how many processes are running in total on the machine, and how many are permitted. This is I suppose you could also try adding a call to |
It is really a surprise to me, but increasing ulimit from 1024 to 2048 helped. All tests passed both on virtual and physical systems. Ubuntu doesn't have any limit set, and RHEL 7.1 has 4096, that is why tests passed on those systems. I think this bug can be closed. |
First I found this bug on an RHEL (Red Hat Enterprise Linux Server) 6.6 system in a virtual qemu machine which has 24 virtual processors.
log-virtual.txt
Linux kernel version is 2.6.32-504.el6.x86_64. Everything used to work on revision 732e2cd which I had previously checked out, so I bisected the problem and found a problematic commits: d513ee7 and f034ee8. I even created a patch which reverted these two commits and fixed this problem for me on the virtual machine.
fix-virtual.patch.txt
But later I found a RHEL 6.6 system which runs on a physical hardware with 48 processors and without virtual layer, and it tried running tests on it. It happens that tests always crash on this machine. Even revision 732e2cd which used to work for me on virtual machine produces the same result.
log-physical.txt
A similar system running Ubuntu 15.04 (kernel 3.19.0-15-generic) doesn't have any such problems.
Go bootstrap compiler used in both cases is recently released version 1.5.3.
The text was updated successfully, but these errors were encountered: