Description
First I found this bug on an RHEL (Red Hat Enterprise Linux Server) 6.6 system in a virtual qemu machine which has 24 virtual processors.
log-virtual.txt
Linux kernel version is 2.6.32-504.el6.x86_64. Everything used to work on revision 732e2cd which I had previously checked out, so I bisected the problem and found a problematic commits: d513ee7 and f034ee8. I even created a patch which reverted these two commits and fixed this problem for me on the virtual machine.
fix-virtual.patch.txt
But later I found a RHEL 6.6 system which runs on a physical hardware with 48 processors and without virtual layer, and it tried running tests on it. It happens that tests always crash on this machine. Even revision 732e2cd which used to work for me on virtual machine produces the same result.
log-physical.txt
A similar system running Ubuntu 15.04 (kernel 3.19.0-15-generic) doesn't have any such problems.
Go bootstrap compiler used in both cases is recently released version 1.5.3.
Activity
gshimansky commentedon Jan 15, 2016
Rerun tests with GOTRACEBACK=2
log-virtual.txt
log-physical.txt
davecheney commentedon Jan 15, 2016
11 is ECHILD, which clone(2) doesn't say it returns. I don't think the changes you highlighted are directly responsible, they just changed the pattern of access that pushed this machine over some limit.
Is AppArmor or SELinux in play ? Are there any odd entries in /etc/security (not 100% of the name)? What is the output from
ulimit -a
on unaffected and affected machines.I don't think qemu is related, unless the machine inside the qemu host is starved for memory.
davecheney commentedon Jan 15, 2016
Can you remove NFS from the equation ?
gshimansky commentedon Jan 15, 2016
Yes I don't think that running tests in parallel causes problems in newosproc and pthread_create. I just described how it changed the behavior on a virtual system.
I rerun tests on virtual machine in /tmp so that there is no NFS access.
log-virtual.txt
On physical system there is no NFS already.
gshimansky commentedon Jan 15, 2016
Ulimit -a is the same on both systems:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515268
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
SELinux seems to be present because there are files in /etc/security and command selinuxenabled retruns 0. But I am quite sure that all SElinux settings are in their default values since distribution installation.
gshimansky commentedon Jan 15, 2016
I disabled selinux in /etc/selinux/config (SELINUX=disabled and selinuxenabled returns 1 now), but it didn't change anything.
log-virtual-noselinux.txt
gshimansky commentedon Jan 15, 2016
I found a system with RHEL 7.1 (kernel version 3.10.0-229.el7.x86_64) and tried running tests on it too. No problems encountered. So am starting to think that there may be some bugs specific to RHEL 6.6 kernel.
[-]Random panics when running tests on RHEL 6.6[/-][+]runtime: random panics when running tests on RHEL 6.6[/+]ianlancetaylor commentedon Jan 15, 2016
It turns out that error 11 is EAGAIN, not ECHILD.
I have long suspected that there is a potential bug in the GNU/Linux support, but I have never been able to write a test case for it. The Linux kernel source code shows that if one thread calls clone while a different thread is calling exec, the call to clone can return EAGAIN (look for uses of in_exec in the kernel source code). That suggests that newosproc in runtime/os1_linux.go should check for that case, and loop calling clone again. But since I've never been able to write a test case showing the problem, I've never made the change.
This kind of problem, if it is indeed the problem, could certainly be kernel specific.
You could try applying this patch to runtime/os_linux.go to see if it fixes the problem.
gshimansky commentedon Jan 15, 2016
Thank you for a patch, but it didn't help so far. I tried to modify it a bit because comparison should be done with -_EAGAIN, but panics still remain. My patch looks like this now
and it produces errors like this:
There are also some CGo tests which fail on pthread_create:
runtime/cgo: pthread_create failed: Resource temporarily unavailable
.ianlancetaylor commentedon Jan 15, 2016
Thanks for trying it. In that case you should check how many processes are running in total on the machine, and how many are permitted. This is
ulimit -u
.I suppose you could also try adding a call to
usleep(1)
in the loop.gshimansky commentedon Jan 18, 2016
It is really a surprise to me, but increasing ulimit from 1024 to 2048 helped. All tests passed both on virtual and physical systems. Ubuntu doesn't have any limit set, and RHEL 7.1 has 4096, that is why tests passed on those systems.
I think this bug can be closed.