Skip to content

misc/cgo/test: TestParallelSleep is unreliable on arm #4008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davecheney opened this issue Aug 27, 2012 · 16 comments
Closed

misc/cgo/test: TestParallelSleep is unreliable on arm #4008

davecheney opened this issue Aug 27, 2012 · 16 comments

Comments

@davecheney
Copy link
Contributor

What steps will reproduce the problem?

diff -r c552fb2b6a6c misc/cgo/test/issue1560.go
--- a/misc/cgo/test/issue1560.go        Sun Aug 26 11:17:55 2012 -0700
+++ b/misc/cgo/test/issue1560.go        Mon Aug 27 12:49:04 2012 +1000
@@ -40,7 +40,7 @@
        if runtime.GOARCH == "arm" {
                // on ARM, the 1.3s deadline is frequently missed,
                // so increase sleep time to 2s
-               sleepSec = 2
+               sleepSec = 20
        }
        start := time.Now()
        parallelSleep(sleepSec)

cd $GOROOT/misc/cgo/test
go test -c
while ./test.test ; do echo ; done

What is the expected output? What do you see instead?

expected: PASS

actual:

pando(~/go/misc/cgo/test) % while ./test.test ; do echo ; done
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
PASS

--- FAIL: TestParallelSleep (35.00 seconds)
_cgo_gotypes.go:751:    parallel 20-second sleeps slept for 34.997584 seconds
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
testLibgcc is disabled on ARM because 5l cannot handle thumb library.
FAIL

Please use labels and text to provide additional information.

This test fails on dual core ARM7 hosts. I cannot trigger the failure, even using the
original 2s timeout, on a single core ARM5 host.
@minux
Copy link
Member

minux commented Aug 27, 2012

Comment 1:

could you please attach a copy of the log file produced by:
strace -olog -f -tt ./test.test -test.v -test.run=TestParallelSleep
when the test fails? I can't reproduce it on my pandaboard es.
i suspect it has something to do with thread scheduling (if i lower the cpu frequency,
the test has a bigger probability to fail).

@davecheney
Copy link
Contributor Author

Comment 2:

Thank you for your suggestions, here is a log file from the first failed test.
note: for this test I reverted the timeout to 2 seconds.

Attachments:

  1. 4008.log (17160 bytes)

@davecheney
Copy link
Contributor Author

Comment 3:

Here is another example at the 20 second delay mark, which should show the problem more
clearly.
pando(~/go/misc/cgo/test) % strace -olog -f -tt ./test.test -test.v
-test.run=TestParallelSleep
=== RUN TestParallelSleep
--- FAIL: TestParallelSleep (27.77 seconds)
_cgo_gotypes.go:751:    parallel 20-second sleeps slept for 27.772286 seconds
FAIL

Attachments:

  1. 4008-2.log (17110 bytes)

@minux
Copy link
Member

minux commented Aug 29, 2012

Comment 4:

nothing suspicious found in the strace log.
it simply turns out the nanosleep(2) syscall is very unreliable on ARM.
in my experience, it might have something to do with cpu frequency scaling.
you can try this:
put a cpu intensive single thread program in the background, and rerun the test
to see if this helps. (i think it all threads are sleeping, the CPU might enter
low power mode which will take quite some time to wakeup)
on my builder, after i switched to userspace cpufreq governor, this problem
disappeared for quite some time.
(my userspace governor simply change frequency by measuring temperature,
and it will try hard to maintain at least 700MHz, as cpu frequency lower than
that has much higher probability to fail this test)

@davecheney
Copy link
Contributor Author

Comment 5:

Yeah, this is very weird. I'll try some experiments tomorrow. I would buy
aggressive power management would cause the machine to miss the deadline by
a few hundred ms, but I can't see that explaining the failure when the
timeout is raised to 20 seconds.

@minux
Copy link
Member

minux commented Aug 29, 2012

Comment 6:

my hypothesis is that the kernel sleep granularity is proportional to
the time you want it to sleep.

@minux
Copy link
Member

minux commented Aug 29, 2012

Comment 7:

issue #3964 is also about this test.

@davecheney
Copy link
Contributor Author

Comment 8:

re: comment 7, I wouldn't expect nanosleep to be accurate inside a VM unless something
like vmware-tools (no idea what this is for FreeBSD/VirtualBox) is installed.
After playing around with my pandaboard this morning, the following is clear
pando(~/src) % cat usleep.c 
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
        struct timeval start, stop;
        struct timespec sleep;
        sleep.tv_sec = 1;
        if (argc > 1) 
                sleep.tv_sec = atoi(argv[1]);
        sleep.tv_nsec = 0;
        gettimeofday(&start, NULL);
        nanosleep(&sleep, NULL);
        gettimeofday(&stop, NULL);
        printf("Elapsed %d\n", (int)((stop.tv_sec-start.tv_sec)*1000000ULL+(stop.tv_usec-start.tv_usec)));
        return 0;
}
with /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor set to performance the
results are very stable
% ./a.out 1; ./a.out 2; ./a.out 3
Elapsed 1000093
Elapsed 2000121
Elapsed 3000030
With the governor left the default, ondemand, the results flail wildly based on the load
on the machine at the time, and if that load occured after the nanosleep started.
% ./a.out 1; ./a.out 2; ./a.out 3
Elapsed 1704405
Elapsed 2071116
Elapsed 3232047
So this function sleeps for longer (sometimes MUCH longer) than requested, which is both
correct according to POSIX and unhelpful at the same time.
Comparing these results to the older ARM5 systems I have, none of them have the ability
to vary their clock frequency so this explains why it doesn't fail there.
My suggestion is to either keep extending the timeout for arm, possibly by as much as
3x, or skip it entirely.

Status changed to Accepted.

@davecheney
Copy link
Contributor Author

Comment 9:

http://golang.org/cl/6488056 contains a fix which I am testing now.
An alternative suggestion could be to start a CPU wasting goroutine to peg the CPU,
causing the power management to raise the CPU frequency, which will make nanosleep more
accurate.

@davecheney
Copy link
Contributor Author

Comment 10:

http://golang.org/cl/6498060/ contains the alternative fix which turns out to
be more reliable, but might be overkill.

@davecheney
Copy link
Contributor Author

Comment 11:

Owner changed to @davecheney.

@alberts
Copy link
Contributor

alberts commented Aug 30, 2012

Comment 12:

Rather than trying to trick the governor (which might stop working tomorrow), how about
some code that warns the user that they don't have the performance governor selected and
that their test might fail? This could potentially help other people who fire up
linux/arm for the first time to realise that they need to be careful about benchmarks.

@alberts
Copy link
Contributor

alberts commented Aug 30, 2012

Comment 13:

In fact, it could help almost anyone benchmarking on Linux. I've seen governor-related
effects in many of our own tests and benchmarks.

@davecheney
Copy link
Contributor Author

Comment 14:

@fullung: thank you for your comments.
I want to be careful about the scope of this ticket. This is not a benchmarking issue,
it just happens to be testing a function that has a fixed time envelope. I think your
comments about benchmarking are valuable, and echo comments from the 'profiling go
programs' blog post, but maybe they should go in a Wiki page. 
With respect to this ticket, if there is little support for this devious and disgusting
workaround, maybe the test should be skipped on arm, or at least only enabled
conditionally.

@alberts
Copy link
Contributor

alberts commented Aug 30, 2012

Comment 15:

I think the most important goal is reliable tests. You probably agree, since you filed
this bug.
There is a tradeoff between skipping tests on unreliable platforms and missing bugs, and
having to deal with noise from spurious test failures.
There's also the question of whether it's the platform or the test itself that is
unreliable.
I guess this clever trick tries to fix the test without changing the platform. I think
my suggestion is: skip the test if you can reliably detect that the platform is
unreliable.
The clever trick fails if the platform is more unreliable than it anticipates.
The skip approach fails if and when it misses a bug.
Finally, to figure out if this specific clever trick is good enough at preventing
spurious failures, you probably have to send it out into the wild to face a hoard of
linux/arm systems for a few months.
All that being said, I think both approaches have about equal merit. Good luck. :-)

@davecheney
Copy link
Contributor Author

Comment 16:

This issue was closed by revision d073677.

Status changed to Fixed.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants