-
Notifications
You must be signed in to change notification settings - Fork 18k
misc/cgo/test: TestParallelSleep is unreliable on arm #4008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
could you please attach a copy of the log file produced by: strace -olog -f -tt ./test.test -test.v -test.run=TestParallelSleep when the test fails? I can't reproduce it on my pandaboard es. i suspect it has something to do with thread scheduling (if i lower the cpu frequency, the test has a bigger probability to fail). |
Thank you for your suggestions, here is a log file from the first failed test. note: for this test I reverted the timeout to 2 seconds. Attachments:
|
Here is another example at the 20 second delay mark, which should show the problem more clearly. pando(~/go/misc/cgo/test) % strace -olog -f -tt ./test.test -test.v -test.run=TestParallelSleep === RUN TestParallelSleep --- FAIL: TestParallelSleep (27.77 seconds) _cgo_gotypes.go:751: parallel 20-second sleeps slept for 27.772286 seconds FAIL Attachments:
|
nothing suspicious found in the strace log. it simply turns out the nanosleep(2) syscall is very unreliable on ARM. in my experience, it might have something to do with cpu frequency scaling. you can try this: put a cpu intensive single thread program in the background, and rerun the test to see if this helps. (i think it all threads are sleeping, the CPU might enter low power mode which will take quite some time to wakeup) on my builder, after i switched to userspace cpufreq governor, this problem disappeared for quite some time. (my userspace governor simply change frequency by measuring temperature, and it will try hard to maintain at least 700MHz, as cpu frequency lower than that has much higher probability to fail this test) |
issue #3964 is also about this test. |
re: comment 7, I wouldn't expect nanosleep to be accurate inside a VM unless something like vmware-tools (no idea what this is for FreeBSD/VirtualBox) is installed. After playing around with my pandaboard this morning, the following is clear pando(~/src) % cat usleep.c #include <time.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { struct timeval start, stop; struct timespec sleep; sleep.tv_sec = 1; if (argc > 1) sleep.tv_sec = atoi(argv[1]); sleep.tv_nsec = 0; gettimeofday(&start, NULL); nanosleep(&sleep, NULL); gettimeofday(&stop, NULL); printf("Elapsed %d\n", (int)((stop.tv_sec-start.tv_sec)*1000000ULL+(stop.tv_usec-start.tv_usec))); return 0; } with /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor set to performance the results are very stable % ./a.out 1; ./a.out 2; ./a.out 3 Elapsed 1000093 Elapsed 2000121 Elapsed 3000030 With the governor left the default, ondemand, the results flail wildly based on the load on the machine at the time, and if that load occured after the nanosleep started. % ./a.out 1; ./a.out 2; ./a.out 3 Elapsed 1704405 Elapsed 2071116 Elapsed 3232047 So this function sleeps for longer (sometimes MUCH longer) than requested, which is both correct according to POSIX and unhelpful at the same time. Comparing these results to the older ARM5 systems I have, none of them have the ability to vary their clock frequency so this explains why it doesn't fail there. My suggestion is to either keep extending the timeout for arm, possibly by as much as 3x, or skip it entirely. Status changed to Accepted. |
http://golang.org/cl/6488056 contains a fix which I am testing now. An alternative suggestion could be to start a CPU wasting goroutine to peg the CPU, causing the power management to raise the CPU frequency, which will make nanosleep more accurate. |
http://golang.org/cl/6498060/ contains the alternative fix which turns out to be more reliable, but might be overkill. |
Owner changed to @davecheney. |
Rather than trying to trick the governor (which might stop working tomorrow), how about some code that warns the user that they don't have the performance governor selected and that their test might fail? This could potentially help other people who fire up linux/arm for the first time to realise that they need to be careful about benchmarks. |
@fullung: thank you for your comments. I want to be careful about the scope of this ticket. This is not a benchmarking issue, it just happens to be testing a function that has a fixed time envelope. I think your comments about benchmarking are valuable, and echo comments from the 'profiling go programs' blog post, but maybe they should go in a Wiki page. With respect to this ticket, if there is little support for this devious and disgusting workaround, maybe the test should be skipped on arm, or at least only enabled conditionally. |
I think the most important goal is reliable tests. You probably agree, since you filed this bug. There is a tradeoff between skipping tests on unreliable platforms and missing bugs, and having to deal with noise from spurious test failures. There's also the question of whether it's the platform or the test itself that is unreliable. I guess this clever trick tries to fix the test without changing the platform. I think my suggestion is: skip the test if you can reliably detect that the platform is unreliable. The clever trick fails if the platform is more unreliable than it anticipates. The skip approach fails if and when it misses a bug. Finally, to figure out if this specific clever trick is good enough at preventing spurious failures, you probably have to send it out into the wild to face a hoard of linux/arm systems for a few months. All that being said, I think both approaches have about equal merit. Good luck. :-) |
This issue was closed by revision d073677. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: