Description
Go's locks make no guarantee of fairness. However, there exists at least one common pattern that exhibits severely unfair behavior, to the extent where it might be considered a serious bug. I propose we look into whether we can do anything in the sync.Mutex implementation that would ameliorate this severe unfairness without hurting the common case. I have one concrete suggestion at the bottom, but I hope to spur thought and discussion about this. I don't claim that this specific solution is the right one.
Problem
I've seen two different reports now of severe lock unfairness under roughly the same conditions, in two independent programs. The general problem occurs when there are two goroutines contending for a lock and one of them holds it for a long time while only briefly releasing it, and the other releases it for a long time and only briefly acquires it.
The demo program available by go get rsc.io/tmp/lockskew
boils down to this:
// shared state
done := make(chan bool, 1)
var mu sync.Mutex
// goroutine 1
go func() {
for {
select {
case <-done:
return
default:
mu.Lock()
time.Sleep(100 * time.Microsecond)
mu.Unlock()
}
}
}()
// goroutine 2
for i := 0; i < n; i++ {
time.Sleep(100 * time.Microsecond)
mu.Lock()
mu.Unlock()
}
done <- 1
Here, goroutine 1 basically wants to hold the lock all the time, releasing it briefly, while goroutine 2 only wants to hold the lock for brief amounts of time. Developers naively (but I don't think completely unreasonably) expect that although goroutine 2 may not get the lock quite as much as goroutine 1, it will get the lock once in a while, maybe after a second or two at worst. In fact, on my Linux workstation it's common for goroutine 2 to take 100+ seconds to acquire the lock just once.
When goroutine 1 executes mu.Unlock(), it marks the lock unlocked and wakes up goroutine 2, intending that goroutine 2 will run and acquire the lock itself. But goroutine 2 does not run immediately. Goroutine 1 keeps running, goes around the loop again, calls mu.Lock(), and reacquires the lock. By the time goroutine 2 actually runs and tries to acquire the lock, goroutine 1 has taken it back. The basic pattern is:
g1 holds lock, sleeping
g2 calls mu.Lock
- finds lock locked
- adds itself to waiting queue
- sleeps
g1 wakes up from time.Sleep
g1 calls mu.Unlock
- marks lock unlocked
- tells runtime g2 is allowed to run
- returns
g1 checks channel (never ready)
g1 calls mu.Lock
- finds lock unlocked
- acquires lock
g1 calls time.Sleep, sleeps
g2 scheduled to run
- finds lock locked
- adds itself to waiting queue
- sleeps
Given that g2 is woken by g1 only once per 100µs, the fact that g2 takes 100+ seconds to acquire the lock means that this cycle repeats millions of times before g1 is interrupted enough to let g2 run.
If you run lockskew
it will acquire the lock as much as it can and print about acquisitions, at most once per second. The performance is sensitive to the specific length of the sleep, but they're all bad. On my system, here is lockskew with 100µs sleeps, as in the program above:
$ ./lockskew
2015/10/28 12:30:24 lock#0 took 180.773982s; average 180.773982s
2015/10/28 12:38:28 lock#1 took 484.603155s; average 332.688568s
2015/10/28 12:40:16 lock#2 took 107.852508s; average 257.743215s
2015/10/28 12:44:00 lock#3 took 223.592777s; average 249.205605s
2015/10/28 12:54:00 lock#4 took 600.067715s; average 319.378027s
2015/10/28 13:03:40 lock#5 took 579.896472s; average 362.797768s
2015/10/28 13:05:11 lock#6 took 91.223369s; average 324.001425s
2015/10/28 13:06:10 lock#8 took 59.019365s; average 258.622936s
2015/10/28 13:11:14 lock#9 took 303.023746s; average 263.063017s
2015/10/28 13:12:22 lock#10 took 68.975316s; average 245.418681s
2015/10/28 13:15:08 lock#11 took 165.555347s; average 238.763403s
2015/10/28 13:15:32 lock#12 took 23.713157s; average 222.221076s
2015/10/28 13:18:58 lock#13 took 205.748355s; average 221.044453s
2015/10/28 13:20:39 lock#14 took 101.183532s; average 213.053725s
2015/10/28 13:21:36 lock#15 took 56.859756s; average 203.291602s
2015/10/28 13:22:59 lock#16 took 83.056960s; average 196.218976s
2015/10/28 13:24:05 lock#17 took 66.815677s; average 189.029904s
And here is lockskew with 100ms sleeps:
$ ./lockskew -t 100ms
2015/10/28 13:26:01 lock#0 took 22.828276s; average 22.828276s
2015/10/28 13:28:04 lock#1 took 123.553891s; average 73.191083s
2015/10/28 13:31:17 lock#2 took 192.438909s; average 112.940359s
2015/10/28 13:31:24 lock#3 took 6.708422s; average 86.382375s
2015/10/28 13:33:04 lock#4 took 100.719121s; average 89.249724s
2015/10/28 13:34:06 lock#5 took 61.773499s; average 84.670353s
2015/10/28 13:34:11 lock#6 took 5.004274s; average 73.289485s
2015/10/28 13:34:14 lock#7 took 2.902557s; average 64.491119s
2015/10/28 13:34:17 lock#8 took 2.702412s; average 57.625707s
2015/10/28 13:34:34 lock#9 took 16.418097s; average 53.504946s
2015/10/28 13:34:39 lock#10 took 4.904286s; average 49.086704s
2015/10/28 13:35:31 lock#11 took 52.456904s; average 49.367554s
2015/10/28 13:37:03 lock#12 took 91.809099s; average 52.632288s
2015/10/28 13:37:17 lock#13 took 13.213271s; average 49.816644s
2015/10/28 13:37:23 lock#15 took 5.204634s; average 43.958641s
Barging vs Handoff
The fundamental problem here is when g1 decides to wake up g2, it leaves the sync.Mutex unlocked and lets g2 deal with reacquiring it. In the time between those two events, any goroutine other than g2 (here g1, but in general any other goroutine) can arrive at mu.Lock and grab the mutex instead of g2. This "drive-by acquisition" is sometimes called barging.
The alternative to barging is for g1 to leave the lock locked when waking g2, explicitly handing off the lock to g2. This coordinated handoff leaves the lock in the "locked" state the entire time, eliminating the window in which barging can occur.
In his work on java.util.concurrent, Doug Lea found that barging helped throughput [1], and that fact has entered the received wisdom surrounding locking. Following standard Java at the time, the evaluation used operating system threads and an operating system scheduler (specifically, Linux 2.4 NPTL on Pentium3/4 and Solaris 9 on UltraSparc2/3). As I understand it, barging is a win because it makes some bad operating system scheduling decisions impossible. Specifically, if the operating system delays the scheduling of g2, making the handoff delay long, not holding the lock during the handoff lets another thread do useful work in the interim.
As the test program and the real programs that inspired it show, however, at least in the Go implementation, barging enables very bad states. Also, the delays that barging helps avoid may not apply to Go, which is primarily using goroutines and a goroutine scheduler.
Alternatives
It's worth noting that there is a known workaround to this problem. If you use a pair of locks, with mu2 protecting mu.Lock, then that makes barging impossible in the two-goroutine case and at least less of an issue in the case of more goroutines. Specifically, the above code changes to:
// shared state
done := make(chan bool, 1)
var mu sync.Mutex
var mu2 sync.Mutex
// goroutine 1
go func() {
for {
select {
case <-done:
return
default:
mu2.Lock()
mu.Lock()
mu2.Unlock()
time.Sleep(100 * time.Microsecond)
mu.Unlock()
}
}
}()
// goroutine 2
for i := 0; i < n; i++ {
time.Sleep(100 * time.Microsecond)
mu2.Lock()
mu.Lock()
mu2.Unlock()
mu.Unlock()
}
done <- 1
Now, because g1 sleeps without holding mu2, goroutine 2 can get mu2 and block in mu.Lock, and then when goroutine 1 goes back around its loop, it blocks on mu2.Lock and cannot barge in on goroutine 2's acquisition of mu.
If we take this approach in the lockskew test, it works much better:
$ ./lockskew -2
2015/10/28 12:27:17 lock#0 took 0.000163s; average 0.000163s
2015/10/28 12:27:18 lock#2910 took 0.000164s; average 0.000164s
2015/10/28 12:27:19 lock#5823 took 0.000163s; average 0.000164s
2015/10/28 12:27:20 lock#8727 took 0.000174s; average 0.000164s
2015/10/28 12:27:21 lock#11651 took 0.000166s; average 0.000164s
Note that there is only one print per second but the program is averaging over 2500 loop iterations per second, about 500,000X faster than above.
Ideally developers would not need to know about, much less use, this kind of hack. Another possibility is to wrap this idiom or some other implementation into a (say) sync.FairMutex with guaranteed fairness properties. But again ideally developers would not need to know or think about these awful details. I do not want to see blog posts explaining when to use sync.Mutex and when to use sync.FairMutex.
If Go's sync.Mutex can do something better by default without sacrificing performance, it seems it should.
Proposal
The proposal is that we detect severe unfairness and revert to handoffs in that case. How to do that is an open question, which is why this is not a design doc.
Performance
As an approximate lower bound on the cost of this in the general case, I added one line to the implementation of sync.Mutex's Unlock:
// Grab the right to wake someone.
new = (old - 1<<mutexWaiterShift) | mutexWoken
if atomic.CompareAndSwapInt32(&m.state, old, new) {
runtime_Semrelease(&m.sema)
runtime.Gosched() <<< ADDED
return
}
Today at least, the implementation of Gosched essentially guarantees to run the goroutine just woken up by the Semrelease. This is not strictly speaking a handoff, since barging can still happen between the unlock and the acquisition in g2, but the handoff period is much smaller than it used to be, and in particular g1 is kept from reacquiring the lock. Anyway, the point is to evaluate how costly this might be, not to do a perfect job. A perfect job would require more lines changed in both sync.Mutex and the runtime (although perhaps not many).
With this line, lockskew with only a single lock runs at about the same speed as with two locks:
$ ./lockskew.fast
2015/10/28 13:49:02 lock#0 took 0.000155s; average 0.000155s
2015/10/28 13:49:03 lock#2934 took 0.000165s; average 0.000162s
2015/10/28 13:49:04 lock#5861 took 0.000163s; average 0.000162s
2015/10/28 13:49:05 lock#8783 took 0.000166s; average 0.000162s
2015/10/28 13:49:06 lock#11712 took 0.000162s; average 0.000162s
The pathological case is gone, but what about the common case? Barging is supposed to help throughput, so one expects that average throughput must have gone down. Doug Lea's paper uses as its benchmark multiple threads calling a random number generator protected by a lock. Each call acquires a lock, does a tiny amount of work, and releases the lock. The paper found that with 256 threads running on even single-processor systems, implementations allowing barging ran 10-100X faster than handoffs. Again this was with operating system threads, not goroutines, so the result may not carry over directly. A goroutine switch is much cheaper than a thread switch.
In fact, the result does not seem to carry over to goroutines, at least not in the random number case. Running go test -bench . rsc.io/tmp/lockskew
will benchmark Go's rand.Int (with the same mutex around simple math) called from multiple goroutines simultaneously, from 1 to 256. As a worst case, I ran it with GOMAXPROCS=256 even though my system has only 12 cores. Here is the effect of adding the Gosched:
name old time/op new time/op delta
Rand1-256 33.2ns ± 2% 32.8ns ± 1% -1.24% (p=0.008 n=10+9)
Rand2-256 40.4ns ±60% 33.4ns ±16% ~ (p=0.203 n=10+11)
Rand4-256 143ns ± 9% 140ns ± 3% ~ (p=0.098 n=9+11)
Rand8-256 230ns ± 4% 208ns ± 3% -9.46% (p=0.000 n=9+12)
Rand12-256 267ns ± 2% 236ns ± 1% -11.81% (p=0.000 n=10+11)
Rand16-256 272ns ± 1% 250ns ± 2% -8.12% (p=0.000 n=9+10)
Rand32-256 283ns ± 1% 269ns ± 2% -4.92% (p=0.000 n=10+10)
Rand64-256 284ns ± 2% 279ns ± 2% -1.90% (p=0.000 n=10+11)
Rand128-256 279ns ± 3% 280ns ± 2% ~ (p=0.591 n=10+11)
Rand256-256 276ns ± 1% 288ns ± 2% +4.23% (p=0.000 n=10+10)
For the most part, the Gosched makes things better, not worse.
Although this microbenchmark doesn't show it, there may be a small degradation in real programs. I ran the golang.org/x/benchmarks/bench program
with -bench=http
with and without the change. The change seemed to cause a roughly 5% increase in the reported "RunTime" result.
Perhaps with work to detect and fall back instead of having the "handoff" on always, we could keep real programs running as they do today but fix these pathological (but real) behaviors at the same time. For example, one idea is to have the blocked mu.Lock call do the detection. In the pathological behavior, it goes to sleep, is awoken, and then can't acquire the lock and goes back to sleep. If it is woken up to find the lock unavailable (say) three times in a row, it could set a "now we're in handoff mode" bit in the lock word. The handoff mode bit could be sticky for the life of the lock, or it could be cleared on a call to mu.Unlock that finds no goroutines waiting.
[1] http://gee.cs.oswego.edu/dl/papers/aqs.pdf but ignore section 5.2. @RLH says the theoretical analysis there is known to be flawed.