Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fall back to fair locks after repeated sleep-acquire failures #13086

Closed
rsc opened this issue Oct 28, 2015 · 18 comments
Closed

runtime: fall back to fair locks after repeated sleep-acquire failures #13086

rsc opened this issue Oct 28, 2015 · 18 comments

Comments

@rsc
Copy link
Contributor

rsc commented Oct 28, 2015

Go's locks make no guarantee of fairness. However, there exists at least one common pattern that exhibits severely unfair behavior, to the extent where it might be considered a serious bug. I propose we look into whether we can do anything in the sync.Mutex implementation that would ameliorate this severe unfairness without hurting the common case. I have one concrete suggestion at the bottom, but I hope to spur thought and discussion about this. I don't claim that this specific solution is the right one.

Problem

I've seen two different reports now of severe lock unfairness under roughly the same conditions, in two independent programs. The general problem occurs when there are two goroutines contending for a lock and one of them holds it for a long time while only briefly releasing it, and the other releases it for a long time and only briefly acquires it.

The demo program available by go get rsc.io/tmp/lockskew boils down to this:

// shared state
done := make(chan bool, 1)
var mu sync.Mutex

// goroutine 1
go func() {
    for {
        select {
        case <-done:
            return
        default:
            mu.Lock()
            time.Sleep(100 * time.Microsecond)
            mu.Unlock()
        }
    }
}()

// goroutine 2
for i := 0; i < n; i++ {
    time.Sleep(100 * time.Microsecond)
    mu.Lock()
    mu.Unlock()
}
done <- 1

Here, goroutine 1 basically wants to hold the lock all the time, releasing it briefly, while goroutine 2 only wants to hold the lock for brief amounts of time. Developers naively (but I don't think completely unreasonably) expect that although goroutine 2 may not get the lock quite as much as goroutine 1, it will get the lock once in a while, maybe after a second or two at worst. In fact, on my Linux workstation it's common for goroutine 2 to take 100+ seconds to acquire the lock just once.

When goroutine 1 executes mu.Unlock(), it marks the lock unlocked and wakes up goroutine 2, intending that goroutine 2 will run and acquire the lock itself. But goroutine 2 does not run immediately. Goroutine 1 keeps running, goes around the loop again, calls mu.Lock(), and reacquires the lock. By the time goroutine 2 actually runs and tries to acquire the lock, goroutine 1 has taken it back. The basic pattern is:

g1 holds lock, sleeping
    g2 calls mu.Lock
    - finds lock locked
    - adds itself to waiting queue
    - sleeps
g1 wakes up from time.Sleep
g1 calls mu.Unlock
- marks lock unlocked
- tells runtime g2 is allowed to run
- returns
g1 checks channel (never ready)
g1 calls mu.Lock
- finds lock unlocked
- acquires lock
g1 calls time.Sleep, sleeps
    g2 scheduled to run
    - finds lock locked
    - adds itself to waiting queue
    - sleeps

Given that g2 is woken by g1 only once per 100µs, the fact that g2 takes 100+ seconds to acquire the lock means that this cycle repeats millions of times before g1 is interrupted enough to let g2 run.

If you run lockskew it will acquire the lock as much as it can and print about acquisitions, at most once per second. The performance is sensitive to the specific length of the sleep, but they're all bad. On my system, here is lockskew with 100µs sleeps, as in the program above:

$ ./lockskew
2015/10/28 12:30:24 lock#0 took 180.773982s; average 180.773982s
2015/10/28 12:38:28 lock#1 took 484.603155s; average 332.688568s
2015/10/28 12:40:16 lock#2 took 107.852508s; average 257.743215s
2015/10/28 12:44:00 lock#3 took 223.592777s; average 249.205605s
2015/10/28 12:54:00 lock#4 took 600.067715s; average 319.378027s
2015/10/28 13:03:40 lock#5 took 579.896472s; average 362.797768s
2015/10/28 13:05:11 lock#6 took 91.223369s; average 324.001425s
2015/10/28 13:06:10 lock#8 took 59.019365s; average 258.622936s
2015/10/28 13:11:14 lock#9 took 303.023746s; average 263.063017s
2015/10/28 13:12:22 lock#10 took 68.975316s; average 245.418681s
2015/10/28 13:15:08 lock#11 took 165.555347s; average 238.763403s
2015/10/28 13:15:32 lock#12 took 23.713157s; average 222.221076s
2015/10/28 13:18:58 lock#13 took 205.748355s; average 221.044453s
2015/10/28 13:20:39 lock#14 took 101.183532s; average 213.053725s
2015/10/28 13:21:36 lock#15 took 56.859756s; average 203.291602s
2015/10/28 13:22:59 lock#16 took 83.056960s; average 196.218976s
2015/10/28 13:24:05 lock#17 took 66.815677s; average 189.029904s

And here is lockskew with 100ms sleeps:

$ ./lockskew -t 100ms
2015/10/28 13:26:01 lock#0 took 22.828276s; average 22.828276s
2015/10/28 13:28:04 lock#1 took 123.553891s; average 73.191083s
2015/10/28 13:31:17 lock#2 took 192.438909s; average 112.940359s
2015/10/28 13:31:24 lock#3 took 6.708422s; average 86.382375s
2015/10/28 13:33:04 lock#4 took 100.719121s; average 89.249724s
2015/10/28 13:34:06 lock#5 took 61.773499s; average 84.670353s
2015/10/28 13:34:11 lock#6 took 5.004274s; average 73.289485s
2015/10/28 13:34:14 lock#7 took 2.902557s; average 64.491119s
2015/10/28 13:34:17 lock#8 took 2.702412s; average 57.625707s
2015/10/28 13:34:34 lock#9 took 16.418097s; average 53.504946s
2015/10/28 13:34:39 lock#10 took 4.904286s; average 49.086704s
2015/10/28 13:35:31 lock#11 took 52.456904s; average 49.367554s
2015/10/28 13:37:03 lock#12 took 91.809099s; average 52.632288s
2015/10/28 13:37:17 lock#13 took 13.213271s; average 49.816644s
2015/10/28 13:37:23 lock#15 took 5.204634s; average 43.958641s

Barging vs Handoff

The fundamental problem here is when g1 decides to wake up g2, it leaves the sync.Mutex unlocked and lets g2 deal with reacquiring it. In the time between those two events, any goroutine other than g2 (here g1, but in general any other goroutine) can arrive at mu.Lock and grab the mutex instead of g2. This "drive-by acquisition" is sometimes called barging.

The alternative to barging is for g1 to leave the lock locked when waking g2, explicitly handing off the lock to g2. This coordinated handoff leaves the lock in the "locked" state the entire time, eliminating the window in which barging can occur.

In his work on java.util.concurrent, Doug Lea found that barging helped throughput [1], and that fact has entered the received wisdom surrounding locking. Following standard Java at the time, the evaluation used operating system threads and an operating system scheduler (specifically, Linux 2.4 NPTL on Pentium3/4 and Solaris 9 on UltraSparc2/3). As I understand it, barging is a win because it makes some bad operating system scheduling decisions impossible. Specifically, if the operating system delays the scheduling of g2, making the handoff delay long, not holding the lock during the handoff lets another thread do useful work in the interim.

As the test program and the real programs that inspired it show, however, at least in the Go implementation, barging enables very bad states. Also, the delays that barging helps avoid may not apply to Go, which is primarily using goroutines and a goroutine scheduler.

Alternatives

It's worth noting that there is a known workaround to this problem. If you use a pair of locks, with mu2 protecting mu.Lock, then that makes barging impossible in the two-goroutine case and at least less of an issue in the case of more goroutines. Specifically, the above code changes to:

// shared state
done := make(chan bool, 1)
var mu sync.Mutex
var mu2 sync.Mutex

// goroutine 1
go func() {
    for {
        select {
        case <-done:
            return
        default:
            mu2.Lock()
            mu.Lock()
            mu2.Unlock()
            time.Sleep(100 * time.Microsecond)
            mu.Unlock()
        }
    }
}()

// goroutine 2
for i := 0; i < n; i++ {
    time.Sleep(100 * time.Microsecond)
    mu2.Lock()
    mu.Lock()
    mu2.Unlock()
    mu.Unlock()
}
done <- 1

Now, because g1 sleeps without holding mu2, goroutine 2 can get mu2 and block in mu.Lock, and then when goroutine 1 goes back around its loop, it blocks on mu2.Lock and cannot barge in on goroutine 2's acquisition of mu.

If we take this approach in the lockskew test, it works much better:

$ ./lockskew -2
2015/10/28 12:27:17 lock#0 took 0.000163s; average 0.000163s
2015/10/28 12:27:18 lock#2910 took 0.000164s; average 0.000164s
2015/10/28 12:27:19 lock#5823 took 0.000163s; average 0.000164s
2015/10/28 12:27:20 lock#8727 took 0.000174s; average 0.000164s
2015/10/28 12:27:21 lock#11651 took 0.000166s; average 0.000164s

Note that there is only one print per second but the program is averaging over 2500 loop iterations per second, about 500,000X faster than above.

Ideally developers would not need to know about, much less use, this kind of hack. Another possibility is to wrap this idiom or some other implementation into a (say) sync.FairMutex with guaranteed fairness properties. But again ideally developers would not need to know or think about these awful details. I do not want to see blog posts explaining when to use sync.Mutex and when to use sync.FairMutex.

If Go's sync.Mutex can do something better by default without sacrificing performance, it seems it should.

Proposal

The proposal is that we detect severe unfairness and revert to handoffs in that case. How to do that is an open question, which is why this is not a design doc.

Performance

As an approximate lower bound on the cost of this in the general case, I added one line to the implementation of sync.Mutex's Unlock:

    // Grab the right to wake someone.
    new = (old - 1<<mutexWaiterShift) | mutexWoken
    if atomic.CompareAndSwapInt32(&m.state, old, new) {
        runtime_Semrelease(&m.sema)
        runtime.Gosched()                  <<< ADDED
        return
    }

Today at least, the implementation of Gosched essentially guarantees to run the goroutine just woken up by the Semrelease. This is not strictly speaking a handoff, since barging can still happen between the unlock and the acquisition in g2, but the handoff period is much smaller than it used to be, and in particular g1 is kept from reacquiring the lock. Anyway, the point is to evaluate how costly this might be, not to do a perfect job. A perfect job would require more lines changed in both sync.Mutex and the runtime (although perhaps not many).

With this line, lockskew with only a single lock runs at about the same speed as with two locks:

$ ./lockskew.fast 
2015/10/28 13:49:02 lock#0 took 0.000155s; average 0.000155s
2015/10/28 13:49:03 lock#2934 took 0.000165s; average 0.000162s
2015/10/28 13:49:04 lock#5861 took 0.000163s; average 0.000162s
2015/10/28 13:49:05 lock#8783 took 0.000166s; average 0.000162s
2015/10/28 13:49:06 lock#11712 took 0.000162s; average 0.000162s

The pathological case is gone, but what about the common case? Barging is supposed to help throughput, so one expects that average throughput must have gone down. Doug Lea's paper uses as its benchmark multiple threads calling a random number generator protected by a lock. Each call acquires a lock, does a tiny amount of work, and releases the lock. The paper found that with 256 threads running on even single-processor systems, implementations allowing barging ran 10-100X faster than handoffs. Again this was with operating system threads, not goroutines, so the result may not carry over directly. A goroutine switch is much cheaper than a thread switch.

In fact, the result does not seem to carry over to goroutines, at least not in the random number case. Running go test -bench . rsc.io/tmp/lockskew will benchmark Go's rand.Int (with the same mutex around simple math) called from multiple goroutines simultaneously, from 1 to 256. As a worst case, I ran it with GOMAXPROCS=256 even though my system has only 12 cores. Here is the effect of adding the Gosched:

name         old time/op  new time/op  delta
Rand1-256    33.2ns ± 2%  32.8ns ± 1%   -1.24%   (p=0.008 n=10+9)
Rand2-256    40.4ns ±60%  33.4ns ±16%     ~     (p=0.203 n=10+11)
Rand4-256     143ns ± 9%   140ns ± 3%     ~      (p=0.098 n=9+11)
Rand8-256     230ns ± 4%   208ns ± 3%   -9.46%   (p=0.000 n=9+12)
Rand12-256    267ns ± 2%   236ns ± 1%  -11.81%  (p=0.000 n=10+11)
Rand16-256    272ns ± 1%   250ns ± 2%   -8.12%   (p=0.000 n=9+10)
Rand32-256    283ns ± 1%   269ns ± 2%   -4.92%  (p=0.000 n=10+10)
Rand64-256    284ns ± 2%   279ns ± 2%   -1.90%  (p=0.000 n=10+11)
Rand128-256   279ns ± 3%   280ns ± 2%     ~     (p=0.591 n=10+11)
Rand256-256   276ns ± 1%   288ns ± 2%   +4.23%  (p=0.000 n=10+10)

For the most part, the Gosched makes things better, not worse.

Although this microbenchmark doesn't show it, there may be a small degradation in real programs. I ran the golang.org/x/benchmarks/bench program with -bench=http with and without the change. The change seemed to cause a roughly 5% increase in the reported "RunTime" result.

Perhaps with work to detect and fall back instead of having the "handoff" on always, we could keep real programs running as they do today but fix these pathological (but real) behaviors at the same time. For example, one idea is to have the blocked mu.Lock call do the detection. In the pathological behavior, it goes to sleep, is awoken, and then can't acquire the lock and goes back to sleep. If it is woken up to find the lock unavailable (say) three times in a row, it could set a "now we're in handoff mode" bit in the lock word. The handoff mode bit could be sticky for the life of the lock, or it could be cleared on a call to mu.Unlock that finds no goroutines waiting.

[1] http://gee.cs.oswego.edu/dl/papers/aqs.pdf but ignore section 5.2. @RLH says the theoretical analysis there is known to be flawed.

/cc @randall77 @dvyukov @aclements @RLH @dr2chase

@rsc rsc added this to the Proposal milestone Oct 28, 2015
@dvyukov
Copy link
Member

dvyukov commented Oct 29, 2015

Some notes on the measuring strategy.

The same unfair algorithm is used on two levels: sync.Mutex and runtime semaphores. If you change only one of them you can get all kinds of weird non-linear effects that can both improve and worsen fairness and performance.

In the test you statically partition iterations among goroutines. And this strategy strongly prefers fair execution -- all goroutines need to finish at the same time. That can explain better numbers in benchmarks.

I've did a quick prototype of fully fair (FIFO) sync.Mutex and runtime sema:
https://go-review.googlesource.com/#/c/16469/
and changed the test to (RunParallel knows how to do it properly):

func benchmarkRand(b *testing.B, n int) {
    b.SetParallelism(n)
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            rand.Int()
        }
    })
}

Here are results on a silly 2 HT core notebook (and number of goroutines is not as high as in typical Go servers):

benchmark             old ns/op     new ns/op     delta
BenchmarkRand1        50.8          48.1          -5.31%
BenchmarkRand1-2      63.1          576           +812.84%
BenchmarkRand1-4      206           584           +183.50%
BenchmarkRand1-8      209           505           +141.63%
BenchmarkRand2        59.8          112           +87.29%
BenchmarkRand2-2      82.1          500           +509.01%
BenchmarkRand2-4      219           496           +126.48%
BenchmarkRand2-8      238           472           +98.32%
BenchmarkRand4        65.5          230           +251.15%
BenchmarkRand4-2      290           472           +62.76%
BenchmarkRand4-4      261           477           +82.76%
BenchmarkRand4-8      275           478           +73.82%
BenchmarkRand8        64.2          385           +499.69%
BenchmarkRand8-2      85.8          471           +448.95%
BenchmarkRand8-4      285           476           +67.02%
BenchmarkRand8-8      283           485           +71.38%
BenchmarkRand16       65.7          295           +349.01%
BenchmarkRand16-2     305           470           +54.10%
BenchmarkRand16-4     290           498           +71.72%
BenchmarkRand16-8     287           495           +72.47%

You can see that the things Doug is talking about are not fake. Even with GOMAXPROCS=1 slowdown is significant (that's our goroutine switching overhead).

@dvyukov
Copy link
Member

dvyukov commented Oct 29, 2015

I think we can do something along the lines of your proposal. However, we should not rely on absolute number of failed acquire tries. What matters for both user perceived latency and performance is real time (for performance it is probably frequency of acquisitions, but we can approximate it with real time). If we take a frequently acquired mutex (like the rand test), 3 is effectively the same as 1 (see the benchmark results). User does not care about number of mutex acquisitions, he cares about milliseconds.
10 ms looks like a good number to trigger hand off more (that's the GC latency guarantee and that's the goroutine preemption period).

@rsc
Copy link
Contributor Author

rsc commented Oct 29, 2015

@dvuykov, thanks for building that prototype and showing what I did wrong. :-)

You reported new benchmark results with two different things changed: both the implementation and the benchmark. What about your correct benchmark with my pseudo-fair implementation? Perhaps your results show only that we don't want complete fairness on by default (or perhaps even at all). My pseudo-fairness change fixed the real-world pathology with only a 5% slowdown in the macrobenchmarks (bench -bench=http), so maybe that's a better starting place.

If you wanted to switch the detection to time instead of number of spurious wakeups, how would you incorporate real time efficiently? My concern is that I didn't want to put new time calls into every Lock acquisition.

@dr2chase
Copy link
Contributor

If 10ms is the Go preemption period, might we have a crude+cheap counter available based on that? If the thread at the head of the queue is denied access 3 times in a row, then it can check that counter each denial after that, and if it changes twice, that's between 10 and 19.9 ms.

@dvyukov
Copy link
Member

dvyukov commented Oct 29, 2015

@dr2chase Sysmon thread can provide such counter, sysmon runs periodically and queries time anyway. However, my plan always was to get rid of sysmon entirely in favor of Windows UMS and Linux fibers.

@rsc Humm... context switch after unlock is exactly what we want to avoid. The micro benchmark does not test the real thing here. 2 points: (1) the context switch will destroy cache locality (in the benchmark there is no significant data), (2) the context switch will increase latency and memory consumption; consider that a server has 10 requests to process, each takes 1 ms to process, if you process them one-by-one to completion first request will be serviced in 1 ms, second in 2 ms, etc; if you constantly round-robin between them, all requests will be serviced in 10 ms and working set the sum of 10 requests; servers need to prioritize the oldest work and get rid of it entirely as soon as possible to free all associated resources before switching to new work.
Also I suspect that in the benchmark most of the goroutines are blocked on the mutex, so there is no significant number of runnable goroutines and you basically switch directly to the just unblocked goroutine. This is faster and also gives false sense of fairness.

@i3d
Copy link
Contributor

i3d commented Oct 29, 2015

In multiple goroutines, the effect is still there, just not as severe as 2 goroutines illustrated in the test. I think the fundamental cause is probably that the scheduler is always favor to keep the runnable goroutine keep running, instead of even looking at the goroutines in the waiting queue that some might also just ready to run...

I think in Linux there is third queue (if my memory still correct) that used to describe threads that can actually be run (don't block on anything) but just not yet to be scheduled to run. On a scheduling point, all of them are just relatively fair picked. (don't know exactly algorithm but I'd imagine just a randomly pick would provide some basic fairness). I am wondering if this could be somewhat useful in Go, as it seems a very common case where a bunch of goroutines would block on waiting some lock and there is one that's doing something and holding the lock. Upon the time when the holder releases the lock, all of these parked goroutines that are waiting on the same lock should be just in the runnable queue and be fair pick along with the runnable. I think the Gosched() is pretty much just doing this artificially.

Although dvyukov@ also has the point of cache destroy... but hmm I am wondering if generally seconds of block of preventing from continuing doing work would actually be much worse than ctx swtiches in microseconds (and we aren't even talking about constant context switches, in practice, you don't lock and unlock right after that...).

@dvyukov
Copy link
Member

dvyukov commented Oct 29, 2015

I think in Linux there is third queue (if my memory still correct) that used to describe threads that can actually be run

Go scheduler has just one queue, and it holds exactly such goroutines. The queue is services in a fair manner (FIFO).

in practice, you don't lock and unlock right after that...

It happens quite frequently in practice, e.g. in some helper function that is called frequently and takes a mutex internally.

@binarycrusader
Copy link
Contributor

Inspiration might be found in an older version of the Solaris Nevada sources:

"LIFO queue ordering is unfair and can lead to starvation, but it gives better performance for heavily contended locks..."

https://github.com/alhazred/onarm/blob/2b3c7a498208e08afc7734be7f08cdad7ac51fbe/usr/src/lib/libc/port/threads/synch.c#L569

Obviously, this applies to OS-level thread scheduling, but the principles should remain applicable.

@adg
Copy link
Contributor

adg commented Sep 26, 2016

@rsc ping!

@rakitzis
Copy link

NB for your consideration: we have seen this in production in a similar variation:

We have many incoming requests contending on a lock protecting a long-running (order of milliseconds) journal transaction. The lock unfairness stretched the long tail into hours(!). The solution was to track the waiters explicitly in a queue and guard the queue with a mutex, this sorted out nearly all of the unfairness.

@rsc
Copy link
Contributor Author

rsc commented Sep 29, 2016

I wrote initially "The proposal is that we detect severe unfairness and revert to handoffs in that case. How to do that is an open question, which is why this is not a design doc." That's still true, unfortunately. We should probably plan to investigate this for Go 1.9 though. Thanks for the ping.

@rakitzis
Copy link

Right, would be great if locks didn't observe perfect fairness, but nevertheless followed the principle of least surprise. Thanks for keeping the issue open.

@rsc
Copy link
Contributor Author

rsc commented Dec 12, 2016

We don't know what to do here. It seems like maybe I should retract this proposal, but this is a problem for some people, and it would be nice to solve it before they realized they had it (that is, not require people to type things like "FairMutex").

Does anyone have any suggestions for what we might do?

/cc @dvyukov @aclements @RLH @dr2chase

@dr2chase
Copy link
Contributor

Do something crudely random, say allow barging (N-1)/N of the time, hand off 1/N? Pick N a power of two, use a super-crude (i.e., fast) RNG and just check for the lower log2(N) bits zero.

@philhofer
Copy link
Contributor

It seems like the unfairness only becomes a noticeable problem when the lock is held for a "very long time" (or, in other words, when the lock holder reschedules while holding the lock). You could detect that condition by looking at how long the lock is held on average, or by looking at the lock queue depth (or the wait time of the goroutine at the head of the queue).

@philhofer
Copy link
Contributor

In other words, if the semaphore was fair (I know it's not now), then you could choose lock handoff if the next goroutine on the semaphore had been waiting for longer than 10ms.

@gopherbot
Copy link

CL https://golang.org/cl/34310 mentions this issue.

@rsc rsc changed the title proposal: runtime: fall back to fair locks after repeated sleep-acquire failures runtime: fall back to fair locks after repeated sleep-acquire failures Dec 19, 2016
@rsc
Copy link
Contributor Author

rsc commented Dec 19, 2016

Since Dmitriy figured out a way to do this, proposal accepted.

rmohr added a commit to rmohr/kubevirt that referenced this issue Jan 11, 2017
Preparation to switch to workqueue, to work around [1].  The Informer is
written in a way that new updates are starving when the the process loop
holds the lock too long.

[1] golang/go#13086
@golang golang locked and limited conversation to collaborators Feb 17, 2018
@rsc rsc unassigned dvyukov Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants