Open
Description
On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock}
degrades dramatically as GOMAXPROCS
increases.
This test program:
package benchmarks_test
import (
"fmt"
"sync"
"testing"
)
func BenchmarkRWMutex(b *testing.B) {
for ng := 1; ng <= 256; ng <<= 2 {
b.Run(fmt.Sprint(ng), func(b *testing.B) {
var mu sync.RWMutex
mu.Lock()
var wg sync.WaitGroup
wg.Add(ng)
n := b.N
quota := n / ng
for g := ng; g > 0; g-- {
if g == 1 {
quota = n
}
go func(quota int) {
for i := 0; i < quota; i++ {
mu.RLock()
mu.RUnlock()
}
wg.Done()
}(quota)
n -= quota
}
if n != 0 {
b.Fatalf("Incorrect quota assignments: %v remaining", n)
}
b.StartTimer()
mu.Unlock()
wg.Wait()
b.StopTimer()
})
}
}
degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:
# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1 20000000 72.6 ns/op
BenchmarkRWMutex/1-4 20000000 72.4 ns/op
BenchmarkRWMutex/1-16 20000000 72.8 ns/op
BenchmarkRWMutex/1-64 20000000 72.5 ns/op
BenchmarkRWMutex/4 20000000 72.6 ns/op
BenchmarkRWMutex/4-4 20000000 105 ns/op
BenchmarkRWMutex/4-16 10000000 130 ns/op
BenchmarkRWMutex/4-64 20000000 160 ns/op
BenchmarkRWMutex/16 20000000 72.4 ns/op
BenchmarkRWMutex/16-4 10000000 125 ns/op
BenchmarkRWMutex/16-16 10000000 263 ns/op
BenchmarkRWMutex/16-64 5000000 287 ns/op
BenchmarkRWMutex/64 20000000 72.6 ns/op
BenchmarkRWMutex/64-4 10000000 137 ns/op
BenchmarkRWMutex/64-16 5000000 306 ns/op
BenchmarkRWMutex/64-64 3000000 517 ns/op
BenchmarkRWMutex/256 20000000 72.4 ns/op
BenchmarkRWMutex/256-4 20000000 137 ns/op
BenchmarkRWMutex/256-16 5000000 280 ns/op
BenchmarkRWMutex/256-64 3000000 602 ns/op
PASS
A "control" test, calling a no-op function instead of RWMutex
methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
josharian commentedon Nov 18, 2016
Possibly of interest for this: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf
josharian commentedon Nov 18, 2016
cc @dvyukov
ianlancetaylor commentedon Nov 18, 2016
It may be difficult to apply the algorithm described in that paper to our existing
sync.RWMutex
type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions,
sync.Mutex
, andsync.Cond
. That is, it doesn't seem to require any special relationship with the runtime package.dvyukov commentedon Nov 18, 2016
Locking per-P slot may be enough and is much simpler:
https://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go
ianlancetaylor commentedon Nov 18, 2016
What happens when a goroutine moves to a different P between read-lock and read-unlock?
dvyukov commentedon Nov 18, 2016
RLock must return a proxy object to unlock, that object must hold the locked P index.
bcmills commentedon Nov 18, 2016
The existing
RWMutex
API allows theRLock
call to occur on a different goroutine from theRUnlock
call. We can certainly assume that mostRLock
/RUnlock
pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)
bcmills commentedon Nov 18, 2016
At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.
The bigger issue is that
sync.RWMutex
is used fairly extensively within the standard library for package-level locks (the various caches inreflect
,http.statusMu
,json.encoderCache
,mime.mimeLock
, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as havingRLock
return an unlocker).dvyukov commentedon Nov 18, 2016
For these cases in std lib
atomic.Value
is much better fit. It is already used in json, gob and http.atomic.Value
is perfectly scalable and virtually zero overhead for readers.bcmills commentedon Nov 18, 2016
I agree in general, but it's not obvious to me how one could use
atomic.Value
to guard lookups in amap
acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)dvyukov commentedon Nov 18, 2016
What kind of cache do you mean?
ianlancetaylor commentedon Nov 18, 2016
E.g., the one in
reflect.ptrTo
.dvyukov commentedon Nov 18, 2016
See e.g.
encoding/json/encode.go:cachedTypeFields
bcmills commentedon Nov 18, 2016
Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)
And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.
It would be interesting to see whether the latency tradeoff favors the
RWMutex
overhead or the O(N) insert overhead for more of the standard library.ianlancetaylor commentedon Nov 18, 2016
@dvyukov Thanks. For the
reflect.ptrTo
case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.89 remaining items
WIP
ntsd commentedon Dec 22, 2023
I made a benchmark, But not sure if the code is correct. Hope this helps.
for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read.
for 100 concurrent 10k iters per each, RWMutex is slower in write, and impressive in read.
What surprises me is
sync.Map
did better in more concurrency.https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios
(*lease.descriptorState).findForTimestamp
cockroachdb/cockroach#139166siso: hashfs no need to use sync.RWMutex