Skip to content

sync: RWMutex scales poorly with CPU count #17973

Open
@bcmills

Description

@bcmills

On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock} degrades dramatically as GOMAXPROCS increases.

This test program:

package benchmarks_test

import (
	"fmt"
	"sync"
	"testing"
)

func BenchmarkRWMutex(b *testing.B) {
	for ng := 1; ng <= 256; ng <<= 2 {
		b.Run(fmt.Sprint(ng), func(b *testing.B) {
			var mu sync.RWMutex
			mu.Lock()

			var wg sync.WaitGroup
			wg.Add(ng)

			n := b.N
			quota := n / ng

			for g := ng; g > 0; g-- {
				if g == 1 {
					quota = n
				}

				go func(quota int) {
					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}
					wg.Done()
				}(quota)

				n -= quota
			}

			if n != 0 {
				b.Fatalf("Incorrect quota assignments: %v remaining", n)
			}

			b.StartTimer()
			mu.Unlock()
			wg.Wait()
			b.StopTimer()
		})
	}
}

degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:

# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1      20000000                72.6 ns/op
BenchmarkRWMutex/1-4    20000000                72.4 ns/op
BenchmarkRWMutex/1-16   20000000                72.8 ns/op
BenchmarkRWMutex/1-64   20000000                72.5 ns/op
BenchmarkRWMutex/4      20000000                72.6 ns/op
BenchmarkRWMutex/4-4    20000000               105 ns/op
BenchmarkRWMutex/4-16   10000000               130 ns/op
BenchmarkRWMutex/4-64   20000000               160 ns/op
BenchmarkRWMutex/16     20000000                72.4 ns/op
BenchmarkRWMutex/16-4   10000000               125 ns/op
BenchmarkRWMutex/16-16  10000000               263 ns/op
BenchmarkRWMutex/16-64   5000000               287 ns/op
BenchmarkRWMutex/64     20000000                72.6 ns/op
BenchmarkRWMutex/64-4   10000000               137 ns/op
BenchmarkRWMutex/64-16   5000000               306 ns/op
BenchmarkRWMutex/64-64   3000000               517 ns/op
BenchmarkRWMutex/256                    20000000                72.4 ns/op
BenchmarkRWMutex/256-4                  20000000               137 ns/op
BenchmarkRWMutex/256-16                  5000000               280 ns/op
BenchmarkRWMutex/256-64                  3000000               602 ns/op
PASS

A "control" test, calling a no-op function instead of RWMutex methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.

Activity

josharian

josharian commented on Nov 18, 2016

@josharian
Contributor
josharian

josharian commented on Nov 18, 2016

@josharian
Contributor
ianlancetaylor

ianlancetaylor commented on Nov 18, 2016

@ianlancetaylor
Contributor

It may be difficult to apply the algorithm described in that paper to our existing sync.RWMutex type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond. That is, it doesn't seem to require any special relationship with the runtime package.

dvyukov

dvyukov commented on Nov 18, 2016

@dvyukov
Member
ianlancetaylor

ianlancetaylor commented on Nov 18, 2016

@ianlancetaylor
Contributor

What happens when a goroutine moves to a different P between read-lock and read-unlock?

dvyukov

dvyukov commented on Nov 18, 2016

@dvyukov
Member

RLock must return a proxy object to unlock, that object must hold the locked P index.

bcmills

bcmills commented on Nov 18, 2016

@bcmills
ContributorAuthor

The existing RWMutex API allows the RLock call to occur on a different goroutine from the RUnlock call. We can certainly assume that most RLock / RUnlock pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.

(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)

bcmills

bcmills commented on Nov 18, 2016

@bcmills
ContributorAuthor

At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.

The bigger issue is that sync.RWMutex is used fairly extensively within the standard library for package-level locks (the various caches in reflect, http.statusMu, json.encoderCache, mime.mimeLock, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as having RLock return an unlocker).

dvyukov

dvyukov commented on Nov 18, 2016

@dvyukov
Member

For these cases in std lib atomic.Value is much better fit. It is already used in json, gob and http. atomic.Value is perfectly scalable and virtually zero overhead for readers.

bcmills

bcmills commented on Nov 18, 2016

@bcmills
ContributorAuthor

I agree in general, but it's not obvious to me how one could use atomic.Value to guard lookups in a map acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)

dvyukov

dvyukov commented on Nov 18, 2016

@dvyukov
Member

What kind of cache do you mean?

ianlancetaylor

ianlancetaylor commented on Nov 18, 2016

@ianlancetaylor
Contributor

E.g., the one in reflect.ptrTo.

dvyukov

dvyukov commented on Nov 18, 2016

@dvyukov
Member

See e.g. encoding/json/encode.go:cachedTypeFields

bcmills

bcmills commented on Nov 18, 2016

@bcmills
ContributorAuthor

Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)

And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.

It would be interesting to see whether the latency tradeoff favors the RWMutex overhead or the O(N) insert overhead for more of the standard library.

ianlancetaylor

ianlancetaylor commented on Nov 18, 2016

@ianlancetaylor
Contributor

@dvyukov Thanks. For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

89 remaining items

added a commit that references this issue on Aug 27, 2023
ntsd

ntsd commented on Dec 22, 2023

@ntsd

I made a benchmark, But not sure if the code is correct. Hope this helps.

for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read.
for 100 concurrent 10k iters per each, RWMutex is slower in write, and impressive in read.
What surprises me is sync.Map did better in more concurrency.

https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsFixThe path to resolution is known, but the work has not been done.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@davidlazar@josharian@rsc@baryluk

        Issue actions

          sync: RWMutex scales poorly with CPU count · Issue #17973 · golang/go