Description
Per-CPU sharded values are a useful and common way to reduce contention on shared write-mostly values. However, this technique is currently difficult or impossible to use in Go (though there have been attempts, such as @jonhoo's https://github.com/jonhoo/drwmutex and @bcmills' https://go-review.googlesource.com/#/c/35676/).
We propose providing an API for creating and working with sharded values. Sharding would be encapsulated in a type, say sync.Sharded
, that would have Get() interface{}
, Put(interface{})
, and Do(func(interface{}))
methods. Get
and Put
would always have to be paired to make Do
possible. (This is actually the same API that was proposed in #8281 (comment) and rejected, but perhaps we have a better understanding of the issues now.) This idea came out of off-and-on discussions between at least @rsc, @hyangah, @RLH, @bcmills, @Sajmani, and myself.
This is a counter-proposal to various proposals to expose the current thread/P ID as a way to implement sharded values (#8281, #18590). These have been turned down as exposing low-level implementation details, tying Go to an API that may be inappropriate or difficult to support in the future, being difficult to use correctly (since the ID may change at any time), being difficult to specify, and as being broadly susceptible to abuse.
There are several dimensions to the design of such an API.
Get
and Put
can be blocking or non-blocking:
-
With non-blocking
Get
andPut
,sync.Sharded
behaves like a collection.Get
returns immediately with the current shard's value or nil if the shard is empty.Put
stores a value for the current shard if the shard's slot is empty (which may be different from whereGet
was called, but would often be the same). If the shard's slot is not empty,Put
could either put to some overflow list (in which case the state is potentially unbounded), or run some user-provided combiner (which would bound the state). -
With blocking
Get
andPut
,sync.Sharded
behaves more like a lock.Get
returns and locks the current shard's value, blocking furtherGet
s from that shard.Put
sets the shard's value and unlocks it. In this case,Put
has to know which shard the value came from, soGet
can either return aput
function (though that would require allocating a closure) or some opaque value that must be passed toPut
that internally identifies the shard. -
It would also be possible to combine these behaviors by using an overflow list with a bounded size. Specifying 0 would yield lock-like behavior, while specifying a larger value would give some slack where
Get
andPut
remain non-blocking without allowing the state to become completely unbounded.
Do
could be consistent or inconsistent:
-
If it's consistent, then it passes the callback a snapshot at a single instant. I can think of two ways to do this: block until all outstanding values are
Put
and also block furtherGet
s until theDo
can complete; or use the "current" value of each shard even if it's checked out. The latter requires that shard values be immutable, but it makesDo
non-blocking. -
If it's inconsistent, then it can wait on each shard independently. This is faster and doesn't affect
Get
andPut
, but the caller can only get a rough idea of the combined value. This is fine for uses like approximate statistics counters.
It may be that we can't make this decision at the API level and have to provide both forms of Do
.
I think this is a good base API, but I can think of a few reasonable extensions:
-
Provide
Peek
andCompareAndSwap
. If a user of the API can be written in terms of these, thenDo
would always be able to get an immediate consistent snapshot. -
Provide a
Value
operation that uses the user-provided combiner (if we go down that API route) to get the combined value of thesync.Sharded
.
Activity
aclements commentedon Jan 26, 2017
My own inclination is towards the non-blocking API with a bounded overflow list. A blocking API seems antithetical to the goal of reducing contention and may lead to performance anomalies if a goroutine or OS thread is descheduled while it has a shard checked out and a non-blocking API with a required combiner may prevent certain use cases (e.g., large structures, or uses that never read the whole sharded value.) It also devolves to the blocking API if the bound is 0.
ianlancetaylor commentedon Jan 26, 2017
The proposal as written is rather abstract. I think it would help to examine the specific use cases that people have for such a thing.
For example, it's clear that one use case is collecting metrics. Presumably the idea is that you have some sort of server, and it wants to log various metrics for each request. The metrics only need to be accumulated when they are reported, and reporting happens much less often than collection. Using a lock;update;unlock sequence will lead to lock contention. But (let's say) we need the metrics to be accurate. So the idea of sharding for this case is a lock;update;unlock sequence with a sharded lock, and an accumulate step that does lock;collect;zero;unlock for each sharded metric. That gives us the values we need while minimizing lock contention.
One way to implement this use case is for the
sync.Sharded
to require a combiner method as you describe. Conceptually, then:For typical metrics the
Do
method does not to be consistent. However, it's not hard to have a consistentDo
as long as the function passed toDo
does not use thesync.Sharded
value itself.With this outline, we see that there is no need for
sync.Sharded
to maintain an overflow list. Any case that wants to use an overflow list will do so in theCombine
function. Obviously theCombine
function must not use thesync.Sharded
value, as that may lead to deadlock, but otherwise it can do whatever it likes.What other uses are there for
sync.Sharded
, and what sorts of implementation do they suggest?bcmills commentedon Jan 26, 2017
I had been considering a somewhat narrower API containing only
Get
and one of {Range
,Do
,ForEach
}, bounding the number of distinct values to the number of threads in the program. The calling code would provide afunc() interface{}
at construction time to use whenGet
is called on a thread without an existing value.The semantics would be similar to the non-blocking proposal:
Get
returns the current shard's value (but does not guarantee exclusiveness), andRange
iterates over all existing values.Because of the lack of exclusiveness, application code would still have to use
atomic
and/orsync
to manipulate the values, but if the value is uncontended and usually owned by the same core's cache, the overhead of that application-side synchronization would be relatively small (compared to locking overhead for non-sharded values).That approach two a few advantages over the alternatives in the current proposal.
atomic
values (as for the stats-counter use case in sync: per-cpu storage #8281) would not have to deal with lock-ordering (as it would with a blockingGet
andPut
).Put
in application code. This is perhaps less significant if we can make trivialdefer
usage less expensive (runtime: defer is slow #14939) and/or add avet
check (along the lines oflostcancel
), but it seems simpler to avoid the problem in the first place.It has one disadvantage that I'm aware of:
Are there other tradeoffs for or against the narrower
Get
/Range
API?bcmills commentedon Jan 26, 2017
It doesn't even require immutability: "externally synchronized" and/or "atomic" would suffice, although "externally synchronized" carries the risk of lock-ordering issues.
bcmills commentedon Jan 26, 2017
Anything that reduces values seems tricky to get right: you'd have to ensure that
Do
iterates in an order such thatCombine
cannot move a value thatDo
has already iterated over into one that it has yet to encounter (and vice-versa), otherwise you risk under- or double-counting that value.I don't immediately see how to provide that property for
Do
in the general case without reintroducing a cross-thread contention point inPut
, but it may be possible.ianlancetaylor commentedon Jan 26, 2017
For a consistent
Do
, first lock all the shards, then run the function on each value, then unlock all the shards. For an inconsistentDo
, it doesn't matter.bcmills commentedon Jan 26, 2017
That essentially makes
Do
a stop-the-world operation: it not only blocks all of those threads untilDo
completes, but also invalidates the cache lines containing the per-shard locks in each of the local CPU caches.Ideally,
Do
should produce much less interference in the steady state: it should only acquire/invalidate locks that are not in the fast path ofGet
andPut
. If the values are read usingatomic
, that doesn't need to invalidate any cache lines at all: the core processingDo
might need to wait to receive an up-to-date value, but since there is no write to the cross-core data the existing cached value doesn't need to be invalidated.I guess that means I'm in favor of an inconsistent
Do
, provided that we don't discover a very compelling use-case for making it consistent.funny-falcon commentedon Jan 26, 2017
For some usages there should be strict knowledge of bounding number of allocated "values", ie number of allocated values should not change. And preferrably, values should be allocated at predictable time, for example, at container (
Sharded
) creation. For that kind of usage, interface withPut
is unuseful.Probably, it should be separate container:
If size never changes, there is no need in
Do
orForEach
or locks.Application code must include its own synchronization code.
Rational: GOMAXPROCS changes rarely (almost never), so dynamic allocation excessive.
I could be mistaken about GOMAXPROCS constantness.
ianlancetaylor commentedon Jan 26, 2017
@bcmills Well, as I said earlier, I think we need to look at specific use cases. For the specific use case I was discussing, I assert that the cost of a consistent
Do
is irrelevant, because it is run very infrequently.What specific use case do you have in mind?
bcmills commentedon Jan 26, 2017
@ianlancetaylor I'm specifically thinking about counting (as in #8281) and CPU-local caching (e.g. buffering unconditional stores to a shared map, a potential optimization avenue for #18177).
funny-falcon commentedon Jan 26, 2017
I'm thinking about stat-collectors and high-performance RPC.
ianlancetaylor commentedon Jan 26, 2017
@bcmills For counting, it seems to me you would use an inconsistent
Do
. If you need to avoid inconsistency while still using an inconsistentDo
, have the combiner store the additional counts elsewhere and not modify the previously stored value. Presumably the combiner is only called in rare cases, so the speed of that rare case is not too important. You could even mitigate that cost by stackingsync.Sharded
values.I don't actually see how to write a consistent
Do
that does not disturb the fast path ofGet
andPut
at all.One approach for buffering stores to a shared map would be a
Do
that removes all the values, replacing them withnil
. Come to think it, that would work for counters also. But it does interfere with the fast path.@funny-falcon Can you expand on what you mean by "high-performance RPC"? I don't see why you need a global distributed value for RPC.
129 remaining items