Skip to content

proposal: sync: support for sharded values #18802

Open
@aclements

Description

@aclements

Per-CPU sharded values are a useful and common way to reduce contention on shared write-mostly values. However, this technique is currently difficult or impossible to use in Go (though there have been attempts, such as @jonhoo's https://github.com/jonhoo/drwmutex and @bcmills' https://go-review.googlesource.com/#/c/35676/).

We propose providing an API for creating and working with sharded values. Sharding would be encapsulated in a type, say sync.Sharded, that would have Get() interface{}, Put(interface{}), and Do(func(interface{})) methods. Get and Put would always have to be paired to make Do possible. (This is actually the same API that was proposed in #8281 (comment) and rejected, but perhaps we have a better understanding of the issues now.) This idea came out of off-and-on discussions between at least @rsc, @hyangah, @RLH, @bcmills, @Sajmani, and myself.

This is a counter-proposal to various proposals to expose the current thread/P ID as a way to implement sharded values (#8281, #18590). These have been turned down as exposing low-level implementation details, tying Go to an API that may be inappropriate or difficult to support in the future, being difficult to use correctly (since the ID may change at any time), being difficult to specify, and as being broadly susceptible to abuse.

There are several dimensions to the design of such an API.

Get and Put can be blocking or non-blocking:

  • With non-blocking Get and Put, sync.Sharded behaves like a collection. Get returns immediately with the current shard's value or nil if the shard is empty. Put stores a value for the current shard if the shard's slot is empty (which may be different from where Get was called, but would often be the same). If the shard's slot is not empty, Put could either put to some overflow list (in which case the state is potentially unbounded), or run some user-provided combiner (which would bound the state).

  • With blocking Get and Put, sync.Sharded behaves more like a lock. Get returns and locks the current shard's value, blocking further Gets from that shard. Put sets the shard's value and unlocks it. In this case, Put has to know which shard the value came from, so Get can either return a put function (though that would require allocating a closure) or some opaque value that must be passed to Put that internally identifies the shard.

  • It would also be possible to combine these behaviors by using an overflow list with a bounded size. Specifying 0 would yield lock-like behavior, while specifying a larger value would give some slack where Get and Put remain non-blocking without allowing the state to become completely unbounded.

Do could be consistent or inconsistent:

  • If it's consistent, then it passes the callback a snapshot at a single instant. I can think of two ways to do this: block until all outstanding values are Put and also block further Gets until the Do can complete; or use the "current" value of each shard even if it's checked out. The latter requires that shard values be immutable, but it makes Do non-blocking.

  • If it's inconsistent, then it can wait on each shard independently. This is faster and doesn't affect Get and Put, but the caller can only get a rough idea of the combined value. This is fine for uses like approximate statistics counters.

It may be that we can't make this decision at the API level and have to provide both forms of Do.

I think this is a good base API, but I can think of a few reasonable extensions:

  • Provide Peek and CompareAndSwap. If a user of the API can be written in terms of these, then Do would always be able to get an immediate consistent snapshot.

  • Provide a Value operation that uses the user-provided combiner (if we go down that API route) to get the combined value of the sync.Sharded.

Activity

aclements

aclements commented on Jan 26, 2017

@aclements
MemberAuthor

My own inclination is towards the non-blocking API with a bounded overflow list. A blocking API seems antithetical to the goal of reducing contention and may lead to performance anomalies if a goroutine or OS thread is descheduled while it has a shard checked out and a non-blocking API with a required combiner may prevent certain use cases (e.g., large structures, or uses that never read the whole sharded value.) It also devolves to the blocking API if the bound is 0.

added this to the Proposal milestone on Jan 26, 2017
ianlancetaylor

ianlancetaylor commented on Jan 26, 2017

@ianlancetaylor
Contributor

The proposal as written is rather abstract. I think it would help to examine the specific use cases that people have for such a thing.

For example, it's clear that one use case is collecting metrics. Presumably the idea is that you have some sort of server, and it wants to log various metrics for each request. The metrics only need to be accumulated when they are reported, and reporting happens much less often than collection. Using a lock;update;unlock sequence will lead to lock contention. But (let's say) we need the metrics to be accurate. So the idea of sharding for this case is a lock;update;unlock sequence with a sharded lock, and an accumulate step that does lock;collect;zero;unlock for each sharded metric. That gives us the values we need while minimizing lock contention.

One way to implement this use case is for the sync.Sharded to require a combiner method as you describe. Conceptually, then:

func (s *Sharded) Get() interface{} {
    s.LockCurrentShard()
    r := s.CurrentShardValue()
    s.SetCurrentShardValue(nil)
    s.UnlockCurrentShard()
    return r
}

func (s *Sharded) Put(v interface{}) {
    s.LockCurrentShard()
    defer s.UnlockCurrentShard()
    c := s.CurrentShardValue()
    if c == nil {
        s.SetCurrentShardValue(v)
    } else {
        m := s.Combine(c, v) // Combine function defined by user.
        s.SetCurrentShardValue(m)
    }
}

For typical metrics the Do method does not to be consistent. However, it's not hard to have a consistent Do as long as the function passed to Do does not use the sync.Sharded value itself.

With this outline, we see that there is no need for sync.Sharded to maintain an overflow list. Any case that wants to use an overflow list will do so in the Combine function. Obviously the Combine function must not use the sync.Sharded value, as that may lead to deadlock, but otherwise it can do whatever it likes.

What other uses are there for sync.Sharded, and what sorts of implementation do they suggest?

bcmills

bcmills commented on Jan 26, 2017

@bcmills
Contributor

I had been considering a somewhat narrower API containing only Get and one of {Range, Do, ForEach}, bounding the number of distinct values to the number of threads in the program. The calling code would provide a func() interface{} at construction time to use when Get is called on a thread without an existing value.

The semantics would be similar to the non-blocking proposal: Get returns the current shard's value (but does not guarantee exclusiveness), and Range iterates over all existing values.

Because of the lack of exclusiveness, application code would still have to use atomic and/or sync to manipulate the values, but if the value is uncontended and usually owned by the same core's cache, the overhead of that application-side synchronization would be relatively small (compared to locking overhead for non-sharded values).

That approach two a few advantages over the alternatives in the current proposal.

  1. There is no "overflow list" to manage. The number of values is strictly bounded to the number of threads, and the value for a given thread cannot accidentally migrate away or be dropped.
  2. Application code using atomic values (as for the stats-counter use case in sync: per-cpu storage #8281) would not have to deal with lock-ordering (as it would with a blocking Get and Put).
  3. There is no possibility of deadlock (or overallocation of values) due to a missing Put in application code. This is perhaps less significant if we can make trivial defer usage less expensive (runtime: defer is slow #14939) and/or add a vet check (along the lines of lostcancel), but it seems simpler to avoid the problem in the first place.

It has one disadvantage that I'm aware of:

  1. Application code must include its own synchronization code.

Are there other tradeoffs for or against the narrower Get/Range API?

bcmills

bcmills commented on Jan 26, 2017

@bcmills
Contributor

[Using the "current" value of each shard even if it's checked out] requires that shard values be immutable, but it makes Do non-blocking.

It doesn't even require immutability: "externally synchronized" and/or "atomic" would suffice, although "externally synchronized" carries the risk of lock-ordering issues.

bcmills

bcmills commented on Jan 26, 2017

@bcmills
Contributor

One way to implement [consistent counting] is for the sync.Sharded to require a combiner method as you describe.

Anything that reduces values seems tricky to get right: you'd have to ensure that Do iterates in an order such that Combine cannot move a value that Do has already iterated over into one that it has yet to encounter (and vice-versa), otherwise you risk under- or double-counting that value.

I don't immediately see how to provide that property for Do in the general case without reintroducing a cross-thread contention point in Put, but it may be possible.

ianlancetaylor

ianlancetaylor commented on Jan 26, 2017

@ianlancetaylor
Contributor

For a consistent Do, first lock all the shards, then run the function on each value, then unlock all the shards. For an inconsistent Do, it doesn't matter.

bcmills

bcmills commented on Jan 26, 2017

@bcmills
Contributor

For a consistent Do, first lock all the shards, then run the function on each value, then unlock all the shards.

That essentially makes Do a stop-the-world operation: it not only blocks all of those threads until Do completes, but also invalidates the cache lines containing the per-shard locks in each of the local CPU caches.

Ideally, Do should produce much less interference in the steady state: it should only acquire/invalidate locks that are not in the fast path of Get and Put. If the values are read using atomic, that doesn't need to invalidate any cache lines at all: the core processing Do might need to wait to receive an up-to-date value, but since there is no write to the cross-core data the existing cached value doesn't need to be invalidated.

I guess that means I'm in favor of an inconsistent Do, provided that we don't discover a very compelling use-case for making it consistent.

funny-falcon

funny-falcon commented on Jan 26, 2017

@funny-falcon
Contributor

For some usages there should be strict knowledge of bounding number of allocated "values", ie number of allocated values should not change. And preferrably, values should be allocated at predictable time, for example, at container (Sharded) creation. For that kind of usage, interface with Put is unuseful.

Probably, it should be separate container:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}

If size never changes, there is no need in Do or ForEach or locks.
Application code must include its own synchronization code.

Rational: GOMAXPROCS changes rarely (almost never), so dynamic allocation excessive.

I could be mistaken about GOMAXPROCS constantness.

ianlancetaylor

ianlancetaylor commented on Jan 26, 2017

@ianlancetaylor
Contributor

@bcmills Well, as I said earlier, I think we need to look at specific use cases. For the specific use case I was discussing, I assert that the cost of a consistent Do is irrelevant, because it is run very infrequently.

What specific use case do you have in mind?

bcmills

bcmills commented on Jan 26, 2017

@bcmills
Contributor

@ianlancetaylor I'm specifically thinking about counting (as in #8281) and CPU-local caching (e.g. buffering unconditional stores to a shared map, a potential optimization avenue for #18177).

funny-falcon

funny-falcon commented on Jan 26, 2017

@funny-falcon
Contributor

I'm thinking about stat-collectors and high-performance RPC.

ianlancetaylor

ianlancetaylor commented on Jan 26, 2017

@ianlancetaylor
Contributor

@bcmills For counting, it seems to me you would use an inconsistent Do. If you need to avoid inconsistency while still using an inconsistent Do, have the combiner store the additional counts elsewhere and not modify the previously stored value. Presumably the combiner is only called in rare cases, so the speed of that rare case is not too important. You could even mitigate that cost by stacking sync.Sharded values.

I don't actually see how to write a consistent Do that does not disturb the fast path of Get and Put at all.

One approach for buffering stores to a shared map would be a Do that removes all the values, replacing them with nil. Come to think it, that would work for counters also. But it does interfere with the fast path.

@funny-falcon Can you expand on what you mean by "high-performance RPC"? I don't see why you need a global distributed value for RPC.

129 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      Participants

      @bradfitz@rsc@seebs@jonhoo@rhysh

      Issue actions

        proposal: sync: support for sharded values · Issue #18802 · golang/go