Skip to content

sync: reduce contention between Map operations with new-but-disjoint keys #21035

Closed
@bcmills

Description

@bcmills

The Go 1.9 implementation of sync.Map uses a single Mutex to guard the read-write map containing new keys. That makes Store calls with different new keys always contend with each other, and also contend with Load calls with different new keys, even if the Loads and Stores for each key are confined to a single thread.

That doesn't really matter for the sync.Map use-cases in the standard library because they do not operate on new keys in the steady state, but it limits the utility of sync.Map for use-cases involving a high rate of churn. Such use-cases may include:

  • caches with high eviction rates, such as caches fronting key-value storage services
  • maps of RPC or HTTP stream ID to handler state
  • maps of opaque handles to Go pointers in order to write C-exported APIs that comply with cgo pointer-passing rules

We should explore ways to address new-key contention, such as sharding the read-write maps and associated locks (as suggested in #20360), journaling writes (and using a Bloom or HyperLogLog filter to avoid reading the journal, along the lines of #21032), or storing the read-write map in an atomic tree data structure instead of a built-in map.

Activity

bcmills

bcmills commented on Jul 16, 2017

@bcmills
ContributorAuthor

(@orcaman and/or @OneOfOne might be interested in this one?)

added this to the Go1.10 milestone on Jul 16, 2017
OneOfOne

OneOfOne commented on Jul 17, 2017

@OneOfOne
Contributor

I'm interested but I can't commit right now, the one idea that comes to my mind would require hacking runtime/hashmap* otherwise we'd have to double hash the keys.

orcaman

orcaman commented on Jul 17, 2017

@orcaman

Very interesting task, I'd love to take this one. I think it should definitely be doable for the Go1.10 Milestone. I'll think about the design some more before I can 100% commit to doing this one in due time.

OneOfOne

OneOfOne commented on Jul 27, 2017

@OneOfOne
Contributor

@orcaman What I wanted to do is something like https://github.com/OneOfOne/cmap/tree/master/v2,

I use a slightly modified version of this in my code, but I wanted to do that with sync.Map in a way, but that'd require a lot more runtime knowledge to be able to use runtime/map's hasher directly to do the sharding rather than double hash it.

Sadly I don't have the knowledge nor time to learn the internals of runtime/map right now.

By all means if you can do that, it'd be great or we can discuss it.

// very simple set/get benchmark using cmap, rwmutex map and sync.Map.

BenchmarkCMap/128-8     20000000               105 ns/op               0 B/op          0 allocs/op
BenchmarkCMap/256-8     20000000                88.6 ns/op             0 B/op          0 allocs/op
BenchmarkCMap/512-8     20000000                67.6 ns/op             0 B/op          0 allocs/op
BenchmarkCMap/1024-8    30000000                42.3 ns/op             0 B/op          0 allocs/op
BenchmarkCMap/2048-8    30000000                34.5 ns/op             0 B/op          0 allocs/op
BenchmarkCMap/4096-8    50000000                33.0 ns/op             0 B/op          0 allocs/op
BenchmarkMutexMap-8     10000000               196 ns/op               0 B/op          0 allocs/op
BenchmarkSyncMap-8      30000000                39.3 ns/op            16 B/op          1 allocs/op
modified the milestones: Go1.10, Unplanned on Nov 29, 2017
robaho

robaho commented on Nov 26, 2018

@robaho

I think you can just use a slice of locks and dirty maps with the low order bits of the hash performing the selection - but this requires access to the internal hash code.

robaho

robaho commented on Nov 27, 2018

@robaho

@bcmills I am willing to give this a try. Is there a document that desribes using internal facilities when an implementation is part of the stdlib?

20 remaining items

thepudds

thepudds commented on Dec 3, 2024

@thepudds
Member

Hi @akavel, new APIs need to go through the proposal process, but improving the internal implementation of existing APIs does not, so my guess is there probably was not a specific proposal for updating the implementation of sync.Map.

As I understand the history, this new implementation was originally merged in CL 573956 as an internal HashTrieMap to support the new unique package proposal (#62483), with some discussion in that proposal.

There is a large stack of more recent CLs that add more functionality I think with the goal of supporting sync.Map, including a comment about enabling by default in CL 608335 that includes some performance-related comments.

There are probably other interesting bits of commentary if you poke around the history some more.

If people are curious about performance improvements, it could be helpful to test it out via benchmarks for their use cases, which is relatively easy via gotip:

$ go install golang.org/dl/gotip@latest
$ gotip download
$ gotip test -bench=.                    # use gotip as your go command
mknyszek

mknyszek commented on Dec 3, 2024

@mknyszek
Contributor

Thanks @thepudds. Yeah, there wasn't even a tracking issue for this since the fact that it came to be at all was somewhat coincidental. It turned out the new design was faster than sync.Map in its own benchmarks, so I pushed a little to get the rest of the operations implemented before the freeze.

The commit messages didn't include the benchmark results on sync's benchmark suite, but the geomean across all of them something like ~20% improved for GOMAXPROCS=1 and GOMAXPROCS=2 and more for higher GOMAXPROCS (1 and 2 are probably more realistic just because very few applications are going to be hammering on a sync.Map on all cores). I can re-run the benchmarks and post them here, and you're also welcome to run them yourself. You can also disable the new map implementation for now with GOEXPERIMENT=nosynchashtriemap.

akavel

akavel commented on Dec 9, 2024

@akavel
Contributor

@mknyszek I think it would be super cool if you could post them here, I suppose other/future readers might also be interested; I got here during some quick research trying to evaluate feasibility of sync.Map for my use, leading me to reach out for thirdparty replacements. If others happen to trek a similar path in the future, could be useful to them.

thepudds

thepudds commented on Dec 9, 2024

@thepudds
Member

Hi @akavel and anyone else interested -- @mknyszek posted some micro-benchmark results for the new sync.Map implementation in #70683, along with some brief discussion.

I would definitely encourage anyone curious to try out the new implementation (e.g., via gotip or otherwise). Bugs reported now are especially helpful.

Also note that he mentioned you can currently disable via GOEXPERIMENT=nosynchashtriemap if you want to switch back and forth between implementations (when testing, for example).

rabbbit

rabbbit commented on Dec 11, 2024

@rabbbit

I would definitely encourage anyone curious to try out the new implementation (e.g., via gotip or otherwise). Bugs reported now are especially helpful.

I was curious, and I ran some of our internal benchmarks against Gotip. We're testing a single sync.Map "graph" with a set of "elements". The test is effectively a "LoadOrStore" test where the new "element" creation is 200x more expensive, in allocs, than a lookup.

  • graph is large, equal to b.N
  • we test 100%, 75%, 50%,. .., 0% new elements are inserted to the graph
  • in a second version, we also run an "export" in parallel that iterates over all elements in a graph.

The results are good in the pathological case: see ~20% win when creating a lot of new elements, and iterating at the same time. Comparing gotip with experiment enabled/disabled:

GraphNewCallNewElementParallel/http2-100-12                      33.10µ ±  6%   31.86µ ±  7%        ~ (p=0.280 n=10)
GraphNewCallNewElementParallel/http2-75-12                       24.43µ ±  2%   24.05µ ±  3%        ~ (p=0.063 n=10)
GraphNewCallNewElementParallel/http2-50-12                       16.80µ ±  4%   16.54µ ±  8%        ~ (p=0.529 n=10)
GraphNewCallNewElementParallel/http2-25-12                       9.272µ ±  6%   8.951µ ±  3%        ~ (p=0.063 n=10)
GraphNewCallNewElementParallel/http2-10-12                       4.818µ ±  6%   4.880µ ±  7%        ~ (p=0.247 n=10)
GraphNewCallNewElementParallel/http2-5-12                        3.689µ ±  9%   3.693µ ±  8%        ~ (p=0.592 n=10)
GraphNewCallNewElementParallel/http2-2-12                        3.572µ ± 18%   3.021µ ±  5%  -15.43% (p=0.023 n=10)
GraphNewCallNewElementParallel/http2-1-12                        2.965µ ±  7%   2.864µ ±  9%        ~ (p=0.481 n=10)
GraphNewCallNewElementParallel/http2-0-12                        2.847µ ±  7%   2.656µ ± 16%        ~ (p=0.118 n=10)
GraphNewCallNewElementParallelWithExport/http2-100-12            38.99µ ±  9%   31.15µ ± 11%  -20.12% (p=0.000 n=10)
GraphNewCallNewElementParallelWithExport/http2-75-12             29.40µ ±  8%   23.27µ ± 10%  -20.83% (p=0.000 n=10)
GraphNewCallNewElementParallelWithExport/http2-50-12             18.59µ ±  4%   16.28µ ±  6%  -12.42% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-25-12            10.139µ ±  3%   9.183µ ±  6%   -9.43% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-10-12             5.321µ ±  5%   5.246µ ±  6%        ~ (p=0.190 n=10) 
GraphNewCallNewElementParallelWithExport/http2-5-12              3.853µ ± 12%   3.882µ ±  5%        ~ (p=1.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-2-12              3.129µ ± 17%   3.213µ ±  4%        ~ (p=0.684 n=10) 
GraphNewCallNewElementParallelWithExport/http2-1-12              2.879µ ±  7%   2.937µ ±  6%        ~ (p=0.184 n=10) 
GraphNewCallNewElementParallelWithExport/http2-0-12              2.738µ ± 50%   2.716µ ±  7%        ~ (p=0.306 n=10)

More interesting were the results on 1.23.4 vs gotip (so unrelated to diff PR):

  • 4-5% CPU degradation for "sequential" case
  • 1-2% more bytes
  • 10% more allocs per op.
                                                      │ gotip_before_3 │         gotip_after_enabled3         │                                                                                              
                                                      │     sec/op     │    sec/op     vs base                │ 
GraphNewCallNewElement/http-12                            19.75Ki ± 0%   20.24Ki ± 0%  +2.44% (p=0.000 n=10)  

                                                      │ gotip_before_3 │         gotip_after_enabled3          │                                                                                             
                                                      │      B/op      │     B/op      vs base                 │
GraphNewCallNewElementParallel/http2-100-12               20.00Ki ± 0%   20.27Ki ± 0%  +1.34% (p=0.000 n=10) 
GraphNewCallNewElementParallel/http2-75-12                15.10Ki ± 0%   15.33Ki ± 1%  +1.52% (p=0.000 n=10) 
GraphNewCallNewElementParallel/http2-50-12                10.23Ki ± 0%   10.35Ki ± 1%  +1.13% (p=0.000 n=10) 
GraphNewCallNewElementParallel/http2-25-12                5.328Ki ± 1%   5.396Ki ± 0%  +1.28% (p=0.000 n=10) 
GraphNewCallNewElementParallel/http2-10-12                2.392Ki ± 0%   2.427Ki ± 0%  +1.47% (p=0.000 n=10) 
GraphNewCallNewElementParallel/http2-5-12                 1.422Ki ± 1%   1.427Ki ± 1%       ~ (p=0.223 n=10) 
GraphNewCallNewElementParallel/http2-2-12                   850.0 ± 1%     853.0 ± 1%       ~ (p=0.208 n=10) 
GraphNewCallNewElementParallel/http2-1-12                   649.5 ± 1%     651.5 ± 0%       ~ (p=0.132 n=10) 
GraphNewCallNewElementParallel/http2-0-12                   448.0 ± 0%     448.0 ± 0%       ~ (p=1.000 n=10) ¹
GraphNewCallNewElementParallelWithExport/http2-100-12     20.07Ki ± 0%   20.27Ki ± 0%  +1.01% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-75-12      15.20Ki ± 0%   15.30Ki ± 0%  +0.61% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-50-12      10.26Ki ± 0%   10.34Ki ± 0%  +0.79% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-25-12      5.345Ki ± 0%   5.392Ki ± 0%  +0.89% (p=0.000 n=10) 
GraphNewCallNewElementParallelWithExport/http2-10-12      2.406Ki ± 0%   2.424Ki ± 0%  +0.73% (p=0.002 n=10) 
GraphNewCallNewElementParallelWithExport/http2-5-12       1.423Ki ± 1%   1.430Ki ± 1%       ~ (p=0.211 n=10) 
GraphNewCallNewElementParallelWithExport/http2-2-12         852.0 ± 1%     854.0 ± 1%       ~ (p=0.383 n=10) 
GraphNewCallNewElementParallelWithExport/http2-1-12         650.5 ± 1%     651.5 ± 0%       ~ (p=0.356 n=10) 
GraphNewCallNewElementParallelWithExport/http2-0-12         448.0 ± 0%     448.0 ± 0%       ~ (p=1.000 n=10) 

                                                      │ gotip_before_3 │        gotip_after_enabled3         │
                                                      │   allocs/op    │ allocs/op   vs base                 │
GraphNewCallNewElement/http-12                              209.0 ± 0%   229.0 ± 0%  +9.57% (p=0.000 n=10)  
GraphNewCallNewElement/tchannel-12                          209.0 ± 0%   229.0 ± 0%  +9.57% (p=0.000 n=10)  
GraphNewCallNewElement/http2-12                             209.0 ± 0%   229.0 ± 0%  +9.57% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-100-12                 215.0 ± 0%   230.0 ± 0%  +6.98% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-75-12                  161.0 ± 0%   173.0 ± 1%  +7.45% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-50-12                  108.0 ± 0%   115.0 ± 2%  +6.48% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-25-12                  54.00 ± 0%   58.00 ± 0%  +7.41% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-10-12                  22.00 ± 0%   24.00 ± 4%  +9.09% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-5-12                   11.00 ± 0%   12.00 ± 0%  +9.09% (p=0.000 n=10)  
GraphNewCallNewElementParallel/http2-2-12                   5.000 ± 0%   5.000 ± 0%       ~ (p=1.000 n=10) ¹
GraphNewCallNewElementParallel/http2-1-12                   3.000 ± 0%   3.000 ± 0%       ~ (p=1.000 n=10) ¹  
GraphNewCallNewElementParallel/http2-0-12                   1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
GraphNewCallNewElementParallelWithExport/http2-100-12       215.0 ± 0%   230.0 ± 0%  +6.98% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-75-12        162.0 ± 1%   173.0 ± 1%  +6.79% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-50-12        108.0 ± 1%   115.0 ± 1%  +6.48% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-25-12        54.00 ± 0%   58.00 ± 0%  +7.41% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-10-12        22.00 ± 0%   24.00 ± 4%  +9.09% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-5-12         11.00 ± 0%   12.00 ± 0%  +9.09% (p=0.000 n=10)  
GraphNewCallNewElementParallelWithExport/http2-2-12         5.000 ± 0%   5.000 ± 0%       ~ (p=1.000 n=10) ¹
GraphNewCallNewElementParallelWithExport/http2-1-12         3.000 ± 0%   3.000 ± 0%       ~ (p=1.000 n=10) ¹
GraphNewCallNewElementParallelWithExport/http2-0-12         1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹

Tangentially, we're using Bazel, so it took me some time to figure out if I got the experiment wired through correctly. Was printing the enabled experiments as part of the benchmark preamble ever considered?

goos: darwin                                                                                                                                                                                                 
goarch: arm64        <<< experiments here?                                                                                                                                                                                         
cpu: Apple M2 Max
mknyszek

mknyszek commented on Dec 12, 2024

@mknyszek
Contributor

The performance degradation is interesting. Where are the allocations coming from? What's the diff in the memory profile? I'm wondering if it's from the callsite or something else.

Also, if you're up for checking it, what does Go 1.23.4 vs. gotip-with-experiment-off look like?

rabbbit

rabbbit commented on Dec 12, 2024

@rabbbit

Also, if you're up for checking it, what does Go 1.23.4 vs. gotip-with-experiment-off look like?

I'm perhaps missing a part of your question (and, experiment-off means new map on? :)), but experiment on/off shows no difference in most benchmarks. So comparison vs 1.23.4 should also be stable?

The performance degradation is interesting. Where are the allocations coming from? What's the diff in the memory profile? I'm wondering if it's from the callsite or something else.

I rechecked, the degradation is there

goos: darwin
goarch: arm64
cpu: Apple M2 Max
                               │ gotip_allocs_before │      gotip_allocs_after       │
                               │       sec/op        │   sec/op     vs base          │
GraphNewCallNewElement/http-12           120.3µ ± 5%   122.0µ ± 5%  ~ (p=0.280 n=10)

                               │ gotip_allocs_before │         gotip_allocs_after          │
                               │        B/op         │     B/op      vs base               │
GraphNewCallNewElement/http-12          19.89Ki ± 1%   20.24Ki ± 0%  +1.73% (p=0.000 n=10)

                               │ gotip_allocs_before │        gotip_allocs_after         │
                               │      allocs/op      │ allocs/op   vs base               │
GraphNewCallNewElement/http-12            212.0 ± 1%   229.0 ± 0%  +8.02% (p=0.000 n=10)

Comparing the two profiles, the gotip shows (I apologize for this being an image, it appears I don't know how to use pprof):

image

Disabling the swissmap results in

goos: darwin
goarch: arm64
cpu: Apple M2 Max
                               │ gotip_allocs_before │   gotip_allocs_after_swiss    │
                               │       sec/op        │   sec/op     vs base          │
GraphNewCallNewElement/http-12           120.3µ ± 5%   118.4µ ± 6%  ~ (p=0.529 n=10)

                               │ gotip_allocs_before │      gotip_allocs_after_swiss       │
                               │        B/op         │     B/op      vs base               │
GraphNewCallNewElement/http-12          19.89Ki ± 1%   20.10Ki ± 0%  +1.06% (p=0.000 n=10)

                               │ gotip_allocs_before │     gotip_allocs_after_swiss      │
                               │      allocs/op      │ allocs/op   vs base               │
GraphNewCallNewElement/http-12            212.0 ± 1%   215.0 ± 0%  +1.42% (p=0.000 n=10)

Not entirely sure what to make out of this, or even how to debug further right now.

prattmic

prattmic commented on Dec 13, 2024

@prattmic
Member

I'll come back to this next week to read the context more carefully, but for now I'll quickly say that an increase in alloc count with swissmaps isn't inherently surprising, as these maps contain more small allocations. Overall size should be similar.

prattmic

prattmic commented on Dec 20, 2024

@prattmic
Member

The new sync.Map implementation does not use the builtin map type at all, so the effect from enabling/disabling swissmaps must be from some other part of your benchmark that is using maps.

rabbbit

rabbbit commented on Feb 11, 2025

@rabbbit

The new sync.Map implementation does not use the builtin map type at all, so the effect from enabling/disabling swissmaps must be from some other part of your benchmark that is using maps.

Sorry, I missed your response. Yes indeed - there's a large sync.Map, where each of the objects internally contains/creates many standard maps. In benchmarks those sub-maps maps are mostly very sparse (1-2 objects typically), so I would expect the allocs to come from something related to map initialization.

mknyszek

mknyszek commented on May 13, 2025

@mknyszek
Contributor

I'm going to mark this issue as fixed now since we no longer have a warm-up time for new keys, so accessing disjoint keys already scales nicely from the moment they're inserted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rsc@akavel@rabbbit@OneOfOne@mknyszek

        Issue actions

          sync: reduce contention between Map operations with new-but-disjoint keys · Issue #21035 · golang/go