CSHARP-2450: Reduce locking costs in BsonSerializer #482

JamesKovacs · 2021-03-24T21:01:40Z

CSHARP-2450: Improved deserialization performance by switching from HashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim.

The following PR is based heavily on work done by @dnickless in PR#433. I analyzed his changes to determine the origin of the observed 5-10% performance improvement and realized that it was not due to a change in data structure for the registration check but due to reduced locking in the hot deserialization path, notably StandardDiscriminatorConvention.GetActualType, which calls BsonSerializer.EnsureKnownTypesAreRegistered. By changing __typesWithRegisteredKnownTypes to a ConcurrentDictionary<Type, Type> (instead of a HashSet<Type>), I was able to move the registration check outside the ReaderWriterLockSlim __config. The following deserialization test shows a 46% performance improvement in this test case. The class being deserialized simply contains an Id field to magnify the impact of the class registration lookup by minimizing the time spent actually deserializing the data. More complex classes still show a significant improvement though not as dramatic. (e.g. ~36% for a class with an Id and 3 string fields.)

To reproduce the results, replace tests/MongoDB.Driver.TestConsoleApplication/Program.cs with the following code:

using System;
using System.Linq;
using System.Threading.Tasks;
using MongoDB.Bson;
using MongoDB.Bson.Serialization;

var threadsCount = Environment.ProcessorCount / 2;  // Use actual number of cores rather than hyperthreads

Console.WriteLine($"Deserializing with {threadsCount} threads");

var obj = new Simple();
var bson = obj.ToBson();

Task RunTest(int index) => Task.Run(() =>
{
    var iterNum = 1_000_000;

    for (int i = 0; i < iterNum; i++)
    {
        var rehydrated = BsonSerializer.Deserialize<Simple>(bson);
    }
});

var start = Environment.TickCount;

var tasks = Enumerable.Range(0, threadsCount)
                      .Select(RunTest)
                      .ToArray();

await Task.WhenAll(tasks);

var end = Environment.TickCount;

Console.WriteLine($"Elapsed time: {end - start} ms");

public class Simple
{
    public ObjectId Id { get; set; } = ObjectId.GenerateNewId();
}

You can then run the test code via the following bash script using both master and csharp2450 branches to observe the performance improvement:

dotnet build --configuration=Release
for i in {1..10}; do dotnet run --project tests/MongoDB.Driver.TestConsoleApplication --configuration Release; sleep 5; done

I'm cautiously optimistic that I haven't missed any multi-threaded subtlety that precludes this optimization in our serialization code. I look forward to your feedback and thoughts on this improvement. Big props to @dnickless on drawing our attention to the potential improvement and prototyping these changes.

…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim.

jyemin · 2021-03-24T21:19:38Z

src/MongoDB.Bson/Serialization/BsonSerializer.cs

@@ -41,7 +42,7 @@ public static class BsonSerializer
        private static HashSet<Type> __discriminatedTypes = new HashSet<Type>();
        private static BsonSerializerRegistry __serializerRegistry;
        private static TypeMappingSerializationProvider __typeMappingSerializationProvider;
-        private static HashSet<Type> __typesWithRegisteredKnownTypes = new HashSet<Type>();
+        private static ConcurrentDictionary<Type, Type> __typesWithRegisteredKnownTypes = new ConcurrentDictionary<Type, Type>();


Are any of the other dictionary lookup hot paths (e.g. __discriminatorConventions)? If so, can we change any of them to ConcurrentDictionary?

There certainly still is __discriminatedTypes which is used in LookupActualType (so damn hot path...). This is something I tackled in #347 which also contains this screenshot of a profiling session that seems to indicate that the remaining locking in LookupActualType would be worth getting rid of:

Agreed. This is still something that we want to take a closer look at. Since the first real work done in LookupActualType is a call to EnsureKnownTypesAreRegistered (this PR), my desire is to get eyes and agreement on this PR - to ensure that it is correct and I didn't miss any threading edge cases - and then rebase PR#347 on top of this so we can get some accurate performance metrics.

Really appreciating the feedback and collaboration on these issues. As well as your patience as it has taken awhile to find time to really dig in and understand the performance and threading implications.

dnickless · 2021-03-25T05:11:06Z

src/MongoDB.Bson/Serialization/BsonSerializer.cs

            }

            __configLock.EnterWriteLock();
            try
            {
-                if (!__typesWithRegisteredKnownTypes.Contains(nominalType))
+                if (!__typesWithRegisteredKnownTypes.ContainsKey(nominalType))


Nit: Consider using GetOrAdd instead of the if-ContainsKey-assignment sequence as this will cause two lookups.

Thank you for the feeback!

GetOrAdd(TKey key, Func<TKey,TValue> valueFactory) internally performs multiple lookups. First to determine if the key is present and then after executing valueFactory but prior to inserting into the collection. If the value is not present in the collection, GetOrAdd will perform two lookups and the current ContainsKey/[]=value does the same. If the value is already present, both only perform a single lookup. Testing out both implementations I see the same performance numbers.

I will add a comment to __typesWithRegisteredKnownTypes[nominalType] = nominalType; noting that it must be performed as the last step to ensure that other threads don't see partially initialized types.

You're right about the actual lookups, of course. I was thinking about the GetHashCode() calls of which you'd be saving one but, in this case, they're not costly so let's ignore that. It's just a standard thing that I constantly tell my devs: "Always use GetOrAdd instead of the other sequence!"... ;)

In fact, I had originally been coming from suggesting the use of TryAdd() which should have saved the lookups but that didn't work as there's additional work to be done inside the if branch and it needs to happen before the dictionary insert... Anyway, that code is very ok the way it is now, I suppose.

dnickless · 2021-03-25T05:37:50Z

Thanks for looking into this and thanks for the kudos. I'm super excited to see the final effect of this once it's been released. Two small remarks:

The more cores you have the stronger the effect, of course. And we've got clients with 12-20 cores...
The reason why I had chosen to go with a Hashtable in CSHARP-2450: Performance: Reduced lock contention in BsonSerializer.LookupActualType #347 and CSHARP-2450 (second attempt): Performance: Reduced lock contention in BsonSerializer.LookupActualType #433 instead of ConcurrentDictionary was that the write scenarios are so rare that having to lock explicitly and also incurring the costs of boxing appeared to be an acceptable trade-off considering the entirely lock-free reads we'd get with this approach. ConcurrentDictionary still has some overhead for managing thread-safety - Hashtable is a bit more "basic" so reading concurrently should be even faster than with ConcurrentDictionary.

JamesKovacs · 2021-03-25T18:01:11Z

The more cores you have the stronger the effect, of course. And we've got clients with 12-20 cores...

Absolutely agreed. The effect is stronger with more cores and more contention on these locks. I ran with between 4 and 64 threads, but settled on 8 to maximize the concurrency while minimizing the CPU context switching. (My test machine has 16 hyper-threaded cores, which translates into 8 physical cores.)

I should also note that this test was intentionally written as a worst-case scenario maximizing contention on these data structures. I did this by keeping the serialized class as small as possible to minimize time deserializing values from BSON into C# properties by only having an Id property. Also everything is in memory and we are deserializing in a tight loop. In more real-world scenarios, time on the wire to retrieve the data from the server would likely dominate the performance numbers. That's not to say that we shouldn't improve deserialization performance if we can, but only that the effects of these changes are likely to be less pronounced in real-world scenarios.

The reason why I had chosen to go with a Hashtable in CSHARP-2450: Performance: Reduced lock contention in BsonSerializer.LookupActualType #347 and CSHARP-2450 (second attempt): Performance: Reduced lock contention in BsonSerializer.LookupActualType #433 instead of ConcurrentDictionary was that the write scenarios are so rare that having to lock explicitly and also incurring the costs of boxing appeared to be an acceptable trade-off considering the entirely lock-free reads we'd get with this approach. ConcurrentDictionary still has some overhead for managing thread-safety - Hashtable is a bit more "basic" so reading concurrently should be even faster than with ConcurrentDictionary.

I tried a variety of data structures including HashSet<T>, ConcurrentDictionary<K,V>, and Hashtable. (Note that HashSet<T> would have required additional locking for correctness but I tested it without for sake of comparison.) All data structures showed the same performance improvements and no marked difference from each other. I chose ConcurrentDictionary<K,V> for the type safety. The biggest difference is due to StandardDiscriminatorConvention.GetActualType no longer requiring acquisition of the __config read lock but relies on multiple reader concurrency of ConcurrentDictionary<K,V>. At steady state once all types are registered, this should be a very fast check to verify that the desired nominalType is registered.

BorisDog

LGTM. Please add full EG link when ready.

JamesKovacs · 2021-03-25T20:40:43Z

Full Evergreen run across all variants and tasks started to increase our confidence that we are not breaking anything with these changes. Note that the Evergreen results are only visible internally.

https://spruce.mongodb.com/version/605cf2f130661542696bde44/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Thanks for reminding me, @BorisDog.

dnickless · 2021-03-26T01:26:11Z

Also everything is in memory and we are deserializing in a tight loop. In more real-world scenarios, time on the wire to retrieve the data from the server would likely dominate the performance numbers. That's not to say that we shouldn't improve deserialization performance if we can, but only that the effects of these changes are likely to be less pronounced in real-world scenarios.

You'd be surprised to see our "real-world" scenario. ;) We're using the driver to do a ton of in-memory de-/serialization stuff, e.g. when we send things across the wire to our fat client and receive it on the other end or to parallelize serialization upfront before writing in a single thread to MongoDB. So there's by far not always an I/O-limited database and network sitting at the other end. You might argue that this is not the primary objective of the driver and that's certainly correct. But I am pretty certain that we'll be taking the achieved performance gains home with a smile - more or less the way you measured them!

rstam

LGTM

What I like about this PR compared to earlier PRs is that as far as I can tell it totally preserves the semantics of a single write lock controlling access to ALL serialization configuration options.

Since the only thing that is different is switching from a HashSet<Type> to a ConcurrentDictionary<Type, Type> (acting as as set) everything should behave exactly the same as before.

…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim. (#482)

CSHARP-2450: Improved deserialization performance by switching from H…

2e4b02e

…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim.

JamesKovacs requested review from rstam, jyemin and BorisDog March 24, 2021 21:01

JamesKovacs mentioned this pull request Mar 24, 2021

CSHARP-2450 (second attempt): Performance: Reduced lock contention in BsonSerializer.LookupActualType #433

Closed

jyemin removed their request for review March 24, 2021 21:13

jyemin reviewed Mar 24, 2021

View reviewed changes

dnickless reviewed Mar 25, 2021

View reviewed changes

BorisDog approved these changes Mar 25, 2021

View reviewed changes

rstam approved these changes Mar 29, 2021

View reviewed changes

Incorporated PR feedback.

440c559

JamesKovacs merged commit 2d021c5 into mongodb:master Mar 29, 2021

JamesKovacs deleted the csharp2450 branch March 29, 2021 17:25

JamesKovacs added a commit that referenced this pull request Mar 30, 2021

CSHARP-2450: Improved deserialization performance by switching from H…

ec2dfa9

…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim. (#482)

JamesKovacs restored the csharp2450 branch September 6, 2024 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSHARP-2450: Reduce locking costs in BsonSerializer #482

CSHARP-2450: Reduce locking costs in BsonSerializer #482

Uh oh!

JamesKovacs commented Mar 24, 2021

Uh oh!

jyemin Mar 24, 2021

Uh oh!

dnickless Mar 25, 2021

Uh oh!

JamesKovacs Mar 25, 2021

Uh oh!

dnickless Mar 25, 2021

Uh oh!

JamesKovacs Mar 25, 2021

Uh oh!

dnickless Mar 26, 2021

Uh oh!

dnickless commented Mar 25, 2021

Uh oh!

JamesKovacs commented Mar 25, 2021

Uh oh!

BorisDog left a comment

Uh oh!

JamesKovacs commented Mar 25, 2021

Uh oh!

dnickless commented Mar 26, 2021

Uh oh!

rstam left a comment

Uh oh!

Uh oh!

CSHARP-2450: Reduce locking costs in BsonSerializer #482

CSHARP-2450: Reduce locking costs in BsonSerializer #482

Uh oh!

Conversation

JamesKovacs commented Mar 24, 2021

Uh oh!

jyemin Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

dnickless Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

JamesKovacs Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

dnickless Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

JamesKovacs Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

dnickless Mar 26, 2021

Choose a reason for hiding this comment

Uh oh!

dnickless commented Mar 25, 2021

Uh oh!

JamesKovacs commented Mar 25, 2021

Uh oh!

BorisDog left a comment

Choose a reason for hiding this comment

Uh oh!

JamesKovacs commented Mar 25, 2021

Uh oh!

dnickless commented Mar 26, 2021

Uh oh!

rstam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!