Skip to content

CSHARP-2450: Reduce locking costs in BsonSerializer #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 29, 2021

Conversation

JamesKovacs
Copy link
Contributor

CSHARP-2450: Improved deserialization performance by switching from HashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim.

The following PR is based heavily on work done by @dnickless in PR#433. I analyzed his changes to determine the origin of the observed 5-10% performance improvement and realized that it was not due to a change in data structure for the registration check but due to reduced locking in the hot deserialization path, notably StandardDiscriminatorConvention.GetActualType, which calls BsonSerializer.EnsureKnownTypesAreRegistered. By changing __typesWithRegisteredKnownTypes to a ConcurrentDictionary<Type, Type> (instead of a HashSet<Type>), I was able to move the registration check outside the ReaderWriterLockSlim __config. The following deserialization test shows a 46% performance improvement in this test case. The class being deserialized simply contains an Id field to magnify the impact of the class registration lookup by minimizing the time spent actually deserializing the data. More complex classes still show a significant improvement though not as dramatic. (e.g. ~36% for a class with an Id and 3 string fields.)

To reproduce the results, replace tests/MongoDB.Driver.TestConsoleApplication/Program.cs with the following code:

using System;
using System.Linq;
using System.Threading.Tasks;
using MongoDB.Bson;
using MongoDB.Bson.Serialization;

var threadsCount = Environment.ProcessorCount / 2;  // Use actual number of cores rather than hyperthreads

Console.WriteLine($"Deserializing with {threadsCount} threads");

var obj = new Simple();
var bson = obj.ToBson();

Task RunTest(int index) => Task.Run(() =>
{
    var iterNum = 1_000_000;

    for (int i = 0; i < iterNum; i++)
    {
        var rehydrated = BsonSerializer.Deserialize<Simple>(bson);
    }
});

var start = Environment.TickCount;

var tasks = Enumerable.Range(0, threadsCount)
                      .Select(RunTest)
                      .ToArray();

await Task.WhenAll(tasks);

var end = Environment.TickCount;

Console.WriteLine($"Elapsed time: {end - start} ms");

public class Simple
{
    public ObjectId Id { get; set; } = ObjectId.GenerateNewId();
}

You can then run the test code via the following bash script using both master and csharp2450 branches to observe the performance improvement:

dotnet build --configuration=Release
for i in {1..10}; do dotnet run --project tests/MongoDB.Driver.TestConsoleApplication --configuration Release; sleep 5; done

I'm cautiously optimistic that I haven't missed any multi-threaded subtlety that precludes this optimization in our serialization code. I look forward to your feedback and thoughts on this improvement. Big props to @dnickless on drawing our attention to the potential improvement and prototyping these changes.

…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim.
@@ -41,7 +42,7 @@ public static class BsonSerializer
private static HashSet<Type> __discriminatedTypes = new HashSet<Type>();
private static BsonSerializerRegistry __serializerRegistry;
private static TypeMappingSerializationProvider __typeMappingSerializationProvider;
private static HashSet<Type> __typesWithRegisteredKnownTypes = new HashSet<Type>();
private static ConcurrentDictionary<Type, Type> __typesWithRegisteredKnownTypes = new ConcurrentDictionary<Type, Type>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of the other dictionary lookup hot paths (e.g. __discriminatorConventions)? If so, can we change any of them to ConcurrentDictionary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There certainly still is __discriminatedTypes which is used in LookupActualType (so damn hot path...). This is something I tackled in #347 which also contains this screenshot of a profiling session that seems to indicate that the remaining locking in LookupActualType would be worth getting rid of:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is still something that we want to take a closer look at. Since the first real work done in LookupActualType is a call to EnsureKnownTypesAreRegistered (this PR), my desire is to get eyes and agreement on this PR - to ensure that it is correct and I didn't miss any threading edge cases - and then rebase PR#347 on top of this so we can get some accurate performance metrics.

Really appreciating the feedback and collaboration on these issues. As well as your patience as it has taken awhile to find time to really dig in and understand the performance and threading implications.

}

__configLock.EnterWriteLock();
try
{
if (!__typesWithRegisteredKnownTypes.Contains(nominalType))
if (!__typesWithRegisteredKnownTypes.ContainsKey(nominalType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider using GetOrAdd instead of the if-ContainsKey-assignment sequence as this will cause two lookups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feeback!

GetOrAdd(TKey key, Func<TKey,TValue> valueFactory) internally performs multiple lookups. First to determine if the key is present and then after executing valueFactory but prior to inserting into the collection. If the value is not present in the collection, GetOrAdd will perform two lookups and the current ContainsKey/[]=value does the same. If the value is already present, both only perform a single lookup. Testing out both implementations I see the same performance numbers.

I will add a comment to __typesWithRegisteredKnownTypes[nominalType] = nominalType; noting that it must be performed as the last step to ensure that other threads don't see partially initialized types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the actual lookups, of course. I was thinking about the GetHashCode() calls of which you'd be saving one but, in this case, they're not costly so let's ignore that. It's just a standard thing that I constantly tell my devs: "Always use GetOrAdd instead of the other sequence!"... ;)

In fact, I had originally been coming from suggesting the use of TryAdd() which should have saved the lookups but that didn't work as there's additional work to be done inside the if branch and it needs to happen before the dictionary insert... Anyway, that code is very ok the way it is now, I suppose.

@dnickless
Copy link
Contributor

Thanks for looking into this and thanks for the kudos. I'm super excited to see the final effect of this once it's been released. Two small remarks:

  1. The more cores you have the stronger the effect, of course. And we've got clients with 12-20 cores...
  2. The reason why I had chosen to go with a Hashtable in CSHARP-2450: Performance: Reduced lock contention in BsonSerializer.LookupActualType #347 and CSHARP-2450 (second attempt): Performance: Reduced lock contention in BsonSerializer.LookupActualType #433 instead of ConcurrentDictionary was that the write scenarios are so rare that having to lock explicitly and also incurring the costs of boxing appeared to be an acceptable trade-off considering the entirely lock-free reads we'd get with this approach. ConcurrentDictionary still has some overhead for managing thread-safety - Hashtable is a bit more "basic" so reading concurrently should be even faster than with ConcurrentDictionary.

@JamesKovacs
Copy link
Contributor Author

  1. The more cores you have the stronger the effect, of course. And we've got clients with 12-20 cores...

Absolutely agreed. The effect is stronger with more cores and more contention on these locks. I ran with between 4 and 64 threads, but settled on 8 to maximize the concurrency while minimizing the CPU context switching. (My test machine has 16 hyper-threaded cores, which translates into 8 physical cores.)

I should also note that this test was intentionally written as a worst-case scenario maximizing contention on these data structures. I did this by keeping the serialized class as small as possible to minimize time deserializing values from BSON into C# properties by only having an Id property. Also everything is in memory and we are deserializing in a tight loop. In more real-world scenarios, time on the wire to retrieve the data from the server would likely dominate the performance numbers. That's not to say that we shouldn't improve deserialization performance if we can, but only that the effects of these changes are likely to be less pronounced in real-world scenarios.

  1. The reason why I had chosen to go with a Hashtable in CSHARP-2450: Performance: Reduced lock contention in BsonSerializer.LookupActualType #347 and CSHARP-2450 (second attempt): Performance: Reduced lock contention in BsonSerializer.LookupActualType #433 instead of ConcurrentDictionary was that the write scenarios are so rare that having to lock explicitly and also incurring the costs of boxing appeared to be an acceptable trade-off considering the entirely lock-free reads we'd get with this approach. ConcurrentDictionary still has some overhead for managing thread-safety - Hashtable is a bit more "basic" so reading concurrently should be even faster than with ConcurrentDictionary.

I tried a variety of data structures including HashSet<T>, ConcurrentDictionary<K,V>, and Hashtable. (Note that HashSet<T> would have required additional locking for correctness but I tested it without for sake of comparison.) All data structures showed the same performance improvements and no marked difference from each other. I chose ConcurrentDictionary<K,V> for the type safety. The biggest difference is due to StandardDiscriminatorConvention.GetActualType no longer requiring acquisition of the __config read lock but relies on multiple reader concurrency of ConcurrentDictionary<K,V>. At steady state once all types are registered, this should be a very fast check to verify that the desired nominalType is registered.

Copy link
Contributor

@BorisDog BorisDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add full EG link when ready.

@JamesKovacs
Copy link
Contributor Author

Full Evergreen run across all variants and tasks started to increase our confidence that we are not breaking anything with these changes. Note that the Evergreen results are only visible internally.

https://spruce.mongodb.com/version/605cf2f130661542696bde44/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Thanks for reminding me, @BorisDog.

@dnickless
Copy link
Contributor

Also everything is in memory and we are deserializing in a tight loop. In more real-world scenarios, time on the wire to retrieve the data from the server would likely dominate the performance numbers. That's not to say that we shouldn't improve deserialization performance if we can, but only that the effects of these changes are likely to be less pronounced in real-world scenarios.

You'd be surprised to see our "real-world" scenario. ;) We're using the driver to do a ton of in-memory de-/serialization stuff, e.g. when we send things across the wire to our fat client and receive it on the other end or to parallelize serialization upfront before writing in a single thread to MongoDB. So there's by far not always an I/O-limited database and network sitting at the other end. You might argue that this is not the primary objective of the driver and that's certainly correct. But I am pretty certain that we'll be taking the achieved performance gains home with a smile - more or less the way you measured them!

Copy link
Contributor

@rstam rstam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

What I like about this PR compared to earlier PRs is that as far as I can tell it totally preserves the semantics of a single write lock controlling access to ALL serialization configuration options.

Since the only thing that is different is switching from a HashSet<Type> to a ConcurrentDictionary<Type, Type> (acting as as set) everything should behave exactly the same as before.

@JamesKovacs JamesKovacs merged commit 2d021c5 into mongodb:master Mar 29, 2021
@JamesKovacs JamesKovacs deleted the csharp2450 branch March 29, 2021 17:25
JamesKovacs added a commit that referenced this pull request Mar 30, 2021
…ashSet<T> protected by a ReaderWriterLockSlim to a ConcurrentDictionary<K,V> outside the ReaderWriterLockSlim. (#482)
@JamesKovacs JamesKovacs restored the csharp2450 branch September 6, 2024 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants