Skip to content

GODRIVER-2677 Improve memory pooling. #1157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 23, 2023

Conversation

qingyang-hu
Copy link
Collaborator

@qingyang-hu qingyang-hu commented Jan 7, 2023

GODRIVER-2677

Summary

Reduce high memory consumption introduced by GODRIVER-2021.

Background & Motivation

I limited 512 slices in the pool. Get() allocates directly from the system when it reaches the limit. So the size of the pool shall not be bigger than 16MB * 512 ~= 8GB in theory considering the ticket indicates ~20GB of memory consumption.
I did not reset the capacity of returned byte slice because it would cause more allocations.

2nd attempt:

  1. Stop pooling on reading;
  2. Don't recycle low-occupied byte slices;
  3. Update the benchmark tests to use a mix of short and long documents.

New benchstat from the new benchmark/operation_test.go:

name                           old time/op    new time/op    delta
ClientWrite/not_compressed-10    36.3µs ± 3%    38.4µs ±13%     ~     (p=0.218 n=10+10)
ClientWrite/snappy-10            40.0µs ± 3%    39.6µs ± 2%     ~     (p=0.279 n=8+8)
ClientWrite/zlib-10               134µs ± 4%     128µs ± 5%   -3.98%  (p=0.002 n=10+10)
ClientWrite/zstd-10              48.2µs ± 3%    51.8µs ± 9%   +7.33%  (p=0.001 n=9+10)

name                           old alloc/op   new alloc/op   delta
ClientWrite/not_compressed-10    14.0kB ± 1%    17.3kB ± 1%  +23.76%  (p=0.000 n=10+9)
ClientWrite/snappy-10            25.9kB ± 0%    23.1kB ± 1%  -10.74%  (p=0.000 n=9+10)
ClientWrite/zlib-10               884kB ± 0%     884kB ± 0%     ~     (p=0.739 n=10+10)
ClientWrite/zstd-10              36.7kB ± 0%    32.9kB ± 1%  -10.18%  (p=0.000 n=10+10)

name                           old allocs/op  new allocs/op  delta
ClientWrite/not_compressed-10      57.0 ± 0%      57.0 ± 0%     ~     (all equal)
ClientWrite/snappy-10              65.0 ± 0%      60.0 ± 0%   -7.69%  (p=0.000 n=10+10)
ClientWrite/zlib-10                 102 ± 1%        97 ± 0%   -4.53%  (p=0.000 n=10+10)
ClientWrite/zstd-10                 153 ± 0%       148 ± 0%   -3.27%  (p=0.000 n=9+9)
name                          old time/op    new time/op    delta
ClientRead/not_compressed-10    29.5µs ± 3%    33.8µs ±15%  +14.42%  (p=0.001 n=9+10)
ClientRead/snappy-10            32.9µs ± 1%    36.6µs ±17%  +11.35%  (p=0.016 n=8+10)
ClientRead/zlib-10               141µs ± 3%     138µs ± 4%     ~     (p=0.077 n=9+9)
ClientRead/zstd-10              42.9µs ± 7%    46.3µs ±13%   +7.88%  (p=0.040 n=9+9)

name                          old alloc/op   new alloc/op   delta
ClientRead/not_compressed-10    16.6kB ± 1%    17.3kB ± 1%   +4.33%  (p=0.000 n=10+10)
ClientRead/snappy-10            24.9kB ± 0%    25.3kB ± 0%   +1.64%  (p=0.000 n=10+10)
ClientRead/zlib-10               884kB ± 0%     881kB ± 0%   -0.28%  (p=0.000 n=10+10)
ClientRead/zstd-10              82.9kB ± 0%    84.1kB ± 1%   +1.38%  (p=0.000 n=10+10)

name                          old allocs/op  new allocs/op  delta
ClientRead/not_compressed-10      73.0 ± 0%      73.0 ± 0%     ~     (all equal)
ClientRead/snappy-10              78.0 ± 0%      76.0 ± 0%   -2.56%  (p=0.000 n=10+9)
ClientRead/zlib-10                 122 ± 0%       118 ± 0%   -3.28%  (p=0.000 n=8+7)
ClientRead/zstd-10                 181 ± 0%       178 ± 0%   -1.66%  (p=0.002 n=8+10)

Notes:

  1. The running time varies among benchmarks, which may be impacted by the system load. However, the memory results are consistent.
  2. The increase of ClientWrite/not_compressed-10 can be eliminated if I recycle the slices regardless of their occupation. However, I decided to adopt a less aggressive policy of pooling highly-occupied slices only to avoid holding slices with large sizes.

Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, simple design. Given benchmark results, seems like a solid change to me 🧑‍🔧 . Just a few questions.

Need to look closer at the benchmarks, seems more nuanced after discussing offline with @qingyang-hu .

// Proper usage of a sync.Pool requires each entry to have approximately the same memory
// cost. To obtain this property when the stored type contains a variably-sized buffer,
// we add a hard limit on the maximum buffer to place back in the pool. We limit the
// size to 16MiB because that's the maximum wire message size supported by MongoDB.
Copy link
Contributor

@benjirewis benjirewis Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// size to 16MiB because that's the maximum wire message size supported by MongoDB.
// size to 16MB because that's the maximum BSON document size supported by MongoDB.

[nit] The size limitation you're referring to is 16MB (not mebibytes abbreviated with MiB) and it's not a limitation on the wire message size but on the document size.

)

type byteslicePool struct {
pool interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make this *sync.Pool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly for the convenience of testing.


countmax int
count int
mutex *sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mutex *sync.Mutex
// mutex guards countmax and count.
mutex *sync.Mutex

[opt] we usually have comments describing the function of mutexes.

@qingyang-hu qingyang-hu marked this pull request as ready for review January 10, 2023 16:11
defer p.mutex.Unlock()
if p.count < p.countmax {
p.count++
return (*p.pool.Get().(*[]byte))[:0]
Copy link
Collaborator

@matthewdale matthewdale Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global count approach suffers from two issues:

  1. pool.Get() doesn't guarantee that it will always return an element from the pool (it may call New instead). As a result, the p.count value may not actually reflect the number of buffers "checked out" and may eventually diverge significantly from the actual number of buffers held, leading to unexpected behavior.
  2. sync.Pool is optimized to prevent locking whenever possible. Holding a global lock defeats that optimization and adds a global contention point to running operations.

Issues with sync.Pool usage leading to high memory usage very similar to the ones we observed in the Go driver are discussed in golang/go#23199 and golang/go#27735. We should consider the approach suggested in golang/go#27735 (comment) which works with the non-deterministic behavior of sync.Pool and preserves its performance characteristics.

Another solution to the same problem is implemented in the "encoding/json" library, which keeps different buffer pools for different size buffers (see this PR). That approach seems to require significantly more code and is much harder to understand, so I don't recommend it.

@benjirewis benjirewis requested review from benjirewis and removed request for benjirewis January 19, 2023 16:49
@qingyang-hu qingyang-hu changed the title GODRIVER-2677 Limit the maximum number of items in pool. GODRIVER-2677 Improve memory pooling. Feb 3, 2023
Comment on lines +437 to 440
// Recycle byte slices that are smaller than 16MiB and at least half occupied.
if c := cap(*wm); c < 16*1024*1024 && c/2 < len(*wm) {
memoryPool.Put(wm)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if I recycle the low-occupied slices as well, the benchmark shows an even lower memory consumption.

@@ -8,7 +8,9 @@ package benchmark

import (
"context"
"math/rand"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crypto/rand is the secure way to generate random numbers. It probably doesn't matter in a test case, but it could be good to enforce using crypto/rand over math/rand in all use cases in the project. What are your thoughts? We currently skip this linter check on _test.go files.

Comment on lines 56 to 60
if random.Int()%2 == 0 {
t = text
} else {
t = "hello"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The t variable is kind of hard to read, can we just change the value of text on even random numbers?

Suggested change
if random.Int()%2 == 0 {
t = text
} else {
t = "hello"
}
if random.Int()%2 == 0 {
text = "hello"
}

Comment on lines 104 to 108
if random.Int()%2 == 0 {
t = text
} else {
t = "hello"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The t variable is kind of hard to read, can we just change the value of text on even random numbers?

Suggested change
if random.Int()%2 == 0 {
t = text
} else {
t = "hello"
}
if random.Int()%2 == 0 {
text = "hello"
}


for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
gotWM, _, gotErr := Operation{}.roundTrip(context.Background(), tc.conn, tc.paramWM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is mostly for testing whether the input slice is returned, in my opinion. Since we no longer reuse the slice, the test is redundant.

@@ -855,18 +856,18 @@ func (op Operation) retryable(desc description.Server) bool {

// roundTrip writes a wiremessage to the connection and then reads a wiremessage. The wm parameter
// is reused when reading the wiremessage.
func (op Operation) roundTrip(ctx context.Context, conn Connection, wm []byte) (result, pooledSlice []byte, err error) {
func (op Operation) roundTrip(ctx context.Context, conn Connection, wm []byte) (result []byte, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on the named return parameters? A lot of these functions are small and the documentation on them is fairly clear. IMO we would remove them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, at least for this one. result is not even used.

@@ -48,7 +53,11 @@ func BenchmarkClientWrite(b *testing.B) {
b.ResetTimer()
b.RunParallel(func(p *testing.PB) {
for p.Next() {
_, err := coll.InsertOne(context.Background(), bson.D{{"text", text}})
n, err := rand.Int(rand.Reader, big.NewInt(int64(len(teststrings))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from crypto/rand may be intermittently slow on some systems and may negatively impact the reliability of the benchmark. If the goal is to get a consistent distribution of small and large messages, consider incrementing a counter and selecting teststrings[i % len(teststrings)] instead of using a random number.

}

b.ResetTimer()
b.RunParallel(func(p *testing.PB) {
for p.Next() {
n, err := rand.Int(rand.Reader, big.NewInt(2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: Consider incrementing a counter and using teststrings[i % len(teststrings)] instead of using a random number.

Copy link
Collaborator

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@qingyang-hu qingyang-hu merged commit 0d0b23b into mongodb:master Feb 23, 2023
matthewdale pushed a commit that referenced this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants