Skip to content

sweepbatcher: fix race conditions in unit tests #889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 4, 2025

Conversation

starius
Copy link
Collaborator

@starius starius commented Feb 26, 2025

Go 1.24 apparently improved its race detector, so regular go test -race started discovering all of these. Now it passes. Some changes were needed not only in unit tests to make accesses of batcher and batch fields thread safe.

Pull Request Checklist

  • Update release_notes.md if your PR contains major features, breaking changes or bugfixes

Copy link
Member

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Looks good, just a few questions.


// Check if the sweep is already in a batch. If that is the case, we
// provide the sweep to that batch and return.
for _, batch := range b.batches {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, now I got it :)

@@ -412,9 +413,22 @@ func NewBatchFromDB(cfg batchConfig, bk batchKit) (*batch, error) {
publishErrorHandler: bk.publishErrorHandler,
purger: bk.purger,
store: bk.store,
log: bk.log,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why this caused a race?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests replace the logger with wrappedLogger (defined in sweep_batcher_test.go) to wait for a particular log message.

// This is used only in unit tests! The reason to have this is to
// avoid data races in require.Eventually calls running in parallel
// to the event loop. See method testRunInEventLoop().
testReqs chan *testRequest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could avoid this by instead using checkpoints in within mocks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the checks that use this feature access batcher.batches and fields of a batch. The fields (e.g. fee rate, the list of sweeps) are updated in the main code, not in the mocks.

}

// UseLogger uses a specified Logger to output package logging info.
// This should be used in preference to SetLogWriter if the caller is also
// using btclog.
func UseLogger(logger btclog.Logger) {
log = logger
log_.Store(&logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only really call UseLogger once it should not race iiuc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call it in runTests function in each test. It increases logging to include debug messages. I caught a race between tests: in one test a sweepbatcher was shutting down and logging something, while another test was calling UseLogger.

@starius starius marked this pull request as ready for review February 26, 2025 13:24
@starius starius requested a review from bhandras February 26, 2025 14:33
Copy link
Member

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@@ -430,7 +444,7 @@ func (b *batch) addSweep(ctx context.Context, sweep *sweep) (bool, error) {
// If the provided sweep is nil, we can't proceed with any checks, so
// we just return early.
if sweep == nil {
b.log.Infof("the sweep is nil")
b.log().Infof("the sweep is nil")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the extra step of calling b.log() we could create log wrappers in the form of:

func (b *batch) Infof(format string, args ...interface{}) {
	b.log().Infof(...)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

// When batch is successfully created it will execute it's first step,
// which leads to a spend monitor of the primary sweep.
<-lnd.RegisterSpendChannel

// Eventually request will be consumed and a new batch will spin up.
require.Eventually(t, func() bool {
return len(getBatches(ctx, batcher)) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could add a Batcher.NumBatches() and a batch.NumSweeps() so most of these repeating checks are easier to write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could also add a batch.Sweeps() which would return the swap hashes for the sweeps. So then we could completely remove the intrusive check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added private methods Batcher.numBatches(ctx), batch.numSweeps(ctx) and batch.snapshot(ctx) and managed to get rid of the majority of testRunInEventLoop calls.

@@ -535,13 +535,15 @@ func (b *Batcher) Run(ctx context.Context) error {
case sweepReq := <-b.sweepReqs:
sweep, err := b.fetchSweep(runCtx, sweepReq)
if err != nil {
log.Warnf("fetchSweep failed: %v.", err)
log().Warnf("fetchSweep failed: %v.", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before, although there it's just one function call, but the wrapped log might be a bit nicer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added private functions debuf, infof, warnf, and errorf.

@@ -20,7 +20,7 @@ env:

# If you change this value, please change it in the following files as well:
# /Dockerfile
GO_VERSION: 1.21.10
GO_VERSION: 1.24.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@starius starius force-pushed the fix-unit-test-races branch 2 times, most recently from 48b25bd to 51196e7 Compare February 28, 2025 18:23
Copy link
Collaborator

@hieblmi hieblmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅ Thanks for the fixes, only a few nits.

for _, batch := range b.batches {
// This is a check to see if a batch is completed. In that case
// we just lazily delete it and continue our scan.
if batch.isComplete() {
delete(b.batches, batch.id)
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the continue here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!


// Check if the sweep is already in a batch. If that is the case, we
// provide the sweep to that batch and return.
for _, batch := range b.batches {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, now I got it :)

b.log.Warnf("Method timeout() returned %v. Number of"+
" sweeps: %d. It may be an empty batch.",
b.Warnf("Method timeout() returned %v. Number of "+
"sweeps: %d. It may be an empty batch.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls clarify, is there another option to being an empty batch here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the following comment:

        if timeout <= 0 {
+               // This may happen if the batch is empty or if SweepInfo.Timeout
+               // is not set, may be possible in tests or if there is a bug.
                b.Warnf("Method timeout() returned %v. Number of "+
                        "sweeps: %d. It may be an empty batch.",

"batchConfTarget")
}

b.log.Infof("initializing rbf fee rate for conf target=%v",
b.Infof("initializing rbf fee rate for conf target=%v",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline after this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return false
}

// Make sure the batch has proper batchConfTarget.
return batch.cfg.batchConfTarget == 123
return snapshot.cfg.batchConfTarget == 123
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place 123 into var...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// Wait for tx to be published.
<-lnd.TxPublishChannel
batch := getOnlyBatch(t, ctx, batcher).snapshot(ctx)
require.Equal(t, int32(123), batch.cfg.batchConfTarget)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and use it here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return false
}

// Make sure the batch has proper batchConfTarget.
return batch.cfg.batchConfTarget == 123
return snapshot.cfg.batchConfTarget == 123
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

require.Equal(t, feeRateMedium, batch.rbfCache.FeeRate)
batch := getOnlyBatch(t, ctx, batcher)
snapshot := batch.snapshot(ctx)
require.Equal(t, 1, len(snapshot.sweeps))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.Len

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

require.Len(t, batch.sweeps, 2)
require.Equal(t, feeRateMedium, batch.rbfCache.FeeRate)
snapshot = batch.snapshot(ctx)
require.Equal(t, 2, len(snapshot.sweeps))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.Len

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

starius added 8 commits March 4, 2025 10:25
It should use the c variable passed into the lambda, not the parent t.
It should use assert, not require package.
Previously, if a completed batch was visited after a batch to which the
sweep was added, it was not deleted because the function returned early.

This has been separated into two loops: the first one removes completed batches,
and the second one adds the sweep to a batch.
This is needed to fix crashes in unit tests under -race.
Several structures were accessed without protection causing crashes under -race.
The code inside require.Eventually runs in parallel with the event loops
of the batcher and its batches. Accessing fields of the batcher and batches
must be done within an event loop.

To address this, testRunInEventLoop methods were added to the Batcher and batch
types. Unit tests were then rewritten to use this approach when accessing
batcher and batch fields.

Additionally, in many cases, receive operations from RegisterSpendChannel
were moved before require.Eventually. This prevents testRunInEventLoop from
getting stuck in an event loop while blocked on a RegisterSpendChannel send
operation.
Try to catch more race conditions in unit tests.
@starius starius force-pushed the fix-unit-test-races branch from 51196e7 to 60f4906 Compare March 4, 2025 16:45
@starius starius merged commit 800f0e0 into lightninglabs:master Mar 4, 2025
4 checks passed
@starius starius deleted the fix-unit-test-races branch March 4, 2025 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants