Skip to content

Refactor Fluent Logger for Improved Thread Safety and Error Handling #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

aaithal
Copy link

@aaithal aaithal commented May 8, 2025

Summary

This PR refactors the Fluent logger implementation to improve certain aspects of thread safety and error handling.

Key Changes

1. Thread Safety
  • Introduced thread-safe wrapper methods with sync prefix that handle their own locking
  • Improved mutex usage patterns to prevent potential deadlocks
  • Added proper double-checking of the closed flag after acquiring locks
2. Error Handling
  • Added proper error handling for SetWriteDeadline() and SetReadDeadline()
  • Implemented context cancellation checks throughout the codebase
  • Used Go 1.13+ error wrapping with %w for better error context preservation
  • Added more specific error messages with consistent prefixes
3. Connection Management
  • Created syncClose() method to safely close connections
  • Added syncConnectWithRetry() to handle connection establishment in a thread-safe manner
  • Improved connection state checking to prevent operations on nil connections
4. Read Timeout Support
  • Added support for read timeouts when reading acknowledgments
  • Added a new ReadTimeout configuration option with appropriate defaults
  • Set read deadlines on connections
5. Context Propagation
  • Ensured context is properly propagated to all network operations
  • Added context cancellation checks at critical points to allow early exit
  • Improved handling of context cancellation in the run() goroutine

Testing

All existing tests pass with these changes. The refactoring maintains backward compatibility while improving the robustness of the code.

aaithal added 6 commits May 5, 2025 17:08
The channel and its usage is redundant since we already wire Fluent
object with a cancellable context. Removed stopRunning to use the
context instead.

Instead of waiting for stopRunning in run(), we can just wait for
context to be done as well.

Signed-off-by: Anirudh Aithal <[email protected]>
Added a new test to ensure that pending channel is accessed in a
thread safe manner. The code can be simplified by removing the
pendingMutex lock. There's a risk in this codebase wrt how muconn
and pendingMutex are acquired and released in different orders in
different methods.

This test is a precursor to the change to remove the pendingMutex
to ensure that nothing is broken.

Signed-off-by: Anirudh Aithal <[email protected]>
Changed the "closed" variable to be an atomic int. There is a race condition b/w
appendBuffer and Close() method wrt how this variable is accessed. This could be
an atomic.Bool, but the CI/CD config uses golang version 1.17. Since atomic.Bool
is supported 1.19 onwards, using that would fail tests and build.

Data race trace without this fix (using `go test -race`):
==================
WARNING: DATA RACE
Write at 0x00c0003324e0 by goroutine 95:
  github.com/fluent/fluent-logger-golang/fluent.(*Fluent).close()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:426 +0x13c
  github.com/fluent/fluent-logger-golang/fluent.(*Fluent).Close()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:393 +0x108
  github.com/fluent/fluent-logger-golang/fluent.TestCloseWhileWaitingForAckResponse.func1()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent_test.go:737 +0x2c
  github.com/fluent/fluent-logger-golang/fluent.timeout.func1()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent_test.go:557 +0x38

Previous read at 0x00c0003324e0 by goroutine 94:
  github.com/fluent/fluent-logger-golang/fluent.(*Fluent).write()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:619 +0x1ac
  github.com/fluent/fluent-logger-golang/fluent.(*Fluent).writeWithRetry()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:553 +0x78
  github.com/fluent/fluent-logger-golang/fluent.(*Fluent).run()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:527 +0x1d0
  github.com/fluent/fluent-logger-golang/fluent.newWithDialer.gowrap1()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:184 +0x4c

Goroutine 95 (running) created at:
  github.com/fluent/fluent-logger-golang/fluent.timeout()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent_test.go:556 +0xe8
  github.com/fluent/fluent-logger-golang/fluent.TestCloseWhileWaitingForAckResponse()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent_test.go:736 +0x444
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.2/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.2/libexec/src/testing/testing.go:1851 +0x40

Goroutine 94 (running) created at:
  github.com/fluent/fluent-logger-golang/fluent.newWithDialer()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent.go:184 +0x368
  github.com/fluent/fluent-logger-golang/fluent.TestCloseWhileWaitingForAckResponse()
      /Users/aithal/workspace/src/github.com/aaithal/fluent-logger-golang/fluent/fluent_test.go:722 +0xd8
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.2/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.2/libexec/src/testing/testing.go:1851 +0x40
==================
[2025-05-06T08:42:08-07:00] Discarding queued events...
    testing.go:1490: race detected during execution of test

Signed-off-by: Anirudh Aithal <[email protected]>
Refactored the code to follow the pattern of releasing the mutex lock
in a defer block as much as possible for the muconn lock. This should make
the code more maintainable, avoiding issues with not accidentally releasing
the lock.

Signed-off-by: Anirudh Aithal <[email protected]>
…nnection

There are a number of places in the code where we do not check if the context
is cancelled before reading from or writing to the connection. This commit adds
those checks.

Signed-off-by: Anirudh Aithal <[email protected]>

cr: https://code.amazon.com/reviews/CR-194284782
Signed-off-by: Anirudh Aithal <[email protected]>
This can be used by clients to exit early when the connection is
unresponsive while reading acks from the server.

Signed-off-by: Anirudh Aithal <[email protected]>
@aaithal aaithal force-pushed the refactor_potential_races branch from 64e3b2b to 1f759a4 Compare May 8, 2025 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants