Skip to content

WebSockets may occasionally fail to establish normal connections #70265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anodiebird opened this issue Nov 9, 2024 · 3 comments
Closed

WebSockets may occasionally fail to establish normal connections #70265

anodiebird opened this issue Nov 9, 2024 · 3 comments

Comments

@anodiebird
Copy link

Go version

go version go1.23.1 windows/amd64

Output of go env in your module/workspace:

set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Administrator.DESKTOP-3AFC8C4\AppData\Local\go-build
set GOENV=C:\Users\Administrator.DESKTOP-3AFC8C4\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\Administrator.DESKTOP-3AFC8C4\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\Administrator.DESKTOP-3AFC8C4\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=D:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLCHAIN=auto
set GOTOOLDIR=D:\go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.23.1
set GODEBUG=
set GOTELEMETRY=local
set GOTELEMETRYDIR=C:\Users\Administrator.DESKTOP-3AFC8C4\AppData\Roaming\go\telemetry
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=NUL
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=C:\Users\ADMINI~1.DES\AppData\Local\Temp\go-build1276363962=/tmp/go-build -gno-record-gcc-switches

What did you do?

We are in the process of developing a software service related to WebSocket. We are using Go version 1.23.1. During the utilization of WebSocket for communication, we have encountered several obstacles.

Sometimes, the WebSocket connection fails to be established normally. This phenomenon consistently appears on some Windows 10 computers, while the majority of Windows 10 computers do not have this problem.

What did you see happen?

In order to reproduce this issue more conveniently, I have written the following most simplified code for establishing a WebSocket connection. In the code below, I use the Gin framework to monitor HTTP route requests. After receiving the HTTP requests from clients, I then use the github.com/gorilla/websocket library (not just this library, I have also adopted other WebSocket - related libraries available online. Unfortunately, none of them can solve the problem I mentioned above) to establish the WebSocket connection.

import (
	"github.com/gin-gonic/gin"
	"github.com/gorilla/websocket"
	"log"
	"net/http"
)

var (
	upgrader websocket.Upgrader = websocket.Upgrader{
		CheckOrigin: func(r *http.Request) bool {
			return true
		},
	}
)

type LogWriter struct{}

func (w *LogWriter) Write(data []byte) (int, error) {
	log.Printf("%s", data)
	return len(data), nil
}
func wsHandle(c *gin.Context) {
	_, err := upgrader.Upgrade(c.Writer, c.Request, nil)
	if err != nil {
		log.Printf("upgrade error:%s", err)
		return
	}
}

func main() {
	gin.DefaultWriter = &LogWriter{}
	gin.DefaultErrorWriter = &LogWriter{}
	server := gin.Default()
	server.GET("/ws", wsHandle)
	err := server.Run("127.0.0.1:10020")
	if err != nil {
		panic(err)
	}
}

I ran the above simple code on the environment where the problem occurred. After the service started to correctly listen on port 10020, I used the following code on the client side to establish a connection with the service.

ws = new WebSocket("ws://127.0.0.1:10020/ws")

Then I noticed that the ws.readyState remained at 0 all the time and never changed to 1, which indicates that both sides have been in the state of establishing a connection all along.

I used the Wireshark tool to trace the entire process of establishing the connection. The whole process of the communication is shown in the figure below.
Capture the process of WebSocket establishing communication through Wireshark.

What did you expect to see?

Under normal circumstances, there should be a communication record of Switching Protocols next. Then the WebSocket connection can be successfully established. However, it is absent in the abnormal environment, which indicates that a blockage has occurred during the process of establishing communication.

@seankhliao
Copy link
Member

That would be an error of the the websocket library, please report to them.

@seankhliao seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2024
@anodiebird
Copy link
Author


Problem location
I started to set breakpoints in the code for problem location. I have found the specific location where the blocking occurs (in the file https://cs.opensource.google/go/go/+/master:src/net/http/server.go;l=689). The code snippet with the blocking phenomenon is shown as follows.

1. func (cr *connReader) backgroundRead() {
2. 	n, err := cr.conn.rwc.Read(cr.byteBuf[:])
3. 	cr.lock()
4. 	if n == 1 {
5. 		cr.hasByte = true
6. 		// We were past the end of the previous request's body already
7. 		// (since we wouldn't be in a background read otherwise), so
8. 		// this is a pipelined HTTP request. Prior to Go 1.11 we used to
9. 		// send on the CloseNotify channel and cancel the context here,
10. 		// but the behavior was documented as only "may", and we only
11. 		// did that because that's how CloseNotify accidentally behaved
12. 		// in very early Go releases prior to context support. Once we
13. 		// added context support, people used a Handler's
14. 		// Request.Context() and passed it along. Having that context
15. 		// cancel on pipelined HTTP requests caused problems.
16. 		// Fortunately, almost nothing uses HTTP/1.x pipelining.
17. 		// Unfortunately, apt-get does, or sometimes does.
18. 		// New Go 1.11 behavior: don't fire CloseNotify or cancel
19. 		// contexts on pipelined requests. Shouldn't affect people, but
20. 		// fixes cases like Issue 23921. This does mean that a client
21. 		// closing their TCP connection after sending a pipelined
22. 		// request won't cancel the context, but we'll catch that on any
23. 		// write failure (in checkConnErrorWriter.Write).
24. 		// If the server never writes, yes, there are still contrived
25. 		// server & client behaviors where this fails to ever cancel the
26. 		// context, but that's kinda why HTTP/1.x pipelining died
27. 		// anyway.
28. 	}
29. 	if ne, ok := err.(net.Error); ok && cr.aborted && ne.Timeout() {
30. 		// Ignore this error. It's the expected error from
31. 		// another goroutine calling abortPendingRead.
32. 	} else if err != nil {
33. 		cr.handleReadError(err)
34. 	}
35. 	cr.aborted = false
36. 	cr.inRead = false
37. 	cr.unlock()
38. 	cr.cond.Broadcast()
39. }

40. func (cr *connReader) abortPendingRead() {
41. 	cr.lock()
42. 	defer cr.unlock()
43. 	if !cr.inRead {
44. 		return
45. 	}
46. 	cr.aborted = true
47. 	cr.conn.rwc.SetReadDeadline(aLongTimeAgo)
48. 	for cr.inRead {
49. 		cr.cond.Wait()
50. 	}
51. 	cr.conn.rwc.SetReadDeadline(time.Time{})
52. }

The blocking positions occur on line 2 and line 49 respectively. It can be seen that the read - timeout limit of rwc is set on line 47. However, the rwc on line 2 is still blocked on the Read method. Then I traced the code execution process in the normal environment and found that after the execution on line 47 is completed, the Read method on line 2 is released from the blocked state to the non - blocked state. By comparison, it can be determined that this is the cause of the failure to establish the WebSocket connection normally. It seems to suggest that there are some potential issues in the net/http library or the libraries at a lower level of the net/http library...


Solution
I originally intended to further research the underlying code of the net library to find out if there were some bugs that couldn't be ignored at the bottom level. However, I added a waiting code (as shown below) in front of line 2 of the above - mentioned code, and this immediately solved the problem. The WebSocket connection can now be established normally even on the abnormal computers. It's really amazing!

1. func (cr *connReader) backgroundRead() {
2. 	time.Sleep(time.Microsecond)
3. 	n, err := cr.conn.rwc.Read(cr.byteBuf[:])
4. 	cr.lock()

What I expect
Although this temporary solution can solve the problem, I still haven't figured out the root cause of it. I also wonder if there are better ways to address this issue. At the same time, it's necessary to clarify whether this implies that there are some potential bugs in the net library that need to be fixed. If there are indeed bugs, how should they be fixed? After all, it's really hard to trust this code when just adding a waiting statement before one line of code can solve the problem.

I hope someone can point out what the root cause of this problem is and come up with a more normal, reliable, and safe way to solve the problem of the inability to establish a normal WebSocket connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants