Skip to content

conn.go panics if TLS connection drops #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
banks opened this issue Aug 7, 2017 · 6 comments
Closed

conn.go panics if TLS connection drops #644

banks opened this issue Aug 7, 2017 · 6 comments

Comments

@banks
Copy link

banks commented Aug 7, 2017

I have started hitting a panic in production where a service is monitoring many thousand database instances, hundreds of which are PostgreSQL and using this driver to connect.

The panic comes from: https://github.com/lib/pq/blob/master/conn.go#L1030

which is for convenience:

	b := cn.scratch[:1]
	_, err := io.ReadFull(cn.c, b)
	if err != nil {
		panic(err)
	}

I don't understand how panicing on a IO error is ever an acceptable way to handle errors in a DB driver? Connection errors must surely be an expected occurrence and one that the driver should handle gracefully if not transparently?

I didn't see any other issues closed or open that seemed to document this, is panicing a design choice that consumers are expected to rescue from for normal operation of their DB?

In my case the logs for my service look like:

panic: EOF

goroutine 13745112 [running]:
[my service path]/vendor/github.com/lib/pq.(*conn).ssl(0xc420868b00, 0xc4214cbe90)
	/go/src/[my service path]/vendor/github.com/lib/pq/conn.go:1030 +0x2a5
[my service path]/vendor/github.com/lib/pq.(*conn).cancel(0xc428d61600, 0x0, 0x0)
	/go/src/[my service path]/vendor/github.com/lib/pq/conn_go18.go:111 +0x164
[my service path]/vendor/github.com/lib/pq.(*conn).watchCancel.func1(0xc42b25f860, 0xc428d61600, 0xc42b682540)
	/go/src/[my service path]/vendor/github.com/lib/pq/conn_go18.go:85 +0xa7
created by [my service path]/vendor/github.com/lib/pq.(*conn).watchCancel
	/go/src/[my service path]/vendor/github.com/lib/pq/conn_go18.go:89 +0x9c

The git sha of the vendored package for the record is dd1fe20 which is master at time of posting.

@banks
Copy link
Author

banks commented Aug 7, 2017

A panic cannot be recovered by a different goroutine.

After closer inspection, I think it's impossible to avoid this issue as a consumer of this library since it occurs in the internal watchCancel go routine.

I will see if I can find a simple way to reproduce. I only see it after hours of running in production against hundreds of TLS-enabled Postgres instances and I don't have very good tracing to identify what specific error condition causes the EOF.

Just from reading the code though, I suspect it might require a race between TCP connection dropping and the Context being cancelled. That makes is sound similar to #614 but I believe the error and root-cause are entirely different despite same effect.

It also interrelates somewhat with my other bug report #620 where using Context cancelation violates timeouts on errors due to this same cancellation procedure which is causing the crash here.

It may be relevant to the reproduction to note that I'm using https://github.com/Kount/pq-timeouts to wrap pq in a customer Dialler/Conn that support read/write timeouts as a work around to #620.

I'll update if I manage to find more info on how to reproduce, but it seems to me like an inherent bug for networking code to ever panic on IO errors in a normal read path. If there is a good rationale I'm missing please let me know though!

@hexdigest
Copy link

Hi, @banks

Did you find any solution to this problem?

@banks
Copy link
Author

banks commented Oct 23, 2017

@hexdigest yes. Sadly it was to switch to https://github.com/jackc/pgx

@ainar-g
Copy link
Contributor

ainar-g commented Mar 27, 2018

@banks @hexdigest #734 might have fixed the issue.

@banks
Copy link
Author

banks commented Mar 27, 2018

Looks like it would fix this issue so Fee free to close when that lands. I’ve moved on from the codebase and company at this point so I can’t confirm that use-case now works flawlesly although the issue here was panics being bad and the PR seems to fix that.

@ainar-g
Copy link
Contributor

ainar-g commented Mar 27, 2018

@mjibson Should we do something else, or can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants