Skip to content

net: DNS broken on darwin without cgo (1.13 regression) #31705

Closed
@bradfitz

Description

@bradfitz

I was testing some new code for the Go build system and found that a simple TCP dial doesn't work on Mac anymore, at least when the binary is cross-compiled.

Code is just:

var coordDialer = &net.Dialer{
        Timeout:   10 * time.Second,
        KeepAlive: 15 * time.Second,
}       

// dialCoordinatorTCP returns a TCP connection to the coordinator, making                                                                                                                                                                                          
// a CONNECT request to a proxy as a fallback.                                                                                                                                                                                                                     
func dialCoordinatorTCP(ctx context.Context, addr string) (net.Conn, error) {
        tcpConn, err := coordDialer.DialContext(ctx, "tcp", addr)

... with a context.Background() for ctx.

It always times out after 10 seconds.

But if I redeploy the same code but built with Go 1.12.x, it works fine.

Activity

added this to the Go1.13 milestone on Apr 26, 2019
bradfitz

bradfitz commented on Apr 26, 2019

@bradfitz
ContributorAuthor

My best guess is f6b42a5 ("net: use libSystem bindings for DNS resolution on macos if cgo is unavailable").

We might need some more test coverage. Or a no-cgo darwin builder.

/cc @ianlancetaylor @grantseltzer

changed the title [-]net: buildlet doesn't work on darwin-amd64 with Go master[/-] [+]net: cross-compiled cgo-less buildlet doesn't work on darwin-amd64 with Go master[/+] on Apr 26, 2019
groob

groob commented on Apr 28, 2019

@groob
Contributor

FWIW I can't seem to reproduce with a binary compiled on a 10.14.4 mac. I tested building with cgo disabled and setting GODEBUG=netdns=go.

bradfitz

bradfitz commented on Apr 29, 2019

@bradfitz
ContributorAuthor

I built on Linux and ran on a Mac, without setting any special environment variables.

grantseltzer

grantseltzer commented on Apr 29, 2019

@grantseltzer
Contributor

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

Not sure how this done for every other binding when there's cross compilation

Also, this is with CGO enabled, netcgo not specified, cross compiled for darwin on linux?

bradfitz

bradfitz commented on Apr 29, 2019

@bradfitz
ContributorAuthor
bradfitz

bradfitz commented on Apr 30, 2019

@bradfitz
ContributorAuthor

We just disabled cgo support for darwin/386 (per #31751) so we now have a CGO_ENABLED=0 Mac builder, which now hits this issue. Which is good in that we can reproduce it.

Looks like it's stuck in DNS queries, so f6b42a5 looks implicated.

https://build.golang.org/log/289a154e730768cccbc64dd0ea2af16b4b48db88

ok  	mime	0.017s
ok  	mime/multipart	0.365s
ok  	mime/quotedprintable	0.112s
panic: test timed out after 3m0s

goroutine 343 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1380 +0xc5
created by time.goFunc
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/time/sleep.go:169 +0x31

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0x11720f00, 0x239388, 0xf, 0x2482c8, 0x1)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:964 +0x2c5
testing.runTests.func1(0x114da000)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1205 +0x54
testing.tRunner(0x114da000, 0x11498f10)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
testing.runTests(0x114a8020, 0x3eef00, 0xdf, 0xdf, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1203 +0x227
testing.(*M).Run(0x11474200, 0x0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:1120 +0x111
net.TestMain(0x11474200)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/main_test.go:52 +0x25
main.main()
	_testmain.go:552 +0xfa

goroutine 612 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGoogleSRV(0x11720aa0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:70 +0x29
testing.tRunner(0x11720aa0, 0x2482f0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 613 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailMX(0x11720b40)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:119 +0x1f
testing.tRunner(0x11720b40, 0x2482d8)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 614 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailNS(0x11720be0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:165 +0x1f
testing.tRunner(0x11720be0, 0x2482dc)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 615 [chan receive, 2 minutes]:
testing.(*T).Parallel(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:817 +0x18d
net.TestLookupGmailTXT(0x11720c80)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/net/lookup_test.go:214 +0x29
testing.tRunner(0x11720c80, 0x2482e0)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:912 +0x90
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6

goroutine 619 [running]:
	goroutine running on other thread; stack unavailable
created by testing.(*T).Run
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/testing/testing.go:963 +0x2a6
FAIL	net	180.028s
randall77

randall77 commented on Apr 30, 2019

@randall77
Contributor

Could this be because when you cross compile on Linux the linker doesn't have access to libSystem?

I don't think this should matter. We don't actually need access to libSystem to build a binary which dynamically links to it. Building on Linux and running on a Mac should work fine with regards to this feature.

gopherbot

gopherbot commented on Apr 30, 2019

@gopherbot
Contributor

Change https://golang.org/cl/174637 mentions this issue: dashboard: add darwin-amd64-nocgo config, remove nacl-386 trybot

40 remaining items

randall77

randall77 commented on Jun 5, 2019

@randall77
Contributor

I don't think the res_search in /usr/lib/system/libsystem_info.dylib ever makes it to res_9_search in /usr/lib/libresolv.dylib. The following files all just reference each other, there's no path to libresolv.dylib from our root (libSystem.B.dylib):

	/usr/lib/libobjc.A.dylib
	/usr/lib/libc++abi.dylib
	/usr/lib/libc++.1.dylib
	/usr/lib/libSystem.B.dylib
        /usr/lib/system/*.dylib
randall77

randall77 commented on Jun 5, 2019

@randall77
Contributor

It wouldn't be hard to try adding libresolv.dylib to our imports and renaming our resolver from res_search to res_9_search.

gopherbot

gopherbot commented on Jun 5, 2019

@gopherbot
Contributor

Change https://golang.org/cl/180838 mentions this issue: net: skip questions before parsing answers

rsc

rsc commented on Jun 5, 2019

@rsc
Contributor

I can't figure out what CL 166297 intended to fix or why it was important to improve anything in non-cgo-only mode. I have a bunch of fixes for it that make it resolve names successfully, which I will send out, but then I will send a CL deleting it entirely. If at some later point someone wants to bring it back, that's fine, provided they explain why.

gopherbot

gopherbot commented on Jun 6, 2019

@gopherbot
Contributor

Change https://golang.org/cl/180843 mentions this issue: net: remove non-cgo macOS resolver code

gopherbot

gopherbot commented on Jun 6, 2019

@gopherbot
Contributor

Change https://golang.org/cl/180842 mentions this issue: net: fix non-cgo macOS resolver code

grantseltzer

grantseltzer commented on Jun 6, 2019

@grantseltzer
Contributor

@rsc the idea is so that darwin dns logic can be used instead of the netgo library for when cgo is disabled. There's a lot of overlap but the /etc/resolver files is the particular example.

The problem I see is that when distributing binaries to darwin hosts it's likely most common to disable cgo. This means that before CL 166297 tools like the ones by hashicorp would not support /etc/resolver files. At my company and those of others I talked to in person and on slack this was an issue.

In terms of testing, shamefully I was just doing it manually but will gladly work to write up tests. I'm taking a look at your CLs now, thanks for the help and feedback!

rsc

rsc commented on Jun 6, 2019

@rsc
Contributor

@grantseltzer, I don't believe non-cgo builds of such tools are working at all today; literally all DNS lookups seem to fail, not just ones involving /etc/resolver. Even once Go-side bugs are fixed, libsystem_info's res_search fails at PTR and CNAME queries; it may also not be thread safe; and it appears not to pay any attention to /etc/resolver. Perhaps switching to libresolv will help; perhaps not. If you'd like to reintroduce a revised version of the code for Go 1.14, that's fine. For Go 1.13, though, we'll revert things to the way they were in Go 1.12. Thanks.

grantseltzer

grantseltzer commented on Jun 6, 2019

@grantseltzer
Contributor

@rsc although your changes in 180842 make a lot of sense to me, i'm quite sure that they at least work enough to honor the /etc/resolver files. The thing I don't understand is that they're only working in non-test files. That behavior is worth exploring.

With that in mind, it is a lot of added complexity for a corner case. I'll investigate to see how else I can get this to work, even if you are removing the code for now. Thanks to you as well.

added a commit that references this issue on Jun 6, 2019
locked and limited conversation to collaborators on Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@rsc@groob@agnivade@ianlancetaylor

        Issue actions

          net: DNS broken on darwin without cgo (1.13 regression) · Issue #31705 · golang/go