Skip to content

Data race in httptest.Server - pkg/storage/etcd, pkg/registry/generic/etcd #18928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wojtek-t opened this issue Dec 19, 2015 · 78 comments
Closed
Assignees
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@wojtek-t
Copy link
Member

The failure is the following:

WARNING: DATA RACE
Write by goroutine 95:
  sync.raceWrite()
      /tmp/workdir/go/src/sync/race.go:41 +0x2e
  sync.(*WaitGroup).Wait()
      /tmp/workdir/go/src/sync/waitgroup.go:124 +0xf9
  net/http/httptest.(*Server).Close()
      /tmp/workdir/go/src/net/http/httptest/server.go:168 +0x80
  k8s.io/kubernetes/plugin/pkg/auth/authenticator/token/oidc.TestOIDCAuthentication()
      /workspace/kubernetes/_output/local/go/src/k8s.io/kubernetes/plugin/pkg/auth/authenticator/token/oidc/oidc_test.go:395 +0x1fc1
  testing.tRunner()
      /tmp/workdir/go/src/testing/testing.go:456 +0xdc

Previous read by goroutine 142:
  sync.raceRead()
      /tmp/workdir/go/src/sync/race.go:37 +0x2e
  sync.(*WaitGroup).Add()
      /tmp/workdir/go/src/sync/waitgroup.go:66 +0xfa
  net/http/httptest.(*waitGroupHandler).ServeHTTP()
      /tmp/workdir/go/src/net/http/httptest/server.go:198 +0x5c
  net/http.serverHandler.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1862 +0x206
  net/http.(*conn).serve()
      /tmp/workdir/go/src/net/http/server.go:1361 +0x117c

@timothysc @kubernetes/goog-testing @yifan-gu

@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 19, 2015
@wojtek-t
Copy link
Member Author

The failure comes from: TestOIDCAuthentication

@timothysc
Copy link
Member

xref: #18917

Same spot I fail on: https://paste.fedoraproject.org/303002/50488092/

@yifan-gu
Copy link
Contributor

cc @yifan-gu

@wojtek-t
Copy link
Member Author

I've just seen the same error in a different test:

Write by goroutine 48:
  sync.raceWrite()
      /tmp/workdir/go/src/sync/race.go:41 +0x2e
  sync.(*WaitGroup).Wait()
      /tmp/workdir/go/src/sync/waitgroup.go:124 +0xf9
  net/http/httptest.(*Server).Close()
      /tmp/workdir/go/src/net/http/httptest/server.go:168 +0x80
  k8s.io/kubernetes/pkg/storage/etcd/testing.(*EtcdTestServer).Terminate()
      /workspace/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/storage/etcd/testing/utils.go:133 +0x138
  k8s.io/kubernetes/pkg/registry/generic/etcd.TestEtcdWatch()
      /workspace/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/registry/generic/etcd/etcd_test.go:610 +0xb5f
  testing.tRunner()
      /tmp/workdir/go/src/testing/testing.go:456 +0xdc

Previous read by goroutine 62:
  sync.raceRead()
      /tmp/workdir/go/src/sync/race.go:37 +0x2e
  sync.(*WaitGroup).Add()
      /tmp/workdir/go/src/sync/waitgroup.go:66 +0xfa
  net/http/httptest.(*waitGroupHandler).ServeHTTP()
      /tmp/workdir/go/src/net/http/httptest/server.go:198 +0x5c
  net/http.serverHandler.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1862 +0x206
  net/http.(*conn).serve()

This failure comes from:
pkg/registry/generic/etcd
TestEtcdWatch

So this may not be related to oidc itself, but maybe to httptest.Server

@wojtek-t wojtek-t added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 20, 2015
@wojtek-t
Copy link
Member Author

I've also seen exactly the same in pkg/storage/etcd.

@wojtek-t wojtek-t changed the title Data race in TestOIDCAuthentication (httptest.Server)? Data race in httptest.Server - pkg/storage/etcd, pkg/registry/generic/etcd and TestOIDCAuthentication Dec 20, 2015
@wojtek-t
Copy link
Member Author

I think I have a hypothesis what can be happening here (although, I'm not able to reproduce it locally).
Basically - it seem for me that there is a race in etcdWatcher (it has always been there, but Go 1.5 exposed it to the world).

What I think is happening here is that:

  1. when you call Watch() on etcdHelper, it starts a new go-routine calling etcdWatch()
  2. however, there is no synchronization in it - so it can happen that it will be started after some time (e.g. after Stop() is called in the test mentioned above)
  3. if that happens, we may be calling etcd server, when it is already terminating...

I will prepare a PR fixing it (although unfortunately I'm not able to reproduce it locally, so this will just be a guess...)

@wojtek-t
Copy link
Member Author

I've sent #18956 - although I hope that it will solve most of problems here.

@wojtek-t
Copy link
Member Author

However - my PR will not fix OIDC test - I think that one is unrelated...

@wojtek-t
Copy link
Member Author

It seems that my fix #18956 didn't help...

@wojtek-t
Copy link
Member Author

One more try to fix tests has just been merged #18999 (although I'm still not able to reproduce them locally)

@wojtek-t
Copy link
Member Author

Since TestOIDCAuthentication is a different issue - I'm creating a separate issue for that one.

@wojtek-t wojtek-t changed the title Data race in httptest.Server - pkg/storage/etcd, pkg/registry/generic/etcd and TestOIDCAuthentication Data race in httptest.Server - pkg/storage/etcd, pkg/registry/generic/etcd Dec 22, 2015
@wojtek-t
Copy link
Member Author

Unfortunately the second fix also didn't help (it seems that flakes are rare even after the first fix, but they are still present... - I will debug more.

@wojtek-t
Copy link
Member Author

One more attempt is in #19006

@wojtek-t
Copy link
Member Author

@fgrzadkowski - hopefully it was finally fixed in the fourth attempt :)
I will keep it open for some more time to monitor it, but it seems that is finally done now.

@gmarek
Copy link
Contributor

gmarek commented Jan 4, 2016

I'll merge #19187 to make this flakes go away.

@gmarek
Copy link
Contributor

gmarek commented Jan 4, 2016

If anyone see this happening again please ping this issue.

@davidopp
Copy link
Member

@gmarek should we close this since you merged #19187? if so, please close.

@davidopp davidopp assigned gmarek and unassigned wojtek-t Jan 10, 2016
@davidopp
Copy link
Member

Oh sorry, I see @xiang90 is working on a less hacky solution, as described here
#18928 (comment)

Re-assigning to him.

@davidopp davidopp assigned xiang90 and unassigned gmarek Jan 10, 2016
@davidopp
Copy link
Member

But if we think @wojtek-t 's sleep will fix the problem good enough for now, we should lower the priority of this from P0 to P1 or P2...

@xiang90
Copy link
Contributor

xiang90 commented Jan 10, 2016

@davidopp Right. Agree to de-prioritize it. I will try to get a more robust one next week.

@davidopp davidopp added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jan 10, 2016
@davidopp
Copy link
Member

Thanks, bumped to P2.

@ikehz
Copy link
Contributor

ikehz commented Jan 21, 2016

@xiang90 We're prioritizing flakes for the time being. Is this something you'll be able to get to soon, or should we find someone else to work on it?

@xiang90
Copy link
Contributor

xiang90 commented Jan 21, 2016

@ihmccreery I am in China recently. I can look into this later. But I assume this issue has been fixed by a work around. And go will also fix the root cause soon. This issue is just to make the test shutdown in a more graceful way.

@ikehz ikehz assigned ikehz and unassigned xiang90 Jan 22, 2016
@davidopp
Copy link
Member

As @brendandburns mentioned, this might be fixed by #19458

@ikehz ikehz assigned davidopp and unassigned ikehz Jan 25, 2016
@ikehz
Copy link
Contributor

ikehz commented Jan 25, 2016

Reassigning to @davidopp based on #19458.

@timothysc
Copy link
Member

iirc the long term fix is in etcd proper.

@brendandburns
Copy link
Contributor

I'm taking this due to #20123 having the fix.

@xiang90
Copy link
Contributor

xiang90 commented Jan 25, 2016

@timothysc etcd/client does not really manage the transport layer. We generate and pass in the transport in k8s and k8s has full management of the shutdown. We need to change that part if we want a more graceful shutdown.

@brendandburns
Copy link
Contributor

I'm going to close this issue, since I beleive that #20123 fixes the original issue in the title.

@wojtek-t @xiang90 if you want to open a separate issue related to what you are seeing in etcd, please file a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

9 participants