Skip to content

Tests time out: System.Net.Sockets.Tests.DnsEndPointTest - mostly on SLES.15 in release/6.0 branches #57929

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karelz opened this issue Aug 23, 2021 · 8 comments · Fixed by #58129
Labels
area-System.Net.Sockets test-bug Problem in test source code (most likely) test-run-core Test failures in .NET Core test runs
Milestone

Comments

@karelz
Copy link
Member

karelz commented Aug 23, 2021

Failures 7/11-9/9 (incl. PRs):

Day Failures Runs
7/11-7/15 6.2x Only 6.0-preview* branches -- 31x (5 days = 6.2x each day)
7/16 10x 2x main=6.0, 8x 6.0-preview* branches
7/17-7/21 8.6x Only 6.0-preview* branches -- 43x (5 days = 8.6x each day)
7/22 11x 1x main=6.0, 10x 6.0-preview* branches
7/23-8/13 5.8x Only 6.0-preview* branches -- 127x (22 days = 5.8x each day)
8/14 1x 1x 6.0-preview7
8/15 2x 2x 6.0-preview7
8/16 4x 4x 6.0-preview7
8/17 1x 1x 6.0-preview7
8/18 6x 2x 6.0, 1x 6.0-rc1, 3x 6.0-preview7
8/19 6x 3x 6.0, 2x 6.0-rc1, 1x 6.0-preview7
8/20 3x 2x 6.0, 1x 6.0-rc1
8/21 3x 2x 6.0, 1x 6.0-rc1
8/22 6x 3x 6.0, 3x 6.0-rc1
8/23 4x 2x 6.0, 1x 6.0-rc1, 1x main=7.0 (1x PR #57745)
8/24 5x 3x 6.0, 2x 6.0-rc1
8/25 2x 1x 6.0, 1x 6.0-rc1
8/26 5x 3x 6.0, 2x 6.0-rc1
8/27 5x 3x 6.0, 2x 6.0-rc1
8/28 3x 3x 6.0-rc1
8/29 6x 3x 6.0, 3x 6.0-rc1
8/30 5x 4x 6.0 (1x PR #58336) & 1x 6.0-rc1
8/31 8x 3x 6.0, 3x 6.0-rc1, 2x main=7.0
9/1 2x 2x 6.0-rc1
9/2 7x 6x 6.0 (3x PR #58336), 1x 6.0-rc1
9/3 -- Fixed in main=7.0 branch
9/3 1x 1x 6.0
9/4 6x 3x 6.0, 3x 6.0-rc1
9/5 4x 2x 6.0, 2x 6.0-rc1
9/6 3x 1x 6.0 (1x PR #58336), 2x 6.0-rc1
9/7 4x 3x 6.0, 1x 6.0-rc1
9/8 -- Fixed in 6.0 branch
9/8 3x 2x 6.0-rc1
9/9 1x 1x 6.0-rc1
9/10 3x 3x 6.0-rc1

Failures:

  • Socket_ConnectAsyncDnsEndPoint_HostNotFound
    • 6/24-8/23 (incl. PRs): 241 -- each day at least one except 8/4 and 8/7.
      • Failures are on release/6.0* branches (preview3+) -- only 1 failure on 7/16 is in main branch, no failure on any PR
      • OS failures:
        • SLES.15.Amd64.Open - 236x
        • Windows.Nano.1809.Amd64.Open - 2x (1 occurrence on main branch - 7/16, 7/18)
        • Windows.10.Amd64.Server19H1.Open - 1x (7/30)
        • Windows.7.Amd64.Open - 1x (8/12)
        • Windows.10.Amd64.ServerRS5.Open - 1x (8/16)
    • 8/24-8/29 (incl. PRs): 15 - all release/6.0*, official builds, SLES.15.Amd64.Open
    • 8/30-9/5 (incl. PRs): 20 - all release/6.0*, 2x PR [release/6.0] add Http3LoopbackConnection.ShutdownAsync and use in AcceptConnectionAsync #58336, SLES.15.Amd64.Open
  • Socket_StaticConnectAsync_HostNotFound
    • 6/24-8/23 (incl. PRs): 95 -- each day at least one except 7/7, 7/18, 8/5, 8/7-8/10, 8/12, 8/14-8/15, 8/17-8/18, 8/20.
      • All failures are on release/6.0* branches (preview3+)
      • OS failures:
        • SLES.15.Amd64.Open - 93x
        • Windows.Nano.1809.Amd64.Open - 1x (7/17)
        • Windows.10.Amd64.Server19H1.Open - 1x (7/22)
    • 8/24-8/29 (incl. PRs): 9 -- all release/6.0*, official builds, SLES.15.Amd64.Open
    • 8/30-9/5 (incl. PRs): 12 -- 1x main=7.0 on Windows.10.Amd64.Server19H1.Open, otherwise release/6.0* on SLES.15.Amd64.Open, 2x PR [release/6.0] add Http3LoopbackConnection.ShutdownAsync and use in AcceptConnectionAsync #58336
  • Socket_StaticConnectAsync_Success -- main branch failures tracked in Tests failed: System.Net.Sockets.Tests.DnsEndPointTest / * #21924
    • 6/24-8/23 (incl. PRs): 21
      • Mix of branches and OSs
        • 4x main + 1x PR + 16x release/6.0-preview* branches
        • OSs:
          • SLES.15.Amd64.Open - 4x
          • Debian.9.Arm32.Open - 4x
          • OSX.1015.Amd64.Open - 2x (incl. 1x Mono)
          • Ubuntu.1804.ArmArch.Open - 2x
          • Ubuntu.1910.Amd64.Open - 2x
          • Fedora.32.Amd64.Open - 3x
          • OSX.1013.Amd64.Open - 1x
          • Windows.10.Amd64.ServerRS5.Open - 1x
          • Debian.9.Amd64.Open - 1x
          • Windows.10.Amd64.Server19H1.Open - 1x
    • 8/24-8/29 (incl. PRs): 2 -- all release/6.0*, official builds -- 1x SLES.15.Amd64.Open, 1x Debian.9.Amd64.Open
    • 8/30-9/5 (incl. PRs): 1 -- main=7.0 official run on Ubuntu.1804.ArmArch.Open

Kusto query:

let failedTests = (methodName : string, includePR : bool, messageSubstr: string) {
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').TestResults
| where Type startswith "System.Net" 
    and Result == 'Fail'
| where Method == methodName
  or Method == 'Socket_StaticConnectAsync_Success'
  or Method == 'Socket_StaticConnectAsync_HostNotFound'
  or Method == 'Socket_ConnectAsyncDnsEndPoint_HostNotFound'
| where Message contains messageSubstr
| distinct JobId, WorkItemId, Message, Exception, StackTrace, Method, Type //, Arguments
| join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
    | where ((Branch == 'refs/heads/main') or (Branch == 'refs/heads/master') or (includePR and (Source startswith "pr/")) or (Branch startswith 'refs/heads/release/6.0'))
    | where Type startswith "test/functional/cli/"
        and not(Properties contains "runtime-staging")
    | summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName) by JobId
| project-rename JobType = Type) on JobId
};
failedTests('', true, '');

Failure:

Timed out while waiting for connection
Expected: True
Actual:   False

   at System.Net.Sockets.Tests.DnsEndPointTest.Socket_ConnectAsyncDnsEndPoint_HostNotFound() in /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/DnsEndPointTest.cs:line 275
@karelz karelz added area-System.Net.Sockets test-bug Problem in test source code (most likely) labels Aug 23, 2021
@karelz karelz added this to the 6.0.0 milestone Aug 23, 2021
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 23, 2021
@ghost
Copy link

ghost commented Aug 23, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Test type: System.Net.Sockets.Tests.DnsEndPointTest

Failures 6/24-8/83 (incl. PRs):

  • Total failures: 241 -- each day at least one except 8/4 and 8/7.
  • Failures are on release/6.0* branches (preview3+) -- only 1 failure on 7/16 is in main branch, no failure on any PR
  • OS failures:
    • SLES.15.Amd64.Open - 236x
    • Windows.Nano.1809.Amd64.Open - 2x (1 occurrence on main branch)
    • Windows.10.Amd64.Server19H1.Open - 1x
    • Windows.7.Amd64.Open - 1x
    • Windows.10.Amd64.ServerRS5.Open - 1x

Failure:

Timed out while waiting for connection
Expected: True
Actual:   False

   at System.Net.Sockets.Tests.DnsEndPointTest.Socket_ConnectAsyncDnsEndPoint_HostNotFound() in /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/DnsEndPointTest.cs:line 275
Author: karelz
Assignees: -
Labels:

area-System.Net.Sockets, test bug

Milestone: 6.0.0

@karelz
Copy link
Member Author

karelz commented Aug 23, 2021

@antonfirsov any idea why only this test fails?
And 99% failures are on release/6.0 branches on SLES 15

@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Aug 23, 2021
@karelz karelz changed the title Test time out: Socket_ConnectAsyncDnsEndPoint_HostNotFound - mostly on SLES.15 in release/6.0 branches Tests time out: System.Net.Sockets.Tests.DnsEndPointTest - mostly on SLES.15 in release/6.0 branches Aug 23, 2021
@antonfirsov
Copy link
Member

antonfirsov commented Aug 23, 2021

Async NameResolution is async-over-sync on Unix, queuing work on the ThreadPool in parallel with other socket tests. One explanation can be that the 10seconds timeout is too strict in some cases. Another one is that resolution of invalid names fails too slowly on these SLES machines. The two can be related.

I wonder if there is anything special to the release test runs in comparison to the main runs (images? machines? some sort of helix config?). Is the fix for #48751 applicable for these test runs? /cc @MattGal

@karelz
Copy link
Member Author

karelz commented Aug 23, 2021

@antonfirsov FYI -- 3rd test is similar failure (added above), but not unique to SLES and also reproduces more on main branch (5x).

@karelz karelz added the test-run-core Test failures in .NET Core test runs label Aug 24, 2021
@MattGal
Copy link
Member

MattGal commented Aug 24, 2021

Async NameResolution is async-over-sync on Unix, queuing work on the ThreadPool in parallel with other socket tests. One explanation can be that the 10seconds timeout is too strict in some cases. Another one is that resolution of invalid names fails too slowly on these SLES machines. The two can be related.

I wonder if there is anything special to the release test runs in comparison to the main runs (images? machines? some sort of helix config?). Is the fix for #48751 applicable for these test runs? /cc @MattGal

We had a chat and I did note that .svc is using D2_v4 (Intel-based) while all the other SLES.15 queues are using D2a_v4 (AMD-EPYC based) processors. Nothing else is particularly interesting (same image, same size, no change in image since 7/29...), and the same test has passed on .svc in the past. Anton is going to investigate, depending on the outcome it may be worth just upping the timeout.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 25, 2021
@wfurt
Copy link
Member

wfurt commented Aug 25, 2021

I don't think this is CPU intensive test.
One option we have is doing the lookup prior running the actual test. We can assert that it fails with expected error as well as we can get timing baseline instead of using fixed timeout value.

karelz pushed a commit that referenced this issue Sep 3, 2021
Long story short:
I have spent several hours trying to get a test run on SLES in order to repro #57929, no success so far, and it may take days to make progress, since I'm unfamiliar with SLES.

I recommend to bump the timeout values for now, and see if it helps with the issue. If not, we may invest into another round of investigation.

(hopefully) fixes #57929
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 3, 2021
@karelz
Copy link
Member Author

karelz commented Sep 3, 2021

Reopening to track backport to 6.0

@karelz karelz reopened this Sep 3, 2021
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Sep 3, 2021
@karelz
Copy link
Member Author

karelz commented Sep 8, 2021

Fixed in 6.0 RC2 in PR #58617.
Fixed in 7.0 in PR #58129.

@karelz karelz closed this as completed Sep 8, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Oct 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets test-bug Problem in test source code (most likely) test-run-core Test failures in .NET Core test runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants