Skip to content

Debug dotnet new timeouts #32889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

dougbu
Copy link
Contributor

@dougbu dougbu commented May 20, 2021

  • hang long enough to grab a dump

This will not help us understand later inability to acquire DotNetNewLock

  • guessing but this might have something to do w/ the 20 minute default timeout for that acquisition

@ghost ghost added the area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates label May 20, 2021
@dougbu
Copy link
Contributor Author

dougbu commented May 20, 2021

I don't plan to merge this. Added reviewers partially for awareness and partially because I'll likely need help making heads or tails of whatever dump this captures. Will rerun 'til I hit the failures, perhaps after removing all non-templates non-Helix jobs from the pipeline YAML.

@BrennanConroy
Copy link
Member

This will not help us understand later inability to acquire DotNetNewLock

Well the reason other tests can't get the lock is because it's global for the process, so when 1 test is hanging on dotnet new, all the other tests are stuck waiting for it to finish.

@dougbu
Copy link
Contributor Author

dougbu commented May 20, 2021

all the other tests are stuck waiting for it to finish.

You mean one test hangs for 15 minutes and enough tests get backed up waiting that some exceed their 20 minute WaitAsync(...) timeout❔ That seems like a solid explanation.

@HaoK
Copy link
Member

HaoK commented May 21, 2021

Why do we need to take a global lock? Can we just change these tests to run sequentially instead? Maybe that would help? I'd imagine those locks were added pre helix days when the templates didn't have an isolated helix machine all to themselves.

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

@HaoK sequential runs might be fine but we need to figure out the dotnet new timeout first. That part is already serialized by the lock itself.

@pranavkm
Copy link
Contributor

Can we just change these tests to run sequentially instead

We used to do that and the tests would take very long to run. @javiercn or @SteveSandersonMS might have more insights about this.

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

We used to do that and the tests would take very long to run.

Has the math perhaps changed since removing the uninstallation of the default templates❔

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BrennanConroy
Copy link
Member

Heisenbug? 😄

@dougbu
Copy link
Contributor Author

dougbu commented May 21, 2021

Heisenbug?

I can't tell 😀

@dougbu
Copy link
Contributor Author

dougbu commented May 22, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@dougbu
Copy link
Contributor Author

dougbu commented May 22, 2021

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-pull-32889-merge-6632ca1f8ec347a89a/ProjectTemplates.Tests--net6.0/dotnet.32172.dmp

May need to lengthen other locks as well. Logs in https://dev.azure.com/dnceng/public/_build/results?buildId=1151130&view=ms.vss-test-web.build-test-results-tab&runId=34807950&resultId=123880&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab make it appear the first failing test timed out after 15 minutes but other tests ran after it, some failing to acquire DotNetNewLock. Also, the "15 minutes" part makes me suspicious my change didn't actually work.

@BrennanConroy
Copy link
Member

Also, the "15 minutes" part makes me suspicious my change didn't actually work.

Looks like there was another timeout we needed to update
01065ff

@dougbu
Copy link
Contributor Author

dougbu commented May 25, 2021

Thanks @BrennanConroy. Unfortunately, more Heisenbug behaviour…

@BrennanConroy
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BrennanConroy
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BrennanConroy
Copy link
Member

Why is this running quarantined jobs only? Aren't most of the templates tests not quarantined?

@dougbu
Copy link
Contributor Author

dougbu commented May 26, 2021

Why is this running quarantined jobs only? Aren't most of the templates tests not quarantined?

We saw the timeouts in regular test runs and aspnetcore-quarantined-pr covers the quarantined tests. That said, we don't need another pipeline for this debugging and aspnetcore-quarantined-pr isn't running a focused set of tests.

@dougbu dougbu force-pushed the dougbu/template.timeout.debugging branch from e8d9be4 to 0f87949 Compare May 29, 2021 06:17
@dougbu dougbu force-pushed the dougbu/template.timeout.debugging branch from 0f87949 to db7f554 Compare June 11, 2021 00:20
@HaoK
Copy link
Member

HaoK commented Jun 11, 2021

Not sure you want to add the blazor templates to this since those will fail a lot due to actual flakiness unrelated to dotnet new, another random datapoint for you, when i was looking at the blazor template with retry runs, occasionally there will be a test run that passes after a little more than 10 minutes, which is an extreme outlier from the normal 1-2 minute runs, that looks like some kind of timeout since its such a round number

@dougbu
Copy link
Contributor Author

dougbu commented Jun 11, 2021

Another dotnet publish failure in https://dev.azure.com/dnceng/public/_build/results?buildId=1182397&view=results. The logging is insane; is any of that detail useful❔

Relevant bits may be

Launch host: C:\h\w\B57109AD\p\dotnet-cli\dotnet.exe, app: C:\h\w\B57109AD\p\dotnet-cli\sdk\6.0.100-preview.6.21308.9\dotnet.dll, argc: 5, args: publish,--no-restore,-c,Release,/bl,
--- Begin breadcrumb write
Breadcrumbs will be written using a background thread
Breadcrumb thread write callback...
--- End breadcrumb write 1
Execute managed assembly exit code: 0x1
Waiting for breadcrumb thread to exit...
Done waiting for breadcrumb thread to exit...

and

 | Expected: True
 | Actual:   False
 |    at Xunit.Assert.True(Nullable`1 condition, String userMessage) in C:\Dev\xunit\xunit\src\xunit.assert\Asserts\BooleanAsserts.cs:line 95
 |    at Templates.Test.IdentityUIPackageTest.IdentityUIPackage_WorksWithDifferentOptions() in /_/src/ProjectTemplates/test/IdentityUIPackageTest.cs:line 91
 |    at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_1.<<InvokeTestMethodAsync>b__1>d.MoveNext() in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\Runners\TestInvoker.cs:line 264
 | --- End of stack trace from previous location ---
 |    at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in C:\Dev\xunit\xunit\src\xunit.execution\Sdk\Frameworks\ExecutionTimer.cs:line 48
 |    at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in C:\Dev\xunit\xunit\src\xunit.core\Sdk\ExceptionAggregator.cs:line 96

Binary log at https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-aspnetcore-refs-pull-32889-merge-6a5d0c1edf6647f58f/ProjectTemplates.Tests--net6.0/AspNet.nniqxrj5imm.binlog may help too.

@dougbu dougbu force-pushed the dougbu/template.timeout.debugging branch from db7f554 to f36ec2d Compare June 11, 2021 04:04
@BrennanConroy
Copy link
Member

The logging is insane; is any of that detail useful❔

I turned it on hoping it would be useful, but either it contains nothing for us or I don't know how to find the useful information in it. We should probably turn it off at this point.

@dougbu
Copy link
Contributor Author

dougbu commented Jun 18, 2021

/azp run

@azure-pipelines
Copy link

Pull request contains merge conflicts.

dougbu and others added 4 commits June 21, 2021 15:35
- hang long enough to grab a dump

This will _not_ help us understand later inability to acquire `DotNetNewLock`
- guessing but this might have something to do w/ the 20 minute default timeout for that acquisition
@BrennanConroy BrennanConroy force-pushed the dougbu/template.timeout.debugging branch from f36ec2d to a29a46c Compare June 21, 2021 22:37
@dougbu
Copy link
Contributor Author

dougbu commented Jun 25, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dougbu dougbu mentioned this pull request Jun 25, 2021
@dougbu
Copy link
Contributor Author

dougbu commented Jun 25, 2021

Now that @sebastienros removed $env:COREHOST_TRACE in e03e116, do we need this draft PR❔ I think we had an issue about the timeouts we've been debugging here but can't find it at the moment…

@dougbu dougbu closed this Jul 13, 2021
@dougbu dougbu deleted the dougbu/template.timeout.debugging branch July 13, 2021 17:26
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants