-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[main] Update Docker images, queues, etc. #38427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c9b383f
to
ec21abe
Compare
--- ".\\helix.matrix.before" 2021-11-16 13:38:59.593746300 -0800
+++ ".\\helix.matrix.after" 2021-11-16 13:21:31.644538900 -0800
@@ -1,14 +1,11 @@
-(Alpine.312.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.12-helix-20200908125345-56c6673
-(Debian.11.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-amd64-20210304164428-5a7c380
-(Debian.9.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:debian-9-helix-arm64v8-a12566d-20190807161036
-(Fedora.34.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:fedora-34-helix-20210728124700-4f64125
+(Alpine.314.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19
+(Debian.11.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-amd64-20211001171307-0ece9b3
+(Debian.11.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-arm64v8-20211001171229-97d8652
+(Fedora.34.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:fedora-34-helix-20210924174119-4f64125
(Mariner)[email protected]/dotnet-buildtools/prereqs:cbl-mariner-1.0-helix-20210528192219-92bf620
-OSX.1014.Amd64.Open
-OSX.1100.Amd64.Open
+OSX.1015.Amd64.Open
Redhat.7.Amd64.Open
-Ubuntu.1804.Amd64.Open
Ubuntu.2004.Amd64.Open
Windows.10.Amd64.Server20H2.Open
Windows.10.Arm64v8.Open
-Windows.11.Amd64.ClientPre.Open
Windows.Amd64.Server2022.Open
|
--- ".\\quarantined.pr.before" 2021-11-16 13:57:28.157821100 -0800
+++ ".\\quarantined.pr.after" 2021-11-16 13:53:39.075019400 -0800
@@ -1,4 +1,4 @@
-(Fedora.34.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:fedora-34-helix-20210728124700-4f64125
-OSX.1014.Amd64.Open
+OSX.1100.Amd64.Open
Ubuntu.1804.Amd64.Open
Windows.11.Amd64.ClientPre.Open
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes
aspnetcore-helix-matrix
failures are about flakiness that exists in our regular runsaspnetcore-quarantined-pr
pipeline times out for 'main' builds fairly frequently
<PropertyGroup Condition="'$(TestDependsOnPlaywright)' == 'true'"> | ||
<SkipHelixQueues> | ||
$(HelixQueueAlpine312); | ||
$(HelixQueueAlpine314); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth removing this line and running aspnetcore-helix-matrix
again❔ It's unclear how to determine whether Playwright is working everywhere given all or most Playwright tests are quarantined and at least the BlazorWasmTemplateTest
tests mostly fail in any case. @javiercn ❔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can basically safely ignore playwright now, I believe the only place that's running now is in the components pipleine, @TanayParikh can you confirm that's true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the original PRs where I:
- Moved to 3.14: Update test matrix #36845
- Updated to a 3.14 queue w/ Python: Fix alpine queue used for Helix #36929
- Associated Helix-Matrix: https://dev.azure.com/dnceng/public/_build/results?buildId=1384007&view=results
- Reverted back to 3.12 after persistent failures: Go back to alpine 3.12 #36951
The linked helix-matrix might have info that isn't in the PRs - all I remember is that the failure I mentioned in the issue was persistent. I couldn't find anything more in emails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @HaoK remembers something I don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
believe the only place that's running now is in the components pipeline.
The Components pipeline mainly runs Microsoft.AspNetCore.Components.E2ETests and those rely on Selenium, not Playwright. In addition, those tests don't run on Helix.
On the other hand, Playwright is used only in BlazorTemplates.Tests. Those tests are all quarantined and never run on any Linux e.g. from a recent aspnetcore-quarantined-tests
run of 'main':
In general, can that restriction be removed or reduced (adding a Linux platform)❔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, can that restriction be removed or reduced (adding a Linux platform)❔
@javiercn and @dotnet/aspnet-blazor-eng (because you're likely the most familiar w/ the $(TestDependsOnPlaywright)
exclusions) please chime in here. I see Playwright is fully supported on Ubuntu 18.04 and 20.04 but we're running our Playwright tests on neither. I haven't searched enough to understand why Ubuntu 18.04 is also not running Playwright tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@javiercn @dotnet/aspnet-blazor-eng any thoughts on Doug's question above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now leaning toward leaving everything but the duplicate Fedora testing unchanged and getting this in tonight or tomorrow. The experts can deal with our limited coverage for the Blazor template tests and Playwright system compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking to resolve questions before merging this PR…
<PropertyGroup Condition="'$(TestDependsOnPlaywright)' == 'true'"> | ||
<SkipHelixQueues> | ||
$(HelixQueueAlpine312); | ||
$(HelixQueueAlpine314); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, can that restriction be removed or reduced (adding a Linux platform)❔
@javiercn and @dotnet/aspnet-blazor-eng (because you're likely the most familiar w/ the $(TestDependsOnPlaywright)
exclusions) please chime in here. I see Playwright is fully supported on Ubuntu 18.04 and 20.04 but we're running our Playwright tests on neither. I haven't searched enough to understand why Ubuntu 18.04 is also not running Playwright tests.
We have one test failure on OSX 10.15:
----- Inner Stack Trace ----- @HaoK have you seen this one before? Should we just quarantine it? |
External networking issue? We wouldn't normally quarantine a test for that. That said, why is a test hitting external resources? |
The failure is in the |
Separately, @Tratcher could you answer my question at https://github.com/dotnet/aspnetcore/pull/38427/files#r749957765 please❔ |
We need the external hit as we want to ensure the contents of identity UI match the cdn, so its literally comparing against the cdn contents |
For that specific test, we can also just configure helix retries since that has a known external network dependency: https://github.com/dotnet/aspnetcore/blob/main/eng/test-configuration.json#L10 we should add a glob line for all of the script ui tests: "Microsoft.AspNetCore.Identity.Test.IdentityUIScriptsTest.*" |
Note the entire work item failed in this case even though it was only And, on @Tratcher's overall point, we would be in a much better place if the exponential retry in Bottom line, the test fails almost never but the |
We could just get rid of using retry handler and let the helix work item retry the entire workitem, it doesn't take long its already on the machine and all of the identity tests take a few seconds to run in its entirety. |
Does the Helix SDK retry support retry just the single |
ec21abe
to
fb0a87f
Compare
Why does it matter the scope? If it only fails on OSX, it doesn't really hurt specifying broader retry policy for that test class |
@HaoK and I chatted offline and the scope is a single Helix agent i.e. retries just retry one test class on that one platform. More generally, we're in agreement to
|
I'm worry that macOS-10.15 Helix machines are significantly slower or have worse network connections than the macOS 10.14 machines we were using before. The most recent Separately from that and somewhat strangely, we aren't seeing timeouts of the |
To clarify, the move that seems to be hurting our |
We could just choose to skip these identity tests on the OSX queue, we don't really need the OS coverage for this, there's no variation between OS for this feature area |
WFM though this remains a general problem. @wtgodbe is hitting similar problems w/ tests on macOS 10.15 e.g. in #38536. I'll add the Suggest we move to macOS-11 and OSX.1100.Amd64[.Open] in all of your PRs for dotnet/aspnetcore-internal#3950 except where we have |
- part of dotnet/aspnetcore-internal#3950 - also touches on #36032 - update Helix queues from Alpine 3.12 to 3.14, OSX 10.14 to 10.15, and (for Arm64) Debian 9 to 11 - use OSX 11.00 when testing PRs and rolling builds; reduce 10.15 usage to scheduled runs - remove overlap (all 3 queues) between PRs / rolling builds and scheduled runs - build source-index on `windows-latest` (not `vs2017-win2016`) - update build and Helix Docker images to latest tags nits: - don't skip unused Helix queues - remove versions from pipeline job display names - some were already outdated; rest will be confusing in the future - remove most comments about unused Helix queues
fb0a87f
to
f4ca1c5
Compare
|
||
namespace Microsoft.AspNetCore.Identity.Test; | ||
|
||
[SkipOnHelix("https://github.com/dotnet/aspnetcore/issues/38542", Queues="OSX.1015.Amd64.Open;OSX.1015.Amd64")] //slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I'm skipping the test class in case the failure would just move to IdentityUI_ScriptTags_SubresourceIntegrityCheck(...)
if I skipped only IdentityUI_ScriptTags_FallbackSourceContent_Matches_CDNContent(...)
. I double-checked and don't see evidence IdentityUI_ScriptTags_SubresourceIntegrityCheck(...)
ran before (and of course not after) the timeouts of IdentityUI_ScriptTags_FallbackSourceContent_Matches_CDNContent(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems fine, we could go so far as skip All.OSX if you wanted since we really only need to run this on any single queue to ensure we are including the right file versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<whining>I don't wanna do another iteration</whining>
😭
Plus, we haven't seen similar issues in other queues. We could add a [RunOnlyOnLatestUbuntu]
attribute at some point in the future 😄
This wasn't correct. |
/backport to release/6.0 |
Started backporting to release/6.0: https://github.com/dotnet/aspnetcore/actions/runs/1483876410 |
@dougbu backporting to release/6.0 failed, the patch most likely resulted in conflicts: $ git am --3way --ignore-whitespace --keep-non-patch changes.patch
Applying: [main] Update Docker images, queues, etc. - part of dotnet/aspnetcore-internal#3950 - also touches on #36032 - update Helix queues from Alpine 3.12 to 3.14, OSX 10.14 to 10.15, and (for Arm64) Debian 9 to 11 - use OSX 11.00 when testing PRs and rolling builds; reduce 10.15 usage to scheduled runs - remove overlap (all 3 queues) between PRs / rolling builds and scheduled runs - build source-index on `windows-latest` (not `vs2017-win2016`) - update build and Helix Docker images to latest tags
Using index info to reconstruct a base tree...
M .azure/pipelines/ci.yml
M .azure/pipelines/quarantined-pr.yml
M docs/Helix.md
M eng/targets/Helix.Common.props
M eng/targets/Helix.targets
M src/ProjectTemplates/test/GrpcTemplateTest.cs
M src/Testing/src/xunit/HelixConstants.cs
M src/Testing/src/xunit/SkipOnHelixAttribute.cs
Falling back to patching base and 3-way merge...
Auto-merging src/Testing/src/xunit/SkipOnHelixAttribute.cs
CONFLICT (content): Merge conflict in src/Testing/src/xunit/SkipOnHelixAttribute.cs
Auto-merging src/Testing/src/xunit/HelixConstants.cs
CONFLICT (content): Merge conflict in src/Testing/src/xunit/HelixConstants.cs
Auto-merging src/ProjectTemplates/test/GrpcTemplateTest.cs
CONFLICT (content): Merge conflict in src/ProjectTemplates/test/GrpcTemplateTest.cs
Auto-merging eng/targets/Helix.targets
CONFLICT (content): Merge conflict in eng/targets/Helix.targets
Auto-merging eng/targets/Helix.Common.props
CONFLICT (content): Merge conflict in eng/targets/Helix.Common.props
Auto-merging docs/Helix.md
Auto-merging .azure/pipelines/quarantined-pr.yml
Auto-merging .azure/pipelines/ci.yml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 [main] Update Docker images, queues, etc. - part of dotnet/aspnetcore-internal#3950 - also touches on #36032 - update Helix queues from Alpine 3.12 to 3.14, OSX 10.14 to 10.15, and (for Arm64) Debian 9 to 11 - use OSX 11.00 when testing PRs and rolling builds; reduce 10.15 usage to scheduled runs - remove overlap (all 3 queues) between PRs / rolling builds and scheduled runs - build source-index on `windows-latest` (not `vs2017-win2016`) - update build and Helix Docker images to latest tags
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128 Please backport manually! |
Pasting reply from Teams: I think it's a bit of a stretch to extrapolate "machines are not as good" from some HttpClient timeouts. These machines aren't in an Azure data center so literally everything they do (fetch helix work items from Azure Service Bus, send events to Azure Event Hub, download payloads from Azure Storage accounts, etc) involves communicating with and downloading successfully from external resources. There is some variance in hardware to be found (some pools have a mix of minis and Pros;) but even the oldest, worst mac minis we have are using gigabit adapters built into their main boards connected to the same general network topology as the others. If we see a pattern of a specific machine having this problem, we can certainly investigate but my suspicions lie with some kind of DoS prevention system with cloudflare. When the vendors aren't using every port on the KVM I will fetch some hardware specs off a few machines but I don't think we can realistically blame compute power on a failure to download an 88 KB file from an external source in 100 seconds. |
Hi @MattGal. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context. |
windows-latest
(notvs2017-win2016
)nits: