Skip to content

Optimize the build process #1112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ThomasBarnekow opened this issue Nov 17, 2021 · 8 comments · Fixed by #1173
Closed

Optimize the build process #1112

ThomasBarnekow opened this issue Nov 17, 2021 · 8 comments · Fixed by #1173

Comments

@ThomasBarnekow
Copy link

Context

This issue is linked to a side discussion in #1109 on the very long build times. Here are some bits and pieces of the discussion in #1109, starting with my comment on a statement made by @bart-degreed:

Running the cibuild already takes a while today (and occasionally times out, the VMs are quite slow)

That is definitely true. One set of activities that takes relatively long is this:

RunInspectCode
RunCleanupCode

On Linux (Ubuntu image), RunInspectCode takes a whopping 16 to 18 minutes (yes, minutes) whereas it "only" takes 4 to 5 minutes on Windows (Visual Studio 2019 image). RunCleanupCode takes 2 to 4 minutes on Linux and 1 to 2 minutes on Windows. In total, we are talking about roughly 18 to 22 minutes on Linux and 5 to 7 minutes on Windows. Overall RunInspectCode accounts for roughly half of the total time on Linux.

The question is why you need to run:

  • both steps as part of the cibuild and
  • those two steps in both images (Ubuntu and Visual Studio 2019).

If possible, you should only run this on Windows.

@bart-degreed responded as follows:

We need to run Linux before Windows, because the Windows build publishes the NuGet package and the documentation website. We'd like to fail fast on formatting issues, not wait for the whole Linux build to complete first. During regular work, formatting issues are usually the cause for failure. Instead of running cleanup locally, we queue up a build and work on something else meanwhile.

I know how frustrating it can be to have to wait on cibuild when working on that, so by all means disable steps temporarily to be productive.

Problem

The two steps in question take 18 to 22 minutes on Linux and 5 to 7 minutes on Windows, i.e., ca. 23 to 29 minutes in total!

The following example highlights how frustrating this long time can be. The build started for my commit e65face related to PR #1103 led to a formatting-related failure after 22 min 8 sec. Here are relevant parts of the output:

Build succeeded.
    0 Warning(s)
    0 Error(s)
Time Elapsed 00:03:53.77
JetBrains Inspect Code 2021.2.2
Running on AMD 64 in 64-bit mode, .NET runtime 3.1.18 under Unix 5.4.0.1056
Running code cleanup on changed files in pull request
JetBrains Cleanup Code 2021.2.2
Running on AMD 64 in 64-bit mode, .NET runtime 3.1.18 under Unix 5.4.0.1056
Version: 2021.2.2
dotnet tool run jb cleanupcode "JsonApiDotNetCore.sln"  --include=";JsonApiDotNetCore.sln;src/Examples/JsonApiDotNetCoreExample.Cosmos/Controllers/NonJsonApiController.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Controllers/OperationsController.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Controllers/PeopleController.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Controllers/TodoItemsController.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Data/AppDbContext.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Definitions/TodoItemDefinition.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/JsonApiDotNetCoreExample.Cosmos.csproj;src/Examples/JsonApiDotNetCoreExample.Cosmos/Models/Person.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Models/Tag.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Models/TodoItem.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Models/TodoItemPriority.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Program.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/Properties/launchSettings.json;src/Examples/JsonApiDotNetCoreExample.Cosmos/Startup.cs;src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json;src/JsonApiDotNetCore/Configuration/NoSqlServiceCollectionExtensions.cs;src/JsonApiDotNetCore/Queries/INoSqlQueryLayerComposer.cs;src/JsonApiDotNetCore/Queries/NoSqlQueryLayerComposer.cs;src/JsonApiDotNetCore/Resources/Annotations/NoSqlHasForeignKeyAttribute.cs;src/JsonApiDotNetCore/Resources/Annotations/NoSqlOwnsManyAttribute.cs;src/JsonApiDotNetCore/Resources/Annotations/NoSqlResourceAttribute.cs;src/JsonApiDotNetCore/Services/NoSqlResourceService.cs" --profile --profile="JADNC Full Cleanup" --properties:Configuration=Release --verbosity=WARN -dsl=GlobalAll -dsl=SolutionPersonal -dsl=ProjectPersonal
JetBrains Cleanup Code 2021.2.2
Running on AMD 64 in 64-bit mode, .NET runtime 3.1.18 under Unix 5.4.0.1056
!!!! Process Aborted !!!!
The following files do not match .editorconfig:
 * src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json
Run the following command to fix style issues:
    dotnet regitlint -s JsonApiDotNetCore.sln --print-command --jb --profile --jb --profile="JADNC Full Cleanup" --jb --properties:Configuration=Release --jb --verbosity=WARN -f commits -a ffe5d62ec78d70b5dcf9f7c7ba080f04f53c908f -b b7930dee25fd506fd4beb55f5b3f4280f08f3244 --fail-on-diff --print-diff
diff --git a/src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json b/src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json
index 38ca8947..8fc2aef8 100644
--- a/src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json
+++ b/src/Examples/JsonApiDotNetCoreExample.Cosmos/appsettings.json
@@ -1,6 +1,7 @@
 {
   "Data": {
-    "DefaultConnection": "AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
+    "DefaultConnection":
+      "AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
   },
   "Logging": {
     "LogLevel": {

The situation was that the relevant part of appsettings.json looked like this:

{
  "Data": {
    "DefaultConnection": "AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
  },
}

However, it was supposed to look like that:

{
  "Data": {
    "DefaultConnection":
      "AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw=="
  },
}

In this case, the error was reported by RunCleanupCode (which is relatively fast) rather than RunInspectCode (which is super-slow at least on Linux).

Solution Options

To kick this off, I'd see multiple solutions.

Run on Windows Only and Possibly Fail Just a Little Later While Creating a Much Better Developer Experience

In PR #1103, I changed the relevant part of Build.ps as follows to only run the two checks on Windows:

if ($isWindows) {
    RunInspectCode
    RunCleanupCode
}

The Linux build took 16 min 13 sec while the Windows build took 14 min 21 sec. Roughly 8 min 20 sec into the Windows build, RunInspectCode was done. 9 min 30 sec into the build, RunCleanupCode finished. Thus, it would have taken approximately 24 min 30 sec instead of 22 min 8 sec to fail the build, e.g., in case of the above formatting issue. However, this way would have led to a much better developer experience. The build on one platform would have confirmed some more material facts ("does it work?") while also pointing out issues that are relatively less material ("does the code look consistent?"). In that case, I would have been more than happy to also fix that formatting issue. However, having to wait 22 minutes to only learn that a JSON file was not laid out as expected was super-frustrating.

Consider Changing the Order of the Images

You said you need to run Linux before Windows, because the Windows build publishes the NuGet package and the documentation website while also wanting to fail fast on formatting issues.

Looking at appveyor.yml, my question would be whether this must run on Windows. If it were possible to publish the NuGet package and the documentation website on Linux, we could:

  1. run the Windows build first, including the RunInspectCode and RunCleanupCode steps, where those two steps only take 5 to 7 minutes in total as opposed to 18 to 22 minutes on Linux; and
  2. run the Linux build second, without the RunInspectCode and RunCleanupCode steps but with the NuGet package and documentation website publishing steps.

We would fail much faster on formatting issues, i.e., after 5 to 7 rather than 18 to 22 minutes, while at the same time reducing the total build time by 18 to 22 minutes (as we also would in the first solution option).

@bart-degreed
Copy link
Contributor

bart-degreed commented Nov 17, 2021

DocFx is used to produce the documentation website, which doesn't work on .NET Core but requires the full .NET Framework or Mono.

In my experience, contrary to what Microsoft wants us all to believe through their marketing, running .NET Core on Windows is still more streamlined, more optimized, and better supported due to its history. Linux support has caught up in the last years, but there are still gaps due to fundamental differences in hardware/memory architecture, file systems, multi-threading, etc. In the runtime, there are various low-level constructs such as memory barriers and volatile fields that provide stronger guarantees on x86/x64 than documented, which has become something existing code is relying on. Just as we rely on the fact that Dictionary<TKey, TValue> is documented as unordered, while the current implementation preserves insertion order as long as no entries are removed.

We build against Linux to protect ourselves against Windows-only line breaks or encodings, slashes in the wrong direction, etc. in an attempt to support multiple platforms. That doesn't mean the platform we produce our binaries on is irrelevant. I'd like to avoid analyzing subtle bugs in the runtime/compiler infrastructure, which are reported quite frequently, such as here.

@ThomasBarnekow
Copy link
Author

OK, if we say that the documentation website and the NuGet packages must be produced on the Windows image, would you be open to exploring the other solution options?

Specifically, would you be fine with the first solution that I've called "Run on Windows Only and Possibly Fail Just a Little Later While Creating a Much Better Developer Experience"? Just to be sure, note that "Run on Windows Only" means that RunInspectCode and RunCleanupCode should be run on Windows only because they take such a long time on Linux?

@ThomasBarnekow
Copy link
Author

Using a separate repo, I have played a little with appveyor to see how we could optimize further. Here's a sample appveyor.yml file that tries to show the logic. The key idea is to use three jobs:

  • one that only builds and runs the code inspection on Windows;
  • one that builds and tests on Linux but does not run the code inspection; and
  • one that builds and tests on Windows, does not run the code inspection, and publishes the documentation and the NuGet package.

This gives you the fastest possible failure for build and code style-related issues while saving the time required for running the code inspection on Linux.

environment:
  matrix:
  - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2019
    PERFORM_CODE_INSPECTION: yes
    PERFORM_TEST: no
    PERFORM_PUBLISH_STEP: no
  - APPVEYOR_BUILD_WORKER_IMAGE: Ubuntu
    PERFORM_CODE_INSPECTION: no
    PERFORM_TEST: yes
    PERFORM_PUBLISH_STEP: no
  - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2019
    PERFORM_CODE_INSPECTION: no
    PERFORM_TEST: yes
    PERFORM_PUBLISH_STEP: yes

matrix:
  fast_finish: true

# We are just using the install phase to display information for demo purposes
install:
- pwsh: |
    Write-Host "Image = $Env:APPVEYOR_BUILD_WORKER_IMAGE`n"
    Write-Host "PERFORM_CODE_INSPECTION = $Env:PERFORM_CODE_INSPECTION"
    Write-Host "PERFORM_TEST = $Env:PERFORM_TEST"
    Write-Host "PERFORM_PUBLISH_STEP = $Env:PERFORM_PUBLISH_STEP`n"

before_build:
- pwsh: Write-Host "BEFORE-BUILD PHASE`n"
- pwsh: Write-Host "dotnet tool restore EXECUTED`n"

build_script:
- pwsh: Write-Host "BUILD PHASE`n"
- pwsh: Write-Host "dotnet build EXECUTED`n"
    
after_build:
- pwsh: Write-Host "AFTER-BUILD PHASE`n"
- pwsh: |
    if ($Env:PERFORM_CODE_INSPECTION -eq "yes") {
      Write-Host "RunInspectCode function EXECUTED"
      Write-Host "RunCleanupCode function EXECUTED`n"
    }

test: off

before_test:
- pwsh: Write-Host "BEFORE-TEST PHASE`n"
- pwsh: |
    if ($isWindows) {
      if ($Env:PERFORM_TEST -eq "yes") {
        Write-Host "Starting Azure Cosmos Emulator on Windows`n"
      }
    }
- sh: |
    if [[ "$PERFORM_TEST" == "yes" ]]; then
      echo "Starting Azure Cosmos Emulator on Linux"
    fi

test_script:
- pwsh: Write-Host "TEST PHASE`n"
- pwsh: |
    if ($Env:PERFORM_TEST -eq "yes") {
      Write-Host "dotnet test EXECUTED`n"
    }

for:
- # Publishing of documentation and NuGet packages
  matrix:
    only:
    - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2019
      PERFORM_PUBLISH_STEP: yes
  
  # Using README.md as a placeholder for the NuGet package
  artifacts:
  - path: README.md

  # Using FTP provider as a placeholder for the NuGet provider
  deploy:
  - provider: FTP
    protocol: ftp
    host: speedtest.tele2.net
    username: anonymous
    password: anonymous
    
  before_deploy:
  - pwsh: Write-Host "BEFORE-DEPLOY PHASE`n"
  - pwsh: |
      if ($Env:PERFORM_PUBLISH_STEP -eq "yes") {
        Write-Host "docfx INSTALLED`n"
      }

  deploy_script:
  - pwsh: Write-Host "DEPLOY PHASE`n"
  - pwsh: |
      if ($Env:PERFORM_PUBLISH_STEP -eq "yes") {
        Write-Host "Documentation PUBLISHED`n"
      }

@bart-degreed
Copy link
Contributor

In my opinion, you're trying to solve the wrong problem. Allow me to explain.

The cibuild is intended as a final check when all development has been completed. Checks such as: does building in release mode produce compiler warnings, do all the tests succeed on all platforms, does the code style comply with the rules, does the documentation website build, is the NuGet package produced properly, etc. All these checks are based on a fixed set of tool versions so that multiple developers using different versions locally can work together towards a single outcome. Asking for a PR review with the intent to get it merged because the developer believes all the work has been completed is pointless when the cibuild still fails.

In all cases, running the cibuild takes a long time. At least long enough that it's not something to go sit and wait for, staring at your screen doing nothing else. Its an async process by nature and its not uncommon for cibuilds to take several hours, such as in the runtime (example here) and the compiler (example here).

Based on that, it makes the most sense to optimize the duration of the cibuild for the common case: all green. Today we have a single build script that runs all the stages, which can be run locally too. Running it locally is many times faster than on AppVeyor, so if anyone must wait for its outcome, running locally is by far the fastest way to find out. But usually, there is no reason to have to wait for it. Any issues resulting from it can be addressed at a later time.

Now it may be tempting to (mis)use the cibuild for kicking off work that is best run locally, because running locally means you'd need to wait for it, interrupting your coding process. Not only does pushing all the time pollute commit history, it also takes up VM resources that are scarce. In that case, the solution is to be able to run that (on your own fast system) without blocking. For example, you can continue writing code when the compiler is running in the background, tests are running in the background, Resharper is adding squiggles in the background, Windows Updates are running in the background, etc.

We already provide various ways to run partial steps of the full build locally, such as docs/build-dev.ps1, docs/generate-examples.ps1, cleanupcode.ps1, inspectcode.ps1, dotnet build and dotnet test.

One solution to make them non-blocking is to add a local script (with a desktop shortcut pointing to it) that copies the repo into a temp folder, then runs all the checks on the copy in the background. The way the styling settings are configured today is actually optimized to not distract while writing code: lots of "issues" that result in a cibuild failure are intentionally suppressed in Visual Studio and Rider. I always got quite distracted by the pointless warnings added by StyleCop when a single space was missing. That's not at all what I care about when trying to write a complex algorithm!

When the cibuild runs cleanupcode for a PR, it runs on only the changed files. This can be accomplished locally too using ReGitLint, passing it the commit hash of your base branch and HEAD, which makes it a lot faster.

So to summarize: the cibuild is and will always be very slow by nature, compared to running locally. Our project does not have funding to buy parallel jobs, but if I had some extra money I'd get a faster computer instead of spending it on agents. So if you need fast feedback, run locally (there's room for improvement there). On the other hand, if you want final checks and don't care how long that takes, use the cibuild.

@ThomasBarnekow
Copy link
Author

I simply acted on your earlier comments, some of which are at least partially repeated in the description of the issue.

In the context of my current PR ##1103, the very long build times and your focus on "code style first" posed an issue. I am trying to contribute back to this project. However, my contribution is originally developed and fully integration-tested on Windows, not Linux. Therefore, running the integration tests on Linux as well is something that I am doing only for this project. Thus, my main concern was to make it work on Linux, which proved to be particularly hard with the Azure Cosmos Emulator for Linux on appveyor specifically. What worked on Ubuntu on WSL locally did not necessarily work on appveyor. Therefore, the only (!) way to find out whether a different configuration or some other change worked was to run it on appveyor (after it had already worked locally of course). When the code style check (which is run before the first integration test is performed) adds 18 to 22 minutes to the build time on Linux, you'll get value-adding feedback much less frequently, and the turnaround time decreases dramatically.

On to issue ##1109, where @ktoim asked whether it would be "possible to get a version of .net6.0 on the AppVeyor feed"? You (@bart-degreed ) responded:

About the AppVeyor feed: we use the free tier, which means jobs cannot be run in parallel and each job is restricted to a maximum duration of 1 hour. Running the cibuild already takes a while today (and occasionally times out, the VMs are quite slow), so I'm not looking forward to doubling our build times. However, before crafting a new release I usually run some tests locally against the latest versions of frameworks, just to make sure the basics still work.

I responded and told you about the fact that RunInspectCode and RunCleanupCode alone take 18 to 22 minutes on Linux. You responded:

We need to run Linux before Windows, because the Windows build publishes the NuGet package and the documentation website. We'd like to fail fast on formatting issues, not wait for the whole Linux build to complete first. During regular work, formatting issues are usually the cause for failure. Instead of running cleanup locally, we queue up a build and work on something else meanwhile.

I know how frustrating it can be to have to wait on cibuild when working on that, so by all means disable steps temporarily to be productive.

I then proposed a change in which we run those two functions only on Windows, where they are MUCH FASTER. This means we are saving 18 to 22 minutes while still running those code-style checks. We would just fail a little later (in ca. 24 minutes rather than 18 to 22 minutes) in case of a code-style issue.

If you see the cibuild as the last check, code style issues should actually be an exception. Failing fast on those code-style checks would not be the highest. You would mainly want to avoid code style issues.

However, I still thought about how we could reach both objectives, i.e., reduce build times and fail fast on code-style issues. The result is the appveyor.yml I shared above. It runs the code style checks on a Windows image first and only in that first job. While this is a little slower than the solution option that I've called "Run on Windows Only and Possibly Fail Just a Little Later While Creating a Much Better Developer Experience", you might like it better if you still wanted to fail fast(er) on code style issues.

Since you have not answered my question, let me ask it again: Would you be fine with the first solution that I've called "Run on Windows Only and Possibly Fail Just a Little Later While Creating a Much Better Developer Experience"? This option is in line with your most recent comment.

@bart-degreed
Copy link
Contributor

My latest response was based on new insights. I've come to realize that we should focus our efforts on improving the local development experience, instead of adapting the cibuild to optimize the workflow for individuals. Tomorrow someone may ask us to run the tests first because they don't have docker running locally. Or to build the documentation website first, because it takes so long to build it locally. This is solving the wrong problem, so I don't think it's a good idea to change the cibuild for such purposes.

In the exceptional case that someone is debugging the cibuild itself (such as installing the Cosmos Emulator for Linux), I already mentioned that turning off steps temporarily in a PR to get faster feedback is fine, so that's not the discussion here.

The cibuild partly validates that the code complies with our principles and quality norms, which typically vary between organizations, teams, projects, and products. The sole fact that "it works" is not good enough, as pointed out in our guidelines:

The worst response a developer can give you to these principles is: “But it works?”

So instead of changing the cibuild, I'd like to focus our efforts on improving the local tools. Things we can do:

  1. Run cleanupcode locally only on changed files, by passing the target branch name as an optional parameter to cleanupcode.ps1. This makes it complete quickly. It would resolve the specified branch name to a commit hash and pass it to regitlint. We'll need to decide what happens when there are staged/unstaged changes.
  2. Run inspectcode on a copy of the repo so it does not block the development process (changing files, switching branches). Its running time does not depend on how many files were changed and it does not change the sources. On my laptop, it takes two and a half minutes to run. By passing an optional parameter, it would copy to a temporary directory, run on that, open the browser with results and remove the copy.

@bart-degreed
Copy link
Contributor

@ThomasBarnekow I've spent some time trying to improve cleanupcode for local usage as described earlier, and in the process created sethreno/ReGitLint#15 and sethreno/ReGitLint#16.

As for inspectcode, apart from a "copy" switch, I thought of something else that may help. While in Visual Studio, you can import the file WarningSeverities.DotSettings as an extra layer, then run Resharper > Inspect > Code Issues in Solution. This applies the same rules as the command-line runner, but faster. Note, however, that the list also includes spelling errors, which are not reported by the command-line running, so they can be ignored.

Do you think the proposed changes are going to improve the local development flow for you?

@ThomasBarnekow
Copy link
Author

@bart-degreed, sorry for the very late response. I did not see this.

I think the answer is "yes".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants