Skip to content

Improve performance of output cache #48328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 3, 2023
Merged

Improve performance of output cache #48328

merged 5 commits into from
Aug 3, 2023

Conversation

mgravell
Copy link
Member

@mgravell mgravell commented May 19, 2023

Improve performance of output cache

Reduce allocations and CPU impact of output cache by making changes to the internal implementation (numbers and evidence at bottom)

Description

The output cache feature is complex; to understand the changes, first we need to discuss the implementation:

  • the middleware adds a OutputCacheContext / feature
  • this attempts to fetch an appropriate OutputCacheEntry to use for the existing payload
    • normally, it gets this via a byte[] returned from the IOutputCacheStore, which it loads into a MemoryStream with BinaryReader, and manually deserializes into a FormatterEntry (parsing the headers into strings, etc - note that the original segments are reconstructed into new byte[] copies of the data)
    • FormatterEntry has a headers Dictionary<string, string?[]>, a string[] tags , and a list of byte[] buffers
    • it then rewrites this into an OutputCacheEntry which has a HeaderDictionary?, the same string[], and a CachedResponseBody (which wraps a List<byte[]>)
  • if it is unable to get a matching entry, the output stream is shimmed via an OutputCacheStream which writes to the underlying stream but also duplicates the data to a SegmentWriteStream (which wraps a List<byte[]>) and uses this to write a new OutputCacheEntry
    • it may then need to serialize this, so if forms a FormatterEntry from the composed OutputCacheEntry, writes this via MemoryStream and BinaryWriter to a byte[], which it passes to the IOutputCacheStore

Key observations:

  • most of the types cited are internal; we have a lot of scope for change
  • all of the byte[] mentioned are dropped on the floor
  • the only time we actually need a right-sized byte[] is fore the IOutputCacheStore API; this will be addressed separately
  • FormatterEntry and OutputCacheEntry are basically the same thing
  • lots of intermediate wrapper objects - List<T>, BinaryReader, BinaryWriter, MemoryStream
  • the List<T> and MemoryStream use array-doubling internally, so may involve multiple array allocations behind the scenes
  • BinaryReader / BinaryWriter may use their own internal buffers
  • the multiple dictionaries and arrays for headers are redundant; this is pure storage - we can use simpler constructs

Work items:

merge FormatterEntry and OutputCacheEntry into a simpler store

we throw away FormatterEntry, and simplify OutputCacheEntry; we throw away the tags completely (they are not needed in the storage model), use ReadOnlyMemory<(string, StringValues)> for the headers (allowing us to use oversized leased arrays), and ReadOnlySequence<byte> for the body payload (deleting CachedResponseBody)

for the very few places where headers on the OutputCacheEntry are inspected, we add a TryFindHeader API; this is incredibly rare, so a few O(N) scans will be much more efficient than paying the same O(N) cost to build a dictionary (plus additional storage), for only a few O(1) fetches

introduce a custom recyclable sequence API

the new RecyclableReadOnlySequenceSegment : ReadOnlySequenceSegment<byte> allows ReadOnlySequence<byte> to be constructed cheaply, using two definitions of recycling:

  • a small number of segments is kept via a pool, allowing chain details to be recycled
  • optionally, recycling a chain may also attempt to detect buffers that can be recycled

implement a v2 serialization format

fortunately the existing format included version preamble, so we can add a v2 effortlessly; key differences in v2:

  1. instead of storing segment subdata, just store the payload length, and write a single contiguous body payload
  2. test header name/values against a hard-coded set of known values; if found, store an integer key rather than the length-prefixed UTF8 data

implement custom reader/writer for the serializer

specifically, we introduce new ref struct reader/writer that are wire-compatible with BinaryReader/BinaryWriter, but minimal; the reader works directly on the byte[] returned from IOutputCacheStore (later we may want to make this work against ReadOnlySequence<byte> for a revised API, but we can defer this), using oversized leased arrays for the headers; if parsing pre-existing v1 data, the body uses the leased segment chain as mentioned above but pointing directly at the original byte[] data (which is never reused, so: safe); for v2 data, the body uses a single-segment (no chain) sequence of the payload bytes

the writer targets an IBufferWriter<byte>, and in particular a new RecyclableArrayBufferWriter<byte> which uses leased arrays (using the single-array model as per ArrayBufferWriter<byte>); we still need a byte[] ToArray() API to satisfy the IOutputCacheStore demands (we can address that later)

change the output capture to use sequence chains

we change OutputCacheStream to use a new RecyclableSequenceBuilder instead of SegmentWriteStream (which was in "shared", and is now removed), which has similar Write APIs, but no longer pretends to be a Stream, and uses RecyclableReadOnlySequenceSegment and leased buffers for the backing store; the DetachAndReset() API hands back a ReadOnlySequence<byte> of the captured payload

recycle the cache-entry as needed

in the middleware, we use a new ReleaseCachedResponse() API that cleans up and recycles (if necessary) the OutputCacheEntry; this works similarly regardless of whether the current request built the data, or whether it was materialized; to help with this, we decorate the cached response as nullable to formalize that the state could be empty, and fixup call paths

intentionally, we only do this in "known good" code paths - we don't use finally / using to avoid timing problems with async exceptions (which could still have sight of the leased buffers)

add new performance profiling project

(and backport the same to the original code)

before vs after

key observations:

  • write path before could allocate ~5x the payload size; now we allocate ~1x, which is explained by the need of the byte[] for the IOutputCacheStore API
  • read path before allocated ~1x the payload size; now no lost overheads
  • performance is improved in all scenarios, but especially on the read path which is now O(1) thanks to (at least for the in-process scenario) using zero-copy for the main payload

@adityamandaleeka
Copy link
Member

cc @sebastienros

@sebastienros
Copy link
Member

Let's build a benchmark ;)

Copy link
Member

@BrennanConroy BrennanConroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple high level nits, haven't looked much at the actual code yet

@mgravell
Copy link
Member Author

@gfoidl @BrennanConroy nits resolved - great feedback, thanks!

@sebastienros totally agree we should add a crank profile that stresses output-cache (ideally with both small and large payloads - maybe query-string length?), and get that merged before this (so we can compare with/without) - I'll take a stab at that, but I might be looking for some of your input there

@mgravell mgravell marked this pull request as ready for review May 24, 2023 11:05
@mgravell mgravell requested a review from Tratcher as a code owner May 24, 2023 11:05
Copy link
Member

@BrennanConroy BrennanConroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of "Tags" is unclear to me.

<ItemGroup>
<Reference Include="Microsoft.AspNetCore.OutputCaching" />
<Reference Include="BenchmarkDotNet" />
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the project? investigating the overheads; I could, however, support burning this now if it is ugly to maintain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this specific line
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />

Copy link
Member

@captainsafia captainsafia Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

investigating the overheads;

Can you clarify what you mean here? I would think that it's not necessary to include the PublicApi analyzers on a benchmark app where we are not shipping an API.

EDIT: Nevermind, I see this is actually disabling the analyzers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, exactly that; I didn't want warning noise about public API in something that isn't public API - this might have originally been an incorrect path thing, but... it doesn't do anything painful


result.Tags = tags;
return result;
static readonly string[] CommonHeaders = new string[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tratcher probably has opinions about this

@mgravell
Copy link
Member Author

The removal of "Tags" is unclear to me.

@BrennanConroy from the entry? It is never used, and isn't part of the public API, yet by existing it requires deserialization into string each time, so that's allocations and UTF8 decode (plus an allocation/lease for the collection to put them in), every time, for absolutely no reason

@amcasey amcasey added the area-middleware Includes: URL rewrite, redirect, response cache/compression, session, and other general middlewares label Jun 2, 2023
@ghost
Copy link

ghost commented Jun 14, 2023

Looks like this PR hasn't been active for some time and the codebase could have been changed in the meantime.
To make sure no breaking changes are introduced, please leave an /azp run comment here to rerun the CI pipeline and confirm success before merging the change.

@ghost ghost added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jun 14, 2023
@mgravell
Copy link
Member Author

(rebased on top of redis output cache changes)

- unify OutputCacheEntry and FormatterEntry
- leased buffers for headers, tags, etc (dispose on way out)
- use ReadOnlySequence<byte> instead of List<byte[]> with recyclable segments
- avoid copying the payload data once fectched
- serialization tweak: use common headers (not yet listed)

finish porting tests; all good

migrate writes to IBufferWriter<byte>, using recyclable array buffer writer (similar to MemoryStream, but: faster)

use pooled buffers when buffering output-cache payloads RROSS should be able to recycle buffers as needed

implement header name/value lookup buffer

add bench project

note memory overhead in bench

full bench suite

add benchmark result

- don't store tags in the cache payload
- don't store segment details in the cache payload

whitespace

include test that uses body-writer rather than stream

tidy up benchmarks

remove pipe impl; we don't need it (proven in bench)

add a few more known headers; don't store request-id

don't write empty segments

fix buffer cleanup paths

revert changes to shared SegmentWriteStream

update numbers

use fixed size stackalloc

Update src/Middleware/OutputCaching/src/OutputCacheEntryFormatter.cs

Co-authored-by: Günther Foidl <[email protected]>

- remove global SkipLocalsInit
- add accessibility on all new members
- in length checks, use - over + to avoid overflow

nits

relocate perf to microbenchmarks

use standard project structure for micro-benchmarks

use "actual" instead of "bytes" to avoid a "pop"/"ld" in the "release" case

avoid a memcopy by writing directly to the buffer bytes (and increasing DRY)

Update src/Middleware/OutputCaching/src/OutputCacheEntry.cs

Co-authored-by: Brennan <[email protected]>

Update src/Middleware/OutputCaching/test/OutputCacheEntryFormatterTests.cs

Co-authored-by: Brennan <[email protected]>

fix PR nits from #48450

fix sln (dead project)

Update src/Middleware/OutputCaching/src/OutputCacheEntry.cs

Co-authored-by: Brennan <[email protected]>

DRY nit

moar nits

use leased buffer for tags when calling SetAsync

Update src/Middleware/OutputCaching/src/OutputCacheEntryFormatter.cs

Co-authored-by: Brennan <[email protected]>

use [LoggerMessage] for logging

merge submodule delete

bump

optimize Output Cache; no API changes yet - all internal:
- unify OutputCacheEntry and FormatterEntry
- leased buffers for headers, tags, etc (dispose on way out)
- use ReadOnlySequence<byte> instead of List<byte[]> with recyclable segments
- avoid copying the payload data once fectched
- serialization tweak: use common headers (not yet listed)

sln fix

dammit
@@ -20,19 +20,13 @@ internal static long EstimateCachedResponseSize(OutputCacheEntry cachedResponse)
long size = sizeof(int);

// Headers
if (cachedResponse.Headers != null)
foreach (var item in cachedResponse.Headers.Span)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it safe to remove the null-check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@captainsafia because it is now a ReadOnlyMemory<T> rather than a HeaderDictionary?

{
size += cachedResponse.Body.Length;
}
size += cachedResponse.Body.Length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here. Should we Debug.Assert if we know that Body and Headers are set at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@captainsafia it is now a ReadOnlySequence<byte> - worst case is that it is empty, which doesn't need any extra work

<ItemGroup>
<Reference Include="Microsoft.AspNetCore.OutputCaching" />
<Reference Include="BenchmarkDotNet" />
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this specific line
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />

// additionally, we add support for reading a string with length specified by the caller (rather than handled automatically),
// and in-place (zero-copy) BLOB reads

private readonly ReadOnlyMemory<byte> original; // used to allow us to zero-copy chunks out of the payload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privates should be _ prefixed. _orginal, _root, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied

private void RequestNewBuffer()
{
Flush();
var span = target.GetSpan(1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use 1024 specifically? Why even pass a value in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "arbitrary size that we've at least hinted" > "arbitrary size that is implementation specific; we don't want trivially small - ultimately any implementation can ignore us, but: this seems a reasonable size to not be constantly swapping buffers; you're right that there's nothing special about the number; want me to add a // fairly arbitrary non-trivial size ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately any implementation can ignore us

Well, ignore the lower bound, they should always give a buffer >= the value passed in.

Sure, a comment is fine. It'd be interesting to see if real world examples could benefit from a larger buffer. 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied

@@ -110,8 +108,12 @@ private async Task InvokeAwaited(HttpContext httpContext, IReadOnlyList<IOutputC
// Can this request be served from cache?
if (context.AllowCacheLookup)
{
if (await TryServeFromCacheAsync(context, policies))
bool served = await TryServeFromCacheAsync(context, policies);
context.ReleaseCachedResponse(); // release even if not served due to failing conditions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be put in the finally instead of sprinkling it everywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lifetime is a bit more complex than that - the cached response is changed in a few spots; in all the return cases you may be right, but the one you highlight here: can get overstomped again later, validly; I was also trying to not recycle buffers in any "cancellation token abort" scenarios, as I don't want to recycle anything if there's even a remote chance that async code (the output cache implementation) is still touching that buffer - I'd rather drop them on the floor than get competition

that was my logic; happy to add some comments, and maybe a bool releaseResponse for the common return case (so the only thing repeated is releaseResponse = true ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but it looks like every single spot this method exits from, the cached response is released. And if I am missing something and it's more complex than that, there should probably be comments explaining it.

I don't want to recycle anything if there's even a remote chance that async code (the output cache implementation) is still touching that buffer

You can preserve that behavior:

var hasException = false;
try
{
}
catch
{
    hasException = true;
    throw;
}
finally
{
    if (!hasException)
    {
        // release
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied

@mgravell mgravell merged commit da234b9 into main Aug 3, 2023
@mgravell mgravell deleted the marc/ocbin branch August 3, 2023 15:42
@ghost ghost added this to the 8.0-rc1 milestone Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-middleware Includes: URL rewrite, redirect, response cache/compression, session, and other general middlewares area-perf Performance infrastructure issues feature-output-caching pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.