-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Improve performance of output cache #48328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/Middleware/OutputCaching/src/RecyclableReadOnlySequenceSegment.cs
Outdated
Show resolved
Hide resolved
Let's build a benchmark ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple high level nits, haven't looked much at the actual code yet
src/Middleware/OutputCaching/perf/Microsoft.AspNetCore.OutputCaching.Performance.csproj
Outdated
Show resolved
Hide resolved
@gfoidl @BrennanConroy nits resolved - great feedback, thanks! @sebastienros totally agree we should add a crank profile that stresses output-cache (ideally with both small and large payloads - maybe query-string length?), and get that merged before this (so we can compare with/without) - I'll take a stab at that, but I might be looking for some of your input there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of "Tags" is unclear to me.
<ItemGroup> | ||
<Reference Include="Microsoft.AspNetCore.OutputCaching" /> | ||
<Reference Include="BenchmarkDotNet" /> | ||
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the project? investigating the overheads; I could, however, support burning this now if it is ugly to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this specific line
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
investigating the overheads;
Can you clarify what you mean here? I would think that it's not necessary to include the PublicApi analyzers on a benchmark app where we are not shipping an API.
EDIT: Nevermind, I see this is actually disabling the analyzers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, exactly that; I didn't want warning noise about public API in something that isn't public API - this might have originally been an incorrect path thing, but... it doesn't do anything painful
|
||
result.Tags = tags; | ||
return result; | ||
static readonly string[] CommonHeaders = new string[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tratcher probably has opinions about this
src/Middleware/OutputCaching/test/OutputCacheEntryFormatterTests.cs
Outdated
Show resolved
Hide resolved
@BrennanConroy from the entry? It is never used, and isn't part of the public API, yet by existing it requires deserialization into |
Looks like this PR hasn't been active for some time and the codebase could have been changed in the meantime. |
(rebased on top of redis output cache changes) |
- unify OutputCacheEntry and FormatterEntry - leased buffers for headers, tags, etc (dispose on way out) - use ReadOnlySequence<byte> instead of List<byte[]> with recyclable segments - avoid copying the payload data once fectched - serialization tweak: use common headers (not yet listed) finish porting tests; all good migrate writes to IBufferWriter<byte>, using recyclable array buffer writer (similar to MemoryStream, but: faster) use pooled buffers when buffering output-cache payloads RROSS should be able to recycle buffers as needed implement header name/value lookup buffer add bench project note memory overhead in bench full bench suite add benchmark result - don't store tags in the cache payload - don't store segment details in the cache payload whitespace include test that uses body-writer rather than stream tidy up benchmarks remove pipe impl; we don't need it (proven in bench) add a few more known headers; don't store request-id don't write empty segments fix buffer cleanup paths revert changes to shared SegmentWriteStream update numbers use fixed size stackalloc Update src/Middleware/OutputCaching/src/OutputCacheEntryFormatter.cs Co-authored-by: Günther Foidl <[email protected]> - remove global SkipLocalsInit - add accessibility on all new members - in length checks, use - over + to avoid overflow nits relocate perf to microbenchmarks use standard project structure for micro-benchmarks use "actual" instead of "bytes" to avoid a "pop"/"ld" in the "release" case avoid a memcopy by writing directly to the buffer bytes (and increasing DRY) Update src/Middleware/OutputCaching/src/OutputCacheEntry.cs Co-authored-by: Brennan <[email protected]> Update src/Middleware/OutputCaching/test/OutputCacheEntryFormatterTests.cs Co-authored-by: Brennan <[email protected]> fix PR nits from #48450 fix sln (dead project) Update src/Middleware/OutputCaching/src/OutputCacheEntry.cs Co-authored-by: Brennan <[email protected]> DRY nit moar nits use leased buffer for tags when calling SetAsync Update src/Middleware/OutputCaching/src/OutputCacheEntryFormatter.cs Co-authored-by: Brennan <[email protected]> use [LoggerMessage] for logging merge submodule delete bump optimize Output Cache; no API changes yet - all internal: - unify OutputCacheEntry and FormatterEntry - leased buffers for headers, tags, etc (dispose on way out) - use ReadOnlySequence<byte> instead of List<byte[]> with recyclable segments - avoid copying the payload data once fectched - serialization tweak: use common headers (not yet listed) sln fix dammit
@@ -20,19 +20,13 @@ internal static long EstimateCachedResponseSize(OutputCacheEntry cachedResponse) | |||
long size = sizeof(int); | |||
|
|||
// Headers | |||
if (cachedResponse.Headers != null) | |||
foreach (var item in cachedResponse.Headers.Span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it safe to remove the null-check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@captainsafia because it is now a ReadOnlyMemory<T>
rather than a HeaderDictionary?
{ | ||
size += cachedResponse.Body.Length; | ||
} | ||
size += cachedResponse.Body.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here. Should we Debug.Assert
if we know that Body
and Headers
are set at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@captainsafia it is now a ReadOnlySequence<byte>
- worst case is that it is empty, which doesn't need any extra work
<ItemGroup> | ||
<Reference Include="Microsoft.AspNetCore.OutputCaching" /> | ||
<Reference Include="BenchmarkDotNet" /> | ||
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this specific line
<Reference Remove="Microsoft.CodeAnalysis.PublicApiAnalyzers" />
// additionally, we add support for reading a string with length specified by the caller (rather than handled automatically), | ||
// and in-place (zero-copy) BLOB reads | ||
|
||
private readonly ReadOnlyMemory<byte> original; // used to allow us to zero-copy chunks out of the payload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Privates should be _
prefixed. _orginal
, _root
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
private void RequestNewBuffer() | ||
{ | ||
Flush(); | ||
var span = target.GetSpan(1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use 1024 specifically? Why even pass a value in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO "arbitrary size that we've at least hinted" > "arbitrary size that is implementation specific; we don't want trivially small - ultimately any implementation can ignore us, but: this seems a reasonable size to not be constantly swapping buffers; you're right that there's nothing special about the number; want me to add a // fairly arbitrary non-trivial size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultimately any implementation can ignore us
Well, ignore the lower bound, they should always give a buffer >=
the value passed in.
Sure, a comment is fine. It'd be interesting to see if real world examples could benefit from a larger buffer. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
@@ -110,8 +108,12 @@ private async Task InvokeAwaited(HttpContext httpContext, IReadOnlyList<IOutputC | |||
// Can this request be served from cache? | |||
if (context.AllowCacheLookup) | |||
{ | |||
if (await TryServeFromCacheAsync(context, policies)) | |||
bool served = await TryServeFromCacheAsync(context, policies); | |||
context.ReleaseCachedResponse(); // release even if not served due to failing conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be put in the finally instead of sprinkling it everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lifetime is a bit more complex than that - the cached response is changed in a few spots; in all the return
cases you may be right, but the one you highlight here: can get overstomped again later, validly; I was also trying to not recycle buffers in any "cancellation token abort" scenarios, as I don't want to recycle anything if there's even a remote chance that async code (the output cache implementation) is still touching that buffer - I'd rather drop them on the floor than get competition
that was my logic; happy to add some comments, and maybe a bool releaseResponse
for the common return
case (so the only thing repeated is releaseResponse = true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but it looks like every single spot this method exits from, the cached response is released. And if I am missing something and it's more complex than that, there should probably be comments explaining it.
I don't want to recycle anything if there's even a remote chance that async code (the output cache implementation) is still touching that buffer
You can preserve that behavior:
var hasException = false;
try
{
}
catch
{
hasException = true;
throw;
}
finally
{
if (!hasException)
{
// release
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
Improve performance of output cache
Reduce allocations and CPU impact of output cache by making changes to the internal implementation (numbers and evidence at bottom)
Description
The output cache feature is complex; to understand the changes, first we need to discuss the implementation:
OutputCacheContext
/ featureOutputCacheEntry
to use for the existing payloadbyte[]
returned from theIOutputCacheStore
, which it loads into aMemoryStream
withBinaryReader
, and manually deserializes into aFormatterEntry
(parsing the headers into strings, etc - note that the original segments are reconstructed into newbyte[]
copies of the data)FormatterEntry
has a headersDictionary<string, string?[]>
, astring[]
tags , and a list ofbyte[]
buffersOutputCacheEntry
which has aHeaderDictionary?
, the samestring[]
, and aCachedResponseBody
(which wraps aList<byte[]>
)OutputCacheStream
which writes to the underlying stream but also duplicates the data to aSegmentWriteStream
(which wraps aList<byte[]>
) and uses this to write a newOutputCacheEntry
FormatterEntry
from the composedOutputCacheEntry
, writes this viaMemoryStream
andBinaryWriter
to abyte[]
, which it passes to theIOutputCacheStore
Key observations:
internal
; we have a lot of scope for changebyte[]
mentioned are dropped on the floorbyte[]
is fore theIOutputCacheStore
API; this will be addressed separatelyFormatterEntry
andOutputCacheEntry
are basically the same thingList<T>
,BinaryReader
,BinaryWriter
,MemoryStream
List<T>
andMemoryStream
use array-doubling internally, so may involve multiple array allocations behind the scenesBinaryReader
/BinaryWriter
may use their own internal buffersWork items:
merge
FormatterEntry
andOutputCacheEntry
into a simpler storewe throw away
FormatterEntry
, and simplifyOutputCacheEntry
; we throw away the tags completely (they are not needed in the storage model), useReadOnlyMemory<(string, StringValues)>
for the headers (allowing us to use oversized leased arrays), andReadOnlySequence<byte>
for the body payload (deletingCachedResponseBody
)for the very few places where headers on the
OutputCacheEntry
are inspected, we add aTryFindHeader
API; this is incredibly rare, so a few O(N) scans will be much more efficient than paying the same O(N) cost to build a dictionary (plus additional storage), for only a few O(1) fetchesintroduce a custom recyclable sequence API
the new
RecyclableReadOnlySequenceSegment : ReadOnlySequenceSegment<byte>
allowsReadOnlySequence<byte>
to be constructed cheaply, using two definitions of recycling:implement a v2 serialization format
fortunately the existing format included version preamble, so we can add a v2 effortlessly; key differences in v2:
implement custom reader/writer for the serializer
specifically, we introduce new
ref struct
reader/writer that are wire-compatible withBinaryReader
/BinaryWriter
, but minimal; the reader works directly on thebyte[]
returned fromIOutputCacheStore
(later we may want to make this work againstReadOnlySequence<byte>
for a revised API, but we can defer this), using oversized leased arrays for the headers; if parsing pre-existing v1 data, the body uses the leased segment chain as mentioned above but pointing directly at the originalbyte[]
data (which is never reused, so: safe); for v2 data, the body uses a single-segment (no chain) sequence of the payload bytesthe writer targets an
IBufferWriter<byte>
, and in particular a newRecyclableArrayBufferWriter<byte>
which uses leased arrays (using the single-array model as perArrayBufferWriter<byte>
); we still need abyte[] ToArray()
API to satisfy theIOutputCacheStore
demands (we can address that later)change the output capture to use sequence chains
we change
OutputCacheStream
to use a newRecyclableSequenceBuilder
instead ofSegmentWriteStream
(which was in "shared", and is now removed), which has similarWrite
APIs, but no longer pretends to be aStream
, and usesRecyclableReadOnlySequenceSegment
and leased buffers for the backing store; theDetachAndReset()
API hands back aReadOnlySequence<byte>
of the captured payloadrecycle the cache-entry as needed
in the middleware, we use a new
ReleaseCachedResponse()
API that cleans up and recycles (if necessary) theOutputCacheEntry
; this works similarly regardless of whether the current request built the data, or whether it was materialized; to help with this, we decorate the cached response as nullable to formalize that the state could be empty, and fixup call pathsintentionally, we only do this in "known good" code paths - we don't use
finally
/using
to avoid timing problems withasync
exceptions (which could still have sight of the leased buffers)add new performance profiling project
(and backport the same to the original code)
before vs after
key observations:
byte[]
for theIOutputCacheStore
API