[scudo] Update header without read-modify-write operation #66955

ChiaHungDuan · 2023-09-20T22:02:47Z

We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.

We used to update the deallocated block with atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.

llvmbot · 2023-09-20T22:04:11Z

@llvm/pr-subscribers-compiler-rt-sanitizer

Changes

We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.

Full diff: https://github.com/llvm/llvm-project/pull/66955.diff

3 Files Affected:

(modified) compiler-rt/lib/scudo/standalone/chunk.h (-13)
(modified) compiler-rt/lib/scudo/standalone/combined.h (+30-34)
(modified) compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp (-23)

diff --git a/compiler-rt/lib/scudo/standalone/chunk.h b/compiler-rt/lib/scudo/standalone/chunk.h
index 32874a8df6421c2..9228df047189091 100644
--- a/compiler-rt/lib/scudo/standalone/chunk.h
+++ b/compiler-rt/lib/scudo/standalone/chunk.h
@@ -128,19 +128,6 @@ inline void loadHeader(u32 Cookie, const void *Ptr,
     reportHeaderCorruption(const_cast<void *>(Ptr));
 }
 
-inline void compareExchangeHeader(u32 Cookie, void *Ptr,
-                                  UnpackedHeader *NewUnpackedHeader,
-                                  UnpackedHeader *OldUnpackedHeader) {
-  NewUnpackedHeader->Checksum =
-      computeHeaderChecksum(Cookie, Ptr, NewUnpackedHeader);
-  PackedHeader NewPackedHeader = bit_cast<PackedHeader>(*NewUnpackedHeader);
-  PackedHeader OldPackedHeader = bit_cast<PackedHeader>(*OldUnpackedHeader);
-  if (UNLIKELY(!atomic_compare_exchange_strong(
-          getAtomicHeader(Ptr), &OldPackedHeader, NewPackedHeader,
-          memory_order_relaxed)))
-    reportHeaderRace(Ptr);
-}
-
 inline bool isValid(u32 Cookie, const void *Ptr,
                     UnpackedHeader *NewUnpackedHeader) {
   PackedHeader NewPackedHeader = atomic_load_relaxed(getConstAtomicHeader(Ptr));
diff --git a/compiler-rt/lib/scudo/standalone/combined.h b/compiler-rt/lib/scudo/standalone/combined.h
index 29589cdd99fa78a..5bc7dd32890be11 100644
--- a/compiler-rt/lib/scudo/standalone/combined.h
+++ b/compiler-rt/lib/scudo/standalone/combined.h
@@ -67,14 +67,13 @@ class Allocator {
       if (UNLIKELY(Header.State != Chunk::State::Quarantined))
         reportInvalidChunkState(AllocatorAction::Recycling, Ptr);
 
-      Chunk::UnpackedHeader NewHeader = Header;
-      NewHeader.State = Chunk::State::Available;
-      Chunk::compareExchangeHeader(Allocator.Cookie, Ptr, &NewHeader, &Header);
+      Header.State = Chunk::State::Available;
+      Chunk::storeHeader(Allocator.Cookie, Ptr, &Header);
 
       if (allocatorSupportsMemoryTagging<Config>())
         Ptr = untagPointer(Ptr);
-      void *BlockBegin = Allocator::getBlockBegin(Ptr, &NewHeader);
-      Cache.deallocate(NewHeader.ClassId, BlockBegin);
+      void *BlockBegin = Allocator::getBlockBegin(Ptr, &Header);
+      Cache.deallocate(Header.ClassId, BlockBegin);
     }
 
     // We take a shortcut when allocating a quarantine batch by working with the
@@ -117,9 +116,8 @@ class Allocator {
       DCHECK_EQ(Header.Offset, 0);
       DCHECK_EQ(Header.SizeOrUnusedBytes, sizeof(QuarantineBatch));
 
-      Chunk::UnpackedHeader NewHeader = Header;
-      NewHeader.State = Chunk::State::Available;
-      Chunk::compareExchangeHeader(Allocator.Cookie, Ptr, &NewHeader, &Header);
+      Header.State = Chunk::State::Available;
+      Chunk::storeHeader(Allocator.Cookie, Ptr, &Header);
       Cache.deallocate(QuarantineClassId,
                        reinterpret_cast<void *>(reinterpret_cast<uptr>(Ptr) -
                                                 Chunk::getHeaderSize()));
@@ -610,47 +608,46 @@ class Allocator {
     if (UNLIKELY(!isAligned(reinterpret_cast<uptr>(OldPtr), MinAlignment)))
       reportMisalignedPointer(AllocatorAction::Reallocating, OldPtr);
 
-    Chunk::UnpackedHeader OldHeader;
-    Chunk::loadHeader(Cookie, OldPtr, &OldHeader);
+    Chunk::UnpackedHeader Header;
+    Chunk::loadHeader(Cookie, OldPtr, &Header);
 
-    if (UNLIKELY(OldHeader.State != Chunk::State::Allocated))
+    if (UNLIKELY(Header.State != Chunk::State::Allocated))
       reportInvalidChunkState(AllocatorAction::Reallocating, OldPtr);
 
     // Pointer has to be allocated with a malloc-type function. Some
     // applications think that it is OK to realloc a memalign'ed pointer, which
     // will trigger this check. It really isn't.
     if (Options.get(OptionBit::DeallocTypeMismatch)) {
-      if (UNLIKELY(OldHeader.OriginOrWasZeroed != Chunk::Origin::Malloc))
+      if (UNLIKELY(Header.OriginOrWasZeroed != Chunk::Origin::Malloc))
         reportDeallocTypeMismatch(AllocatorAction::Reallocating, OldPtr,
-                                  OldHeader.OriginOrWasZeroed,
+                                  Header.OriginOrWasZeroed,
                                   Chunk::Origin::Malloc);
     }
 
-    void *BlockBegin = getBlockBegin(OldTaggedPtr, &OldHeader);
+    void *BlockBegin = getBlockBegin(OldTaggedPtr, &Header);
     uptr BlockEnd;
     uptr OldSize;
-    const uptr ClassId = OldHeader.ClassId;
+    const uptr ClassId = Header.ClassId;
     if (LIKELY(ClassId)) {
       BlockEnd = reinterpret_cast<uptr>(BlockBegin) +
                  SizeClassMap::getSizeByClassId(ClassId);
-      OldSize = OldHeader.SizeOrUnusedBytes;
+      OldSize = Header.SizeOrUnusedBytes;
     } else {
       BlockEnd = SecondaryT::getBlockEnd(BlockBegin);
       OldSize = BlockEnd - (reinterpret_cast<uptr>(OldTaggedPtr) +
-                            OldHeader.SizeOrUnusedBytes);
+                            Header.SizeOrUnusedBytes);
     }
     // If the new chunk still fits in the previously allocated block (with a
     // reasonable delta), we just keep the old block, and update the chunk
     // header to reflect the size change.
     if (reinterpret_cast<uptr>(OldTaggedPtr) + NewSize <= BlockEnd) {
       if (NewSize > OldSize || (OldSize - NewSize) < getPageSizeCached()) {
-        Chunk::UnpackedHeader NewHeader = OldHeader;
-        NewHeader.SizeOrUnusedBytes =
+        Header.SizeOrUnusedBytes =
             (ClassId ? NewSize
                      : BlockEnd -
                            (reinterpret_cast<uptr>(OldTaggedPtr) + NewSize)) &
             Chunk::SizeOrUnusedBytesMask;
-        Chunk::compareExchangeHeader(Cookie, OldPtr, &NewHeader, &OldHeader);
+        Chunk::storeHeader(Cookie, OldPtr, &Header);
         if (UNLIKELY(useMemoryTagging<Config>(Options))) {
           if (ClassId) {
             resizeTaggedChunk(reinterpret_cast<uptr>(OldTaggedPtr) + OldSize,
@@ -672,7 +669,7 @@ class Allocator {
     void *NewPtr = allocate(NewSize, Chunk::Origin::Malloc, Alignment);
     if (LIKELY(NewPtr)) {
       memcpy(NewPtr, OldTaggedPtr, Min(NewSize, OldSize));
-      quarantineOrDeallocateChunk(Options, OldTaggedPtr, &OldHeader, OldSize);
+      quarantineOrDeallocateChunk(Options, OldTaggedPtr, &Header, OldSize);
     }
     return NewPtr;
   }
@@ -1110,31 +1107,30 @@ class Allocator {
                                    Chunk::UnpackedHeader *Header,
                                    uptr Size) NO_THREAD_SAFETY_ANALYSIS {
     void *Ptr = getHeaderTaggedPointer(TaggedPtr);
-    Chunk::UnpackedHeader NewHeader = *Header;
     // If the quarantine is disabled, the actual size of a chunk is 0 or larger
     // than the maximum allowed, we return a chunk directly to the backend.
     // This purposefully underflows for Size == 0.
     const bool BypassQuarantine = !Quarantine.getCacheSize() ||
                                   ((Size - 1) >= QuarantineMaxChunkSize) ||
-                                  !NewHeader.ClassId;
+                                  !Header->ClassId;
     if (BypassQuarantine)
-      NewHeader.State = Chunk::State::Available;
+      Header->State = Chunk::State::Available;
     else
-      NewHeader.State = Chunk::State::Quarantined;
-    NewHeader.OriginOrWasZeroed = useMemoryTagging<Config>(Options) &&
-                                  NewHeader.ClassId &&
-                                  !TSDRegistry.getDisableMemInit();
-    Chunk::compareExchangeHeader(Cookie, Ptr, &NewHeader, Header);
+      Header->State = Chunk::State::Quarantined;
+    Header->OriginOrWasZeroed = useMemoryTagging<Config>(Options) &&
+                                Header->ClassId &&
+                                !TSDRegistry.getDisableMemInit();
+    Chunk::storeHeader(Cookie, Ptr, Header);
 
     if (UNLIKELY(useMemoryTagging<Config>(Options))) {
       u8 PrevTag = extractTag(reinterpret_cast<uptr>(TaggedPtr));
       storeDeallocationStackMaybe(Options, Ptr, PrevTag, Size);
-      if (NewHeader.ClassId) {
+      if (Header->ClassId) {
         if (!TSDRegistry.getDisableMemInit()) {
           uptr TaggedBegin, TaggedEnd;
           const uptr OddEvenMask = computeOddEvenMaskForPointerMaybe(
-              Options, reinterpret_cast<uptr>(getBlockBegin(Ptr, &NewHeader)),
-              NewHeader.ClassId);
+              Options, reinterpret_cast<uptr>(getBlockBegin(Ptr, Header)),
+              Header->ClassId);
           // Exclude the previous tag so that immediate use after free is
           // detected 100% of the time.
           setRandomTag(Ptr, Size, OddEvenMask | (1UL << PrevTag), &TaggedBegin,
@@ -1145,8 +1141,8 @@ class Allocator {
     if (BypassQuarantine) {
       if (allocatorSupportsMemoryTagging<Config>())
         Ptr = untagPointer(Ptr);
-      void *BlockBegin = getBlockBegin(Ptr, &NewHeader);
-      const uptr ClassId = NewHeader.ClassId;
+      void *BlockBegin = getBlockBegin(Ptr, Header);
+      const uptr ClassId = Header->ClassId;
       if (LIKELY(ClassId)) {
         bool UnlockRequired;
         auto *TSD = TSDRegistry.getTSDAndLock(&UnlockRequired);
diff --git a/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp b/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
index 7a29f3c11b70ffb..1b2c1eb5a7d0c5c 100644
--- a/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
+++ b/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
@@ -37,29 +37,6 @@ TEST(ScudoChunkDeathTest, ChunkBasic) {
   free(Block);
 }
 
-TEST(ScudoChunkTest, ChunkCmpXchg) {
-  initChecksum();
-  const scudo::uptr Size = 0x100U;
-  scudo::Chunk::UnpackedHeader OldHeader = {};
-  OldHeader.OriginOrWasZeroed = scudo::Chunk::Origin::Malloc;
-  OldHeader.ClassId = 0x42U;
-  OldHeader.SizeOrUnusedBytes = Size;
-  OldHeader.State = scudo::Chunk::State::Allocated;
-  void *Block = malloc(HeaderSize + Size);
-  void *P = reinterpret_cast<void *>(reinterpret_cast<scudo::uptr>(Block) +
-                                     HeaderSize);
-  scudo::Chunk::storeHeader(Cookie, P, &OldHeader);
-  memset(P, 'A', Size);
-  scudo::Chunk::UnpackedHeader NewHeader = OldHeader;
-  NewHeader.State = scudo::Chunk::State::Quarantined;
-  scudo::Chunk::compareExchangeHeader(Cookie, P, &NewHeader, &OldHeader);
-  NewHeader = {};
-  EXPECT_TRUE(scudo::Chunk::isValid(Cookie, P, &NewHeader));
-  EXPECT_EQ(NewHeader.State, scudo::Chunk::State::Quarantined);
-  EXPECT_FALSE(scudo::Chunk::isValid(InvalidCookie, P, &NewHeader));
-  free(Block);
-}
-
 TEST(ScudoChunkDeathTest, CorruptHeader) {
   initChecksum();
   const scudo::uptr Size = 0x100U;

cferris1000

LGTM.

cryptoad · 2023-09-21T00:33:20Z

This is risky.

This is a race detection mechanism where we detect if 2 chunks are modified concurrently. There is a window of time where 2 threads can free the same chunk, both headers will be read, pass the check and written as a free chunk header. If we do not detect this, then the free chunk will be added to 2 caches, and further badness will ensue.

I agree this is an expensive check, and I initially considered both options (cmpxchg vs read-store). If you look for "race on chunk header at address" in Google, you'll find stuff like Instabug/Instabug-Android#435. So those races do happen in the wild, and while it would be complicated to exploit, I wouldn't think it to be impossible.

If you do want to remove this - at the cost of security - you might want to remove reportHeaderRace as well though.

ChiaHungDuan · 2023-09-21T04:20:37Z

This is risky.

This is a race detection mechanism where we detect if 2 chunks are modified concurrently. There is a window of time where 2 threads can free the same chunk, both headers will be read, pass the check and written as a free chunk header. If we do not detect this, then the free chunk will be added to 2 caches, and further badness will ensue.

I agree this is an expensive check, and I initially considered both options (cmpxchg vs read-store). If you look for "race on chunk header at address" in Google, you'll find stuff like Instabug/Instabug-Android#435. So those races do happen in the wild, and while it would be complicated to exploit, I wouldn't think it to be impossible.

If you do want to remove this - at the cost of security - you might want to remove reportHeaderRace as well though.

Thanks, we really want to get your input.

We have measured cmpxchg, xchg and atomic-store. The former two show the same performance overhead (especially on platforms with more cores). We also simulated the behavior and the success rate is about 1 success from 2 millions attempts on x86_64 machine with 96 cores. Like you said, it's not impossible and seems not long to hit the case.

I tried to study some double-free attacks and they usually try to override the metadata in the freelist. In Scudo, the way it stores the freelist is quite different from those cases. That gives me the thought that maybe the trade-off is acceptable. Besides, attacking on this may have lower value than compromise the checksum. When the checksum is compromised, then forfeit a double-free case will be easier. Even though, I'm still uncertain.

Another thought is to slightly harden the double-free protection. We can check the header status at allocation (still with atomic-load/atomic-store) as well. Even though we may only be able to check if it's allocated, but the benefit of this is that we rely on the mutex while accessing primary allocator and don't need to introduce additional synchronization. We still miss the case of accessing cache only but I guess it's a light enhancement.

What do you think?

cryptoad · 2023-09-21T14:55:40Z

What do you think?

I mean, it's fair. If the performance gains are significant, the protection might not be worth it. I initially went the side of security. I agree that the window for the race makes it difficult, and failure to win the race would mean an abort() which would prevent further attempts, or at least make them very noisy.

You might want to run that through the Fuchsia people as well, make sure that everyone knows it's coming. I do think that the race will happen in the wild (another example @ tensorflow/tensorflow#58143), and will end up manifesting in some weird way that will be hard to debug - which is probably not worth the performance cost either.

I'll leave it to you and Christopher to make the decision.

compiler-rt/lib/scudo/standalone/chunk.h

cferris1000 · 2023-09-27T00:50:25Z

Yeah, we agree there is a small loss of security. We are trying to think of another method to detect this error that might not give as much information, but prevent the error from occurring. For example, if this happened then it's likely the same allocation winds up twice in a list, and maybe there is a way we can modify the header when doing an allocation that prevents the same allocation from being used twice. If you've got any ideas along that line, let us know.

Our thinking is that as a pure bug, the chance of this happening is low. If you trigger this problem, you will likely trigger a normal double free so the chance of only every seeing the multiple threads double freeing at the same time is low. Thus you will still see the double free error.

ChiaHungDuan · 2023-09-27T18:33:26Z

Also created an internal bug to bring some enhancements back later

We used to update the deallocated block with atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.

ChiaHungDuan requested review from cryptoad and cferris1000 September 20, 2023 22:02

llvmbot added compiler-rt compiler-rt:sanitizer labels Sep 20, 2023

cferris1000 approved these changes Sep 20, 2023

View reviewed changes

cryptoad requested changes Sep 21, 2023

View reviewed changes

compiler-rt/lib/scudo/standalone/chunk.h Show resolved Hide resolved

fixup! [scudo] Update header without read-modify-write operation

38670d2

llvmbot added the compiler-rt:scudo Scudo Hardened Allocator label Sep 27, 2023

ChiaHungDuan requested a review from cryptoad September 27, 2023 18:03

cryptoad approved these changes Sep 27, 2023

View reviewed changes

ChiaHungDuan merged commit 54ddd07 into llvm:main Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[scudo] Update header without read-modify-write operation #66955

[scudo] Update header without read-modify-write operation #66955

Uh oh!

ChiaHungDuan commented Sep 20, 2023

Uh oh!

llvmbot commented Sep 20, 2023

Uh oh!

cferris1000 left a comment

Uh oh!

cryptoad commented Sep 21, 2023

Uh oh!

ChiaHungDuan commented Sep 21, 2023

Uh oh!

cryptoad commented Sep 21, 2023

Uh oh!

Uh oh!

cferris1000 commented Sep 27, 2023

Uh oh!

ChiaHungDuan commented Sep 27, 2023

Uh oh!

Uh oh!

[scudo] Update header without read-modify-write operation #66955

[scudo] Update header without read-modify-write operation #66955

Uh oh!

Conversation

ChiaHungDuan commented Sep 20, 2023

Uh oh!

llvmbot commented Sep 20, 2023

Uh oh!

cferris1000 left a comment

Choose a reason for hiding this comment

Uh oh!

cryptoad commented Sep 21, 2023

Uh oh!

ChiaHungDuan commented Sep 21, 2023

Uh oh!

cryptoad commented Sep 21, 2023

Uh oh!

Uh oh!

cferris1000 commented Sep 27, 2023

Uh oh!

ChiaHungDuan commented Sep 27, 2023

Uh oh!

Uh oh!