Skip to content

Conversation

ChiaHungDuan
Copy link
Contributor

We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.

We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will
be detected. However, this operation incurs huge performance overhead
which takes over 50% execution time in deallocate(). Given that we
already have the checksum to guard the most double-free cases and other
block verifications in the primary allocator, use atomic-store instead.
@llvmbot
Copy link
Member

llvmbot commented Sep 20, 2023

@llvm/pr-subscribers-compiler-rt-sanitizer

Changes

We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will be detected. However, this operation incurs huge performance overhead which takes over 50% execution time in deallocate(). Given that we already have the checksum to guard the most double-free cases and other block verifications in the primary allocator, use atomic-store instead.


Full diff: https://github.com/llvm/llvm-project/pull/66955.diff

3 Files Affected:

  • (modified) compiler-rt/lib/scudo/standalone/chunk.h (-13)
  • (modified) compiler-rt/lib/scudo/standalone/combined.h (+30-34)
  • (modified) compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp (-23)
diff --git a/compiler-rt/lib/scudo/standalone/chunk.h b/compiler-rt/lib/scudo/standalone/chunk.h
index 32874a8df6421c2..9228df047189091 100644
--- a/compiler-rt/lib/scudo/standalone/chunk.h
+++ b/compiler-rt/lib/scudo/standalone/chunk.h
@@ -128,19 +128,6 @@ inline void loadHeader(u32 Cookie, const void *Ptr,
     reportHeaderCorruption(const_cast<void *>(Ptr));
 }
 
-inline void compareExchangeHeader(u32 Cookie, void *Ptr,
-                                  UnpackedHeader *NewUnpackedHeader,
-                                  UnpackedHeader *OldUnpackedHeader) {
-  NewUnpackedHeader->Checksum =
-      computeHeaderChecksum(Cookie, Ptr, NewUnpackedHeader);
-  PackedHeader NewPackedHeader = bit_cast<PackedHeader>(*NewUnpackedHeader);
-  PackedHeader OldPackedHeader = bit_cast<PackedHeader>(*OldUnpackedHeader);
-  if (UNLIKELY(!atomic_compare_exchange_strong(
-          getAtomicHeader(Ptr), &OldPackedHeader, NewPackedHeader,
-          memory_order_relaxed)))
-    reportHeaderRace(Ptr);
-}
-
 inline bool isValid(u32 Cookie, const void *Ptr,
                     UnpackedHeader *NewUnpackedHeader) {
   PackedHeader NewPackedHeader = atomic_load_relaxed(getConstAtomicHeader(Ptr));
diff --git a/compiler-rt/lib/scudo/standalone/combined.h b/compiler-rt/lib/scudo/standalone/combined.h
index 29589cdd99fa78a..5bc7dd32890be11 100644
--- a/compiler-rt/lib/scudo/standalone/combined.h
+++ b/compiler-rt/lib/scudo/standalone/combined.h
@@ -67,14 +67,13 @@ class Allocator {
       if (UNLIKELY(Header.State != Chunk::State::Quarantined))
         reportInvalidChunkState(AllocatorAction::Recycling, Ptr);
 
-      Chunk::UnpackedHeader NewHeader = Header;
-      NewHeader.State = Chunk::State::Available;
-      Chunk::compareExchangeHeader(Allocator.Cookie, Ptr, &NewHeader, &Header);
+      Header.State = Chunk::State::Available;
+      Chunk::storeHeader(Allocator.Cookie, Ptr, &Header);
 
       if (allocatorSupportsMemoryTagging<Config>())
         Ptr = untagPointer(Ptr);
-      void *BlockBegin = Allocator::getBlockBegin(Ptr, &NewHeader);
-      Cache.deallocate(NewHeader.ClassId, BlockBegin);
+      void *BlockBegin = Allocator::getBlockBegin(Ptr, &Header);
+      Cache.deallocate(Header.ClassId, BlockBegin);
     }
 
     // We take a shortcut when allocating a quarantine batch by working with the
@@ -117,9 +116,8 @@ class Allocator {
       DCHECK_EQ(Header.Offset, 0);
       DCHECK_EQ(Header.SizeOrUnusedBytes, sizeof(QuarantineBatch));
 
-      Chunk::UnpackedHeader NewHeader = Header;
-      NewHeader.State = Chunk::State::Available;
-      Chunk::compareExchangeHeader(Allocator.Cookie, Ptr, &NewHeader, &Header);
+      Header.State = Chunk::State::Available;
+      Chunk::storeHeader(Allocator.Cookie, Ptr, &Header);
       Cache.deallocate(QuarantineClassId,
                        reinterpret_cast<void *>(reinterpret_cast<uptr>(Ptr) -
                                                 Chunk::getHeaderSize()));
@@ -610,47 +608,46 @@ class Allocator {
     if (UNLIKELY(!isAligned(reinterpret_cast<uptr>(OldPtr), MinAlignment)))
       reportMisalignedPointer(AllocatorAction::Reallocating, OldPtr);
 
-    Chunk::UnpackedHeader OldHeader;
-    Chunk::loadHeader(Cookie, OldPtr, &OldHeader);
+    Chunk::UnpackedHeader Header;
+    Chunk::loadHeader(Cookie, OldPtr, &Header);
 
-    if (UNLIKELY(OldHeader.State != Chunk::State::Allocated))
+    if (UNLIKELY(Header.State != Chunk::State::Allocated))
       reportInvalidChunkState(AllocatorAction::Reallocating, OldPtr);
 
     // Pointer has to be allocated with a malloc-type function. Some
     // applications think that it is OK to realloc a memalign'ed pointer, which
     // will trigger this check. It really isn't.
     if (Options.get(OptionBit::DeallocTypeMismatch)) {
-      if (UNLIKELY(OldHeader.OriginOrWasZeroed != Chunk::Origin::Malloc))
+      if (UNLIKELY(Header.OriginOrWasZeroed != Chunk::Origin::Malloc))
         reportDeallocTypeMismatch(AllocatorAction::Reallocating, OldPtr,
-                                  OldHeader.OriginOrWasZeroed,
+                                  Header.OriginOrWasZeroed,
                                   Chunk::Origin::Malloc);
     }
 
-    void *BlockBegin = getBlockBegin(OldTaggedPtr, &OldHeader);
+    void *BlockBegin = getBlockBegin(OldTaggedPtr, &Header);
     uptr BlockEnd;
     uptr OldSize;
-    const uptr ClassId = OldHeader.ClassId;
+    const uptr ClassId = Header.ClassId;
     if (LIKELY(ClassId)) {
       BlockEnd = reinterpret_cast<uptr>(BlockBegin) +
                  SizeClassMap::getSizeByClassId(ClassId);
-      OldSize = OldHeader.SizeOrUnusedBytes;
+      OldSize = Header.SizeOrUnusedBytes;
     } else {
       BlockEnd = SecondaryT::getBlockEnd(BlockBegin);
       OldSize = BlockEnd - (reinterpret_cast<uptr>(OldTaggedPtr) +
-                            OldHeader.SizeOrUnusedBytes);
+                            Header.SizeOrUnusedBytes);
     }
     // If the new chunk still fits in the previously allocated block (with a
     // reasonable delta), we just keep the old block, and update the chunk
     // header to reflect the size change.
     if (reinterpret_cast<uptr>(OldTaggedPtr) + NewSize <= BlockEnd) {
       if (NewSize > OldSize || (OldSize - NewSize) < getPageSizeCached()) {
-        Chunk::UnpackedHeader NewHeader = OldHeader;
-        NewHeader.SizeOrUnusedBytes =
+        Header.SizeOrUnusedBytes =
             (ClassId ? NewSize
                      : BlockEnd -
                            (reinterpret_cast<uptr>(OldTaggedPtr) + NewSize)) &
             Chunk::SizeOrUnusedBytesMask;
-        Chunk::compareExchangeHeader(Cookie, OldPtr, &NewHeader, &OldHeader);
+        Chunk::storeHeader(Cookie, OldPtr, &Header);
         if (UNLIKELY(useMemoryTagging<Config>(Options))) {
           if (ClassId) {
             resizeTaggedChunk(reinterpret_cast<uptr>(OldTaggedPtr) + OldSize,
@@ -672,7 +669,7 @@ class Allocator {
     void *NewPtr = allocate(NewSize, Chunk::Origin::Malloc, Alignment);
     if (LIKELY(NewPtr)) {
       memcpy(NewPtr, OldTaggedPtr, Min(NewSize, OldSize));
-      quarantineOrDeallocateChunk(Options, OldTaggedPtr, &OldHeader, OldSize);
+      quarantineOrDeallocateChunk(Options, OldTaggedPtr, &Header, OldSize);
     }
     return NewPtr;
   }
@@ -1110,31 +1107,30 @@ class Allocator {
                                    Chunk::UnpackedHeader *Header,
                                    uptr Size) NO_THREAD_SAFETY_ANALYSIS {
     void *Ptr = getHeaderTaggedPointer(TaggedPtr);
-    Chunk::UnpackedHeader NewHeader = *Header;
     // If the quarantine is disabled, the actual size of a chunk is 0 or larger
     // than the maximum allowed, we return a chunk directly to the backend.
     // This purposefully underflows for Size == 0.
     const bool BypassQuarantine = !Quarantine.getCacheSize() ||
                                   ((Size - 1) >= QuarantineMaxChunkSize) ||
-                                  !NewHeader.ClassId;
+                                  !Header->ClassId;
     if (BypassQuarantine)
-      NewHeader.State = Chunk::State::Available;
+      Header->State = Chunk::State::Available;
     else
-      NewHeader.State = Chunk::State::Quarantined;
-    NewHeader.OriginOrWasZeroed = useMemoryTagging<Config>(Options) &&
-                                  NewHeader.ClassId &&
-                                  !TSDRegistry.getDisableMemInit();
-    Chunk::compareExchangeHeader(Cookie, Ptr, &NewHeader, Header);
+      Header->State = Chunk::State::Quarantined;
+    Header->OriginOrWasZeroed = useMemoryTagging<Config>(Options) &&
+                                Header->ClassId &&
+                                !TSDRegistry.getDisableMemInit();
+    Chunk::storeHeader(Cookie, Ptr, Header);
 
     if (UNLIKELY(useMemoryTagging<Config>(Options))) {
       u8 PrevTag = extractTag(reinterpret_cast<uptr>(TaggedPtr));
       storeDeallocationStackMaybe(Options, Ptr, PrevTag, Size);
-      if (NewHeader.ClassId) {
+      if (Header->ClassId) {
         if (!TSDRegistry.getDisableMemInit()) {
           uptr TaggedBegin, TaggedEnd;
           const uptr OddEvenMask = computeOddEvenMaskForPointerMaybe(
-              Options, reinterpret_cast<uptr>(getBlockBegin(Ptr, &NewHeader)),
-              NewHeader.ClassId);
+              Options, reinterpret_cast<uptr>(getBlockBegin(Ptr, Header)),
+              Header->ClassId);
           // Exclude the previous tag so that immediate use after free is
           // detected 100% of the time.
           setRandomTag(Ptr, Size, OddEvenMask | (1UL << PrevTag), &TaggedBegin,
@@ -1145,8 +1141,8 @@ class Allocator {
     if (BypassQuarantine) {
       if (allocatorSupportsMemoryTagging<Config>())
         Ptr = untagPointer(Ptr);
-      void *BlockBegin = getBlockBegin(Ptr, &NewHeader);
-      const uptr ClassId = NewHeader.ClassId;
+      void *BlockBegin = getBlockBegin(Ptr, Header);
+      const uptr ClassId = Header->ClassId;
       if (LIKELY(ClassId)) {
         bool UnlockRequired;
         auto *TSD = TSDRegistry.getTSDAndLock(&UnlockRequired);
diff --git a/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp b/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
index 7a29f3c11b70ffb..1b2c1eb5a7d0c5c 100644
--- a/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
+++ b/compiler-rt/lib/scudo/standalone/tests/chunk_test.cpp
@@ -37,29 +37,6 @@ TEST(ScudoChunkDeathTest, ChunkBasic) {
   free(Block);
 }
 
-TEST(ScudoChunkTest, ChunkCmpXchg) {
-  initChecksum();
-  const scudo::uptr Size = 0x100U;
-  scudo::Chunk::UnpackedHeader OldHeader = {};
-  OldHeader.OriginOrWasZeroed = scudo::Chunk::Origin::Malloc;
-  OldHeader.ClassId = 0x42U;
-  OldHeader.SizeOrUnusedBytes = Size;
-  OldHeader.State = scudo::Chunk::State::Allocated;
-  void *Block = malloc(HeaderSize + Size);
-  void *P = reinterpret_cast<void *>(reinterpret_cast<scudo::uptr>(Block) +
-                                     HeaderSize);
-  scudo::Chunk::storeHeader(Cookie, P, &OldHeader);
-  memset(P, 'A', Size);
-  scudo::Chunk::UnpackedHeader NewHeader = OldHeader;
-  NewHeader.State = scudo::Chunk::State::Quarantined;
-  scudo::Chunk::compareExchangeHeader(Cookie, P, &NewHeader, &OldHeader);
-  NewHeader = {};
-  EXPECT_TRUE(scudo::Chunk::isValid(Cookie, P, &NewHeader));
-  EXPECT_EQ(NewHeader.State, scudo::Chunk::State::Quarantined);
-  EXPECT_FALSE(scudo::Chunk::isValid(InvalidCookie, P, &NewHeader));
-  free(Block);
-}
-
 TEST(ScudoChunkDeathTest, CorruptHeader) {
   initChecksum();
   const scudo::uptr Size = 0x100U;

Copy link
Contributor

@cferris1000 cferris1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cryptoad
Copy link

This is risky.

This is a race detection mechanism where we detect if 2 chunks are modified concurrently. There is a window of time where 2 threads can free the same chunk, both headers will be read, pass the check and written as a free chunk header. If we do not detect this, then the free chunk will be added to 2 caches, and further badness will ensue.

I agree this is an expensive check, and I initially considered both options (cmpxchg vs read-store). If you look for "race on chunk header at address" in Google, you'll find stuff like Instabug/Instabug-Android#435. So those races do happen in the wild, and while it would be complicated to exploit, I wouldn't think it to be impossible.

If you do want to remove this - at the cost of security - you might want to remove reportHeaderRace as well though.

@ChiaHungDuan
Copy link
Contributor Author

This is risky.

This is a race detection mechanism where we detect if 2 chunks are modified concurrently. There is a window of time where 2 threads can free the same chunk, both headers will be read, pass the check and written as a free chunk header. If we do not detect this, then the free chunk will be added to 2 caches, and further badness will ensue.

I agree this is an expensive check, and I initially considered both options (cmpxchg vs read-store). If you look for "race on chunk header at address" in Google, you'll find stuff like Instabug/Instabug-Android#435. So those races do happen in the wild, and while it would be complicated to exploit, I wouldn't think it to be impossible.

If you do want to remove this - at the cost of security - you might want to remove reportHeaderRace as well though.

Thanks, we really want to get your input.

We have measured cmpxchg, xchg and atomic-store. The former two show the same performance overhead (especially on platforms with more cores). We also simulated the behavior and the success rate is about 1 success from 2 millions attempts on x86_64 machine with 96 cores. Like you said, it's not impossible and seems not long to hit the case.

I tried to study some double-free attacks and they usually try to override the metadata in the freelist. In Scudo, the way it stores the freelist is quite different from those cases. That gives me the thought that maybe the trade-off is acceptable. Besides, attacking on this may have lower value than compromise the checksum. When the checksum is compromised, then forfeit a double-free case will be easier. Even though, I'm still uncertain.

Another thought is to slightly harden the double-free protection. We can check the header status at allocation (still with atomic-load/atomic-store) as well. Even though we may only be able to check if it's allocated, but the benefit of this is that we rely on the mutex while accessing primary allocator and don't need to introduce additional synchronization. We still miss the case of accessing cache only but I guess it's a light enhancement.

What do you think?

@cryptoad
Copy link

What do you think?

I mean, it's fair. If the performance gains are significant, the protection might not be worth it. I initially went the side of security. I agree that the window for the race makes it difficult, and failure to win the race would mean an abort() which would prevent further attempts, or at least make them very noisy.

You might want to run that through the Fuchsia people as well, make sure that everyone knows it's coming. I do think that the race will happen in the wild (another example @ tensorflow/tensorflow#58143), and will end up manifesting in some weird way that will be hard to debug - which is probably not worth the performance cost either.

I'll leave it to you and Christopher to make the decision.

@cferris1000
Copy link
Contributor

Yeah, we agree there is a small loss of security. We are trying to think of another method to detect this error that might not give as much information, but prevent the error from occurring. For example, if this happened then it's likely the same allocation winds up twice in a list, and maybe there is a way we can modify the header when doing an allocation that prevents the same allocation from being used twice. If you've got any ideas along that line, let us know.

Our thinking is that as a pure bug, the chance of this happening is low. If you trigger this problem, you will likely trigger a normal double free so the chance of only every seeing the multiple threads double freeing at the same time is low. Thus you will still see the double free error.

@llvmbot llvmbot added the compiler-rt:scudo Scudo Hardened Allocator label Sep 27, 2023
@ChiaHungDuan
Copy link
Contributor Author

Also created an internal bug to bring some enhancements back later

@ChiaHungDuan ChiaHungDuan merged commit 54ddd07 into llvm:main Sep 28, 2023
legrosbuffle pushed a commit to legrosbuffle/llvm-project that referenced this pull request Sep 29, 2023
We used to update the deallocated block with
atomic_compare_exchange_strong to ensure the concurrent double-free will
be detected. However, this operation incurs huge performance overhead
which takes over 50% execution time in deallocate(). Given that we
already have the checksum to guard the most double-free cases and other
block verifications in the primary allocator, use atomic-store instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants