Skip to content

[scudo] Fix release to OS logic in secondary cache. #103303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

JoshuaMBa
Copy link
Contributor

@JoshuaMBa JoshuaMBa commented Aug 13, 2024

In order to preserve the least recently used (LRU) ordering of the list of decommitted entries in the secondary cache, the cache entries are released from least recent to most recent. This also helps to avoid unnecessary scans of the cache since entries ready to be released (specifically, entries that are considered old relative to the configurable release interval) will always be at the tail of the list of committed entries by the LRU ordering. For this same reason, the OldestTime variable is no longer needed to indicate when releases are necessary so it has been removed.

This addresses an issue where the order of the entries in the decommitted list could be inverted if they were released without respecting the LRU ordering.

@llvmbot
Copy link
Member

llvmbot commented Aug 13, 2024

@llvm/pr-subscribers-compiler-rt-sanitizer

Author: Joshua Baehring (JoshuaMBa)

Changes

In order to preserve the least recently used (LRU) ordering of the cache in the list of decommitted entries, the cache entries are released from least recent to most recent. This also helps to avoid unnecessary scans of the cache since entries ready to be released (specifically, entries that are considered old relative to the configurable release interval) will always be at the tail of the list of committed entries by the LRU ordering.


Full diff: https://github.com/llvm/llvm-project/pull/103303.diff

1 Files Affected:

  • (modified) compiler-rt/lib/scudo/standalone/secondary.h (+16-25)
diff --git a/compiler-rt/lib/scudo/standalone/secondary.h b/compiler-rt/lib/scudo/standalone/secondary.h
index e3dc8d4ef8247c..956fbb508f2638 100644
--- a/compiler-rt/lib/scudo/standalone/secondary.h
+++ b/compiler-rt/lib/scudo/standalone/secondary.h
@@ -321,8 +321,6 @@ class MapAllocatorCache {
         }
         CachedBlock PrevEntry = Quarantine[QuarantinePos];
         Quarantine[QuarantinePos] = Entry;
-        if (OldestTime == 0)
-          OldestTime = Entry.Time;
         Entry = PrevEntry;
       }
 
@@ -344,8 +342,6 @@ class MapAllocatorCache {
 
       insert(Entry, (Entry.Time == 0) ? DECOMMITTED : COMMITTED);
 
-      if (OldestTime == 0)
-        OldestTime = Entry.Time;
     } while (0); // ScopedLock L(Mutex);
 
     for (MemMapT &EvictMemMap : EvictionMemMaps)
@@ -585,32 +581,28 @@ class MapAllocatorCache {
     }
   }
 
-  void releaseIfOlderThan(CachedBlock &Entry, u64 Time) REQUIRES(Mutex) {
-    if (!Entry.isValid() || !Entry.Time)
-      return;
-    if (Entry.Time > Time) {
-      if (OldestTime == 0 || Entry.Time < OldestTime)
-        OldestTime = Entry.Time;
-      return;
-    }
+  inline void release(CachedBlock &Entry) {
     Entry.MemMap.releaseAndZeroPagesToOS(Entry.CommitBase, Entry.CommitSize);
     Entry.Time = 0;
   }
 
   void releaseOlderThan(u64 Time) EXCLUDES(Mutex) {
     ScopedLock L(Mutex);
-    if (!EntriesCount || OldestTime == 0 || OldestTime > Time)
-      return;
-    OldestTime = 0;
-    for (uptr I = 0; I < Config::getQuarantineSize(); I++)
-      releaseIfOlderThan(Quarantine[I], Time);
-    for (u16 I = EntryLists[COMMITTED].Head; I != CachedBlock::InvalidEntry;
-         I = Entries[I].Next) {
-      if (Entries[I].Time && Entries[I].Time <= Time) {
-        unlink(I, COMMITTED);
-        pushFront(I, DECOMMITTED);
-      }
-      releaseIfOlderThan(Entries[I], Time);
+    for (uptr I = 0; I < Config::getQuarantineSize(); I++) {
+      CachedBlock &Entry = Quarantine[I];
+      if (!Entry.isValid() || Entry.Time > Time)
+        continue;
+      release(Entry);
+    }
+
+    // Release oldest entries first by releasing from tail
+    u16 ReleaseIndex = EntryLists[COMMITTED].Tail;
+    while (ReleaseIndex != CachedBlock::InvalidEntry &&
+           Entries[ReleaseIndex].Time <= Time) {
+      release(Entries[ReleaseIndex]);
+      unlink(ReleaseIndex, COMMITTED);
+      pushFront(ReleaseIndex, DECOMMITTED);
+      ReleaseIndex = Entries[ReleaseIndex].Prev;
     }
   }
 
@@ -619,7 +611,6 @@ class MapAllocatorCache {
   u32 QuarantinePos GUARDED_BY(Mutex) = 0;
   atomic_u32 MaxEntriesCount = {};
   atomic_uptr MaxEntrySize = {};
-  u64 OldestTime GUARDED_BY(Mutex) = 0;
   atomic_s32 ReleaseToOsIntervalMs = {};
   u32 CallsToRetrieve GUARDED_BY(Mutex) = 0;
   u32 SuccessfulRetrieves GUARDED_BY(Mutex) = 0;

Copy link
Contributor

@ChiaHungDuan ChiaHungDuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

@JoshuaMBa
Copy link
Contributor Author

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

The main issue with testing this is that the interval at which the issue occurs is somewhat arbitrary. I can try different cache sizes and interval values to determine a point where the error occurs, but I don't think that's a robust way of testing this case. I will try coming up with a better way of testing this outside of the current CacheOrder test.

@JoshuaMBa JoshuaMBa changed the title [scudo] Add time-based release to OS logic. [scudo] Fix release to OS logic in secondary cache. Aug 13, 2024
@ChiaHungDuan
Copy link
Contributor

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

The main issue with testing this is that the interval at which the issue occurs is somewhat arbitrary. I can try different cache sizes and interval values to determine a point where the error occurs, but I don't think that's a robust way of testing this case. I will try coming up with a better way of testing this outside of the current CacheOrder test.

IIRC, we want to ensure the eviction order is still LRU. If we do a page releasing before the first eviction, we will see the wrong order happens, right?

@JoshuaMBa JoshuaMBa force-pushed the release_to_os_logic branch from 1e34ab0 to d1a4ef4 Compare August 13, 2024 22:07
@JoshuaMBa
Copy link
Contributor Author

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

The main issue with testing this is that the interval at which the issue occurs is somewhat arbitrary. I can try different cache sizes and interval values to determine a point where the error occurs, but I don't think that's a robust way of testing this case. I will try coming up with a better way of testing this outside of the current CacheOrder test.

IIRC, we want to ensure the eviction order is still LRU. If we do a page releasing before the first eviction, we will see the wrong order happens, right?

Not necessarily. The case I showed was when all cached entries are old and ready to be released. That is when we see a complete inversion of the cache order. If the interval is not large enough, then it's possible that the entries are released in the proper order.

For example, if the committed entries are (1 2 3 4 5), from most recent to least recent and only 5 is old enough to be released on a store() call, then the committed list becomes (1 2 3 4) and the decommitted list becomes (5). On another store() call, if only 4 is old enough to be released then the committed list becomes (1 2 3) and the decommitted list becomes (4 5). In this case, the LRU ordering is actually being preserved.

However, I have now uploaded a test case that solves this issue by just first disabling releases during the initial store() calls and then reenabling the releases.

@ChiaHungDuan
Copy link
Contributor

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

The main issue with testing this is that the interval at which the issue occurs is somewhat arbitrary. I can try different cache sizes and interval values to determine a point where the error occurs, but I don't think that's a robust way of testing this case. I will try coming up with a better way of testing this outside of the current CacheOrder test.

IIRC, we want to ensure the eviction order is still LRU. If we do a page releasing before the first eviction, we will see the wrong order happens, right?

Not necessarily. The case I showed was when all cached entries are old and ready to be released. That is when we see a complete inversion of the cache order. If the interval is not large enough, then it's possible that the entries are released in the proper order.

For example, if the committed entries are (1 2 3 4 5), from most recent to least recent and only 5 is old enough to be released on a store() call, then the committed list becomes (1 2 3 4) and the decommitted list becomes (5). On another store() call, if only 4 is old enough to be released then the committed list becomes (1 2 3) and the decommitted list becomes (4 5). In this case, the LRU ordering is actually being preserved.

However, I have now uploaded a test case that solves this issue by just first disabling releases during the initial store() calls and then reenabling the releases.

So what I'm saying is doing an explicit release to dump all commited entries to decommited list

@JoshuaMBa
Copy link
Contributor Author

This is a fix, please mention that in the title and the description. In addition, I think this can be covered by the test, right? If so, please add a test case.

The main issue with testing this is that the interval at which the issue occurs is somewhat arbitrary. I can try different cache sizes and interval values to determine a point where the error occurs, but I don't think that's a robust way of testing this case. I will try coming up with a better way of testing this outside of the current CacheOrder test.

IIRC, we want to ensure the eviction order is still LRU. If we do a page releasing before the first eviction, we will see the wrong order happens, right?

Not necessarily. The case I showed was when all cached entries are old and ready to be released. That is when we see a complete inversion of the cache order. If the interval is not large enough, then it's possible that the entries are released in the proper order.
For example, if the committed entries are (1 2 3 4 5), from most recent to least recent and only 5 is old enough to be released on a store() call, then the committed list becomes (1 2 3 4) and the decommitted list becomes (5). On another store() call, if only 4 is old enough to be released then the committed list becomes (1 2 3) and the decommitted list becomes (4 5). In this case, the LRU ordering is actually being preserved.
However, I have now uploaded a test case that solves this issue by just first disabling releases during the initial store() calls and then reenabling the releases.

So what I'm saying is doing an explicit release to dump all commited entries to decommited list

Yes that will work as long as we disable the release interval.

In order to preserve the least recently used (LRU)
ordering of the cache in the list of decommitted entries,
the cache entries are released from least recent to most
recent. This also helps to avoid unnecessary scans of the cache
since entries ready to be released (specifically, entries that
are considered old relative to the configurable release interval)
will always be at the tail of the list of committed entries
by the LRU ordering.
@JoshuaMBa JoshuaMBa force-pushed the release_to_os_logic branch from d1a4ef4 to ee985d4 Compare August 14, 2024 00:03
Copy link
Contributor

@cferris1000 cferris1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JoshuaMBa JoshuaMBa closed this Aug 14, 2024
@JoshuaMBa JoshuaMBa deleted the release_to_os_logic branch August 14, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants