Skip to content

Commit b1f2020

Browse files
Yu Zhaoakpm00
authored andcommitted
mm: remap unused subpages to shared zeropage when splitting isolated thp
Patch series "mm: split underused THPs", v5. The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: | THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K This patch (of 5): Here being unused means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, the unused subpages can be mapped to the shared zeropage, hence saving memory. This is particularly helpful when the internal fragmentation of a thp is high, i.e. it has many untouched subpages. This is also a prerequisite for THP low utilization shrinker which will be introduced in later patches, where underutilized THPs are split, and the zero-filled pages are freed saving memory. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Signed-off-by: Usama Arif <[email protected]> Tested-by: Shuang Zhai <[email protected]> Cc: Alexander Zhu <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nico Pache <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuang Zhai <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 903edea commit b1f2020

File tree

4 files changed

+75
-16
lines changed

4 files changed

+75
-16
lines changed

include/linux/rmap.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -745,7 +745,12 @@ int folio_mkclean(struct folio *);
745745
int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
746746
struct vm_area_struct *vma);
747747

748-
void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
748+
enum rmp_flags {
749+
RMP_LOCKED = 1 << 0,
750+
RMP_USE_SHARED_ZEROPAGE = 1 << 1,
751+
};
752+
753+
void remove_migration_ptes(struct folio *src, struct folio *dst, int flags);
749754

750755
/*
751756
* rmap_walk_control: To control rmap traversing for specific needs

mm/huge_memory.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3004,15 +3004,15 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
30043004
return false;
30053005
}
30063006

3007-
static void remap_page(struct folio *folio, unsigned long nr)
3007+
static void remap_page(struct folio *folio, unsigned long nr, int flags)
30083008
{
30093009
int i = 0;
30103010

30113011
/* If unmap_folio() uses try_to_migrate() on file, remove this check */
30123012
if (!folio_test_anon(folio))
30133013
return;
30143014
for (;;) {
3015-
remove_migration_ptes(folio, folio, true);
3015+
remove_migration_ptes(folio, folio, RMP_LOCKED | flags);
30163016
i += folio_nr_pages(folio);
30173017
if (i >= nr)
30183018
break;
@@ -3222,7 +3222,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
32223222

32233223
if (nr_dropped)
32243224
shmem_uncharge(folio->mapping->host, nr_dropped);
3225-
remap_page(folio, nr);
3225+
remap_page(folio, nr, PageAnon(head) ? RMP_USE_SHARED_ZEROPAGE : 0);
32263226

32273227
/*
32283228
* set page to its compound_head when split to non order-0 pages, so
@@ -3498,7 +3498,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
34983498
if (mapping)
34993499
xas_unlock(&xas);
35003500
local_irq_enable();
3501-
remap_page(folio, folio_nr_pages(folio));
3501+
remap_page(folio, folio_nr_pages(folio), 0);
35023502
ret = -EAGAIN;
35033503
}
35043504

mm/migrate.c

Lines changed: 63 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -204,13 +204,57 @@ bool isolate_folio_to_list(struct folio *folio, struct list_head *list)
204204
return true;
205205
}
206206

207+
static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
208+
struct folio *folio,
209+
unsigned long idx)
210+
{
211+
struct page *page = folio_page(folio, idx);
212+
bool contains_data;
213+
pte_t newpte;
214+
void *addr;
215+
216+
VM_BUG_ON_PAGE(PageCompound(page), page);
217+
VM_BUG_ON_PAGE(!PageAnon(page), page);
218+
VM_BUG_ON_PAGE(!PageLocked(page), page);
219+
VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
220+
221+
if (folio_test_mlocked(folio) || (pvmw->vma->vm_flags & VM_LOCKED) ||
222+
mm_forbids_zeropage(pvmw->vma->vm_mm))
223+
return false;
224+
225+
/*
226+
* The pmd entry mapping the old thp was flushed and the pte mapping
227+
* this subpage has been non present. If the subpage is only zero-filled
228+
* then map it to the shared zeropage.
229+
*/
230+
addr = kmap_local_page(page);
231+
contains_data = memchr_inv(addr, 0, PAGE_SIZE);
232+
kunmap_local(addr);
233+
234+
if (contains_data)
235+
return false;
236+
237+
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
238+
pvmw->vma->vm_page_prot));
239+
set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
240+
241+
dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
242+
return true;
243+
}
244+
245+
struct rmap_walk_arg {
246+
struct folio *folio;
247+
bool map_unused_to_zeropage;
248+
};
249+
207250
/*
208251
* Restore a potential migration pte to a working pte entry
209252
*/
210253
static bool remove_migration_pte(struct folio *folio,
211-
struct vm_area_struct *vma, unsigned long addr, void *old)
254+
struct vm_area_struct *vma, unsigned long addr, void *arg)
212255
{
213-
DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
256+
struct rmap_walk_arg *rmap_walk_arg = arg;
257+
DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
214258

215259
while (page_vma_mapped_walk(&pvmw)) {
216260
rmap_t rmap_flags = RMAP_NONE;
@@ -234,6 +278,9 @@ static bool remove_migration_pte(struct folio *folio,
234278
continue;
235279
}
236280
#endif
281+
if (rmap_walk_arg->map_unused_to_zeropage &&
282+
try_to_map_unused_to_zeropage(&pvmw, folio, idx))
283+
continue;
237284

238285
folio_get(folio);
239286
pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -312,14 +359,21 @@ static bool remove_migration_pte(struct folio *folio,
312359
* Get rid of all migration entries and replace them by
313360
* references to the indicated page.
314361
*/
315-
void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
362+
void remove_migration_ptes(struct folio *src, struct folio *dst, int flags)
316363
{
364+
struct rmap_walk_arg rmap_walk_arg = {
365+
.folio = src,
366+
.map_unused_to_zeropage = flags & RMP_USE_SHARED_ZEROPAGE,
367+
};
368+
317369
struct rmap_walk_control rwc = {
318370
.rmap_one = remove_migration_pte,
319-
.arg = src,
371+
.arg = &rmap_walk_arg,
320372
};
321373

322-
if (locked)
374+
VM_BUG_ON_FOLIO((flags & RMP_USE_SHARED_ZEROPAGE) && (src != dst), src);
375+
376+
if (flags & RMP_LOCKED)
323377
rmap_walk_locked(dst, &rwc);
324378
else
325379
rmap_walk(dst, &rwc);
@@ -934,7 +988,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
934988
* At this point we know that the migration attempt cannot
935989
* be successful.
936990
*/
937-
remove_migration_ptes(folio, folio, false);
991+
remove_migration_ptes(folio, folio, 0);
938992

939993
rc = mapping->a_ops->writepage(&folio->page, &wbc);
940994

@@ -1098,7 +1152,7 @@ static void migrate_folio_undo_src(struct folio *src,
10981152
struct list_head *ret)
10991153
{
11001154
if (page_was_mapped)
1101-
remove_migration_ptes(src, src, false);
1155+
remove_migration_ptes(src, src, 0);
11021156
/* Drop an anon_vma reference if we took one */
11031157
if (anon_vma)
11041158
put_anon_vma(anon_vma);
@@ -1336,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
13361390
lru_add_drain();
13371391

13381392
if (old_page_state & PAGE_WAS_MAPPED)
1339-
remove_migration_ptes(src, dst, false);
1393+
remove_migration_ptes(src, dst, 0);
13401394

13411395
out_unlock_both:
13421396
folio_unlock(dst);
@@ -1474,7 +1528,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
14741528

14751529
if (page_was_mapped)
14761530
remove_migration_ptes(src,
1477-
rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
1531+
rc == MIGRATEPAGE_SUCCESS ? dst : src, 0);
14781532

14791533
unlock_put_anon:
14801534
folio_unlock(dst);

mm/migrate_device.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
424424
continue;
425425

426426
folio = page_folio(page);
427-
remove_migration_ptes(folio, folio, false);
427+
remove_migration_ptes(folio, folio, 0);
428428

429429
src_pfns[i] = 0;
430430
folio_unlock(folio);
@@ -840,7 +840,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
840840
dst = src;
841841
}
842842

843-
remove_migration_ptes(src, dst, false);
843+
remove_migration_ptes(src, dst, 0);
844844
folio_unlock(src);
845845

846846
if (folio_is_zone_device(src))

0 commit comments

Comments
 (0)