Skip to content

Commit f1dd2cd

Browse files
Michal Hockotorvalds
Michal Hocko
authored andcommitted
mm, memory_hotplug: do not associate hotadded memory to zones until online
The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since 9d99aaa ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining 511c2ab ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [[email protected]: simplify zone_intersects()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove unused local `i'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Wei Yang <[email protected]> Tested-by: Dan Williams <[email protected]> Tested-by: Reza Arbab <[email protected]> Acked-by: Heiko Carstens <[email protected]> # For s390 bits Acked-by: Vlastimil Babka <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent d336e94 commit f1dd2cd

File tree

12 files changed

+185
-175
lines changed

12 files changed

+185
-175
lines changed

arch/ia64/mm/init.c

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -648,18 +648,11 @@ mem_init (void)
648648
#ifdef CONFIG_MEMORY_HOTPLUG
649649
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
650650
{
651-
pg_data_t *pgdat;
652-
struct zone *zone;
653651
unsigned long start_pfn = start >> PAGE_SHIFT;
654652
unsigned long nr_pages = size >> PAGE_SHIFT;
655653
int ret;
656654

657-
pgdat = NODE_DATA(nid);
658-
659-
zone = pgdat->node_zones +
660-
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
661-
ret = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
662-
655+
ret = __add_pages(nid, start_pfn, nr_pages, !for_device);
663656
if (ret)
664657
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
665658
__func__, ret);

arch/powerpc/mm/mem.c

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -128,16 +128,12 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
128128

129129
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
130130
{
131-
struct pglist_data *pgdata;
132-
struct zone *zone;
133131
unsigned long start_pfn = start >> PAGE_SHIFT;
134132
unsigned long nr_pages = size >> PAGE_SHIFT;
135133
int rc;
136134

137135
resize_hpt_for_hotplug(memblock_phys_mem_size());
138136

139-
pgdata = NODE_DATA(nid);
140-
141137
start = (unsigned long)__va(start);
142138
rc = create_section_mapping(start, start + size);
143139
if (rc) {
@@ -147,11 +143,7 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
147143
return -EFAULT;
148144
}
149145

150-
/* this should work for most non-highmem platforms */
151-
zone = pgdata->node_zones +
152-
zone_for_memory(nid, start, size, 0, for_device);
153-
154-
return __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
146+
return __add_pages(nid, start_pfn, nr_pages, !for_device);
155147
}
156148

157149
#ifdef CONFIG_MEMORY_HOTREMOVE

arch/s390/mm/init.c

Lines changed: 2 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -168,41 +168,15 @@ unsigned long memory_block_size_bytes(void)
168168
#ifdef CONFIG_MEMORY_HOTPLUG
169169
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
170170
{
171-
unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
172171
unsigned long start_pfn = PFN_DOWN(start);
173172
unsigned long size_pages = PFN_DOWN(size);
174-
pg_data_t *pgdat = NODE_DATA(nid);
175-
struct zone *zone;
176-
int rc, i;
173+
int rc;
177174

178175
rc = vmem_add_mapping(start, size);
179176
if (rc)
180177
return rc;
181178

182-
for (i = 0; i < MAX_NR_ZONES; i++) {
183-
zone = pgdat->node_zones + i;
184-
if (zone_idx(zone) != ZONE_MOVABLE) {
185-
/* Add range within existing zone limits, if possible */
186-
zone_start_pfn = zone->zone_start_pfn;
187-
zone_end_pfn = zone->zone_start_pfn +
188-
zone->spanned_pages;
189-
} else {
190-
/* Add remaining range to ZONE_MOVABLE */
191-
zone_start_pfn = start_pfn;
192-
zone_end_pfn = start_pfn + size_pages;
193-
}
194-
if (start_pfn < zone_start_pfn || start_pfn >= zone_end_pfn)
195-
continue;
196-
nr_pages = (start_pfn + size_pages > zone_end_pfn) ?
197-
zone_end_pfn - start_pfn : size_pages;
198-
rc = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
199-
if (rc)
200-
break;
201-
start_pfn += nr_pages;
202-
size_pages -= nr_pages;
203-
if (!size_pages)
204-
break;
205-
}
179+
rc = __add_pages(nid, start_pfn, size_pages, !for_device);
206180
if (rc)
207181
vmem_remove_mapping(start, size);
208182
return rc;

arch/sh/mm/init.c

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -487,18 +487,12 @@ void free_initrd_mem(unsigned long start, unsigned long end)
487487
#ifdef CONFIG_MEMORY_HOTPLUG
488488
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
489489
{
490-
pg_data_t *pgdat;
491490
unsigned long start_pfn = PFN_DOWN(start);
492491
unsigned long nr_pages = size >> PAGE_SHIFT;
493492
int ret;
494493

495-
pgdat = NODE_DATA(nid);
496-
497494
/* We only have ZONE_NORMAL, so this is easy.. */
498-
ret = __add_pages(nid, pgdat->node_zones +
499-
zone_for_memory(nid, start, size, ZONE_NORMAL,
500-
for_device),
501-
start_pfn, nr_pages, !for_device);
495+
ret = __add_pages(nid, start_pfn, nr_pages, !for_device);
502496
if (unlikely(ret))
503497
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
504498

arch/x86/mm/init_32.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -825,13 +825,10 @@ void __init mem_init(void)
825825
#ifdef CONFIG_MEMORY_HOTPLUG
826826
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
827827
{
828-
struct pglist_data *pgdata = NODE_DATA(nid);
829-
struct zone *zone = pgdata->node_zones +
830-
zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
831828
unsigned long start_pfn = start >> PAGE_SHIFT;
832829
unsigned long nr_pages = size >> PAGE_SHIFT;
833830

834-
return __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
831+
return __add_pages(nid, start_pfn, nr_pages, !for_device);
835832
}
836833

837834
#ifdef CONFIG_MEMORY_HOTREMOVE

arch/x86/mm/init_64.c

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -772,22 +772,15 @@ static void update_end_of_memory_vars(u64 start, u64 size)
772772
}
773773
}
774774

775-
/*
776-
* Memory is added always to NORMAL zone. This means you will never get
777-
* additional DMA/DMA32 memory.
778-
*/
779775
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
780776
{
781-
struct pglist_data *pgdat = NODE_DATA(nid);
782-
struct zone *zone = pgdat->node_zones +
783-
zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
784777
unsigned long start_pfn = start >> PAGE_SHIFT;
785778
unsigned long nr_pages = size >> PAGE_SHIFT;
786779
int ret;
787780

788781
init_memory_mapping(start, start + size);
789782

790-
ret = __add_pages(nid, zone, start_pfn, nr_pages, !for_device);
783+
ret = __add_pages(nid, start_pfn, nr_pages, !for_device);
791784
WARN_ON_ONCE(ret);
792785

793786
/* update max_pfn, max_low_pfn and high_memory */

drivers/base/memory.c

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -392,39 +392,43 @@ static ssize_t show_valid_zones(struct device *dev,
392392
struct device_attribute *attr, char *buf)
393393
{
394394
struct memory_block *mem = to_memory_block(dev);
395-
unsigned long start_pfn, end_pfn;
396-
unsigned long valid_start, valid_end, valid_pages;
395+
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
397396
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
398-
struct zone *zone;
399-
int zone_shift = 0;
397+
unsigned long valid_start_pfn, valid_end_pfn;
398+
bool append = false;
399+
int nid;
400400

401-
start_pfn = section_nr_to_pfn(mem->start_section_nr);
402-
end_pfn = start_pfn + nr_pages;
403-
404-
/* The block contains more than one zone can not be offlined. */
405-
if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end))
401+
/*
402+
* The block contains more than one zone can not be offlined.
403+
* This can happen e.g. for ZONE_DMA and ZONE_DMA32
404+
*/
405+
if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, &valid_start_pfn, &valid_end_pfn))
406406
return sprintf(buf, "none\n");
407407

408-
zone = page_zone(pfn_to_page(valid_start));
409-
valid_pages = valid_end - valid_start;
410-
411-
/* MMOP_ONLINE_KEEP */
412-
sprintf(buf, "%s", zone->name);
408+
start_pfn = valid_start_pfn;
409+
nr_pages = valid_end_pfn - start_pfn;
413410

414-
/* MMOP_ONLINE_KERNEL */
415-
zone_can_shift(valid_start, valid_pages, ZONE_NORMAL, &zone_shift);
416-
if (zone_shift) {
417-
strcat(buf, " ");
418-
strcat(buf, (zone + zone_shift)->name);
411+
/*
412+
* Check the existing zone. Make sure that we do that only on the
413+
* online nodes otherwise the page_zone is not reliable
414+
*/
415+
if (mem->state == MEM_ONLINE) {
416+
strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
417+
goto out;
419418
}
420419

421-
/* MMOP_ONLINE_MOVABLE */
422-
zone_can_shift(valid_start, valid_pages, ZONE_MOVABLE, &zone_shift);
423-
if (zone_shift) {
424-
strcat(buf, " ");
425-
strcat(buf, (zone + zone_shift)->name);
420+
nid = pfn_to_nid(start_pfn);
421+
if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) {
422+
strcat(buf, NODE_DATA(nid)->node_zones[ZONE_NORMAL].name);
423+
append = true;
426424
}
427425

426+
if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) {
427+
if (append)
428+
strcat(buf, " ");
429+
strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name);
430+
}
431+
out:
428432
strcat(buf, "\n");
429433

430434
return strlen(buf);

include/linux/memory_hotplug.h

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -123,8 +123,8 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
123123
unsigned long nr_pages);
124124
#endif /* CONFIG_MEMORY_HOTREMOVE */
125125

126-
/* reasonably generic interface to expand the physical pages in a zone */
127-
extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
126+
/* reasonably generic interface to expand the physical pages */
127+
extern int __add_pages(int nid, unsigned long start_pfn,
128128
unsigned long nr_pages, bool want_memblock);
129129

130130
#ifdef CONFIG_NUMA
@@ -299,15 +299,16 @@ extern int add_memory_resource(int nid, struct resource *resource, bool online);
299299
extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
300300
bool for_device);
301301
extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
302+
extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
303+
unsigned long nr_pages);
302304
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
303305
extern bool is_memblock_offlined(struct memory_block *mem);
304306
extern void remove_memory(int nid, u64 start, u64 size);
305-
extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
307+
extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
306308
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
307309
unsigned long map_offset);
308310
extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
309311
unsigned long pnum);
310-
extern bool zone_can_shift(unsigned long pfn, unsigned long nr_pages,
311-
enum zone_type target, int *zone_shift);
312-
312+
extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages,
313+
int online_type);
313314
#endif /* __LINUX_MEMORY_HOTPLUG_H */

include/linux/mmzone.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -532,6 +532,22 @@ static inline bool zone_is_empty(struct zone *zone)
532532
return zone->spanned_pages == 0;
533533
}
534534

535+
/*
536+
* Return true if [start_pfn, start_pfn + nr_pages) range has a non-empty
537+
* intersection with the given zone
538+
*/
539+
static inline bool zone_intersects(struct zone *zone,
540+
unsigned long start_pfn, unsigned long nr_pages)
541+
{
542+
if (zone_is_empty(zone))
543+
return false;
544+
if (start_pfn >= zone_end_pfn(zone) ||
545+
start_pfn + nr_pages <= zone->zone_start_pfn)
546+
return false;
547+
548+
return true;
549+
}
550+
535551
/*
536552
* The "priority" of VM scanning is how much of the queues we will scan in one
537553
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the

kernel/memremap.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -359,6 +359,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
359359

360360
mem_hotplug_begin();
361361
error = arch_add_memory(nid, align_start, align_size, true);
362+
if (!error)
363+
move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
364+
align_start >> PAGE_SHIFT,
365+
align_size >> PAGE_SHIFT);
362366
mem_hotplug_done();
363367
if (error)
364368
goto err_add_memory;

0 commit comments

Comments
 (0)