Add allocation profile export and zhack subcommand for import #17576

pcd1193182 · 2025-07-28T20:27:33Z

Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]

Motivation and Context

When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. Finally, it may also be difficult to source storage that is large enough to match what's being used at customer/production sites for budgetary or procurement reasons.

Description

The core idea of the solution is this: If we know what regions are allocated on the production system we're trying to mimic, we don't actually need to do the process that got us there. We can skip straight to the final state by doing raw allocations of the allocated regions on that system, with no data underlying them or block pointers pointing to them.

This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though if the process takes long enough we can run into issues with our older TXG starting to get overwritten. A checkpoint is a good way to preserve the system state at a single point in time for analysis while the system is serving IO.

The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint.

We verify that the system receiving the allocation profile has the same layout as the source system, to prevent any issues with violating ZFS's expectations or triggering assertions. This includes number of vdevs, number of metaslabs per vdev, and metaslab size. There is a -f option to allow profiles to skip the check for number of metaslabs per vdev, in which case allocations beyond the last metaslab will be dropped.

How Has This Been Tested?

Tested with ZFS test suite to ensure no regressions. New utility and functionality has been used to performance performance testing multiple times. Also manually verified space map contents to ensure that allocation mapping matches original system.

In addition a new test has been added to the test suite to verify the functionality of the utility.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

cmd/zdb/zdb_il.c

cmd/zdb/zdb.c

cmd/zleak

behlendorf

That was fast. I haven't had a chance to test it locally but I like where this is going.

contrib/pyzfs/libzfs_core/bindings/libzfs_core.py

contrib/pyzfs/libzfs_core/exceptions.py

include/sys/fs/zfs.h

lib/libuutil/libuutil.abi

man/man8/zdb.8

module/zfs/metaslab.c

cmd/zdb/zdb.c

behlendorf

Looks good. Can you simply add a basic test case to tests/functional/cli_root/zhack/.

When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>

Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #17576

When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#17576

Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17576

pcd1193182 force-pushed the frag_copy branch 2 times, most recently from 1a44f2e to 0f38484 Compare July 28, 2025 21:01

behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 29, 2025

pcd1193182 force-pushed the frag_copy branch from 0f38484 to ab50973 Compare July 30, 2025 21:40

pcd1193182 force-pushed the frag_copy branch 2 times, most recently from 27be603 to 3c9e82a Compare September 2, 2025 22:39

behlendorf reviewed Sep 3, 2025

View reviewed changes

cmd/zdb/zdb_il.c Outdated Show resolved Hide resolved

cmd/zdb/zdb.c Outdated Show resolved Hide resolved

cmd/zleak Outdated Show resolved Hide resolved

pcd1193182 force-pushed the frag_copy branch from 7f97e48 to 4bea873 Compare September 4, 2025 22:08

behlendorf reviewed Sep 4, 2025

View reviewed changes

pcd1193182 force-pushed the frag_copy branch from 4bea873 to adecfd4 Compare September 5, 2025 17:46

behlendorf approved these changes Sep 5, 2025

View reviewed changes

pcd1193182 force-pushed the frag_copy branch from d68a9ab to 833b2bb Compare September 8, 2025 16:12

behlendorf requested a review from amotin September 9, 2025 18:08

pcd1193182 changed the title ~~Add allocation profile export and zleak utility for import~~ Add allocation profile export and zhack subcommand for import Sep 9, 2025

Paul Dagnelie added 2 commits September 9, 2025 13:36

Enable zhack to work properly with 4k sector size disks

541f69f

Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the frag_copy branch from 6379d4e to 541f69f Compare September 9, 2025 20:36

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 9, 2025

behlendorf approved these changes Sep 10, 2025

View reviewed changes

behlendorf closed this in 8f15d2e Sep 10, 2025

behlendorf pushed a commit that referenced this pull request Sep 10, 2025

Enable zhack to work properly with 4k sector size disks

bc4aac0

Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #17576

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 10, 2025

Enable zhack to work properly with 4k sector size disks

e2e7082

Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17576

pcd1193182 mentioned this pull request Sep 11, 2025

Make new zhack test a little more reliable #17728

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add allocation profile export and zhack subcommand for import #17576

Add allocation profile export and zhack subcommand for import #17576

Uh oh!

pcd1193182 commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf left a comment

Uh oh!

Uh oh!

Add allocation profile export and zhack subcommand for import #17576

Add allocation profile export and zhack subcommand for import #17576

Uh oh!

Conversation

pcd1193182 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcd1193182 commented Jul 28, 2025 •

edited

Loading