Skip to content

Conversation

pcd1193182
Copy link
Contributor

@pcd1193182 pcd1193182 commented Jul 28, 2025

Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]

Motivation and Context

When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. Finally, it may also be difficult to source storage that is large enough to match what's being used at customer/production sites for budgetary or procurement reasons.

Description

The core idea of the solution is this: If we know what regions are allocated on the production system we're trying to mimic, we don't actually need to do the process that got us there. We can skip straight to the final state by doing raw allocations of the allocated regions on that system, with no data underlying them or block pointers pointing to them.

This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though if the process takes long enough we can run into issues with our older TXG starting to get overwritten. A checkpoint is a good way to preserve the system state at a single point in time for analysis while the system is serving IO.

The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint.

We verify that the system receiving the allocation profile has the same layout as the source system, to prevent any issues with violating ZFS's expectations or triggering assertions. This includes number of vdevs, number of metaslabs per vdev, and metaslab size. There is a -f option to allow profiles to skip the check for number of metaslabs per vdev, in which case allocations beyond the last metaslab will be dropped.

How Has This Been Tested?

Tested with ZFS test suite to ensure no regressions. New utility and functionality has been used to performance performance testing multiple times. Also manually verified space map contents to ensure that allocation mapping matches original system.

In addition a new test has been added to the test suite to verify the functionality of the utility.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 force-pushed the frag_copy branch 2 times, most recently from 1a44f2e to 0f38484 Compare July 28, 2025 21:01
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 29, 2025
@pcd1193182 pcd1193182 force-pushed the frag_copy branch 2 times, most recently from 27be603 to 3c9e82a Compare September 2, 2025 22:39
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fast. I haven't had a chance to test it locally but I like where this is going.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you simply add a basic test case to tests/functional/cli_root/zhack/.

@behlendorf behlendorf requested a review from amotin September 9, 2025 18:08
@pcd1193182 pcd1193182 changed the title Add allocation profile export and zleak utility for import Add allocation profile export and zhack subcommand for import Sep 9, 2025
Paul Dagnelie added 2 commits September 9, 2025 13:36
When attempting to debug performance problems on large systems, one of
the major factors that affect performance is free space
fragmentation. This heavily affects the allocation process, which is an
area of active development in ZFS. Unfortunately, fragmenting a large
pool for testing purposes is time consuming; it usually involves filling
the pool and then repeatedly overwriting data until the free space
becomes fragmented, which can take many hours. And even if the time is
available, artificial workloads rarely generate the same fragmentation
patterns as the natural workloads they're attempting to mimic.

This patch has two parts. First, in zdb, we add the ability to export
the full allocation map of the pool. It iterates over each vdev,
printing every allocated segment in the ms_allocatable range tree. This
can be done while the pool is online, though in that case the allocation
map may actually be from several different TXGs as new ones are loaded
on demand.

The second is a new subcommand for zhack, zhack metaslab leak (and its
supporting kernel changes). This is a zhack subcommand that imports a
pool and then modified the range trees of the metaslabs, allowing the
sync process to write them out normall. It does not currently store
those allocations anywhere to make them reversible, and there is no
corresponding free subcommand (which would be extremely dangerous); this
is an irreversible process, only intended for performance testing. The
only way to reclaim the space afterwards is to destroy the pool or roll
back to a checkpoint.

Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 9, 2025
behlendorf pushed a commit that referenced this pull request Sep 10, 2025
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes #17576
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 10, 2025
When attempting to debug performance problems on large systems, one of
the major factors that affect performance is free space
fragmentation. This heavily affects the allocation process, which is an
area of active development in ZFS. Unfortunately, fragmenting a large
pool for testing purposes is time consuming; it usually involves filling
the pool and then repeatedly overwriting data until the free space
becomes fragmented, which can take many hours. And even if the time is
available, artificial workloads rarely generate the same fragmentation
patterns as the natural workloads they're attempting to mimic.

This patch has two parts. First, in zdb, we add the ability to export
the full allocation map of the pool. It iterates over each vdev,
printing every allocated segment in the ms_allocatable range tree. This
can be done while the pool is online, though in that case the allocation
map may actually be from several different TXGs as new ones are loaded
on demand.

The second is a new subcommand for zhack, zhack metaslab leak (and its
supporting kernel changes). This is a zhack subcommand that imports a
pool and then modified the range trees of the metaslabs, allowing the
sync process to write them out normall. It does not currently store
those allocations anywhere to make them reversible, and there is no
corresponding free subcommand (which would be extremely dangerous); this
is an irreversible process, only intended for performance testing. The
only way to reclaim the space afterwards is to destroy the pool or roll
back to a checkpoint.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#17576
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 10, 2025
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17576
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants