-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add allocation profile export and zhack subcommand for import #17576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1a44f2e
to
0f38484
Compare
27be603
to
3c9e82a
Compare
7f97e48
to
4bea873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was fast. I haven't had a chance to test it locally but I like where this is going.
4bea873
to
adecfd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can you simply add a basic test case to tests/functional/cli_root/zhack/
.
d68a9ab
to
833b2bb
Compare
When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
6379d4e
to
541f69f
Compare
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #17576
When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#17576
Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17576
Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]
Motivation and Context
When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. Finally, it may also be difficult to source storage that is large enough to match what's being used at customer/production sites for budgetary or procurement reasons.
Description
The core idea of the solution is this: If we know what regions are allocated on the production system we're trying to mimic, we don't actually need to do the process that got us there. We can skip straight to the final state by doing raw allocations of the allocated regions on that system, with no data underlying them or block pointers pointing to them.
This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though if the process takes long enough we can run into issues with our older TXG starting to get overwritten. A checkpoint is a good way to preserve the system state at a single point in time for analysis while the system is serving IO.
The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint.
We verify that the system receiving the allocation profile has the same layout as the source system, to prevent any issues with violating ZFS's expectations or triggering assertions. This includes number of vdevs, number of metaslabs per vdev, and metaslab size. There is a
-f
option to allow profiles to skip the check for number of metaslabs per vdev, in which case allocations beyond the last metaslab will be dropped.How Has This Been Tested?
Tested with ZFS test suite to ensure no regressions. New utility and functionality has been used to performance performance testing multiple times. Also manually verified space map contents to ensure that allocation mapping matches original system.
In addition a new test has been added to the test suite to verify the functionality of the utility.
Types of changes
Checklist:
Signed-off-by
.