Skip to content

Conversation

GregoryComer
Copy link
Member

@GregoryComer GregoryComer commented Sep 8, 2025

Summary

Allow build-time configuration of the thread pool size and default to a performance heuristic.

There are 2 modes that we want to support:

  • Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future.
  • All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases.

With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup.

Specifying Threadpool Size

To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options):

  • EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES- Use threads = detected perf cores.
  • EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES - Use threads = logical cores.

Test plan

I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils.

Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.

Copy link

pytorch-bot bot commented Sep 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14090

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 43 Pending

As of commit aa5b494 with merge base f2eb38e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2025
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

* the threadpool to use 4 threads.
*/

#ifndef EXECUTORCH_THREADPOOL_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a compile time option? How do you expect users to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've piped it through CMake options now. In theory, I'd prefer to have a runtime API, but I don't really see a good way to do this given that it needs to be set prior to threadpool creation, which is implicit. I'm open to ideas. My expectation is that 99.9% of users will use the default perf core option, so needing to override at build time is an acceptable compromise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but compile option is just not useful. You compile once deploy on many platforms, you cannot possibly know what the right size is

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment

@swolchok
Copy link
Contributor

swolchok commented Sep 9, 2025

I like the 3 mutually exclusive options better. I don't see why we are vague about the "size heuristic" given that we are intentionally making the user select a sizing algorithm; I would just s/EXECUTORCH_THREADPOOL_USE_SIZE_HEURISTIC/EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES/. Also, it is unclear to me whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores; might want to call it EXECUTORCH_THREADPOOL_USE_ALL_PHYSICAL_CORES or EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES depending on which is correct. (FWIW, not counting hyperthreading would IMO be the correct choice for memory-bandwidth-bound applications.)

Agree with Kimish that it is unclear how users are supposed to correctly choose a fixed size for the thread pool at compile time.

@GregoryComer
Copy link
Member Author

I like the 3 mutually exclusive options better. I don't see why we are vague about the "size heuristic" given that we are intentionally making the user select a sizing algorithm; I would just s/EXECUTORCH_THREADPOOL_USE_SIZE_HEURISTIC/EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES/. Also, it is unclear to me whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores; might want to call it EXECUTORCH_THREADPOOL_USE_ALL_PHYSICAL_CORES or EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES depending on which is correct. (FWIW, not counting hyperthreading would IMO be the correct choice for memory-bandwidth-bound applications.)

Agree with Kimish that it is unclear how users are supposed to correctly choose a fixed size for the thread pool at compile time.

I've updated to use the discrete options. Regarding physical or logical cores, from a look at cpuinfo, it initially looks inconsistent - Windows counts logical cores but reading the code, it looks like it's counting physical cores on Linux?. I'm going to double check the behavior on a few systems. The x86 system I have easy access to at this exact moment don't report multiple threads per core.

The main intent of this option is to be backwards compatible with the existing behavior. I'd imagine that very few OSS users would want this behavior, as it's slow and there's no OSS way to use fewer threads.

Regarding the user-facing interface, I've added CMake options. I responded to Kimish's comment above - I'd prefer a runtime API in theory but it's awkward to do with the current design. I'm open to ideas.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@GregoryComer GregoryComer changed the title Add EXECUTORCH_THREADPOOL_SIZE option and heuristic Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores Sep 10, 2025
@kimishpatel
Copy link
Contributor

whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores

normally it would

@kimishpatel
Copy link
Contributor

whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores

normally it would

As in it would reported as separate cores. you will have to explicitly ask, I think, for actual core count

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that using EXECUTORCH_THREADPOOL_SIZE as build time option is a good. For oss users that are trying to use this option, it feels you are better off with using the unsafe option.

The other possibility is that set_num_thread option will create a new threadpool and use that as a current threadpool without destorying the existing one.

In any case, I think the other two compile time options look good to me and I am not sure if we have to add EXECUTORCH_THREADPOOL_SIZE right now. So the immediate need should be met by the other two, no?

@GregoryComer
Copy link
Member Author

I am not convinced that using EXECUTORCH_THREADPOOL_SIZE as build time option is a good. For oss users that are trying to use this option, it feels you are better off with using the unsafe option.

The other possibility is that set_num_thread option will create a new threadpool and use that as a current threadpool without destorying the existing one.

In any case, I think the other two compile time options look good to me and I am not sure if we have to add EXECUTORCH_THREADPOOL_SIZE right now. So the immediate need should be met by the other two, no?

Sure. I don't necessarily see a strong use case for EXECUTORCH_THREADPOOL_SIZE, so I'm happy to drop it. The perf core option as default is what I really care about. I'll update the PR.

Comment on lines +124 to +134
#if defined(EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES)
// Use threads=cores.
static int num_threads = cpuinfo_get_processors_count();
#else
// Set threads equal to the number of performance cores.
static int num_threads =
::executorch::extension::cpuinfo::get_num_performant_cores();
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default behavior than seems get num performance cores? I thought you would want this the other way around. That is by default you have logical cores and in oss cmake we can make performant core as default build option.

Issue is that for internal uses now you only have performant cores

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In extension/threadpool/targets.bzl, I changed it to define EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES when not in OSS, so that should cover this case. If there's a better way to ensure this, I'm definitely open to it. I could add an API to retrieve the threadpool size and add an internal test to verify the behavior, if you'd like.

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment around preserving the default behavior for internal uses

@GregoryComer
Copy link
Member Author

@kimishpatel I believe I've addressed all the comments. Do you have any additional concerns?

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline for making default behavior chose all core and add cmake flag to chose perf core for oss

@GregoryComer GregoryComer force-pushed the thread-pool-size branch 3 times, most recently from ceb2242 to 110746c Compare September 11, 2025 22:16
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

GregoryComer added a commit to GregoryComer/executorch that referenced this pull request Sep 12, 2025
…ance cores (pytorch#14090)

Summary:
Allow build-time configuration of the thread pool size and default to a performance heuristic.

There are 2 modes that we want to support:
 * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future.
 * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases.

With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup.

#### Specifying Threadpool Size

To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options):
* `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores.
* `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores.


Test Plan:
I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils.

Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.

Reviewed By: kimishpatel

Differential Revision: D81965471

Pulled By: GregoryComer
@facebook-github-bot
Copy link
Contributor

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471.

…ance cores (pytorch#14090)

Summary:
Allow build-time configuration of the thread pool size and default to a performance heuristic.

There are 2 modes that we want to support:
 * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future.
 * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases.

With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup.

#### Specifying Threadpool Size

To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options):
* `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores.
* `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores.


Test Plan:
I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils.

Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.

Rollback Plan:

Reviewed By: kimishpatel

Differential Revision: D81965471

Pulled By: GregoryComer
@facebook-github-bot
Copy link
Contributor

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471.

@GregoryComer
Copy link
Member Author

GregoryComer commented Sep 12, 2025

Fighting some out of sync issues with the imported diff, maybe due to diff train issues. Finally got it resolved now. I haven't touched this PR's contents really at all since earlier today and CI was green. Going to merge as this is a critical feature.

@GregoryComer GregoryComer merged commit 72d50b2 into pytorch:main Sep 12, 2025
120 of 124 checks passed
GregoryComer added a commit to GregoryComer/executorch that referenced this pull request Sep 15, 2025
GregoryComer added a commit that referenced this pull request Sep 15, 2025
…sing only performance cores (#14090)"

This reverts commit 72d50b2.

### Summary
Seeing crashes in macos unittest job.
StrycekSimon pushed a commit to nxp-upstream/executorch that referenced this pull request Sep 23, 2025
…ance cores (pytorch#14090)

### Summary
Allow build-time configuration of the thread pool size and default to a
performance heuristic.

There are 2 modes that we want to support:
* Heuristic-based. Choose the number of threads according to a
performance heuristic. Use threads equal to the number of detected
performance cores, but we can continue to iterate on this by adding
fine-grained heuristics for specific chipsets in the future.
* All cores (threads=cores). This is the current behavior. We need to
maintain this as an option for some use cases.

With this PR, the default (for OSS) is to use performance cores. From
testing with CV models on ~10 representative devices across the
performance spectrum, this gives anywhere from parity with the existing
perf to up to a 13x speedup (measured on Pixel 6). Many common devices
(S20, S22, iPhone 15 Pro) show a 2-4x speedup.

#### Specifying Threadpool Size

To specify the threadpool size, I've added two preprocessor options (and
corresponding CMake options):
* `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected
perf cores.
* `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical
cores.

### Test plan
I've verified that logic functions correctly in OSS by building the
executor_runner on M1 Mac and observing the existing logging in
cpuinfo_utils.

Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run
100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.
StrycekSimon pushed a commit to nxp-upstream/executorch that referenced this pull request Sep 23, 2025
…h#14307)

…sing only performance cores (pytorch#14090)"

This reverts commit 72d50b2.

### Summary
Seeing crashes in macos unittest job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants