Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores #14090

GregoryComer · 2025-09-08T21:37:09Z

Summary

Allow build-time configuration of the thread pool size and default to a performance heuristic.

There are 2 modes that we want to support:

Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future.
All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases.

With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup.

Specifying Threadpool Size

To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options):

EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES- Use threads = detected perf cores.
EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES - Use threads = logical cores.

Test plan

I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils.

Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.

pytorch-bot · 2025-09-08T21:37:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14090

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 43 Pending

As of commit aa5b494 with merge base f2eb38e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-09-08T21:39:48Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-08T21:54:12Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-08T22:06:37Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-08T22:27:31Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

kimishpatel · 2025-09-09T00:48:21Z

extension/threadpool/threadpool.h

+ *   the threadpool to use 4 threads.
+ */
+
+#ifndef EXECUTORCH_THREADPOOL_SIZE


why is this a compile time option? How do you expect users to use it?

I've piped it through CMake options now. In theory, I'd prefer to have a runtime API, but I don't really see a good way to do this given that it needs to be set prior to threadpool creation, which is implicit. I'm open to ideas. My expectation is that 99.9% of users will use the default perf core option, so needing to override at build time is an acceptable compromise.

Yeah but compile option is just not useful. You compile once deploy on many platforms, you cannot possibly know what the right size is

kimishpatel

left a comment

swolchok · 2025-09-09T17:50:25Z

I like the 3 mutually exclusive options better. I don't see why we are vague about the "size heuristic" given that we are intentionally making the user select a sizing algorithm; I would just s/EXECUTORCH_THREADPOOL_USE_SIZE_HEURISTIC/EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES/. Also, it is unclear to me whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores; might want to call it EXECUTORCH_THREADPOOL_USE_ALL_PHYSICAL_CORES or EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES depending on which is correct. (FWIW, not counting hyperthreading would IMO be the correct choice for memory-bandwidth-bound applications.)

Agree with Kimish that it is unclear how users are supposed to correctly choose a fixed size for the thread pool at compile time.

GregoryComer · 2025-09-09T23:02:18Z

I like the 3 mutually exclusive options better. I don't see why we are vague about the "size heuristic" given that we are intentionally making the user select a sizing algorithm; I would just s/EXECUTORCH_THREADPOOL_USE_SIZE_HEURISTIC/EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES/. Also, it is unclear to me whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores; might want to call it EXECUTORCH_THREADPOOL_USE_ALL_PHYSICAL_CORES or EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES depending on which is correct. (FWIW, not counting hyperthreading would IMO be the correct choice for memory-bandwidth-bound applications.)

Agree with Kimish that it is unclear how users are supposed to correctly choose a fixed size for the thread pool at compile time.

I've updated to use the discrete options. Regarding physical or logical cores, from a look at cpuinfo, it initially looks inconsistent - Windows counts logical cores but reading the code, it looks like it's counting physical cores on Linux?. I'm going to double check the behavior on a few systems. The x86 system I have easy access to at this exact moment don't report multiple threads per core.

The main intent of this option is to be backwards compatible with the existing behavior. I'd imagine that very few OSS users would want this behavior, as it's slow and there's no OSS way to use fewer threads.

Regarding the user-facing interface, I've added CMake options. I responded to Kimish's comment above - I'd prefer a runtime API in theory but it's awkward to do with the current design. I'm open to ideas.

facebook-github-bot · 2025-09-09T23:09:40Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

kimishpatel · 2025-09-10T03:04:31Z

whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores

normally it would

kimishpatel · 2025-09-10T03:05:18Z

whether EXECUTORCH_THREADPOOL_USE_ALL_CORES will count hyperthreading as creating extra cores

normally it would

As in it would reported as separate cores. you will have to explicitly ask, I think, for actual core count

extension/threadpool/CMakeLists.txt

kimishpatel

I am not convinced that using EXECUTORCH_THREADPOOL_SIZE as build time option is a good. For oss users that are trying to use this option, it feels you are better off with using the unsafe option.

The other possibility is that set_num_thread option will create a new threadpool and use that as a current threadpool without destorying the existing one.

In any case, I think the other two compile time options look good to me and I am not sure if we have to add EXECUTORCH_THREADPOOL_SIZE right now. So the immediate need should be met by the other two, no?

GregoryComer · 2025-09-10T03:43:33Z

I am not convinced that using EXECUTORCH_THREADPOOL_SIZE as build time option is a good. For oss users that are trying to use this option, it feels you are better off with using the unsafe option.

The other possibility is that set_num_thread option will create a new threadpool and use that as a current threadpool without destorying the existing one.

In any case, I think the other two compile time options look good to me and I am not sure if we have to add EXECUTORCH_THREADPOOL_SIZE right now. So the immediate need should be met by the other two, no?

Sure. I don't necessarily see a strong use case for EXECUTORCH_THREADPOOL_SIZE, so I'm happy to drop it. The perf core option as default is what I really care about. I'll update the PR.

extension/threadpool/targets.bzl

kimishpatel · 2025-09-10T23:25:21Z

extension/threadpool/threadpool.cpp

+#if defined(EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES)
+  // Use threads=cores.
+  static int num_threads = cpuinfo_get_processors_count();
+#else
+  // Set threads equal to the number of performance cores.
+  static int num_threads =
+      ::executorch::extension::cpuinfo::get_num_performant_cores();
+#endif


default behavior than seems get num performance cores? I thought you would want this the other way around. That is by default you have logical cores and in oss cmake we can make performant core as default build option.

Issue is that for internal uses now you only have performant cores

In extension/threadpool/targets.bzl, I changed it to define EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES when not in OSS, so that should cover this case. If there's a better way to ensure this, I'm definitely open to it. I could add an API to retrieve the threadpool size and add an internal test to verify the behavior, if you'd like.

kimishpatel

Left one comment around preserving the default behavior for internal uses

GregoryComer · 2025-09-11T20:42:59Z

@kimishpatel I believe I've addressed all the comments. Do you have any additional concerns?

kimishpatel

synced offline for making default behavior chose all core and add cmake flag to chose perf core for oss

facebook-github-bot · 2025-09-11T22:18:19Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-11T22:37:03Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-12T04:38:11Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

facebook-github-bot · 2025-09-12T05:00:46Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471.

…ance cores (pytorch#14090) Summary: Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. Test Plan: I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change. Reviewed By: kimishpatel Differential Revision: D81965471 Pulled By: GregoryComer

facebook-github-bot · 2025-09-12T05:12:03Z

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471.

…ance cores (pytorch#14090) Summary: Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. Test Plan: I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change. Rollback Plan: Reviewed By: kimishpatel Differential Revision: D81965471 Pulled By: GregoryComer

facebook-github-bot · 2025-09-12T05:16:41Z

@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471.

GregoryComer · 2025-09-12T05:18:01Z

Fighting some out of sync issues with the imported diff, maybe due to diff train issues. Finally got it resolved now. I haven't touched this PR's contents really at all since earlier today and CI was green. Going to merge as this is a critical feature.

…sing only performance cores (pytorch#14090)" This reverts commit 72d50b2.

…sing only performance cores (#14090)" This reverts commit 72d50b2. ### Summary Seeing crashes in macos unittest job.

…ance cores (pytorch#14090) ### Summary Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. ### Test plan I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.

…h#14307) …sing only performance cores (pytorch#14090)" This reverts commit 72d50b2. ### Summary Seeing crashes in macos unittest job.

GregoryComer requested review from digantdesai and mergennachin September 8, 2025 21:37

GregoryComer requested review from kimishpatel and swolchok as code owners September 8, 2025 21:37

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2025

GregoryComer force-pushed the thread-pool-size branch from ecb0cbf to 7c03267 Compare September 8, 2025 21:50

GregoryComer added the release notes: misc Miscellaneous label Sep 8, 2025

GregoryComer force-pushed the thread-pool-size branch from 7c03267 to 2052781 Compare September 8, 2025 22:06

GregoryComer force-pushed the thread-pool-size branch from 2052781 to 825ff42 Compare September 8, 2025 22:25

kimishpatel reviewed Sep 9, 2025

View reviewed changes

kimishpatel requested changes Sep 9, 2025

View reviewed changes

GregoryComer requested review from larryliu0820 and kirklandsign as code owners September 9, 2025 22:35

GregoryComer force-pushed the thread-pool-size branch from 1ee7835 to 01cc300 Compare September 9, 2025 23:03

GregoryComer changed the title ~~Add EXECUTORCH_THREADPOOL_SIZE option and heuristic~~ Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores Sep 10, 2025

kimishpatel reviewed Sep 10, 2025

View reviewed changes

extension/threadpool/CMakeLists.txt Show resolved Hide resolved

kimishpatel reviewed Sep 10, 2025

View reviewed changes

mergennachin reviewed Sep 10, 2025

View reviewed changes

extension/threadpool/targets.bzl Outdated Show resolved Hide resolved

GregoryComer force-pushed the thread-pool-size branch from 01cc300 to a586ecf Compare September 10, 2025 22:07

kimishpatel reviewed Sep 10, 2025

View reviewed changes

kimishpatel requested changes Sep 10, 2025

View reviewed changes

GregoryComer requested review from kimishpatel and mergennachin September 11, 2025 20:42

kimishpatel approved these changes Sep 11, 2025

View reviewed changes

GregoryComer force-pushed the thread-pool-size branch 3 times, most recently from ceb2242 to 110746c Compare September 11, 2025 22:16

GregoryComer force-pushed the thread-pool-size branch from 110746c to 50c3ca7 Compare September 11, 2025 22:36

GregoryComer force-pushed the thread-pool-size branch from 27201f1 to 7b52e42 Compare September 12, 2025 05:11

facebook-github-bot added fb-exported meta-exported labels Sep 12, 2025

GregoryComer force-pushed the thread-pool-size branch from 7b52e42 to aa5b494 Compare September 12, 2025 05:16

GregoryComer merged commit 72d50b2 into pytorch:main Sep 12, 2025
120 of 124 checks passed

GregoryComer added a commit to GregoryComer/executorch that referenced this pull request Sep 15, 2025

Revert "Add EXECt switch UTORCH_THREADPOOL_SIZE options, default to u…

0ac9b64

…sing only performance cores (pytorch#14090)" This reverts commit 72d50b2.

GregoryComer mentioned this pull request Sep 15, 2025

Revert "Add EXECUTORCH_THREADPOOL_SIZE options, default to u… #14307

Merged

GregoryComer added a commit that referenced this pull request Sep 15, 2025

Revert "Add EXECUTORCH_THREADPOOL_SIZE options, default to u… (#14307)

750cba7

…sing only performance cores (#14090)" This reverts commit 72d50b2. ### Summary Seeing crashes in macos unittest job.

GregoryComer mentioned this pull request Sep 15, 2025

Out-of-bounds access in pthreadpool when reducing threadpool size on macos #14321

Open

Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores #14090

Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores #14090

Uh oh!

Conversation

GregoryComer commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Specifying Threadpool Size

Test plan

Uh oh!

pytorch-bot bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14090

⏳ No Failures, 43 Pending

Uh oh!

facebook-github-bot commented Sep 8, 2025

Uh oh!

facebook-github-bot commented Sep 8, 2025

Uh oh!

facebook-github-bot commented Sep 8, 2025

Uh oh!

facebook-github-bot commented Sep 8, 2025

Uh oh!

kimishpatel Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

GregoryComer Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

kimishpatel Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

swolchok commented Sep 9, 2025

Uh oh!

GregoryComer commented Sep 9, 2025

Uh oh!

facebook-github-bot commented Sep 9, 2025

Uh oh!

kimishpatel commented Sep 10, 2025

Uh oh!

kimishpatel commented Sep 10, 2025

Uh oh!

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

GregoryComer commented Sep 10, 2025

Uh oh!

Uh oh!

kimishpatel Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

GregoryComer Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

GregoryComer commented Sep 11, 2025

Uh oh!

kimishpatel left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 11, 2025

Uh oh!

facebook-github-bot commented Sep 11, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

GregoryComer commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

GregoryComer commented Sep 8, 2025 •

edited

Loading

pytorch-bot bot commented Sep 8, 2025 •

edited

Loading

GregoryComer commented Sep 12, 2025 •

edited

Loading