-
Notifications
You must be signed in to change notification settings - Fork 673
Add EXECUTORCH_THREADPOOL_SIZE options, default to using only performance cores #14090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14090
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 43 PendingAs of commit aa5b494 with merge base f2eb38e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
ecb0cbf
to
7c03267
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
7c03267
to
2052781
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
2052781
to
825ff42
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
extension/threadpool/threadpool.h
Outdated
* the threadpool to use 4 threads. | ||
*/ | ||
|
||
#ifndef EXECUTORCH_THREADPOOL_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a compile time option? How do you expect users to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've piped it through CMake options now. In theory, I'd prefer to have a runtime API, but I don't really see a good way to do this given that it needs to be set prior to threadpool creation, which is implicit. I'm open to ideas. My expectation is that 99.9% of users will use the default perf core option, so needing to override at build time is an acceptable compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but compile option is just not useful. You compile once deploy on many platforms, you cannot possibly know what the right size is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a comment
I like the 3 mutually exclusive options better. I don't see why we are vague about the "size heuristic" given that we are intentionally making the user select a sizing algorithm; I would just Agree with Kimish that it is unclear how users are supposed to correctly choose a fixed size for the thread pool at compile time. |
I've updated to use the discrete options. Regarding physical or logical cores, from a look at cpuinfo, it initially looks inconsistent - Windows counts logical cores but reading the code, it looks like it's counting physical cores on Linux?. I'm going to double check the behavior on a few systems. The x86 system I have easy access to at this exact moment don't report multiple threads per core. The main intent of this option is to be backwards compatible with the existing behavior. I'd imagine that very few OSS users would want this behavior, as it's slow and there's no OSS way to use fewer threads. Regarding the user-facing interface, I've added CMake options. I responded to Kimish's comment above - I'd prefer a runtime API in theory but it's awkward to do with the current design. I'm open to ideas. |
1ee7835
to
01cc300
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
normally it would |
As in it would reported as separate cores. you will have to explicitly ask, I think, for actual core count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that using EXECUTORCH_THREADPOOL_SIZE
as build time option is a good. For oss users that are trying to use this option, it feels you are better off with using the unsafe option.
The other possibility is that set_num_thread option will create a new threadpool and use that as a current threadpool without destorying the existing one.
In any case, I think the other two compile time options look good to me and I am not sure if we have to add EXECUTORCH_THREADPOOL_SIZE
right now. So the immediate need should be met by the other two, no?
Sure. I don't necessarily see a strong use case for |
01cc300
to
a586ecf
Compare
#if defined(EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES) | ||
// Use threads=cores. | ||
static int num_threads = cpuinfo_get_processors_count(); | ||
#else | ||
// Set threads equal to the number of performance cores. | ||
static int num_threads = | ||
::executorch::extension::cpuinfo::get_num_performant_cores(); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default behavior than seems get num performance cores? I thought you would want this the other way around. That is by default you have logical cores and in oss cmake we can make performant core as default build option.
Issue is that for internal uses now you only have performant cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In extension/threadpool/targets.bzl, I changed it to define EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES when not in OSS, so that should cover this case. If there's a better way to ensure this, I'm definitely open to it. I could add an API to retrieve the threadpool size and add an internal test to verify the behavior, if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment around preserving the default behavior for internal uses
@kimishpatel I believe I've addressed all the comments. Do you have any additional concerns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced offline for making default behavior chose all core and add cmake flag to chose perf core for oss
ceb2242
to
110746c
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
110746c
to
50c3ca7
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D81965471. |
…ance cores (pytorch#14090) Summary: Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. Test Plan: I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change. Reviewed By: kimishpatel Differential Revision: D81965471 Pulled By: GregoryComer
27201f1
to
7b52e42
Compare
@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471. |
…ance cores (pytorch#14090) Summary: Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. Test Plan: I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change. Rollback Plan: Reviewed By: kimishpatel Differential Revision: D81965471 Pulled By: GregoryComer
7b52e42
to
aa5b494
Compare
@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating diff in D81965471. |
Fighting some out of sync issues with the imported diff, maybe due to diff train issues. Finally got it resolved now. I haven't touched this PR's contents really at all since earlier today and CI was green. Going to merge as this is a critical feature. |
…sing only performance cores (pytorch#14090)" This reverts commit 72d50b2.
…ance cores (pytorch#14090) ### Summary Allow build-time configuration of the thread pool size and default to a performance heuristic. There are 2 modes that we want to support: * Heuristic-based. Choose the number of threads according to a performance heuristic. Use threads equal to the number of detected performance cores, but we can continue to iterate on this by adding fine-grained heuristics for specific chipsets in the future. * All cores (threads=cores). This is the current behavior. We need to maintain this as an option for some use cases. With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup. #### Specifying Threadpool Size To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options): * `EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES`- Use threads = detected perf cores. * `EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES` - Use threads = logical cores. ### Test plan I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils. Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.
…h#14307) …sing only performance cores (pytorch#14090)" This reverts commit 72d50b2. ### Summary Seeing crashes in macos unittest job.
Summary
Allow build-time configuration of the thread pool size and default to a performance heuristic.
There are 2 modes that we want to support:
With this PR, the default (for OSS) is to use performance cores. From testing with CV models on ~10 representative devices across the performance spectrum, this gives anywhere from parity with the existing perf to up to a 13x speedup (measured on Pixel 6). Many common devices (S20, S22, iPhone 15 Pro) show a 2-4x speedup.
Specifying Threadpool Size
To specify the threadpool size, I've added two preprocessor options (and corresponding CMake options):
EXECUTORCH_THREADPOOL_USE_PERFORMANCE_CORES
- Use threads = detected perf cores.EXECUTORCH_THREADPOOL_USE_ALL_LOGICAL_CORES
- Use threads = logical cores.Test plan
I've verified that logic functions correctly in OSS by building the executor_runner on M1 Mac and observing the existing logging in cpuinfo_utils.
Measuring MobileNet V3 (exported from examples) on XNNPACK, time to run 100 iterations drops from ~450ms to ~230ms on M1 Pro with this change.