Skip to content

Provide a method to set thread count and set sane defaults on mobile #8763

@GregoryComer

Description

@GregoryComer

📚 The doc issue

The (p)threadpool extension is used to provide thread-level parallelism in XNNPACK and the optimized kernel library. The thread count has a significant impact on mobile performance, especially when thread count exceed the number of performant cores. This effect differs by phone model, but some devices (such as Pixels), can be 20x slower with current behavior than restricting to a single thread. A 2-3x speedup is common on many other phones, including flagship models.

This issue has caused significant friction with existing adopters. We are responsible for out-of-box performance, so we should do the fast thing out of box. First impressions tend to be negative when they observe poor performance. If we stick this in specific places, such as in the JNI or Meta-internal code, consumers using a different will miss it.

Controlling thread count

From existing discussions, there are two main options to control thread count:

  1. Initialize the threadpool with fewer threads based on a performance heuristic. Add a build-time configuration to configure the threadpool thread count.
  2. Upstream a feature to pthreadpool to allow for execution with fewer threads than are available. @kimishpatel has attempted this, but Maratyszcza has not been interested in merging it. We may need to either fork pthreadpool, switch to Google/XNNPACK's forked pthreadpool and merge the PR there, or provide a clean implementation using the pthreadpool interface.

Option (1) is the simplest, however, @kimishpatel has expressed concerns that this prevents users from opting in to using more threads than we default to. I'm not too worried about this, as I haven't ever seen anyone do this - they just ask us how many threads to use to go fast. We can provide a build-time configuration to construct a larger threadpool, as well. This allows people to opt out of our defaults.

Option (2) is more work, but would allow for thread count to be configured independently from the threadpool size. This is likely nice to have. Given the difficulty in upstreaming the required PR, aligning with Google and using their threadpool fork may be the easiest path.

I don't personally have a strong opinion on which path we take, so long as we can solve the problem.

Where do we put the defaults?

It is critical that the defaults work "out-of-box", regardless of whether someone is using module.h from C++, JNI bindings, iOS bindings, or Meta-internal bindings. Ideally, I'd like to put the defaults in the threadpool extension itself, though I could live with putting it in module.h.

How many threads?

In addition to being able to configure the thread count, we need to know how many threads to default to. We don't necessarily have to get 100% performance in every conceivable circumstance - we just need to provide reasonable defaults that work well for the majority of use cases. What we have right now is effectively unshippable, so any improvement here will be night and day.

Using cores / 2 on Android and 2 on iOS is a reasonable baseline, though we may want to avoid doing this on homogenous SOCs. We can iterate on our heuristic over time to add deeper hardware-specific and performant core detection as we gather more data. This logic may be able to be placed in the cpuinfo repository.

cc @mergennachin @kimishpatel @iseeyuan @byjlw

Metadata

Metadata

Assignees

Labels

high prioritymodule: user experienceIssues related to reducing friction for userstriage reviewItems require an triage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions