-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to hint number of threads for CPU MLContext #436
Comments
(for naming, I'd follow the "use whole words" identifier advice and avoid fragments like "num") |
if it's a hint rather than a configuration, and if setting the exact number depends on information not exposed to developers (the # of core), maybe a better approach would be an enum that hints towards single or multi-threaded execution? |
Single vs multi-thread doesn't provide sufficient granularity. @huningxin Is there public data we can share here? Like a X-Y plot of (thread count, performance) for the models we want to support? |
my understanding is that |
Yes. We collected inference latency of some MediaPipe models on Chromium WebNN XNNPACK CPU prototype with different number of threads setting (1, 2 and 4). According to the current Chromium prototype implementation, the threads number is capped to the minimum value of 4 and system available cores. And because the parallel inference jobs are scheduled by Chromium's ThreadPool, there is no guarantee that the number of threads set by user would be allocated. In the following chart, the multi-threads inference speedup is normalized to single-thread (numThreads=1) performance. As the chart illustrates, for some models, such as SelfieSegmenter (landscape), MobileNetV3 (small_075), BlazeFace (short-range), Blendshape and FaceDetector, setting more number of threads doesn't help. These models are usually small. |
if I read the chart correctly, there is only one case where setting the number of threads to something different from 1 or max leads to better performance (GestureClassifier) - can anyone hint as to why 2 threads are optimal for that particular model? |
I suppose the context switching / job scheduling overhead would take over the inference time reduction by adding two more threads / jobs for that particular model. |
🤔 It's also possible due to graph topology that an odd number of threads assigns nodes such that more sequential dependencies occur with 3 threads (edit: oops, you said 4 above), whereas with 2 threads, more long-running operators happen to align nicely. @huningxin : Would this new |
This seem to be possible, although we didn't test with 3 threads.
This is a good point. The current prototype implementation interprets it as intra-operator threading. Should we allow developers to hint inter-operator threading and intra-operator threading separately? |
This is a temporary solution to close the perf gap between the TFLite backend and the XNNPACK backend, which will allow us to delete the XNNPACK backend Long-term discussions of how to specify this behavior are happening on webmachinelearning/webnn#436 Bug: 338162119 Change-Id: I42199744f4a8f3e685cc550dcd013183be65aeb3 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5506116 Reviewed-by: Reilly Grant <[email protected]> Reviewed-by: Alex Gough <[email protected]> Auto-Submit: Austin Sullivan <[email protected]> Commit-Queue: Alex Gough <[email protected]> Cr-Commit-Position: refs/heads/main@{#1295088}
The |
Closing. |
Framework use cases
The multi-cores architecture is widely available in modern CPUs that are commonly utilized by ML frameworks to parallelize the operator computation when inferring a model.
However, the number of threads (degree of parallelism) configuration may depend on different usage scenarios, e.g., for small models, single-threaded execution may be preferred because the task scheduling overhead may take over the speedup of parallel execution.
So, the ML frameworks usually allow users to control the number of threads according to their requirement. For example, the ONNXRuntime allows to configure
intra_op_num_threads
of CPU execution provider. TensorFlow-Lite providessetNumThreads
method for its interpreter.Native ML APIs
The native CPU ML API/lib commonly employ a threadpool for thread-level parallelism. The threadpool usually allows to configure the number of threads in this pool, for example:
XNNPACK utilizes pthreadpool for that allows to configure
threads_count
when creating the thread pool.MLAS utilizes
onnxruntime::concurrency::ThreadPool
that allows to constructs a thread pool for running withdegree_of_parallelism
threads.BNNS allows to set
n_threads
that controls the number of worker threads to execute an kernel.Other references
Model Loader API already extends the
MLContextOptions
withnumThreads
that allows the JS code to set the number of thread to use when computing a model.Proposal
WebNN may adopt the
MLContextOptions.numThreads
extension and allow frameworks to hint the number of threads to run operators in parallel for CPU MLContext./cc @pyu10055 @wacky6
The text was updated successfully, but these errors were encountered: