You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer: the improvements described below aren't improvements due to optimizations in the transforms code; they mostly relate to the dataset wrapper. While they are still improvements over what is currently released in 0.15, the conclusion should not be that the current transforms are problematic (again: the "issue" was in the dataset wrapper).
We've been running some benchmarks with @pmeier and here are our conclusions:
There is a huge cost of having many entires in a sample, even if those are pass-through. That cost comes from pytree's [un]flattening of the inputs. By default our dataset wrapper returned all entries in a sample even if those are unused - we are changing the default behaviour to only return the strictly needed entries for a given task in only return small set of targets by default from dataset wrapper #7488. Our benchmarks (below) show that this increase the perf of the SSDLite detection pipeline by 2X.
We have observed that the dataset wrapper for COCO (even prior to the optimizations above) is still faster than the one we have in the training references. While the transforms themselves aren't always significantly faster, the V2 versions are ~20% faster than the V1 versions for the SSDLite detection pipeline (most of the improvement comes from the dataset wrapping). See details below.
For a typical classification pipeline (small number of inputs), we do not observe any improvement when removing both the pytree logic (wrapping / unwrapping) and the "tensor pass-through heuristic" logic. This suggests that neither of those are bottlenecks right now, as long as the number of inputs stays small. It also suggests that Only flatten a pytree once per container transform #6769 isn't a priority.
BTW, point 1. suggests that we should probably enforce to have only oneBoundingBox instance per sample, instead of multiple ones (#7319).
Disclaimer: the improvements described below aren't improvements due to optimizations in the transforms code; they mostly relate to the dataset wrapper. While they are still improvements over what is currently released in 0.15, the conclusion should not be that the current transforms are problematic (again: the "issue" was in the dataset wrapper).
We've been running some benchmarks with @pmeier and here are our conclusions:
There is a huge cost of having many entires in a sample, even if those are pass-through. That cost comes from
pytree
's [un]flattening of the inputs. By default our dataset wrapper returned all entries in a sample even if those are unused - we are changing the default behaviour to only return the strictly needed entries for a given task in only return small set of targets by default from dataset wrapper #7488. Our benchmarks (below) show that this increase the perf of the SSDLite detection pipeline by 2X.As reported in Detection references are needlessly transforming masks #7489 our training references currently handle masks when they're present, and they're present even in pure-detection tasks. While this is something that can/should be addressed in the "COCO wrapper" of the references, it should also be prevented by modifying the V2 dataset wrapper. only return small set of targets by default from dataset wrapper #7488 only returns masks when strictly needed which improves perf by an additional 20%.
We have observed that the dataset wrapper for COCO (even prior to the optimizations above) is still faster than the one we have in the training references. While the transforms themselves aren't always significantly faster, the V2 versions are ~20% faster than the V1 versions for the SSDLite detection pipeline (most of the improvement comes from the dataset wrapping). See details below.
For a typical classification pipeline (small number of inputs), we do not observe any improvement when removing both the
pytree
logic (wrapping / unwrapping) and the "tensor pass-through heuristic" logic. This suggests that neither of those are bottlenecks right now, as long as the number of inputs stays small. It also suggests that Only flatten a pytree once per container transform #6769 isn't a priority.BTW, point 1. suggests that we should probably enforce to have only one
BoundingBox
instance per sample, instead of multiple ones (#7319).Details for points 1. and 2. :
Benchmarks ran from pmeier/detection-reference-benchmark@0ae9027
Detail of 4.
The text was updated successfully, but these errors were encountered: