Description
Disclaimer: the improvements described below aren't improvements due to optimizations in the transforms code; they mostly relate to the dataset wrapper. While they are still improvements over what is currently released in 0.15, the conclusion should not be that the current transforms are problematic (again: the "issue" was in the dataset wrapper).
We've been running some benchmarks with @pmeier and here are our conclusions:
-
There is a huge cost of having many entires in a sample, even if those are pass-through. That cost comes from
pytree
's [un]flattening of the inputs. By default our dataset wrapper returned all entries in a sample even if those are unused - we are changing the default behaviour to only return the strictly needed entries for a given task in only return small set of targets by default from dataset wrapper #7488. Our benchmarks (below) show that this increase the perf of the SSDLite detection pipeline by 2X. -
As reported in Detection references are needlessly transforming masks #7489 our training references currently handle masks when they're present, and they're present even in pure-detection tasks. While this is something that can/should be addressed in the "COCO wrapper" of the references, it should also be prevented by modifying the V2 dataset wrapper. only return small set of targets by default from dataset wrapper #7488 only returns masks when strictly needed which improves perf by an additional 20%.
-
We have observed that the dataset wrapper for COCO (even prior to the optimizations above) is still faster than the one we have in the training references. While the transforms themselves aren't always significantly faster, the V2 versions are ~20% faster than the V1 versions for the SSDLite detection pipeline (most of the improvement comes from the dataset wrapping). See details below.
-
For a typical classification pipeline (small number of inputs), we do not observe any improvement when removing both the
pytree
logic (wrapping / unwrapping) and the "tensor pass-through heuristic" logic. This suggests that neither of those are bottlenecks right now, as long as the number of inputs stays small. It also suggests that Only flatten a pytree once per container transform #6769 isn't a priority.
BTW, point 1. suggests that we should probably enforce to have only one BoundingBox
instance per sample, instead of multiple ones (#7319).
Details for points 1. and 2. :
Benchmarks ran from pmeier/detection-reference-benchmark@0ae9027
############################################################
detection-ssdlite
############################################################
input_type='PIL', api_version='v2'
Results computed for 1_000 samples
With current dataset wrapper (returning all target keys, including masks)
median std
WrapCocoSampleForTransformsV2 2389 µs +- 3534 µs
RandomIoUCrop 3055 µs +- 15973 µs
RandomHorizontalFlip 900 µs +- 4436 µs
PILToTensor 1057 µs +- 4692 µs
ConvertDtype 1303 µs +- 7031 µs
SanitizeBoundingBox 1605 µs +- 6134 µs
total 10309 µs
Removing unnecessary target keys, but still including masks:
median std
WrapCocoSampleForTransformsV2 2367 µs +- 3514 µs
RandomIoUCrop 1352 µs +- 15292 µs
RandomHorizontalFlip 335 µs +- 447 µs
PILToTensor 369 µs +- 220 µs
ConvertDtype 408 µs +- 187 µs
SanitizeBoundingBox 816 µs +- 1183 µs
total 5648 µs
Removing unnecessary target keys and also masks:
median std
WrapCocoSampleForTransformsV2 2349 µs +- 3454 µs
RandomIoUCrop 880 µs +- 10715 µs
RandomHorizontalFlip 226 µs +- 187 µs
PILToTensor 351 µs +- 151 µs
ConvertDtype 388 µs +- 173 µs
SanitizeBoundingBox 318 µs +- 130 µs
total 4512 µs
Detail of 4.
############################################################
detection-ssdlite
############################################################
loading annotations into memory...
Done (t=14.21s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 132.17it/s]
input_type='Tensor', api_version='v1'
Results computed for 1_000 samples
median std
ConvertCocoPolysToMaskV1 4591 µs +- 6946 µs
PILToTensorV1 497 µs +- 103 µs
RandomIoUCropV1 1078 µs +- 16846 µs
RandomHorizontalFlipV1 38 µs +- 457 µs
ConvertImageDtypeV1 557 µs +- 303 µs
total 6761 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=12.38s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 134.83it/s]
input_type='Tensor', api_version='v2'
Results computed for 1_000 samples
median std
WrapCocoSampleForTransformsV2 2364 µs +- 3673 µs
PILToTensor 528 µs +- 221 µs
RandomIoUCrop 717 µs +- 10495 µs
RandomHorizontalFlip 300 µs +- 473 µs
ConvertDtype 402 µs +- 297 µs
SanitizeBoundingBox 729 µs +- 1231 µs
total 5039 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.04s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 141.10it/s]
input_type='PIL', api_version='v1'
Results computed for 1_000 samples
median std
ConvertCocoPolysToMaskV1 4471 µs +- 6949 µs
RandomIoUCropV1 1051 µs +- 11616 µs
RandomHorizontalFlipV1 41 µs +- 410 µs
PILToTensorV1 344 µs +- 256 µs
ConvertImageDtypeV1 543 µs +- 389 µs
total 6450 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.48s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 139.58it/s]
input_type='PIL', api_version='v2'
Results computed for 1_000 samples
median std
WrapCocoSampleForTransformsV2 2393 µs +- 3537 µs
RandomIoUCrop 896 µs +- 10620 µs
RandomHorizontalFlip 298 µs +- 416 µs
PILToTensor 375 µs +- 167 µs
ConvertDtype 421 µs +- 239 µs
SanitizeBoundingBox 727 µs +- 1140 µs
total 5110 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.00s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 138.44it/s]
input_type='Datapoint', api_version='v2'
Results computed for 1_000 samples
median std
WrapCocoSampleForTransformsV2 2446 µs +- 3707 µs
ToImageTensor 588 µs +- 162 µs
RandomIoUCrop 967 µs +- 15236 µs
RandomHorizontalFlip 350 µs +- 503 µs
ConvertDtype 507 µs +- 328 µs
SanitizeBoundingBox 828 µs +- 1232 µs
total 5686 µs
------------------------------------------------------------
Summaries
v2 / v1
Tensor 0.75
PIL 0.79
[a] [b] [c] [d] [e]
Tensor, v1, [a] 1.00 1.34 1.05 1.32 1.19
Tensor, v2, [b] 0.75 1.00 0.78 0.99 0.89
PIL, v1, [c] 0.95 1.28 1.00 1.26 1.13
PIL, v2, [d] 0.76 1.01 0.79 1.00 0.90
Datapoint, v2, [e] 0.84 1.13 0.88 1.11 1.00
Slowdown as row / col