Some conclusions from recent benchmarks on SSDLite detection transforms

Disclaimer: the improvements described below aren't improvements due to optimizations in the transforms code; they mostly relate to the dataset wrapper. While they are still improvements over what is currently released in 0.15, the conclusion should *not* be that the current transforms are problematic (again: the "issue" was in the dataset wrapper).


We've been running some benchmarks with @pmeier and here are our conclusions:

1. **There is a huge cost of having many entires in a sample**, even if those are pass-through. **That cost comes from `pytree`'s [un]flattening of the inputs**. By default our dataset wrapper returned all entries in a sample even if those are unused - we are changing the default behaviour to only return the strictly needed entries for a given task in https://github.com/pytorch/vision/pull/7488. Our benchmarks (below) show that this **increase the perf of the SSDLite detection pipeline by 2X.**

3. As reported in https://github.com/pytorch/vision/issues/7489 our training references currently handle masks when they're present, and they're present even in pure-detection tasks. While this is something that can/should be addressed in the  "COCO wrapper" of the references, it should also be prevented by modifying the V2 dataset wrapper. https://github.com/pytorch/vision/pull/7488 only returns masks when strictly needed which improves perf by an additional 20%.

4. We have observed that the dataset wrapper for COCO (even prior to the optimizations above) is still  faster than the one we have in the training references. While the transforms themselves aren't always significantly faster, the **V2 versions are ~20% faster than the V1 versions for the SSDLite detection pipeline** (most of the improvement comes from the dataset wrapping). See details below.

5. For a typical classification pipeline (small number of inputs), we do not observe any improvement when removing **both** the `pytree` logic (wrapping / unwrapping) and the "tensor pass-through heuristic" logic. This suggests that neither of those are bottlenecks right now, as long as the number of inputs stays small. It also suggests that https://github.com/pytorch/vision/issues/6769  isn't a priority.


BTW, point 1. suggests that we should probably enforce to have only **one** `BoundingBox` instance per sample, instead of multiple ones (https://github.com/pytorch/vision/issues/7319).

Details for points 1. and 2. :
<details>

Benchmarks ran from https://github.com/pmeier/detection-reference-benchmark/commit/0ae9027701fee7d2370d587f088f468449675a94


```
############################################################
detection-ssdlite
############################################################
input_type='PIL', api_version='v2'
Results computed for 1_000 samples


With current dataset wrapper (returning all target keys, including masks)

                               median          std   
WrapCocoSampleForTransformsV2    2389 µs +-   3534 µs
RandomIoUCrop                    3055 µs +-  15973 µs
RandomHorizontalFlip              900 µs +-   4436 µs
PILToTensor                      1057 µs +-   4692 µs
ConvertDtype                     1303 µs +-   7031 µs
SanitizeBoundingBox              1605 µs +-   6134 µs
total                           10309 µs


Removing unnecessary target keys, but still including masks:
                               median          std   
WrapCocoSampleForTransformsV2    2367 µs +-   3514 µs
RandomIoUCrop                    1352 µs +-  15292 µs
RandomHorizontalFlip              335 µs +-    447 µs
PILToTensor                       369 µs +-    220 µs
ConvertDtype                      408 µs +-    187 µs
SanitizeBoundingBox               816 µs +-   1183 µs
total                            5648 µs

Removing unnecessary target keys and also masks:
                               median          std   
WrapCocoSampleForTransformsV2    2349 µs +-   3454 µs
RandomIoUCrop                     880 µs +-  10715 µs
RandomHorizontalFlip              226 µs +-    187 µs
PILToTensor                       351 µs +-    151 µs
ConvertDtype                      388 µs +-    173 µs
SanitizeBoundingBox               318 µs +-    130 µs
total                            4512 µs

```

</details>


Detail of 4.
<details>

```
############################################################
detection-ssdlite
############################################################
loading annotations into memory...
Done (t=14.21s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 132.17it/s]
input_type='Tensor', api_version='v1'

Results computed for 1_000 samples

                          median          std   
ConvertCocoPolysToMaskV1    4591 µs +-   6946 µs
PILToTensorV1                497 µs +-    103 µs
RandomIoUCropV1             1078 µs +-  16846 µs
RandomHorizontalFlipV1        38 µs +-    457 µs
ConvertImageDtypeV1          557 µs +-    303 µs

total                       6761 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=12.38s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 134.83it/s]
input_type='Tensor', api_version='v2'

Results computed for 1_000 samples

                               median          std   
WrapCocoSampleForTransformsV2    2364 µs +-   3673 µs
PILToTensor                       528 µs +-    221 µs
RandomIoUCrop                     717 µs +-  10495 µs
RandomHorizontalFlip              300 µs +-    473 µs
ConvertDtype                      402 µs +-    297 µs
SanitizeBoundingBox               729 µs +-   1231 µs

total                            5039 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.04s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 141.10it/s]
input_type='PIL', api_version='v1'

Results computed for 1_000 samples

                          median          std   
ConvertCocoPolysToMaskV1    4471 µs +-   6949 µs
RandomIoUCropV1             1051 µs +-  11616 µs
RandomHorizontalFlipV1        41 µs +-    410 µs
PILToTensorV1                344 µs +-    256 µs
ConvertImageDtypeV1          543 µs +-    389 µs

total                       6450 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.48s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 139.58it/s]
input_type='PIL', api_version='v2'

Results computed for 1_000 samples

                               median          std   
WrapCocoSampleForTransformsV2    2393 µs +-   3537 µs
RandomIoUCrop                     896 µs +-  10620 µs
RandomHorizontalFlip              298 µs +-    416 µs
PILToTensor                       375 µs +-    167 µs
ConvertDtype                      421 µs +-    239 µs
SanitizeBoundingBox               727 µs +-   1140 µs

total                            5110 µs
------------------------------------------------------------
loading annotations into memory...
Done (t=13.00s)
creating index...
index created!
Caching 1000 ([89444, 73295, 101719] ... [31395, 96727, 47807]) COCO samples
100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:07<00:00, 138.44it/s]
input_type='Datapoint', api_version='v2'

Results computed for 1_000 samples

                               median          std   
WrapCocoSampleForTransformsV2    2446 µs +-   3707 µs
ToImageTensor                     588 µs +-    162 µs
RandomIoUCrop                     967 µs +-  15236 µs
RandomHorizontalFlip              350 µs +-    503 µs
ConvertDtype                      507 µs +-    328 µs
SanitizeBoundingBox               828 µs +-   1232 µs

total                            5686 µs
------------------------------------------------------------

Summaries

           v2 / v1
Tensor        0.75
PIL           0.79

                     [a]   [b]   [c]   [d]   [e]
   Tensor, v1, [a]  1.00  1.34  1.05  1.32  1.19
   Tensor, v2, [b]  0.75  1.00  0.78  0.99  0.89
      PIL, v1, [c]  0.95  1.28  1.00  1.26  1.13
      PIL, v2, [d]  0.76  1.01  0.79  1.00  0.90
Datapoint, v2, [e]  0.84  1.13  0.88  1.11  1.00

Slowdown as row / col
```

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some conclusions from recent benchmarks on SSDLite detection transforms #7494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some conclusions from recent benchmarks on SSDLite detection transforms #7494

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions