Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions configs/recognition/swin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,18 @@ The vision community is witnessing a modeling shift from CNNs to Transformers, w

### Kinetics-400

| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | inference time(video/s) | gpu_mem(M) | params | config | ckpt | log |
| :---------------------: | :-----------: | :--: | :------: | :----------: | :------: | :------: | :--------------------: | :--------------------: | :--------------: | :---------------------: | :--------: | :----: | :--------: | :------: | :-----: |
| 32x2x1 | short-side 320 | 8 | Swin-T | ImageNet-1k | 78.29 | 93.58 | [78.46](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py) | [93.46](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py) | 4 clips x 3 crop | x | 21072 | 28.2M | [config](/configs/recognition/swin/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-241016b2.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-S | ImageNet-1k | 80.23 | 94.32 | [80.23](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py) | [94.16](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py) | 4 clips x 3 crop | x | 33632 | 49.8M | [config](/configs/recognition/swin/swin-small_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-e91ab986.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-B | ImageNet-1k | 80.21 | 94.32 | [80.27](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py) | [94.42](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py) | 4 clips x 3 crop | x | 45143 | 88.0M | [config](/configs/recognition/swin/swin-base_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-182ec6cc.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-L | ImageNet-22k | 83.15 | 95.76 | 83.1\* | 95.9\* | 4 clips x 3 crop | x | 68881 | 197.0M | [config](/configs/recognition/swin/swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-78ad8b11.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | gpu_mem(M) | FLOPs | params | config | ckpt | log |
| :---------------------: | :------------: | :--: | :------: | :----------: | :------: | :------: | :-----------------------: | :-----------------------: | :---------------: | :--------: | :---: | :----: | :-----------: | :---------: | :---------: |
| 32x2x1 | short-side 320 | 8 | Swin-T | ImageNet-1k | 78.29 | 93.58 | 78.46 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 93.46 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 4 clips x 3 crops | 21072 | 88G | 28.2M | [config](/configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-241016b2.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-S | ImageNet-1k | 80.23 | 94.32 | 80.23 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 94.16 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 4 clips x 3 crops | 33632 | 166G | 49.8M | [config](/configs/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-e91ab986.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-B | ImageNet-1k | 80.21 | 94.32 | 80.27 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 94.42 \[[VideoSwin](https://github.com/SwinTransformer/Video-Swin-Transformer)\] | 4 clips x 3 crops | 45143 | 282G | 88.0M | [config](/configs/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-182ec6cc.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |
| 32x2x1 | short-side 320 | 8 | Swin-L | ImageNet-22k | 83.15 | 95.76 | 83.1\* | 95.9\* | 4 clips x 3 crops | 68881 | 604G | 197M | [config](/configs/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-78ad8b11.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log) |

### Kinetics-700

| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | inference time(video/s) | gpu_mem(M) | params | config | ckpt | log |
| :---------------------: | :------------: | :--: | :------: | :----------: | :------: | :------: | :--------------: | :---------------------: | :--------: | :----: | :----------------------: | :--------------------: | :--------------------: |
| 32x2x1 | short-side 320 | 16 | Swin-L | ImageNet-22k | 75.26 | 92.44 | 4 clips x 3 crop | x | 68898 | 197.4M | [config](/configs/recognition/swin/swin-large_p244-w877-in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large_p244-w877-in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large_p244-w877-in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb_20220930-f8d74db7.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large_p244-w877-in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large_p244-w877-in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.py.log) |
| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | gpu_mem(M) | FLOPs | params | config | ckpt | log |
| :---------------------: | :------------: | :--: | :------: | :----------: | :------: | :------: | :---------------: | :--------: | :---: | :----: | :----------------------------: | :--------------------------: | :-------------------------: |
| 32x2x1 | short-side 320 | 16 | Swin-L | ImageNet-22k | 75.26 | 92.44 | 4 clips x 3 crops | 68898 | 604G | 197M | [config](/configs/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb_20220930-f8d74db7.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.py.log) |

1. The **gpus** indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set `--auto-scale-lr` when calling `tools/train.py`, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
2. The values in columns named after "reference" are the results got by testing on our dataset, using the checkpoints provided by the author with same model settings. `*` means that the numbers are copied from the paper.
Expand All @@ -51,7 +51,7 @@ python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train VideoSwin model on Kinetics-400 dataset in a deterministic option with periodic validation.

```shell
python tools/train.py configs/recognition/swin/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
python tools/train.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
--cfg-options randomness.seed=0 randomness.deterministic=True
```

Expand All @@ -68,7 +68,7 @@ python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test VideoSwin model on Kinetics-400 dataset and dump the result to a pkl file.

```shell
python tools/test.py configs/recognition/swin/swin-tiny_p244-w877-in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
python tools/test.py configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

Expand Down
122 changes: 122 additions & 0 deletions configs/recognition/swin/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
Collections:
- Name: Swin
README: configs/recognition/swin/README.md
Paper:
URL: https://arxiv.org/abs/2106.13230
Title: 'Video Swin Transformer'

Models:
- Name: swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb
Config: configs/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py
In Collection: Swin
Metadata:
Architecture: Swin-T
Batch Size: 8
Epochs: 30
FLOPs: 88G
Parameters: 28.2M
Pretrained: ImageNet-1K
Resolution: short-side 320
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 78.29
Top 5 Accuracy: 93.58
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-tiny-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-241016b2.pth

- Name: swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb
Config: configs/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py
In Collection: Swin
Metadata:
Architecture: Swin-S
Batch Size: 8
Epochs: 30
FLOPs: 166G
Parameters: 49.8M
Pretrained: ImageNet-1K
Resolution: short-side 320
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 80.23
Top 5 Accuracy: 94.32
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-small-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-e91ab986.pth

- Name: swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb
Config: configs/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py.py
In Collection: Swin
Metadata:
Architecture: Swin-B
Batch Size: 8
Epochs: 30
FLOPs: 282G
Parameters: 88.0M
Pretrained: ImageNet-1K
Resolution: short-side 320
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 80.21
Top 5 Accuracy: 94.32
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-base-p244-w877_in1k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-182ec6cc.pth

- Name: swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb
Config: configs/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py
In Collection: Swin
Metadata:
Architecture: Swin-L
Batch Size: 8
Epochs: 30
FLOPs: 604G
Parameters: 197M
Pretrained: ImageNet-22K
Resolution: short-side 320
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 83.15
Top 5 Accuracy: 95.76
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb/swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb_20220930-78ad8b11.pth

- Name: swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb
Config: configs/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.py
In Collection: Swin
Metadata:
Architecture: Swin-L
Batch Size: 8
Epochs: 30
FLOPs: 604G
Parameters: 197M
Pretrained: ImageNet-22K
Resolution: short-side 320
Training Data: Kinetics-700
Training Resources: 16 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-700
Task: Action Recognition
Metrics:
Top 1 Accuracy: 75.26
Top 5 Accuracy: 92.44
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb/swin-large-p244-w877_in22k-pre_16xb8-amp-32x2x1-30e_kinetics700-rgb_20220930-f8d74db7.pth
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
_base_ = [
'swin-large_p244-w877-in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py'
'swin-large-p244-w877_in22k-pre_8xb8-amp-32x2x1-30e_kinetics400-rgb.py'
]

model = dict(cls_head=dict(num_classes=700))
Expand All @@ -15,7 +15,7 @@
# io_backend='petrel',
# path_mapping=dict(
# {'data/kinetics700': 's3://openmmlab/datasets/action/Kinetics700'}))
file_client_args = dict(backend='disk')
file_client_args = dict(io_backend='disk')
train_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
Expand Down
1 change: 1 addition & 0 deletions model-index.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Import:
- configs/recognition/tanet/metafile.yml
- configs/recognition/x3d/metafile.yml
- configs/recognition/trn/metafile.yml
- configs/recognition/swin/metafile.yml
- configs/detection/ava/metafile.yml
- configs/detection/acrn/metafile.yml
- configs/skeleton/stgcn/metafile.yml
Expand Down
10 changes: 10 additions & 0 deletions tests/models/recognizers/test_recognizer3d.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,3 +115,13 @@ def test_tpn_slowonly():
loss_vars, _ = train_test_step(config, input_shape=input_shape)
assert 'loss_aux' in loss_vars
assert loss_vars['loss_cls'] + loss_vars['loss_aux'] == loss_vars['loss']


def test_swin():
register_all_modules()
config = get_recognizer_cfg('swin/swin-tiny-p244-w877_in1k-pre_'
'8xb8-amp-32x2x1-30e_kinetics400-rgb.py')
config.model['backbone']['pretrained2d'] = False
config.model['backbone']['pretrained'] = None
input_shape = (1, 3, 4, 64, 64) # M C T H W
train_test_step(config, input_shape=input_shape)