Prototype Torch Interpolate with anti-aliasing

Problem: see notebooks/analysis.ipynb

TL;DR:

Currently:

MAE(downsampled_pil, downsampled_torch) >> 1
MaxAbsE(downsampled_pil, downsampled_torch) > 100

We would like:

MAE(downsampled_pil, downsampled_torch) ~ 1
MaxAbsE(downsampled_pil, downsampled_torch) < 10

Algorithm (PIL implementation)

For 1D, output pixel is computed as

output[ox]  = input[xmin + 0] * kernel[x + 0]
output[ox] += input[xmin + 1] * kernel[x + 1]
output[ox] += input[xmin + 2] * kernel[x + 2]
...
output[ox] += input[xmin + n] * kernel[x + n]

where n = ceil(support * scale) * 2 + 1 and

support = 1  # for bilinear
support = 2  # for bicubic
scale = input_size / output_size

center = (ox + 0.5) * scale
xmin = max( round(center - support), 0 )

Kernel values are computed using triangle filtering (bilinear mode)

kernel[x + k] = triangle(...)

Step 0

Result 1 : cxxflag by default and non-separable version

PYTHONPATH=/pytorch/ python test.py

Input tensor: [1, 3, 438, 906]
Input is_contiguous memory_format torch.channels_last: true
Input is_contiguous memory_format torch.channels_last_3d: false
Input is_contiguous : false

Output tensor: [1, 3, 196, 320]
Output is_contiguous memory_format torch.channels_last: false
Output is_contiguous memory_format torch.channels_last_3d: false
Output is_contiguous : true
-> Antialias option: scale=2.23469
-> Antialias option: scale=2.83125
Size of indices_weights: 2
- dim 1 size: 14
- dim 2 size: 14
AA TI_SHOW: N=320
AA TI_SHOW: interp_size=7
AA TI_SHOW_STRIDES: 4 0 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 8 4 8 4 8 4 8 4 8 4 8 4 8 4 |
PyTorch vs PIL: Mean Absolute Error: 6.302572250366211
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5034226179122925
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_s0_output.png

OMP_NUM_THREADS=6 PYTHONPATH=/pytorch/ python test.py --bench

PyTorch vs PIL: Mean Absolute Error: 6.302572250366211
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5034226179122925
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_s0_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 6
[---------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) -----------]
                      |  PIL 8.1.2  |  1.9.0a0+git8518b0e  |  aa_interp_lin_s0
6 threads: -------------------------------------------------------------------
      channels_first  |     2.0     |         1.2          |        10.2

Times are in milliseconds (ms).

Result 2 : cxxflag: `-O3` and separable version

We are using PIL-SIMD here

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_s0_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) -------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_s0
1 threads: ------------------------------------------------------------------------------------
      channels_first contiguous  |       350.6       |        668.4         |       5630.3

Times are in microseconds (us).

OMP_NUM_THREADS=6 PYTHONPATH=/pytorch/ python test.py --bench

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_s0_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 6
[------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) -------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_s0
6 threads: ------------------------------------------------------------------------------------
      channels_first contiguous  |       339.9       |        153.6         |       1123.4

Times are in microseconds (us).

Step 1

Result 1 : cxxflag: `-O3` and separable version, indices as bounds

We are using PIL-SIMD here

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench --step=step_one

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0Saved downsampled proto output: data/proto_aa_interp_lin_step_one_output.pngTorch config: PyTorch built with:  - GCC 9.3  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK
-O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[---------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) ----------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_one
1 threads: ------------------------------------------------------------------------------------------
      channels_first contiguous  |       351.2       |        666.6         |          3376.9

Times are in microseconds (us).

OMP_NUM_THREADS=6 PYTHONPATH=/pytorch/ python test.py --bench --step=step_one

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_one_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 6
[---------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) ----------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_one
6 threads: ------------------------------------------------------------------------------------------
      channels_first contiguous  |       345.7       |        155.0         |          989.0

Times are in microseconds (us).

Step 2

Result : cxxflag: `-O3` and separable version, indices as bounds, single weights tensor

We are using PIL-SIMD here

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench --step=step_two

mem_format:  channels_firstis_contiguous:  TruePyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF,
USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[---------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) ----------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two
1 threads: ------------------------------------------------------------------------------------------
      channels_first contiguous  |       345.9       |        670.7         |          2927.0

Times are in microseconds (us).

OMP_NUM_THREADS=6 PYTHONPATH=/pytorch/ python test.py --bench --step=step_two

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 6
[---------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) ----------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two
6 threads: ------------------------------------------------------------------------------------------
      channels_first contiguous  |       341.7       |        154.1         |          574.9

Times are in microseconds (us).

Step 2.1

Result : cxxflag: `-O3` and separable version, indices as bounds, single weights tensor + optim tricks

We are using PIL-SIMD here

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench --step=step_two_dot_one

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_dot_one_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[-------------------------------------------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) --------------------------------------------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two_dot_one  |  aa_interp_lin_step_two_dot_one wo float/byte conversion
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous  |       333.3       |        659.9         |              2500.4              |                           2175.3

Times are in microseconds (us).

Step 3

Result : cxxflag: `-O3` and single TI version with indices as bounds, single weights tensor + optim tricks

We are using PIL-SIMD here

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench --step=step_three

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.302402019500732
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035501718521118
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_dot_one_output.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[-------------------------------------------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) --------------------------------------------------------]
                                 |  PIL 7.0.0.post3  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two_dot_one  |  aa_interp_lin_step_two_dot_one wo float/byte conversion
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous  |       335.7       |        661.3         |              3069.2              |                           2744.7

Times are in microseconds (us).

Step 2.2

Result : cxxflag: `-O3` and separable version, indices as bounds, single weights tensor + more optim tricks

We are using Pillow without SIMD

OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ python test.py --bench --step=step_two_dot_two

mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 6.3022003173828125
PyTorch vs PIL: Max Absolute Error: 151.0
Proto vs PIL: Mean Absolute Error: 0.5035820603370667
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_dot_two_output_320_196.png
mem_format:  channels_first
is_contiguous:  True
PyTorch vs PIL: Mean Absolute Error: 13.175492286682129
PyTorch vs PIL: Max Absolute Error: 172.0
Proto vs PIL: Mean Absolute Error: 0.5021122694015503
Proto vs PIL: Max Absolute Error: 1.0
Saved downsampled proto output: data/proto_aa_interp_lin_step_two_dot_two_output_120_96.png
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - Build settings: BUILD_TYPE=Release, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[----------------------------------------------------- Downsampling: torch.Size([3, 438, 906]) -> (320, 196) -----------------------------------------------------]
                                 |  PIL 8.2.0  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two_dot_two  |  aa_interp_lin_step_two_dot_two wo float/byte conversion
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous  |    1842.8   |        664.0         |              2272.6              |                           1938.4

Times are in microseconds (us).

[------------------------------------------------------ Downsampling: torch.Size([3, 438, 906]) -> (120, 96) -----------------------------------------------------]
                                 |  PIL 8.2.0  |  1.9.0a0+gitb5647dd  |  aa_interp_lin_step_two_dot_two  |  aa_interp_lin_step_two_dot_two wo float/byte conversion
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous  |    1296.6   |        350.2         |              1918.2              |                           1674.0

Times are in microseconds (us).

Refs:

Development

Docker container setup

docker run --rm -it \
    --name=tv-interpolate \
    -v $PWD:/interpolate-antialiasing \
    -w /interpolate-antialiasing \
    --network=host --security-opt seccomp:unconfined --privileged --shm-size 16G \
    nvidia/cuda:11.1-cudnn8-devel-ubuntu20.04 \
    /bin/bash

apt-get update && ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime && \
    apt-get install -y tzdata && \
    dpkg-reconfigure --frontend noninteractive tzdata && \
    apt-get install -y git cmake python3 python3-pip numactl && \
    ln -s /usr/bin/python3 /usr/bin/python && \
    pip install numpy typing_extensions Pillow ninja expecttest

Install PyTorch nightly with CUDA suppport

pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html

Install linux perf

echo "deb http://archive.ubuntu.com/ubuntu/ bionic main universe\n" >> /etc/apt/sources.list && \
apt-get update && apt-get install -y linux-tools-4.15.0-20-generic linux-tools-4.15.0-20 linux-tools-4.15.0-20-lowlatency && \
rm -rf /usr/bin/perf && \
ln -s /usr/lib/linux-tools-4.15.0-20/perf /usr/bin/perf

Install Pillow-SIMD

apt-get install -y libpng-dev libjpeg-turbo8-dev
pip uninstall -y pillow && CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

Debug segmentation faults

apt-get install gdb
OMP_NUM_THREADS=1 PYTHONPATH=/pytorch/ gdb --args python test.py --step=step_two_dot_two

b step_two_dot_two/aa_interpolation_impl.h:134
run

To activate tui : ctrl+x -> ctrl+a
To switch focus between window and cmd: fs n
Run torchvision test with ASAN

- extra_compile_args = {'cxx': []}
+ extra_compile_args = {'cxx': ["-fsanitize=address", "-fno-omit-frame-pointer"]}

python setup.py develop

LD_PRELOAD=/usr/lib/gcc/x86_64-linux-gnu/9/libasan.so python test_resize_aa.py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
godbolt		godbolt
notebooks		notebooks
playground_cuda		playground_cuda
profiling		profiling
step_four		step_four
step_one		step_one
step_three		step_three
step_two		step_two
step_two_dot_one		step_two_dot_one
step_two_dot_three		step_two_dot_three
step_two_dot_two		step_two_dot_two
step_zero		step_zero
.gitignore		.gitignore
README.md		README.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prototype Torch Interpolate with anti-aliasing

TL;DR:

Algorithm (PIL implementation)

Step 0

Step 1

Step 2

Step 2.1

Step 3

Step 2.2

Refs:

Development

About

Uh oh!

Releases

Packages

Uh oh!

Languages

vfdev-5/interpolate-antialiasing

Folders and files

Latest commit

History

Repository files navigation

Prototype Torch Interpolate with anti-aliasing

TL;DR:

Algorithm (PIL implementation)

Step 0

Step 1

Step 2

Step 2.1

Step 3

Step 2.2

Refs:

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages