Skip to content

Create some benchmarks #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timholy opened this issue Sep 14, 2021 · 12 comments
Open

Create some benchmarks #1

timholy opened this issue Sep 14, 2021 · 12 comments

Comments

@timholy
Copy link
Member

timholy commented Sep 14, 2021

I think it may be getting close to time to promote JuliaImages a bit more heavily. I still have a few things I want to do first wrt cleaning the code base (getting a release of Images.jl out that is compatible with ImageCore 0.9 is high on the list, as is the abs2 issue). But it will take some time to assemble good benchmarks, and I think it's best to do that out in the open so that others can pitch in. @mkitti, you'll note from the README that I'd love to have Fiji benchmarks, is that by any chance something you'd be able and interested to contribute?

FWIW, here's where we are on my machine today:

generic

specialized

Note the log-scaling. The "generic" tests (top) are expressed per-voxel, the "specialized" tests (below) are absolute.

Together with OpenCV, we pretty much dominate already (in one case by more than three orders of magnitude), but there are a few exceptions that merit investigation. (That Gaussian blur algorithm in OpenCV is worth emulating...) Missing points generally represent unsupported operations; the "specialized" for OpenCV is still a WIP but they support very little 3d so a lot of points will be missing anyway.

This also doesn't capture some of our other advantages, like scaling to big data. I've been playing a bit with dask but it seems like a pretty far cry from what we offer. scikit-image has a habit of immediately converting everything to a numpy array.

@timholy
Copy link
Member Author

timholy commented Sep 23, 2021

Time for a fun update. This does rely on some stuff that either hasn't been released yet or is still local on my machine:
generics

special

and one that now combines both kinds of benchmarking:
compact

OpenCV's performance advantage has, on average, essentially been erased, and I believe there is more to come.

@Tokazama
Copy link

Tokazama commented Sep 23, 2021

I'm assuming things are going to get better as LoopVectorization becomes easier to incorporate?

@timholy
Copy link
Member Author

timholy commented Sep 23, 2021

Yep. The RGB filtering benchmarks (blur and gradient, aka "big kernels" and "small kernels" respectively) still aren't using LoopVectorization. It looks like OpenCV struggles a bit there too, and I think it's possible that a really good implementation may ultimately beat them almost across the board. Even the grayscale ones could use more work, as my thread utilization is hovering at around 67%, but I haven't taken a close look. But beating Matlab & skimage by an order of magnitude on a lot of benchmarks is already pretty satisfying.

@timholy
Copy link
Member Author

timholy commented Sep 23, 2021

Really want ImageJ benchmarks too, as that's kind of the standard in much of biology.

@Tokazama
Copy link

Would be nice to have ITK benchmarks too, but that's pretty specialized and the last time I used ITK I nearly ripped out my hair.

@timholy
Copy link
Member Author

timholy commented Sep 23, 2021

I checked a decade or so ago when I was still using Matlab and looking to use something faster, and to my surprise Matlab blew it away on most things I tried. That said, it has been a decade...I agree, ITK would be quite nice if at all possible.

@mkitti
Copy link
Member

mkitti commented Sep 23, 2021

I am looking into the ImgLib2 benchmarks (ImageJ2 / FIJI). Another technology to benchmark against is the OpenCL based CLIJ2 which is available as an ImageJ plugin as well as for other platforms (Python, MATLAB) . For ITK, perhaps SimpleITK https://simpleitk.org/ might be appropriate.

timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 24, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.
timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 24, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`
timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 24, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`
timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 25, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`
timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 25, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks, skip the final
  `voronoift!` on dimension `N`, and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`

Co-authored-by: Johnny Chen <[email protected]>
timholy added a commit to JuliaImages/ImageMorphology.jl that referenced this issue Sep 25, 2021
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks, skip the final
  `voronoift!` on dimension `N`, and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`

Co-authored-by: Johnny Chen <[email protected]>
johnnychen94 added a commit to JuliaImages/ImageMorphology.jl that referenced this issue May 21, 2022
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks, skip the final
  `voronoift!` on dimension `N`, and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`

Co-authored-by: Johnny Chen <[email protected]>
johnnychen94 added a commit to JuliaImages/ImageMorphology.jl that referenced this issue May 21, 2022
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks, skip the final
  `voronoift!` on dimension `N`, and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`

Co-authored-by: Johnny Chen <[email protected]>
johnnychen94 added a commit to JuliaImages/ImageMorphology.jl that referenced this issue May 21, 2022
If Julia is being run with multiple threads, this now will default
to using all threads in `feature_transform`. The multithreaded
implementation is relatively straightforward, although the recursive algorithm
makes it a little hard to wrap your head around:

- `computeft!` for dimension `d` is dependent only on slices of dimensions `1:d`,
  and indeed for all except the final call to `voronoift!` just on dimensions `1:d-1`.
  Hence you can divide the last dimension `N` into chunks, skip the final
  `voronoift!` on dimension `N`, and give each to a thread.

- For the final `voronoift!` call along dimension `N`, it's a
  one-dimensional operation along this dimension. Hence you can just
  split the next-to-last dimension into chunks instead.

This change was responsible for performance improvements in `distance_transform`
noted in JuliaImages/image_benchmarks#1.

Julia 1.3 is required for `@spawn`

Co-authored-by: Johnny Chen <[email protected]>
@RoyiAvital
Copy link

I am not sure about Blur in those benchmarks as the Gaussian Blur can be applied in many forms with different balance of quality vs. run time.

But even for working with a simple kernel I can't reproduce the results above on the latest version of Images.jl (Currently 0.25.2).

For instance:

using Images;
using BenchmarkTools;

# Images
mA = rand(RGB{Float32}, 2048, 2048);

# Kernels
mW = Float32.(centered([0.25 0 0.25; 0 -1 0; 0.25 0 0.25])); #<! The Filter

@btime imfilter(mA, mW, "replicate", ImageFiltering.Algorithm.FIR());

I get:

julia> @btime imfilter(mA, mW, "replicate", ImageFiltering.Algorithm.FIR());
  66.387 ms (30 allocations: 96.13 MiB)

On MATLAB:

mA = rand(2048, 2048, 3, 'single');
mW = [0.25, 0, 0.25; 0, -1, 0; 0.25, 0, 0.25];
hF = @() imfilter(mA, mW, 'replicate');
timeit(hF)

I get:

ans =

    0.0113

So MATLAB is faster on my machine.

@timholy
Copy link
Member Author

timholy commented Nov 29, 2022

Thanks for the benchmarks! A few thoughts:

  • Currently we only multithread separable kernels, and yours isn't. Not a difficult fix and and could be added by anyone, it's just that my own work only requires separable kernels so that's where my own priorities have been.
  • Using a @SMatrix for the kernel should be a bit better, care to try?
  • in local tests LoopVectorization makes filtering a whole lot better, esp. for SoA layouts. LoopVectorization has its own multithreading paradigm so it might simply be best to transition to LV.

@RoyiAvital
Copy link

RoyiAvital commented Nov 29, 2022

OK, I chose a non separable kernel on purpose, for separable kernel I'd probably use conv2() in MATLAB. Let me try it and report.

@RoyiAvital
Copy link

Just a sample point with a static array:

using Images;
using StaticArrays;
using StaticKernels;
using BenchmarkTools;

# Images
mA = rand(RGB{Float32}, 2048, 2048);
mB = rand(Float32, 2048, 2048, 3);

# Kernels
# mW = @SMatrix Float32.(centered([0.25 0 0.25; 0 -1 0; 0.25 0 0.25])); #<! The Filter
mW = centered(SMatrix{3, 3, Float32}([0.25 0 0.25; 0 -1 0; 0.25 0 0.25])); #<! The Filter
mK = @kernel w -> 0.25 * w[-1, -1] + 0.25* w[-1, 1] - w[0,0] + 0.25 * w[1, -1] + 0.25 * w[1, 1];

# Pre Allocated Results
mAO = Array{RGB{Float32}}(undef, 2048, 2048);
mBO = Array{Float32, 3}(undef, 2048, 2048, 3);

# Planar
@btime begin
    for ii in 1:3
        @views map!(mK, mBO[:, :, ii], extend(mB[:, :, ii], StaticKernels.ExtensionReplicate()));
    end
end

# Packed
@btime imfilter!(mAO, mA, mW, "replicate", ImageFiltering.Algorithm.FIR());
# @btime imfilter(mA, mW, "replicate", ImageFiltering.Algorithm.FIR());

What I get it:

julia> 

  13.101 ms (39 allocations: 1.22 KiB)
  59.660 ms (28 allocations: 48.13 MiB)

So I don't see any improvement.
In few minutes I will share the comparison to MATLAB on a separable array.

@RoyiAvital
Copy link

RoyiAvital commented Nov 29, 2022

OK, Here is for a simple separable filter:

using Images;
using StaticArrays;
using StaticKernels;
using BenchmarkTools;

mA = rand(RGB{Float32}, 2048, 2048);

vW = [0.2f0 -0.5f0 0.3f0];
mW = vW' * vW;

vW = centered(vW);
mW = centered(mW);

@btime imfilter(mA, mW, "replicate", ImageFiltering.Algorithm.FIR());
@btime imfilter(mA, (vW', vW), "replicate", ImageFiltering.Algorithm.FIR());

I get ~85 Mili Sec for both (Which makes sense since according to documentation imfilter() checks for separability).
The strange thing is being slower than the previous kernel.
I guess in the previous case it optimized out the zeros coefficients. Could it be?

In MATLAB:

mA = rand(2038, 2048, 3, 'single');
vW = [0.2, -0.5, 0.3];
mW = vW.' * vW;
hF = @() imfilter(mA, mW, 'replicate');
timeit(hF)

I get ~8 Mili Sec. About ~10 times faster and indeed faster than the previous kernel for MATLAB.

By the way, in MATLAB conv2() is even more performant in case one only after the valid part of the convolution.
Something like:

function [ mO ] = Conv2DRgb( vW, mI )

mO = zeros(size(mI), 'like', mI);

for ii = 1:3
    mO(:, :, ii) = conv2(vW, vW', mI(:, :, ii), 'same');
end


end

Will get even faster by ~10%. Though above, for simplicity, I used same (Which is still faster).

Just for reference, I use MATLAB R2022b on Windows 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants