-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Following our Slack discussion:
thread = True()
makes the broadcasting in the time integration significantly faster on multiple threads, but it's not used in the elixirs.
Without multithreaded time integration:
julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)
───────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 4.68s / 94.0% 12.6MiB / 82.4%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────────
perform step 914 4.26s 96.6% 4.66ms 7.61MiB 73.4% 8.52KiB
~perform step~ 914 2.71s 61.5% 2.96ms 672B 0.0% 0.74B
rhs! 4.57k 1.55s 35.1% 339μs 7.61MiB 73.4% 1.70KiB
reset ∂u/∂t 4.57k 489ms 11.1% 107μs 0.00B 0.0% 0.00B
volume integral 4.57k 441ms 10.0% 96.5μs 1.32MiB 12.8% 304B
source terms 4.57k 212ms 4.8% 46.5μs 1.26MiB 12.1% 288B
interface flux 4.57k 120ms 2.7% 26.3μs 1.46MiB 14.1% 336B
prolong2interfaces 4.57k 107ms 2.4% 23.4μs 1.32MiB 12.8% 304B
surface integral 4.57k 72.0ms 1.6% 15.8μs 1.19MiB 11.4% 272B
~rhs!~ 4.57k 70.0ms 1.6% 15.3μs 4.78KiB 0.0% 1.07B
Jacobian 4.57k 34.6ms 0.8% 7.58μs 1.05MiB 10.1% 240B
prolong2mortars 4.57k 327μs 0.0% 71.6ns 0.00B 0.0% 0.00B
prolong2boundaries 4.57k 258μs 0.0% 56.4ns 0.00B 0.0% 0.00B
mortar flux 4.57k 250μs 0.0% 54.7ns 0.00B 0.0% 0.00B
boundary flux 4.57k 139μs 0.0% 30.4ns 0.00B 0.0% 0.00B
calculate dt 915 97.5ms 2.2% 107μs 257KiB 2.4% 288B
analyze solution 11 50.1ms 1.1% 4.55ms 2.50MiB 24.1% 233KiB
With thread = True()
:
julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)
────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished. Final time: 2.0 Time steps: 914 (accepted), 914 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 1.34s / 77.7% 12.6MiB / 82.4%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────────
perform step 914 986ms 94.5% 1.08ms 7.61MiB 73.4% 8.52KiB
rhs! 4.57k 793ms 76.0% 174μs 7.61MiB 73.4% 1.70KiB
volume integral 4.57k 206ms 19.8% 45.2μs 1.32MiB 12.8% 304B
source terms 4.57k 186ms 17.8% 40.8μs 1.26MiB 12.1% 288B
interface flux 4.57k 134ms 12.9% 29.4μs 1.46MiB 14.1% 336B
prolong2interfaces 4.57k 86.2ms 8.3% 18.9μs 1.32MiB 12.8% 304B
~rhs!~ 4.57k 69.1ms 6.6% 15.1μs 4.78KiB 0.0% 1.07B
surface integral 4.57k 62.2ms 6.0% 13.6μs 1.19MiB 11.4% 272B
reset ∂u/∂t 4.57k 24.9ms 2.4% 5.44μs 0.00B 0.0% 0.00B
Jacobian 4.57k 22.7ms 2.2% 4.96μs 1.05MiB 10.1% 240B
mortar flux 4.57k 222μs 0.0% 48.5ns 0.00B 0.0% 0.00B
prolong2mortars 4.57k 205μs 0.0% 44.9ns 0.00B 0.0% 0.00B
prolong2boundaries 4.57k 197μs 0.0% 43.1ns 0.00B 0.0% 0.00B
boundary flux 4.57k 140μs 0.0% 30.6ns 0.00B 0.0% 0.00B
~perform step~ 914 193ms 18.5% 211μs 672B 0.0% 0.74B
analyze solution 11 48.4ms 4.6% 4.40ms 2.50MiB 24.1% 233KiB
calculate dt 915 9.15ms 0.9% 10.0μs 257KiB 2.4% 288B
As @JoshuaLampert pointed out:
If we decide to use thread = True() by default, we could also add it to the ode_default_options instead of putting them manually into each elixir
As @ranocha pointed out:
And issues when switching to Base threads since OrdinaryDiffEq.jl will still use Polyester.jl
We should check these aspects before switching everything
For my multi-GPU prototype of TrixiParticles, I built a custom wrapper data type with a custom broadcasting style to make sure that all broadcasting in the time integration itself is also done on multiple GPUs with @threaded
.
I can easily adapt this to have a ThreadedBroadcastArray
data type that defines broadcasting with Trixi.@threaded
. This way, all time integration schemes are automatically multithreaded in the same way as the rest of Trixi, even if (when?) we move away from Polyester.jl.
We only need to change
u0_ode = compute_coefficients(first(tspan), semi) |
to
u0_ode_ = compute_coefficients(first(tspan), semi)
u0_ode = ThreadedBroadcastArray(u0_ode_)