Performance regression in RxInfer code on current master

We have a set of small benchmarks to quickly test our code in [RxInfer](https://github.com/biaslab/RxInfer.jl). The aim of the package is to run efficient Bayesian inference, potentially on low-power low-memory devices like RaspberryPI. We just noticed, that on Julia 1.10 we have quite a noticeable GC regression. Consider this [notebook](https://github.com/biaslab/RxInfer.jl/blob/main/examples/Tiny%20Benchmark.ipynb). Not an MWE but still, this notebook computes Bayesian posteriors in a simple linear Gaussian state-space probabilistic model. There are two settings:

- Filtering, for each time step $t$ use observations up to the time step $t$.
- Smoothing, for each time step $t$ use observations up to the time step $T > t$

Here are the results on the current Julia release
```julia
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1504 samples with 1 evaluation.
 Range (min … max):  2.633 ms … 13.932 ms  ┊ GC (min … max): 0.00% … 69.28%
 Time  (median):     3.073 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.319 ms ±  1.058 ms  ┊ GC (mean ± σ):  7.08% ± 13.05%

   ▅▇▇██▇▅▃▂                                                 ▁
  ██████████▇▇▅▇█▇▇▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇██▇▆█▇▆▄▄▅ █
  2.63 ms      Histogram: log(frequency) by time     7.92 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 288 samples with 1 evaluation.
 Range (min … max):  13.868 ms … 29.987 ms  ┊ GC (min … max):  0.00% … 35.63%
 Time  (median):     15.545 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   17.411 ms ±  3.975 ms  ┊ GC (mean ± σ):  10.81% ± 14.33%

   ▄▃█▁▄▅▁
  ▇███████▇▆▅▅▃▃▄▂▄▃▂▁▃▃▁▁▁▁▁▁▁▁▁▂▃▅▃▃▅▅▄▃▂▃▃▃▃▃▂▂▁▄▂▁▃▁▁▂▂▂▂ ▃
  13.9 ms         Histogram: frequency by time        28.4 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.
```

Here are the results on the 1.10-beta1

```julia
julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1308 samples with 1 evaluation.
 Range (min … max):  3.260 ms … 78.207 ms  ┊ GC (min … max): 0.00% … 94.71%
 Time  (median):     3.479 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.818 ms ±  3.293 ms  ┊ GC (mean ± σ):  6.64% ±  7.41%

      ▄▆██▅▁
  ▂▃▄▇██████▇▅▅▃▃▃▃▃▂▂▃▃▃▃▃▃▂▃▃▂▂▂▁▂▂▂▂▁▂▁▂▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  3.26 ms        Histogram: frequency by time        4.94 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min … max):  15.160 ms … 88.841 ms  ┊ GC (min … max): 0.00% … 79.71%
 Time  (median):     15.757 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.336 ms ±  7.862 ms  ┊ GC (mean ± σ):  7.05% ± 11.57%

  █▅▁
  █████▇▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▄ ▅
  15.2 ms      Histogram: log(frequency) by time      57.9 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.
```

As you can see in the case of `run_filtering`, the maximum time jumped from `13ms` to `78ms`. The GC max also indicates a jump from `69%` to `94%`. In the case of `run_smoothing` the situation is similar, the maximum time jumped from `29ms` to `88ms`. The GC max jumped from `35%` to `79%`. 

The inference precedure allocates a lot of intermediate "messages" in a form of distributions from `Distributions.jl` package, but **does not** use any sampling. Instead, it computes analytical solutions for posteriors. This analytical solutions also rely on dynamic multiple dispatch in many places. Eliminating dynamic multiple dispatch is not really an option, it just how it works and it was quite efficient anyway until now. 

The major differences between two functions is that `run_filtering` allocates a lot of information (messages) and do not use it afterwards that is probably can be free-ed right away, and the `run_smoothing` retains/stores this information till the end of the procedure. You can also see that the resulting minimum execution time is also worse in both cases.

I think this is quite a severe regression, especially for the `filtering` case, which is supposed to run the real-time Bayesian inference with as little GC pauses as possible. We can of course refine our code base, but in the mean-time can it be improved in general? What can cause this? How should we proceed and debug this? How can we help figuring out further?

```julia
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 12 virtual cores

julia>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance regression in RxInfer code on current master #50704

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Performance regression in RxInfer code on current master #50704

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions