Skip to content

Performance regression in RxInfer code on current master #50704

Closed
@bvdmitri

Description

@bvdmitri

We have a set of small benchmarks to quickly test our code in RxInfer. The aim of the package is to run efficient Bayesian inference, potentially on low-power low-memory devices like RaspberryPI. We just noticed, that on Julia 1.10 we have quite a noticeable GC regression. Consider this notebook. Not an MWE but still, this notebook computes Bayesian posteriors in a simple linear Gaussian state-space probabilistic model. There are two settings:

  • Filtering, for each time step $t$ use observations up to the time step $t$.
  • Smoothing, for each time step $t$ use observations up to the time step $T > t$

Here are the results on the current Julia release

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1504 samples with 1 evaluation.
 Range (min  max):  2.633 ms  13.932 ms  ┊ GC (min  max): 0.00%  69.28%
 Time  (median):     3.073 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.319 ms ±  1.058 ms  ┊ GC (mean ± σ):  7.08% ± 13.05%

   ▅▇▇██▇▅▃▂                                                 ▁
  ██████████▇▇▅▇█▇▇▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇██▇▆█▇▆▄▄▅ █
  2.63 ms      Histogram: log(frequency) by time     7.92 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 288 samples with 1 evaluation.
 Range (min  max):  13.868 ms  29.987 ms  ┊ GC (min  max):  0.00%  35.63%
 Time  (median):     15.545 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   17.411 ms ±  3.975 ms  ┊ GC (mean ± σ):  10.81% ± 14.33%

   ▄▃█▁▄▅▁
  ▇███████▇▆▅▅▃▃▄▂▄▃▂▁▃▃▁▁▁▁▁▁▁▁▁▂▃▅▃▃▅▅▄▃▂▃▃▃▃▃▂▂▁▄▂▁▃▁▁▂▂▂▂ ▃
  13.9 ms         Histogram: frequency by time        28.4 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.

Here are the results on the 1.10-beta1

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1308 samples with 1 evaluation.
 Range (min  max):  3.260 ms  78.207 ms  ┊ GC (min  max): 0.00%  94.71%
 Time  (median):     3.479 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.818 ms ±  3.293 ms  ┊ GC (mean ± σ):  6.64% ±  7.41%

      ▄▆██▅▁
  ▂▃▄▇██████▇▅▅▃▃▃▃▃▂▂▃▃▃▃▃▃▂▃▃▂▂▂▁▂▂▂▂▁▂▁▂▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  3.26 ms        Histogram: frequency by time        4.94 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min  max):  15.160 ms  88.841 ms  ┊ GC (min  max): 0.00%  79.71%
 Time  (median):     15.757 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.336 ms ±  7.862 ms  ┊ GC (mean ± σ):  7.05% ± 11.57%

  █▅▁
  █████▇▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▄ ▅
  15.2 ms      Histogram: log(frequency) by time      57.9 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.

As you can see in the case of run_filtering, the maximum time jumped from 13ms to 78ms. The GC max also indicates a jump from 69% to 94%. In the case of run_smoothing the situation is similar, the maximum time jumped from 29ms to 88ms. The GC max jumped from 35% to 79%.

The inference precedure allocates a lot of intermediate "messages" in a form of distributions from Distributions.jl package, but does not use any sampling. Instead, it computes analytical solutions for posteriors. This analytical solutions also rely on dynamic multiple dispatch in many places. Eliminating dynamic multiple dispatch is not really an option, it just how it works and it was quite efficient anyway until now.

The major differences between two functions is that run_filtering allocates a lot of information (messages) and do not use it afterwards that is probably can be free-ed right away, and the run_smoothing retains/stores this information till the end of the procedure. You can also see that the resulting minimum execution time is also worse in both cases.

I think this is quite a severe regression, especially for the filtering case, which is supposed to run the real-time Bayesian inference with as little GC pauses as possible. We can of course refine our code base, but in the mean-time can it be improved in general? What can cause this? How should we proceed and debug this? How can we help figuring out further?

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 12 virtual cores

julia>

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceMust go fasterregressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions