Occasional segfaults when running with @threads

This is mostly a condensed repost of #44019. I was told that julia 1.7.1 was a "very old release to be relying on threading," but the issue still occurs on the latest nightly.

Running the following code will sometimes segfault. It is more common with 32 threads, but will sometimes occur with 16 or fewer threads.

<details>
<summary>Current Code</summary>

```julia
using Base.Iterators
using Base.Threads
using Serialization
using Distributions

function bin_data(data, lo, hi, nbins)
    dx = (hi - lo) / nbins
    bins = ((data .- lo) ./ dx) .|> floor
    bins = UInt8.(bins)
    clamp.(bins, UInt8(0), UInt8(nbins))
end

l = SpinLock()
function compress_data(data)
    lock(l)
    tmpfn = tempname()
    unlock(l)
    write(tmpfn, data)
    run(
        pipeline(
            `xz -9e --keep --format=raw --suffix=.xz $(tmpfn)`,
            stdout = devnull,
            stderr = devnull,
        ),
    )
    nbytes = filesize(tmpfn * ".xz")
    rm(tmpfn * ".xz")
    rm(tmpfn)
    return nbytes
end

compressed_size_bytes(data) = compress_data(data)
compressed_size_bits(data) = compress_data(data) * 8

function emission_times_exp(n, k, Γ)
    η = (k + Γ) / (k * Γ)
    dist = Exponential(η)
    rand(dist, n)
end

function lose_data(lagtimes, γ)
    @assert(all(lagtimes .>= 0.0))
    ind = Int[]
    fixed_times = cumsum(lagtimes)
    for i = 1:length(lagtimes)
        x = rand()
        if x < γ
            push!(ind, i)
        end
    end
    detected_times = fixed_times[ind]
    detected_times |> diff
end

ns = [100_000, 1_000_000, 10_000_000]
# ns = [1_000]  # testing only
ks = [0.1, 0.5, 1.0, 5.0, 10.0]
Γs = [0.1, 0.5, 1.0, 5.0, 10.0]
γs = range(0.1, 1.0, step = 0.1)
ntrials = 1000

smrates = Iterators.product(ks, Γs) |> collect |> vec

l = SpinLock()
@threads for trialnum = 1:ntrials
    data = Dict()
    for p in smrates
        (k, Γ) = p
        for n in ns
            # nm_times = get_emission_dt(n, k, Γ)
            # mar_times = emission_times_exp(n, k, Γ)
            nm_times = 10.0 .* rand(n)
            mar_times = 10.0 .* rand(n)

            for γ in γs
                nm_lost = lose_data(nm_times, γ)
                mar_lost = lose_data(mar_times, γ)
                hi = max(maximum(nm_lost),maximum(mar_lost))

                @assert(all(nm_lost .>= 0.0))
                @assert(all(mar_lost .>= 0.0))

                nm_binned = bin_data(nm_lost, 0.0, hi, 100)
                mar_binned = bin_data(mar_lost, 0.0, hi, 100)

                nm_size = compressed_size_bytes(nm_binned)
                mar_size = compressed_size_bytes(mar_binned)

                experiment_index = (n = n, k = k, Γ = Γ, γ = γ, trial = trialnum)

                try
                    lock(l)
                    data[experiment_index] = (1.0, 1.0)
                finally
                    unlock(l)
                end
            end
        end
    end
end
```
</details>


<details>
<summary>Output of `versioninfo()`</summary>

```
Julia Version 1.9.0-DEV.109
Commit 3a47c1c4e1 (2022-03-01 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores
```
</details>

<details>
<summary>Stack trace from the one time in many that the program managed to print one before dying</summary>

The line that originates in my code (line 65 at the top) is just the `@threads` loop.

```
signal (11): Segmentation fault
in expression starting at /mnt/ssd-data/Experiments/02-2022/simple-photon-model/code/buggy/gen_paramsweep_segfault.jl:65
jl_uv_call_close_callback at /buildworker/worker/package_linux64/build/src/jl_uv.c:83 [inlined]
jl_uv_closeHandle at /buildworker/worker/package_linux64/build/src/jl_uv.c:106
uv__finish_close at /workspace/srcdir/libuv/src/unix/core.c:301
uv__run_closing_handles at /workspace/srcdir/libuv/src/unix/core.c:315
uv_run at /workspace/srcdir/libuv/src/unix/core.c:393
ijl_process_events at /buildworker/worker/package_linux64/build/src/jl_uv.c:210
ijl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:598
poptask at ./task.jl:884
wait at ./task.jl:893
wait at ./condition.jl:124
wait at ./process.jl:647
success at ./process.jl:509
Exception: /tmp/julia-3a47c1c4e1/bin/julia killed by signal segmentation fault (core dumped)
```
</details>

<details>
<summary> A valgrind report from the previous issue </summary>

Taken from [this comment](https://github.com/JuliaLang/julia/issues/44019#issuecomment-1030270532)

```
snellius paulm@tcn116 17:23 ~$ valgrind --smc-check=all-non-file --suppressions=$HOME/c/julia-git/contrib/valgrind-julia.supp ~/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== Memcheck, a memory error detector
==1988794== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1988794== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1988794== Command: /home/paulm/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== 
--1988794-- WARNING: unhandled amd64-linux syscall: 1008
--1988794-- You may be able to write your own handler.
--1988794-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--1988794-- Nevertheless we consider this a bug.  Please report
--1988794-- it at http://valgrind.org/support/bug_reports.html.
==1988794== Warning: client switching stacks?  SP change: 0x311d17d8 --> 0xe2a5fff8
==1988794==          to suppress, use: --max-stackframe=2978539552 or greater
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: client switching stacks?  SP change: 0x30bce7d8 --> 0xef554ff8
==1988794==          to suppress, use: --max-stackframe=3197659168 or greater
==1988794== Warning: client switching stacks?  SP change: 0x34dd57d8 --> 0xf3958ff8
==1988794==          to suppress, use: --max-stackframe=3199744032 or greater
==1988794==          further instances of this message will not be shown.
==1988794== Thread 3:
==1988794== Syscall param write(buf) points to uninitialised byte(s)
==1988794==    at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794==  Address 0xef54e000 is in a rw- anonymous segment
==1988794== 
==1988794== Syscall param write(buf) points to unaddressable byte(s)
==1988794==    at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794==  Address 0xef54e000 is in a rw- anonymous segment
==1988794== 
==1988794== Thread 80:
==1988794== Invalid read of size 8
==1988794==    at 0x5B87E84: maybe_collect (julia_threads.h:325)
==1988794==    by 0x5B87E84: jl_gc_big_alloc (gc.c:947)
==1988794==  Address 0xfffffffffe49ff10 is not stack'd, malloc'd or (recently) free'd
==1988794== 
==1988794== Thread 88:
==1988794== Invalid read of size 8
==1988794==    at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794==    by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1988794== 
==1988794== 
==1988794== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1988794==  Access not within mapped region at address 0x0
==1988794==    at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794==    by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794==  If you believe this happened as a result of a stack
==1988794==  overflow in your program's main thread (unlikely but
==1988794==  possible), you can try to increase the size of the
==1988794==  main thread stack using the --main-stacksize= flag.
==1988794==  The main thread stack size used in this run was 16777216.
==1988794== 
==1988794== HEAP SUMMARY:
==1988794==     in use at exit: 603,293,029 bytes in 49,164 blocks
==1988794==   total heap usage: 1,022,326 allocs, 973,162 frees, 3,449,011,340 bytes allocated
==1988794== 
==1988794== LEAK SUMMARY:
==1988794==    definitely lost: 163 bytes in 12 blocks
==1988794==    indirectly lost: 0 bytes in 0 blocks
==1988794==      possibly lost: 1,273,212 bytes in 12,672 blocks
==1988794==    still reachable: 602,018,922 bytes in 36,477 blocks
==1988794==                       of which reachable via heuristic:
==1988794==                         newarray           : 56,448 bytes in 10 blocks
==1988794==                         multipleinheritance: 7,992 bytes in 15 blocks
==1988794==         suppressed: 732 bytes in 3 blocks
==1988794== Rerun with --leak-check=full to see details of leaked memory
==1988794== 
==1988794== Use --track-origins=yes to see where uninitialised values come from
==1988794== For lists of detected and suppressed errors, rerun with: -s
==1988794== ERROR SUMMARY: 15 errors from 4 contexts (suppressed: 46 from 6)
Segmentation fault
```
</details>

An rr trace of the bug under Julia 1.7.1 can be found [here](https://www.dropbox.com/s/yv3zuvy7ojod7nb/gh-julia-44019-rr-1.tar.xz?dl=0). However, I have not been able to catch the program crashing under `rr` under the nightly build: trying to use BugReporting.jl will simply complete the program successfully, while running `julia` itself under `rr` has not managed to produce any useful results yet (it has been running for over 24 hours at this point with no apparent progress).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Occasional segfaults when running with @threads #44460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Occasional segfaults when running with @threads #44460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions