-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
This is mostly a condensed repost of #44019. I was told that julia 1.7.1 was a "very old release to be relying on threading," but the issue still occurs on the latest nightly.
Running the following code will sometimes segfault. It is more common with 32 threads, but will sometimes occur with 16 or fewer threads.
Current Code
using Base.Iterators
using Base.Threads
using Serialization
using Distributions
function bin_data(data, lo, hi, nbins)
dx = (hi - lo) / nbins
bins = ((data .- lo) ./ dx) .|> floor
bins = UInt8.(bins)
clamp.(bins, UInt8(0), UInt8(nbins))
end
l = SpinLock()
function compress_data(data)
lock(l)
tmpfn = tempname()
unlock(l)
write(tmpfn, data)
run(
pipeline(
`xz -9e --keep --format=raw --suffix=.xz $(tmpfn)`,
stdout = devnull,
stderr = devnull,
),
)
nbytes = filesize(tmpfn * ".xz")
rm(tmpfn * ".xz")
rm(tmpfn)
return nbytes
end
compressed_size_bytes(data) = compress_data(data)
compressed_size_bits(data) = compress_data(data) * 8
function emission_times_exp(n, k, Γ)
η = (k + Γ) / (k * Γ)
dist = Exponential(η)
rand(dist, n)
end
function lose_data(lagtimes, γ)
@assert(all(lagtimes .>= 0.0))
ind = Int[]
fixed_times = cumsum(lagtimes)
for i = 1:length(lagtimes)
x = rand()
if x < γ
push!(ind, i)
end
end
detected_times = fixed_times[ind]
detected_times |> diff
end
ns = [100_000, 1_000_000, 10_000_000]
# ns = [1_000] # testing only
ks = [0.1, 0.5, 1.0, 5.0, 10.0]
Γs = [0.1, 0.5, 1.0, 5.0, 10.0]
γs = range(0.1, 1.0, step = 0.1)
ntrials = 1000
smrates = Iterators.product(ks, Γs) |> collect |> vec
l = SpinLock()
@threads for trialnum = 1:ntrials
data = Dict()
for p in smrates
(k, Γ) = p
for n in ns
# nm_times = get_emission_dt(n, k, Γ)
# mar_times = emission_times_exp(n, k, Γ)
nm_times = 10.0 .* rand(n)
mar_times = 10.0 .* rand(n)
for γ in γs
nm_lost = lose_data(nm_times, γ)
mar_lost = lose_data(mar_times, γ)
hi = max(maximum(nm_lost),maximum(mar_lost))
@assert(all(nm_lost .>= 0.0))
@assert(all(mar_lost .>= 0.0))
nm_binned = bin_data(nm_lost, 0.0, hi, 100)
mar_binned = bin_data(mar_lost, 0.0, hi, 100)
nm_size = compressed_size_bytes(nm_binned)
mar_size = compressed_size_bytes(mar_binned)
experiment_index = (n = n, k = k, Γ = Γ, γ = γ, trial = trialnum)
try
lock(l)
data[experiment_index] = (1.0, 1.0)
finally
unlock(l)
end
end
end
end
end
Output of `versioninfo()`
Julia Version 1.9.0-DEV.109
Commit 3a47c1c4e1 (2022-03-01 20:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
Threads: 1 on 32 virtual cores
Stack trace from the one time in many that the program managed to print one before dying
The line that originates in my code (line 65 at the top) is just the @threads
loop.
signal (11): Segmentation fault
in expression starting at /mnt/ssd-data/Experiments/02-2022/simple-photon-model/code/buggy/gen_paramsweep_segfault.jl:65
jl_uv_call_close_callback at /buildworker/worker/package_linux64/build/src/jl_uv.c:83 [inlined]
jl_uv_closeHandle at /buildworker/worker/package_linux64/build/src/jl_uv.c:106
uv__finish_close at /workspace/srcdir/libuv/src/unix/core.c:301
uv__run_closing_handles at /workspace/srcdir/libuv/src/unix/core.c:315
uv_run at /workspace/srcdir/libuv/src/unix/core.c:393
ijl_process_events at /buildworker/worker/package_linux64/build/src/jl_uv.c:210
ijl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:598
poptask at ./task.jl:884
wait at ./task.jl:893
wait at ./condition.jl:124
wait at ./process.jl:647
success at ./process.jl:509
Exception: /tmp/julia-3a47c1c4e1/bin/julia killed by signal segmentation fault (core dumped)
A valgrind report from the previous issue
Taken from this comment
snellius paulm@tcn116 17:23 ~$ valgrind --smc-check=all-non-file --suppressions=$HOME/c/julia-git/contrib/valgrind-julia.supp ~/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== Memcheck, a memory error detector
==1988794== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1988794== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1988794== Command: /home/paulm/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794==
--1988794-- WARNING: unhandled amd64-linux syscall: 1008
--1988794-- You may be able to write your own handler.
--1988794-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--1988794-- Nevertheless we consider this a bug. Please report
--1988794-- it at http://valgrind.org/support/bug_reports.html.
==1988794== Warning: client switching stacks? SP change: 0x311d17d8 --> 0xe2a5fff8
==1988794== to suppress, use: --max-stackframe=2978539552 or greater
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: client switching stacks? SP change: 0x30bce7d8 --> 0xef554ff8
==1988794== to suppress, use: --max-stackframe=3197659168 or greater
==1988794== Warning: client switching stacks? SP change: 0x34dd57d8 --> 0xf3958ff8
==1988794== to suppress, use: --max-stackframe=3199744032 or greater
==1988794== further instances of this message will not be shown.
==1988794== Thread 3:
==1988794== Syscall param write(buf) points to uninitialised byte(s)
==1988794== at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794== Address 0xef54e000 is in a rw- anonymous segment
==1988794==
==1988794== Syscall param write(buf) points to unaddressable byte(s)
==1988794== at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794== Address 0xef54e000 is in a rw- anonymous segment
==1988794==
==1988794== Thread 80:
==1988794== Invalid read of size 8
==1988794== at 0x5B87E84: maybe_collect (julia_threads.h:325)
==1988794== by 0x5B87E84: jl_gc_big_alloc (gc.c:947)
==1988794== Address 0xfffffffffe49ff10 is not stack'd, malloc'd or (recently) free'd
==1988794==
==1988794== Thread 88:
==1988794== Invalid read of size 8
==1988794== at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794== by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1988794==
==1988794==
==1988794== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1988794== Access not within mapped region at address 0x0
==1988794== at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794== by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794== If you believe this happened as a result of a stack
==1988794== overflow in your program's main thread (unlikely but
==1988794== possible), you can try to increase the size of the
==1988794== main thread stack using the --main-stacksize= flag.
==1988794== The main thread stack size used in this run was 16777216.
==1988794==
==1988794== HEAP SUMMARY:
==1988794== in use at exit: 603,293,029 bytes in 49,164 blocks
==1988794== total heap usage: 1,022,326 allocs, 973,162 frees, 3,449,011,340 bytes allocated
==1988794==
==1988794== LEAK SUMMARY:
==1988794== definitely lost: 163 bytes in 12 blocks
==1988794== indirectly lost: 0 bytes in 0 blocks
==1988794== possibly lost: 1,273,212 bytes in 12,672 blocks
==1988794== still reachable: 602,018,922 bytes in 36,477 blocks
==1988794== of which reachable via heuristic:
==1988794== newarray : 56,448 bytes in 10 blocks
==1988794== multipleinheritance: 7,992 bytes in 15 blocks
==1988794== suppressed: 732 bytes in 3 blocks
==1988794== Rerun with --leak-check=full to see details of leaked memory
==1988794==
==1988794== Use --track-origins=yes to see where uninitialised values come from
==1988794== For lists of detected and suppressed errors, rerun with: -s
==1988794== ERROR SUMMARY: 15 errors from 4 contexts (suppressed: 46 from 6)
Segmentation fault
An rr trace of the bug under Julia 1.7.1 can be found here. However, I have not been able to catch the program crashing under rr
under the nightly build: trying to use BugReporting.jl will simply complete the program successfully, while running julia
itself under rr
has not managed to produce any useful results yet (it has been running for over 24 hours at this point with no apparent progress).