- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
This is mostly a condensed repost of #44019. I was told that julia 1.7.1 was a "very old release to be relying on threading," but the issue still occurs on the latest nightly.
Running the following code will sometimes segfault. It is more common with 32 threads, but will sometimes occur with 16 or fewer threads.
Current Code
using Base.Iterators
using Base.Threads
using Serialization
using Distributions
function bin_data(data, lo, hi, nbins)
dx = (hi - lo) / nbins
bins = ((data .- lo) ./ dx) .|> floor
bins = UInt8.(bins)
clamp.(bins, UInt8(0), UInt8(nbins))
end
l = SpinLock()
function compress_data(data)
lock(l)
tmpfn = tempname()
unlock(l)
write(tmpfn, data)
run(
pipeline(
`xz -9e --keep --format=raw --suffix=.xz $(tmpfn)`,
stdout = devnull,
stderr = devnull,
),
)
nbytes = filesize(tmpfn * ".xz")
rm(tmpfn * ".xz")
rm(tmpfn)
return nbytes
end
compressed_size_bytes(data) = compress_data(data)
compressed_size_bits(data) = compress_data(data) * 8
function emission_times_exp(n, k, Γ)
η = (k + Γ) / (k * Γ)
dist = Exponential(η)
rand(dist, n)
end
function lose_data(lagtimes, γ)
@assert(all(lagtimes .>= 0.0))
ind = Int[]
fixed_times = cumsum(lagtimes)
for i = 1:length(lagtimes)
x = rand()
if x < γ
push!(ind, i)
end
end
detected_times = fixed_times[ind]
detected_times |> diff
end
ns = [100_000, 1_000_000, 10_000_000]
# ns = [1_000] # testing only
ks = [0.1, 0.5, 1.0, 5.0, 10.0]
Γs = [0.1, 0.5, 1.0, 5.0, 10.0]
γs = range(0.1, 1.0, step = 0.1)
ntrials = 1000
smrates = Iterators.product(ks, Γs) |> collect |> vec
l = SpinLock()
@threads for trialnum = 1:ntrials
data = Dict()
for p in smrates
(k, Γ) = p
for n in ns
# nm_times = get_emission_dt(n, k, Γ)
# mar_times = emission_times_exp(n, k, Γ)
nm_times = 10.0 .* rand(n)
mar_times = 10.0 .* rand(n)
for γ in γs
nm_lost = lose_data(nm_times, γ)
mar_lost = lose_data(mar_times, γ)
hi = max(maximum(nm_lost),maximum(mar_lost))
@assert(all(nm_lost .>= 0.0))
@assert(all(mar_lost .>= 0.0))
nm_binned = bin_data(nm_lost, 0.0, hi, 100)
mar_binned = bin_data(mar_lost, 0.0, hi, 100)
nm_size = compressed_size_bytes(nm_binned)
mar_size = compressed_size_bytes(mar_binned)
experiment_index = (n = n, k = k, Γ = Γ, γ = γ, trial = trialnum)
try
lock(l)
data[experiment_index] = (1.0, 1.0)
finally
unlock(l)
end
end
end
end
end
Output of `versioninfo()`
Julia Version 1.9.0-DEV.109
Commit 3a47c1c4e1 (2022-03-01 20:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
Threads: 1 on 32 virtual cores
Stack trace from the one time in many that the program managed to print one before dying
The line that originates in my code (line 65 at the top) is just the @threads
loop.
signal (11): Segmentation fault
in expression starting at /mnt/ssd-data/Experiments/02-2022/simple-photon-model/code/buggy/gen_paramsweep_segfault.jl:65
jl_uv_call_close_callback at /buildworker/worker/package_linux64/build/src/jl_uv.c:83 [inlined]
jl_uv_closeHandle at /buildworker/worker/package_linux64/build/src/jl_uv.c:106
uv__finish_close at /workspace/srcdir/libuv/src/unix/core.c:301
uv__run_closing_handles at /workspace/srcdir/libuv/src/unix/core.c:315
uv_run at /workspace/srcdir/libuv/src/unix/core.c:393
ijl_process_events at /buildworker/worker/package_linux64/build/src/jl_uv.c:210
ijl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:598
poptask at ./task.jl:884
wait at ./task.jl:893
wait at ./condition.jl:124
wait at ./process.jl:647
success at ./process.jl:509
Exception: /tmp/julia-3a47c1c4e1/bin/julia killed by signal segmentation fault (core dumped)
A valgrind report from the previous issue
Taken from this comment
snellius paulm@tcn116 17:23 ~$ valgrind --smc-check=all-non-file --suppressions=$HOME/c/julia-git/contrib/valgrind-julia.supp ~/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== Memcheck, a memory error detector
==1988794== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1988794== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1988794== Command: /home/paulm/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794==
--1988794-- WARNING: unhandled amd64-linux syscall: 1008
--1988794-- You may be able to write your own handler.
--1988794-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--1988794-- Nevertheless we consider this a bug. Please report
--1988794-- it at http://valgrind.org/support/bug_reports.html.
==1988794== Warning: client switching stacks? SP change: 0x311d17d8 --> 0xe2a5fff8
==1988794== to suppress, use: --max-stackframe=2978539552 or greater
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: client switching stacks? SP change: 0x30bce7d8 --> 0xef554ff8
==1988794== to suppress, use: --max-stackframe=3197659168 or greater
==1988794== Warning: client switching stacks? SP change: 0x34dd57d8 --> 0xf3958ff8
==1988794== to suppress, use: --max-stackframe=3199744032 or greater
==1988794== further instances of this message will not be shown.
==1988794== Thread 3:
==1988794== Syscall param write(buf) points to uninitialised byte(s)
==1988794== at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794== Address 0xef54e000 is in a rw- anonymous segment
==1988794==
==1988794== Syscall param write(buf) points to unaddressable byte(s)
==1988794== at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794== Address 0xef54e000 is in a rw- anonymous segment
==1988794==
==1988794== Thread 80:
==1988794== Invalid read of size 8
==1988794== at 0x5B87E84: maybe_collect (julia_threads.h:325)
==1988794== by 0x5B87E84: jl_gc_big_alloc (gc.c:947)
==1988794== Address 0xfffffffffe49ff10 is not stack'd, malloc'd or (recently) free'd
==1988794==
==1988794== Thread 88:
==1988794== Invalid read of size 8
==1988794== at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794== by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1988794==
==1988794==
==1988794== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1988794== Access not within mapped region at address 0x0
==1988794== at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794== by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794== If you believe this happened as a result of a stack
==1988794== overflow in your program's main thread (unlikely but
==1988794== possible), you can try to increase the size of the
==1988794== main thread stack using the --main-stacksize= flag.
==1988794== The main thread stack size used in this run was 16777216.
==1988794==
==1988794== HEAP SUMMARY:
==1988794== in use at exit: 603,293,029 bytes in 49,164 blocks
==1988794== total heap usage: 1,022,326 allocs, 973,162 frees, 3,449,011,340 bytes allocated
==1988794==
==1988794== LEAK SUMMARY:
==1988794== definitely lost: 163 bytes in 12 blocks
==1988794== indirectly lost: 0 bytes in 0 blocks
==1988794== possibly lost: 1,273,212 bytes in 12,672 blocks
==1988794== still reachable: 602,018,922 bytes in 36,477 blocks
==1988794== of which reachable via heuristic:
==1988794== newarray : 56,448 bytes in 10 blocks
==1988794== multipleinheritance: 7,992 bytes in 15 blocks
==1988794== suppressed: 732 bytes in 3 blocks
==1988794== Rerun with --leak-check=full to see details of leaked memory
==1988794==
==1988794== Use --track-origins=yes to see where uninitialised values come from
==1988794== For lists of detected and suppressed errors, rerun with: -s
==1988794== ERROR SUMMARY: 15 errors from 4 contexts (suppressed: 46 from 6)
Segmentation fault
An rr trace of the bug under Julia 1.7.1 can be found here. However, I have not been able to catch the program crashing under rr
under the nightly build: trying to use BugReporting.jl will simply complete the program successfully, while running julia
itself under rr
has not managed to produce any useful results yet (it has been running for over 24 hours at this point with no apparent progress).
Activity
chipbuster commentedon Mar 4, 2022
Running under an external
rr
has crashed with an internalrr
error.It looks like this is an
rr
issue and not the actual Julia crash (the error message was about the wrong signal being received for a stop) but I have uploaded this trace in case it winds up being useful.https://www.dropbox.com/s/ln805f9tfq3tsau/julia-rr-crash-nightly.tar.zst?dl=0
process: ensure uvfinalize and _uv_close_cb are synchronized
process: ensure uvfinalize and _uv_close_cb are synchronized (#44476)
process: ensure uvfinalize and _uv_close_cb are synchronized (#44476)
process: ensure uvfinalize and _uv_close_cb are synchronized (#44476)
dpinol commentedon Apr 22, 2022
I get a crash also on my application when pushing into a array local variable within Threads.@thread.
Then, I tested the code in this issue description and it still crashes on my Ubuntu in both 1.8.0-beta 3 and julia built from master yesterday. @chipbuster can you reproduce the crash with current julia versions?
This is what I get with valgrind on my application:
This is what I get without valgrind on my application with julia 1.9-dev:
With Julia 1.8-beta3
KristofferC commentedon Apr 22, 2022
Could you be a bit more explicit (perhaps with a code snippet)? As you describe it, that sounds like a data race.
dpinol commentedon Apr 22, 2022
The crash involves a local variable only accessed from a single thread. The snipped below illustrates what my application does, but julia never crashes with the snippet (nor valgrind reports any issue).
dpinol commentedon Apr 22, 2022
I uploaded the result of running #44460 (comment) with julia 1.9-dev with --bug-
julia report=rr-local --project=. --threads=32
(5Gb 😿 )(link edited with the packed rr data)
https://www.dropbox.com/s/ia8b159h59x9uv4/issue-44460-rr.tgz?dl=0
vtjnash commentedon Apr 22, 2022
It looks like you need to run
rr pack
before uploading that trace (should only make it slightly bigger)dpinol commentedon Apr 22, 2022
chipbuster commentedon Apr 23, 2022
Yeah, the example snippet is not really the best because it generates a lot of stuff in rr and doesn't always crash quickly.
Something that I found sometimes persuades it to crash quicker is clearing all caches and buffer from Linux before starting a recording, but it's not really foolproof.
vtjnash commentedon Apr 26, 2022
@dpinol there doesn't seem to be anything remarkable in your trace that I can see. The process runs for a while (about 4138 iterations of the loop happen), then something external terminated it with a SIGKILL (9).
chipbuster commentedon May 1, 2022
@dpinol I'm not able to reproduce the crash running with the latest Julia.
Based on the description provided by vtjnash, t's possible that my example is crashing for you simply by running out of memory.
dpinol commentedon May 5, 2022
Yes, I agree in this case it runs out of memory. I created a different MWE at #45196. thanks