Skip to content

w64devkit build segfaults at 0xFFFFFFFFFFFFFFFF #2922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cebtenzzre opened this issue Aug 31, 2023 · 5 comments · Fixed by #2945
Closed

w64devkit build segfaults at 0xFFFFFFFFFFFFFFFF #2922

cebtenzzre opened this issue Aug 31, 2023 · 5 comments · Fixed by #2945
Assignees
Labels
bug Something isn't working windows Issues specific to Windows

Comments

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Aug 31, 2023

Steps to reproduce:

  1. Install latest w64devkit
  2. Build with make LLAMA_DEBUG=1
  3. Simply run ./main, regardless of whether you have a model in the default location (I don't)

50% of the time, it will fail. I cannot reproduce it if I build with MSYS2's mingw-w64 toolchain instead.

I bisected it to commit 0c44427, which adds -march=native to CXXFLAGS.

If cv2pdb is to be trusted (confirmed below), the crash happens here:
https://github.com/ggerganov/llama.cpp/blob/8afe2280009ecbfc9de2c93b8f41283dc810609a/common/common.cpp#L723

Something is going wrong before that function call:

    llama_model * model  = llama_load_model_from_file(params.model.c_str(), lparams);
00007FF7CBE59448  mov         rax,qword ptr [params]  
00007FF7CBE5944F  add         rax,0C8h  
00007FF7CBE59455  mov         rcx,rax  
00007FF7CBE59458  call        _M_range_check+0F70h (07FF7CBEA0340h)  
00007FF7CBE5945D  mov         rcx,rax  
00007FF7CBE59460  vmovdqu     ymm0,ymmword ptr [lparams]  
00007FF7CBE59465  vmovdqa     ymmword ptr [rbp-60h],ymm0  <-- segfault is here
00007FF7CBE5946A  vmovdqu     ymm0,ymmword ptr [rbp+10h]  
00007FF7CBE5946F  vmovdqa     ymmword ptr [rbp-40h],ymm0  
00007FF7CBE59474  lea         rax,[rbp-60h]  
00007FF7CBE59478  mov         rdx,rax  
00007FF7CBE5947B  call        llama_load_model_from_file (07FF7CBE4B9D7h)

rbp is 0x0000007CA91FE0D0, so I'm not sure where 0xFFFFFFFFFFFFFFFF comes from. And it's a read violation, but that instruction is only reading from a register.

@cebtenzzre cebtenzzre added bug Something isn't working windows Issues specific to Windows labels Aug 31, 2023
@cebtenzzre cebtenzzre self-assigned this Aug 31, 2023
@staviq
Copy link
Contributor

staviq commented Aug 31, 2023

Gdb decided to play nice with me for once, and I got this ( this is after i added debug fprintfs described below ):

image

image

I added a debug fprintf here:

std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(gpt_params & params) {
    struct llama_context_params lparams = llama_context_params_from_gpt_params(params);
	
	fprintf(stderr, "Model: %s, lparams 0x%X\n",params.model.c_str(), (uint64_t)(&lparams) );
	fflush(stderr);

    llama_model * model  = llama_load_model_from_file(params.model.c_str(), lparams);
    if (model == NULL) {
        fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
        return std::make_tuple(nullptr, nullptr);
    }

And in the disassembly, segfault happens after fflush.

Note that <_ZNKSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE5c_strEv> for my debug fprintf works,

And after <_ZNKSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE5c_strEv> in preparation for <llama_load_model_from_file(char const*, llama_context_params)> call, there are what appears to be AVX instructions, which are not invoked in the code directly, so I believe it's because of march=native

I haven't played with x86 assembly since uni, so maybe you can see something in here:

(gdb) info all-registers
rax            0x1e4fb0            1986480
rbx            0x1e4fb0            1986480
rcx            0x1e4fb0            1986480
rdx            0x2                 2
rsi            0x5fe520            6284576
rdi            0x1e164a            1971786
rbp            0x5fe530            0x5fe530
rsp            0x5fe4b0            0x5fe4b0
r8             0x7ffcf9f55940      140724502092096
r9             0x7ff6bfa30998      140697753815448
r10            0x0                 0
r11            0x246               582
r12            0x1                 1
r13            0x1e15e0            1971680
r14            0x1e1570            1971568
r15            0x0                 0
rip            0x7ff6bf9195ea      0x7ff6bf9195ea <llama_init_from_gpt_params(gpt_params&)+165>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x2b                43
es             0x2b                43
fs             0x53                83
gs             0x2b                43
st0            0                   (raw 0x00000000000000000000)
st1            0                   (raw 0x00000000000000000000)
st2            0                   (raw 0x00000000000000000000)
st3            1                   (raw 0x3fff8000000000000000)
st4            0.571250147471054165729 (raw 0x3ffe923d731d392dc6dc)
--Type <RET> for more, q to quit, c to continue without paging--
st5            2.82235073047193707641e-324 (raw 0x3bcc923d731d392dc6dc)
st6            2.82235073047193707641e-324 (raw 0x3bcc923d731d392dc6dc)
st7            2.82235073047193707641e-324 (raw 0x3bcc923d731d392dc6dc)
fctrl          0x230037f           36701055
fstat          0x230               560
ftag           0x0                 0
fiseg          0x0                 0
fioff          0x0                 0
foseg          0x0                 0
fooff          0x0                 0
fop            0x0                 0
xmm0           {v8_bfloat16 = {0xcb8d, 0x64f0, 0x200, 0x0, 0x200, 0x0, 0x0, 0x0}, v8_half = {0xcb8d, 0x64f0, 0x200, 0x0, 0x200, 0x0, 0x0, 0x0}, v4_float = {0x64f0cb8d, 0x200, 0x200, 0x0}, v2_double = {0x20064f0cb8d, 0x200}, v16_int8 = {0x8d, 0xcb, 0xf0, 0x64, 0x0, 0x2, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0xcb8d, 0x64f0, 0x200, 0x0, 0x200, 0x0, 0x0, 0x0}, v4_int32 = {0x64f0cb8d, 0x200, 0x200, 0x0}, v2_int64 = {0x20064f0cb8d, 0x200}, uint128 = 0x2000000020064f0cb8d}
xmm1           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm2           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm3           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm4           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm5           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0--Type <RET> for more, q to quit, c to continue without paging--
x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm6           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm7           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm8           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm9           {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm10          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm11          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm12          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm13          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
xmm14          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
--Type <RET> for more, q to quit, c to continue without paging--
xmm15          {v8_bfloat16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v2_int64 = {0x0, 0x0}, uint128 = 0x0}
mxcsr          0x1fbb              [ IE DE OE UE PE IM DM ZM OM UM PM ]
(gdb)

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Aug 31, 2023

I can reproduce the issue if I replace -march=native -mtune=native with -mavx -mtune=znver1 or -mavx -mtune=icelake-client, but not with -mavx -mtune=bdver4 or just -mtune=znver1.

I can make it crash on both a Ryzen 5 3600 and an Intel i5-3570K.

@staviq
Copy link
Contributor

staviq commented Aug 31, 2023

I did some digging and the disassembly from w64devkit around the crashing code, looks very very similar to native Linux gcc build.

Linux native build also produces identical avx instructions for what seems to be copying args for llama_load_model_from_file call

I thought this might be a memory alignment issue, but vmovdqu is supposed to convert unaligned to aligned, so it doesn't look like is memory alignment.

Which makes me think the problem is actually created earlier/elsewhere.

In the meantime, I managed to make vscode work with w64devkit gdb, if you run vscode from the w64devkit shell, vscode inherits the env properly, automatically, and you can set breakpoints from text, and disassembly view can do single ASM steps too.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Aug 31, 2023

Actually, an alignment issue would explain everything. If the destination operand of vmovdqa from a ymm register is not 32-byte aligned, it will crash with a general protection fault. The rbp values you and I have seen do not appear to be 32-byte aligned when you subtract 0x60. But they must sometimes be aligned, which explains why it doesn't crash consistently.

So, w64devkit's gcc is trying to copy llama_context_params using AVX registers, but uses an aligned instruction with an unaligned stack address (where the function arguments are stored on Windows), which fails. It must be a compiler bug.

As a workaround, we could disable -mtune=native on certain versions of MinGW - or MinGW in general. -march=native is the important part, anyway.

There is also -muse-unaligned-vector-move according to GCC bug 54412.

cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this issue Aug 31, 2023
cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this issue Aug 31, 2023
@staviq
Copy link
Contributor

staviq commented Aug 31, 2023

Idea:
Maybe we should add some compiler detection for build-info.h, and add that to logs, or print below build id in main ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working windows Issues specific to Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants