Skip to content

Segfaults with gcc 13 and -fstrict-aliasing #261

@gszr

Description

@gszr

Here are some notes of a segfault I have been facing with lua-protobuf in Kong Gateway, which I pinpointed to the optimization -fstrict-aliasing (part of -O2) in lua-protobuf -- long write-up, be warned : )

2024/03/06 12:50:52 [notice] 184162#0: *1 [kong] init.lua:589 declarative config loaded from /home/gs/code/work/issues/KAG-3949/kong.yml, context: init_worker_by_lua*
2024/03/06 12:50:52 [notice] 184162#0: *644 [kong] process.lua:217 Starting go-hello, context: ngx.timer
2024/03/06 12:50:55 [notice] 184161#0: signal 17 (SIGCHLD) received from 184162
2024/03/06 12:50:55 [alert] 184161#0: worker process 184162 exited on signal 11 (core dumped)
2024/03/06 12:50:55 [notice] 184161#0: start worker process 184186
2024/03/06 12:50:55 [notice] 184186#0: *1291 [kong] process.lua:217 Starting go-hello, context: ngx.timer

A segfault coming from the worker process. I fired up gdb and got the following:

Core was generated by `nginx: worker process                                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
(gdb) bt
#0  0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#1  0x0000772185e1321e in lpb_setdeffields () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#2  0x0000772185e13518 in lpb_pushtypetable () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#3  0x0000772185e172e6 in lpbD_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#4  0x0000772185e17352 in Lpb_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#5  0x0000772186796b93 in lj_BC_FUNCC ()

So I saw it originated in the lua-protobuf C library and rebuilt lua-protobuf with -g2

$ CFLAGS="-O2 -g2" luarocks install lua-protobuf 0.5.0

I preserved optimization level 2 since that is the default the library is built with.

Restarted Kong and reran...

Core was generated by `nginx: worker process                                                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
    at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220

warning: Source file is more recent than executable.
1220	            if ((*pfield = e->value) != NULL)
(gdb) bt
#0  pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
    at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220

So we can see that the segfault is caused by the assignment:

if ((*pfield = e->value) != NULL)

which happens on line 1220 of the pb.h file: https://github.com/starwing/lua-protobuf/blame/master/pb.h#L1220. Looking at the code around that line did not help much; if pfield itself was NULL, the segfault would have happened on line 1217 instead, so the pointer must be valid, but that region it points to isn't writeable.

Back to the compilation command, I tried a few other things
Since the default is to build with -O2 , what happens if I do -O0?

$ CFLAGS="-g2" luarocks install lua-protobuf 0.5.0 master?
Installing https://luarocks.org/lua-protobuf-0.5.0-1.src.rock

lua-protobuf 0.5.0-1 depends on lua >= 5.1 (5.1-1 provided by VM)
gcc -g2 -fPIC -I/home/gs/code/work/kong/bazel-bin/build/kong-dev/openresty/luajit/include/luajit-2.1 -c pb.c -o pb.o
gcc -shared -o pb.so pb.o
lua-protobuf 0.5.0-1 is now installed in /home/gs/code/work/kong/bazel-bin/build/kong-dev (license: MIT)

... it now works. So it must be some optimization that gets enabled by the -O2 flag.
I rerun with -O1 - it works; then I rerun with -O3 - and it also works -- (which is strange, since each level of optimization adds to the previous one)

Pursuing another line, I decided to test with other compilers:
clang version 16.0.6 -> works
gcc version 12.3.0 -> works
(with same exact flags)

Back to compilation levels; we know that -O0 and -O1 work, but -O2 does not, so the issue must be triggered by a subset of optimizations introduced by -O2.
I tried putting together a list of all individual optimizations done by -O1 and -O2; it turns out that such a list isn't really possible, because gcc does not expose all optimizations done by -Ox in individual flags (and there are no plans for doing so [2]). However, at least some (perhaps most) of those optimizations are exposed through individual -f flags [3], so I compiled the following list: https://gist.github.com/gszr/6164e79c596bd98d3a4fd170ff351611

I wrote a small script to compile the library with all these and "binary search" for at least one that triggers the issue (note on the flags: it is not enough to list all flags; at least -O1 must be specified so they take effect).

After a few rounds of removing / adding flags, I pinpointed the issue to the following optimization: -fstrict-aliasing. Which we can confirm with the following -- enabling all of -O1 + -fstrict-aliasing which belongs to -O2:

$ CC=/usr/bin/gcc CFLAGS="-O2 -fstrict-aliasing" luarocks install lua-protobuf 0.5.0

... and it breaks.

Could there be some other optimization that could also lead to the issue? To test this, I reran with -O2, but disabling specifically -fstrict-aliasing:

$ CC=/usr/bin/gcc CFLAGS="-O2 -fno-strict-aliasing" luarocks install lua-protobuf 0.5.0
... and things work normally, no segfaults : )

[1] https://gcc.gnu.org/wiki/FAQ#Is_-O1_.28-O2.2C-O3.2C_-Os_or_-Og.29_equivalent_to_individual_-foptimization_options.3F
[2] https://gcc.gnu.org/legacy-ml/gcc-help/2009-10/msg00134.html
[3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions