-
Notifications
You must be signed in to change notification settings - Fork 396
Description
Here are some notes of a segfault I have been facing with lua-protobuf in Kong Gateway, which I pinpointed to the optimization -fstrict-aliasing
(part of -O2
) in lua-protobuf -- long write-up, be warned : )
2024/03/06 12:50:52 [notice] 184162#0: *1 [kong] init.lua:589 declarative config loaded from /home/gs/code/work/issues/KAG-3949/kong.yml, context: init_worker_by_lua*
2024/03/06 12:50:52 [notice] 184162#0: *644 [kong] process.lua:217 Starting go-hello, context: ngx.timer
2024/03/06 12:50:55 [notice] 184161#0: signal 17 (SIGCHLD) received from 184162
2024/03/06 12:50:55 [alert] 184161#0: worker process 184162 exited on signal 11 (core dumped)
2024/03/06 12:50:55 [notice] 184161#0: start worker process 184186
2024/03/06 12:50:55 [notice] 184186#0: *1291 [kong] process.lua:217 Starting go-hello, context: ngx.timer
A segfault coming from the worker process. I fired up gdb and got the following:
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
(gdb) bt
#0 0x0000772185e0dc7c in pb_nextfield () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#1 0x0000772185e1321e in lpb_setdeffields () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#2 0x0000772185e13518 in lpb_pushtypetable () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#3 0x0000772185e172e6 in lpbD_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#4 0x0000772185e17352 in Lpb_decode () from /home/gs/code/work/kong/bazel-bin/build/kong-dev/lib/lua/5.1/pb.so
#5 0x0000772186796b93 in lj_BC_FUNCC ()
So I saw it originated in the lua-protobuf C library and rebuilt lua-protobuf with -g2
$ CFLAGS="-O2 -g2" luarocks install lua-protobuf 0.5.0
I preserved optimization level 2 since that is the default the library is built with.
Restarted Kong and reran...
Core was generated by `nginx: worker process '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220
warning: Source file is more recent than executable.
1220 if ((*pfield = e->value) != NULL)
(gdb) bt
#0 pb_nextfield (t=t@entry=0x5a68f7bdcfa0, pfield=pfield@entry=0x7ffffb7092f0)
at /tmp/luarocks_lua-protobuf-0.5.0-1-4338781/lua-protobuf/pb.h:1220
So we can see that the segfault is caused by the assignment:
if ((*pfield = e->value) != NULL)
which happens on line 1220 of the pb.h file: https://github.com/starwing/lua-protobuf/blame/master/pb.h#L1220. Looking at the code around that line did not help much; if pfield
itself was NULL
, the segfault would have happened on line 1217 instead, so the pointer must be valid, but that region it points to isn't writeable.
Back to the compilation command, I tried a few other things
Since the default is to build with -O2
, what happens if I do -O0
?
$ CFLAGS="-g2" luarocks install lua-protobuf 0.5.0 master?
Installing https://luarocks.org/lua-protobuf-0.5.0-1.src.rock
lua-protobuf 0.5.0-1 depends on lua >= 5.1 (5.1-1 provided by VM)
gcc -g2 -fPIC -I/home/gs/code/work/kong/bazel-bin/build/kong-dev/openresty/luajit/include/luajit-2.1 -c pb.c -o pb.o
gcc -shared -o pb.so pb.o
lua-protobuf 0.5.0-1 is now installed in /home/gs/code/work/kong/bazel-bin/build/kong-dev (license: MIT)
... it now works. So it must be some optimization that gets enabled by the -O2
flag.
I rerun with -O1
- it works; then I rerun with -O3
- and it also works -- (which is strange, since each level of optimization adds to the previous one)
Pursuing another line, I decided to test with other compilers:
clang version 16.0.6 -> works
gcc version 12.3.0 -> works
(with same exact flags)
Back to compilation levels; we know that -O0
and -O1
work, but -O2 does not, so the issue must be triggered by a subset of optimizations introduced by -O2
.
I tried putting together a list of all individual optimizations done by -O1
and -O2
; it turns out that such a list isn't really possible, because gcc does not expose all optimizations done by -Ox in individual flags (and there are no plans for doing so [2]). However, at least some (perhaps most) of those optimizations are exposed through individual -f flags [3], so I compiled the following list: https://gist.github.com/gszr/6164e79c596bd98d3a4fd170ff351611
I wrote a small script to compile the library with all these and "binary search" for at least one that triggers the issue (note on the flags: it is not enough to list all flags; at least -O1 must be specified so they take effect).
After a few rounds of removing / adding flags, I pinpointed the issue to the following optimization: -fstrict-aliasing
. Which we can confirm with the following -- enabling all of -O1
+ -fstrict-aliasing
which belongs to -O2
:
$ CC=/usr/bin/gcc CFLAGS="-O2 -fstrict-aliasing" luarocks install lua-protobuf 0.5.0
... and it breaks.
Could there be some other optimization that could also lead to the issue? To test this, I reran with -O2
, but disabling specifically -fstrict-aliasing
:
$ CC=/usr/bin/gcc CFLAGS="-O2 -fno-strict-aliasing" luarocks install lua-protobuf 0.5.0
... and things work normally, no segfaults : )
[1] https://gcc.gnu.org/wiki/FAQ#Is_-O1_.28-O2.2C-O3.2C_-Os_or_-Og.29_equivalent_to_individual_-foptimization_options.3F
[2] https://gcc.gnu.org/legacy-ml/gcc-help/2009-10/msg00134.html
[3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html