0. 背景 前段时间线上一台服务器突然无法登录的情况。
1. 排查过程 从服务器运行日志发现,有个服务陷入了endless loop。每5秒出现一条告警打印。
1 2 3 [56.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002 c ] maybe in an endless loop (version = 1431 ) [01.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002 c ] maybe in an endless loop (version = 1431 ) [06.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002 c ] maybe in an endless loop (version = 1431 )
进入skynet控制台,ping 2c这个地址也没有返回。初步判断该服务陷入某种死循环。
查看机器运行cpu并没有什么负载,没有一个cpu处于满载状态。说明不像陷入了循环。
1 2 3 4 5 6 7 > top top - 01:35:01 up 1 day, 23:04, 1 user, load average: 0.02, 0.05, 0.04 Tasks: 128 total, 1 running, 127 sleeping, 0 stopped, 0 zombie %Cpu0 : 4.0 us, 0.3 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 2.0 us, 0.7 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 1.0 us, 0.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.3 us, 1.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
可以简要得出结论:
处于endless loop中,ping没有返回,说明3b这个服务的协程一直没有让出,怀疑死循环
cpu又没有负载,特别是没有某一个核心处于高负载中,也就是cpu让出来了,说明协程不处于死循环中
至此可以推测3b服务陷入某种io中,让出了cpu却未让出lua协程。
2. 查看代码逻辑 此服务lua逻辑不多,c代码逻辑比较多。服务启动了一个用于RabbitMQ代理的AMQP客户端,用于跟一个RabbitMQ服务器进行数据交互。rabbitmq-c客户端是一个开源c库。https://github.com/alanxz/rabbitmq-c
此时可以怀疑网络IO的问题了
因为用的版本比较旧,怀疑是库的问题,且推测已经做了修复,先看了一眼库的issues。
3. 重现 线上偶发,出现频率很低,大约一年出现一次。推测是网络IO的问题,怀疑在某些极端网络条件下,出现丢包情况。导致rabbitmq-c获取数据介于某种中间状态。故,先设置mq连接的丢包率尝试复现。
1 2 3 4 5 6 sudo iptables -A INPUT -p tcp --sport <port> -m statistic --mode random --probability 0.5 -j DROP sudo iptables -A OUTPUT -p tcp --dport <port> -m statistic --mode random --probability 0.5 -j DROP sudo iptables -L INPUT --line-numbers sudo iptables -D INPUT 1
丢包率太低(低于30%)无法复现
丢包率太高(高于70%)也不容易复现,会触发断线重连
丢包率适中的时候,大约两三个小时能触发一次。
进入gdb调试,获取堆栈
1 2 3 4 5 6 7 8 9 (gdb) #0 0x00007d3d22205c3f in __GI___poll (fds=0x7d3d1ebf90d0 , nfds=1 , timeout=-1 ) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007d3d1daf2f29 in amqp_poll () from ./mq.so #2 0x00007d3d1daf3a71 in recv_with_timeout () from ./mq.so #3 0x00007d3d1daf3deb in wait_frame_inner () from ./mq.so #4 0x00007d3d1daf4189 in amqp_simple_wait_frame_on_channel () from ./mq.so #5 0x00007d3d1dae7aeb in amqp_read_message () from ./mq.so #6 0x00007d3d1dae7963 in amqp_consume_message () from ./mq.so #7 0x00007d3d1dae4030 in consume_message (L=0x7d3d1a20ea68 ) at lualib-src/lua-mq.c
从堆栈可以看到:
skynet的某一个工作线程(也就是出问题service)卡在了poll函数。
从参数timeout=-1可以看到,poll是没有超时时间的,直到读取到数据为止(或网络完全断开?)
4. 再次重现 由于无法查看 mq.so 的参数。看了默认情况下rabbitmq-c 的编译是release版本的。需要调整为debug版本重新编译。
1 2 3 4 5 6 7 8 # rabbitmq-c/CMakeLists.txt:36 set(default_build_type "Release") # Release修改为Debug if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES) message(STATUS "Setting build type to '${default_build_type}' as none was specified.") set(CMAKE_BUILD_TYPE "${default_build_type}" CACHE STRING "Choose the type of build." FORCE) set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo") endif()
堆栈
1 2 3 4 5 6 7 8 9 10 (gdb) bt #0 0x0000715e54949c3f in __GI___poll (fds=0x715e51bfa230 , nfds=1 , timeout=-1 ) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x0000715e50234fa4 in amqp_poll (fd=16 , event=2 , deadline=...) at librabbitmq/amqp_socket.c:286 #2 0x0000715e50235aec in recv_with_timeout (state=0x715d47114380 , timeout=...) at librabbitmq/amqp_socket.c:725 #3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380 , decoded_frame=0x715e51bfa3d0 , timeout=0x0 ) at librabbitmq/amqp_socket.c:846 #4 0x0000715e50236204 in amqp_simple_wait_frame_on_channel (state=0x715d47114380 , channel=1 , decoded_frame=0x715e51bfa3d0 ) at librabbitmq/amqp_socket.c:952 #5 0x0000715e50229b66 in amqp_read_message (state=0x715d47114380 , channel=1 , message=0x715e51bfa588 , flags=0 ) at librabbitmq/amqp_consumer.c:217 #6 0x0000715e502299de in amqp_consume_message (state=0x715d47114380 , envelope=0x715e51bfa540 , timeout=0x715e51bfa510 , flags=0 ) at librabbitmq/amqp_consumer.c:186 #7 0x0000715e50226480 in consume_message (L=0x715e4da16d08 ) at lualib-src/lua-mq.c
参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (gdb) frame 1 #1 0x0000715e50234fa4 in amqp_poll (fd=16 , event=2 , deadline=...) at librabbitmq/amqp_socket.c:286 286 res = poll(&pfd, 1 , timeout_ms);(gdb) p timeout_ms $1 = -1 (gdb) frame 2 #2 0x0000715e50235aec in recv_with_timeout (state=0x715d47114380 , timeout=...) at librabbitmq/amqp_socket.c:725 725 res = amqp_poll(fd, AMQP_SF_POLLIN, timeout);(gdb) p timeout $2 = {time_point_ns = 18446744073709551615 } (gdb) frame 3 #3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380 , decoded_frame=0x715e51bfa3d0 , timeout=0x0 ) at librabbitmq/amqp_socket.c:846 846 res = recv_with_timeout(state, deadline);(gdb) p deadline $3 = {time_point_ns = 18446744073709551615 } (gdb) p state->next_recv_heartbeat $4 = {time_point_ns = 18446744073709551615 } (gdb) p state->next_send_heartbeat $5 = {time_point_ns = 18446744073709551615 } (gdb) p timeout_deadline $6 = {time_point_ns = 18446744073709551615 } (gdb)
5. 原因 超时设置的地方
#3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380, decoded_frame=0x715e51bfa3d0, timeout=0x0) at librabbitmq/amqp_socket.c:846
https://github.com/alanxz/rabbitmq-c/blob/master/librabbitmq/amqp_socket.c#L778C16-L778C31
1 2 3 4 5 6 7 8 # deadline从timeout_deadline,state->next_recv_heartbeat,state->next_send_heartbeat取最小值 deadline = amqp_time_first(timeout_deadline, amqp_time_first(state->next_recv_heartbeat, state->next_send_heartbeat)); res = recv_with_timeout(state, deadline);
这里是在timeout_deadline和心跳超时取最小值。但是不凑巧这里都是UINT64_MAX。
5.1. 心跳超时设置 项目未开启心跳,原因是在select io模型下开启心跳有异常。
5.2. 作者修复 https://github.com/alanxz/rabbitmq-c/blob/8b7471eab8d09536b3c104dbb30a65699cf48104/librabbitmq/amqp_socket.c#L888
作者在该处确实有改动。但是这里代码没有区别,最终取到的都是UINT64_MAX。
作者的大量建议都是开启心跳超时检测来解决该问题。
6. 尝试开启心跳 其实已经使用poll模型了。select模型超时会coredump的问题可能并不会出现。
6.1. 仍然复现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 (gdb) thread 8 [Switching to thread 8 (Thread 0x774bea9fe640 (LWP 67748 ))] #0 0x0000774bef38bc3f in __GI___poll (fds=0x774bea9fc1b0 , nfds=1 , timeout=-1 ) at ../sysdeps/unix/sysv/linux/poll.c:29 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.(gdb) bt #0 0x0000774bef38bc3f in __GI___poll (fds=0x774bea9fc1b0 , nfds=1 , timeout=-1 ) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x0000774beb476fa4 in amqp_poll (fd=115 , event=2 , deadline=...) at librabbitmq/amqp_socket.c:286 #2 0x0000774beb477aec in recv_with_timeout (state=0x774be8068680 , timeout=...) at librabbitmq/amqp_socket.c:725 #3 0x0000774beb477e66 in wait_frame_inner (state=0x774be8068680 , decoded_frame=0x774bea9fc330 , timeout=0x0 ) at librabbitmq/amqp_socket.c:846 #4 0x0000774beb478346 in amqp_simple_wait_frame_noblock (state=0x774be8068680 , decoded_frame=0x774bea9fc330 , timeout=0x0 ) at librabbitmq/amqp_socket.c:988 #5 0x0000774beb478278 in amqp_simple_wait_frame (state=0x774be8068680 , decoded_frame=0x774bea9fc330 ) at librabbitmq/amqp_socket.c:972 #6 0x0000774beb47838c in amqp_simple_wait_method_list (state=0x774be8068680 , expected_channel=0 , expected_methods=0x774bea9fc3a0 , output =0x774bea9fc460 ) at librabbitmq/amqp_socket.c:997 #7 0x0000774beb47844e in amqp_simple_wait_method (state=0x774be8068680 , expected_channel=0 , expected_method=655370 , output =0x774bea9fc460 ) at librabbitmq/amqp_socket.c:1018 #8 0x0000774beb478f86 in amqp_login_inner (state=0x774be8068680 , vhost=0x774be804a86f "XXX" , channel_max=0 , frame_max=131072 , heartbeat=60 , client_properties=0x774beb47d4f0 <amqp_empty_table>, sasl_method=AMQP_SASL_METHOD_PLAIN, vl=0x774bea9fc6a0 ) at librabbitmq/amqp_mq.c:1270 #9 0x0000774beb4798e3 in amqp_login (state=0x774be8068680 , vhost=0x774be804a86f "XXX" , channel_max=0 , frame_max=131072 , heartbeat=60 , sasl_method=AMQP_SASL_METHOD_PLAIN) at librabbitmq/amqp_socket.c:1472
参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 (gdb) frame 1 #1 0x0000774beb476fa4 in amqp_poll (fd=115 , event=2 , deadline=...) at librabbitmq/amqp_socket.c:286 286 res = poll(&pfd, 1 , timeout_ms);(gdb) p deadline $1 = {time_point_ns = 18446744073709551615 } (gdb) frame 2 #2 0x0000774beb477aec in recv_with_timeout (state=0x774be8068680 , timeout=...) at librabbitmq/amqp_socket.c:725 725 res = amqp_poll(fd, AMQP_SF_POLLIN, timeout);(gdb) p timeout $5 = {time_point_ns = 18446744073709551615 } (gdb) frame 3 #3 0x0000774beb477e66 in wait_frame_inner (state=0x774be8068680 , decoded_frame=0x774bea9fc330 , timeout=0x0 ) at librabbitmq/amqp_socket.c:846 846 res = recv_with_timeout(state, deadline);(gdb) p deadline $6 = {time_point_ns = 18446744073709551615 } (gdb) p timeout_deadline $7 = {time_point_ns = 18446744073709551615 } (gdb) p state->next_recv_heartbeat $8 = {time_point_ns = 18446744073709551615 } (gdb) p state->next_send_heartbeat $9 = {time_point_ns = 18446744073709551615 }
从参数可以看出来心跳超时仍然是UINT64_MAX。这是为什么呢。
因为重连登录的时候,心跳尚未赋值,仍然会在登录的时候出现类似的情况。
6.2. 修复 https://github.com/alanxz/rabbitmq-c/commit/7f92d532360dd254bf0484085e7a51b812cc578f
作者为登录添加了一个超时时间。
7. 如何修复
开启超时心跳;
登录增加超时参数。
1 2 cd 3rd/rabbitmq-c git checkout 7f92d532360dd254bf0484085e7a51b812cc578f