Ron's Blog

学习、消化、沉淀

0%

记一次服务器mq无限超时问题

0. 背景

前段时间线上一台服务器突然无法登录的情况。

1. 排查过程

从服务器运行日志发现,有个服务陷入了endless loop。每5秒出现一条告警打印。

1
2
3
[56.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002c ] maybe in an endless loop (version = 1431)
[01.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002c ] maybe in an endless loop (version = 1431)
[06.35 I] [0 skynet] :A message from [ :00000000 ] to [ :0000002c ] maybe in an endless loop (version = 1431)

进入skynet控制台,ping 2c这个地址也没有返回。初步判断该服务陷入某种死循环。

1
skynet> ping 2c

查看机器运行cpu并没有什么负载,没有一个cpu处于满载状态。说明不像陷入了循环。

1
2
3
4
5
6
7
> top
top - 01:35:01 up 1 day, 23:04, 1 user, load average: 0.02, 0.05, 0.04
Tasks: 128 total, 1 running, 127 sleeping, 0 stopped, 0 zombie
%Cpu0 : 4.0 us, 0.3 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 2.0 us, 0.7 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 1.0 us, 0.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.3 us, 1.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

可以简要得出结论:

  • 处于endless loop中,ping没有返回,说明3b这个服务的协程一直没有让出,怀疑死循环
  • cpu又没有负载,特别是没有某一个核心处于高负载中,也就是cpu让出来了,说明协程不处于死循环中

至此可以推测3b服务陷入某种io中,让出了cpu却未让出lua协程。

2. 查看代码逻辑

此服务lua逻辑不多,c代码逻辑比较多。服务启动了一个用于RabbitMQ代理的AMQP客户端,用于跟一个RabbitMQ服务器进行数据交互。rabbitmq-c客户端是一个开源c库。https://github.com/alanxz/rabbitmq-c

此时可以怀疑网络IO的问题了

因为用的版本比较旧,怀疑是库的问题,且推测已经做了修复,先看了一眼库的issues。

3. 重现

线上偶发,出现频率很低,大约一年出现一次。推测是网络IO的问题,怀疑在某些极端网络条件下,出现丢包情况。导致rabbitmq-c获取数据介于某种中间状态。故,先设置mq连接的丢包率尝试复现。

1
2
3
4
5
6
sudo iptables -A INPUT -p tcp --sport <port> -m statistic --mode random --probability 0.5 -j DROP
sudo iptables -A OUTPUT -p tcp --dport <port> -m statistic --mode random --probability 0.5 -j DROP

# 删除防火墙设置
sudo iptables -L INPUT --line-numbers
sudo iptables -D INPUT 1
  • 丢包率太低(低于30%)无法复现
  • 丢包率太高(高于70%)也不容易复现,会触发断线重连
  • 丢包率适中的时候,大约两三个小时能触发一次。

进入gdb调试,获取堆栈

1
2
3
4
5
6
7
8
9
(gdb)
#0 0x00007d3d22205c3f in __GI___poll (fds=0x7d3d1ebf90d0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007d3d1daf2f29 in amqp_poll () from ./mq.so
#2 0x00007d3d1daf3a71 in recv_with_timeout () from ./mq.so
#3 0x00007d3d1daf3deb in wait_frame_inner () from ./mq.so
#4 0x00007d3d1daf4189 in amqp_simple_wait_frame_on_channel () from ./mq.so
#5 0x00007d3d1dae7aeb in amqp_read_message () from ./mq.so
#6 0x00007d3d1dae7963 in amqp_consume_message () from ./mq.so
#7 0x00007d3d1dae4030 in consume_message (L=0x7d3d1a20ea68) at lualib-src/lua-mq.c

从堆栈可以看到:

  • skynet的某一个工作线程(也就是出问题service)卡在了poll函数。
  • 从参数timeout=-1可以看到,poll是没有超时时间的,直到读取到数据为止(或网络完全断开?)

4. 再次重现

由于无法查看 mq.so 的参数。看了默认情况下rabbitmq-c 的编译是release版本的。需要调整为debug版本重新编译。

1
2
3
4
5
6
7
8
# rabbitmq-c/CMakeLists.txt:36

set(default_build_type "Release") # Release修改为Debug
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "Setting build type to '${default_build_type}' as none was specified.")
set(CMAKE_BUILD_TYPE "${default_build_type}" CACHE STRING "Choose the type of build." FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
endif()

堆栈

1
2
3
4
5
6
7
8
9
10
(gdb) bt
#0 0x0000715e54949c3f in __GI___poll (fds=0x715e51bfa230, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x0000715e50234fa4 in amqp_poll (fd=16, event=2, deadline=...) at librabbitmq/amqp_socket.c:286
#2 0x0000715e50235aec in recv_with_timeout (state=0x715d47114380, timeout=...) at librabbitmq/amqp_socket.c:725
#3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380, decoded_frame=0x715e51bfa3d0, timeout=0x0) at librabbitmq/amqp_socket.c:846
#4 0x0000715e50236204 in amqp_simple_wait_frame_on_channel (state=0x715d47114380, channel=1, decoded_frame=0x715e51bfa3d0) at librabbitmq/amqp_socket.c:952
#5 0x0000715e50229b66 in amqp_read_message (state=0x715d47114380, channel=1, message=0x715e51bfa588, flags=0) at librabbitmq/amqp_consumer.c:217
#6 0x0000715e502299de in amqp_consume_message (state=0x715d47114380, envelope=0x715e51bfa540, timeout=0x715e51bfa510, flags=0) at librabbitmq/amqp_consumer.c:186
#7 0x0000715e50226480 in consume_message (L=0x715e4da16d08) at lualib-src/lua-mq.c

参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(gdb) frame 1
#1 0x0000715e50234fa4 in amqp_poll (fd=16, event=2, deadline=...) at librabbitmq/amqp_socket.c:286
286 res = poll(&pfd, 1, timeout_ms);
(gdb) p timeout_ms
$1 = -1
(gdb) frame 2
#2 0x0000715e50235aec in recv_with_timeout (state=0x715d47114380, timeout=...) at librabbitmq/amqp_socket.c:725
725 res = amqp_poll(fd, AMQP_SF_POLLIN, timeout);
(gdb) p timeout
$2 = {time_point_ns = 18446744073709551615}
(gdb) frame 3
#3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380, decoded_frame=0x715e51bfa3d0, timeout=0x0) at librabbitmq/amqp_socket.c:846
846 res = recv_with_timeout(state, deadline);
(gdb) p deadline
$3 = {time_point_ns = 18446744073709551615}
(gdb) p state->next_recv_heartbeat
$4 = {time_point_ns = 18446744073709551615}
(gdb) p state->next_send_heartbeat
$5 = {time_point_ns = 18446744073709551615}
(gdb) p timeout_deadline
$6 = {time_point_ns = 18446744073709551615}
(gdb)

5. 原因

超时设置的地方

#3 0x0000715e50235e66 in wait_frame_inner (state=0x715d47114380, decoded_frame=0x715e51bfa3d0, timeout=0x0) at librabbitmq/amqp_socket.c:846

https://github.com/alanxz/rabbitmq-c/blob/master/librabbitmq/amqp_socket.c#L778C16-L778C31

1
2
3
4
5
6
7
8
# deadline从timeout_deadline,state->next_recv_heartbeat,state->next_send_heartbeat取最小值
deadline = amqp_time_first(timeout_deadline,
amqp_time_first(state->next_recv_heartbeat,
state->next_send_heartbeat));

/* TODO this needs to wait for a _frame_ and not anything written from the
* socket */
res = recv_with_timeout(state, deadline);

这里是在timeout_deadline和心跳超时取最小值。但是不凑巧这里都是UINT64_MAX。

5.1. 心跳超时设置

项目未开启心跳,原因是在select io模型下开启心跳有异常。

5.2. 作者修复

https://github.com/alanxz/rabbitmq-c/blob/8b7471eab8d09536b3c104dbb30a65699cf48104/librabbitmq/amqp_socket.c#L888

作者在该处确实有改动。但是这里代码没有区别,最终取到的都是UINT64_MAX。

作者的大量建议都是开启心跳超时检测来解决该问题。

6. 尝试开启心跳

其实已经使用poll模型了。select模型超时会coredump的问题可能并不会出现。

6.1. 仍然复现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(gdb) thread 8
[Switching to thread 8 (Thread 0x774bea9fe640 (LWP 67748))]
#0 0x0000774bef38bc3f in __GI___poll (fds=0x774bea9fc1b0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x0000774bef38bc3f in __GI___poll (fds=0x774bea9fc1b0, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x0000774beb476fa4 in amqp_poll (fd=115, event=2, deadline=...) at librabbitmq/amqp_socket.c:286
#2 0x0000774beb477aec in recv_with_timeout (state=0x774be8068680, timeout=...) at librabbitmq/amqp_socket.c:725
#3 0x0000774beb477e66 in wait_frame_inner (state=0x774be8068680, decoded_frame=0x774bea9fc330, timeout=0x0) at librabbitmq/amqp_socket.c:846
#4 0x0000774beb478346 in amqp_simple_wait_frame_noblock (state=0x774be8068680, decoded_frame=0x774bea9fc330, timeout=0x0) at librabbitmq/amqp_socket.c:988
#5 0x0000774beb478278 in amqp_simple_wait_frame (state=0x774be8068680, decoded_frame=0x774bea9fc330) at librabbitmq/amqp_socket.c:972
#6 0x0000774beb47838c in amqp_simple_wait_method_list (state=0x774be8068680, expected_channel=0, expected_methods=0x774bea9fc3a0, output=0x774bea9fc460) at librabbitmq/amqp_socket.c:997
#7 0x0000774beb47844e in amqp_simple_wait_method (state=0x774be8068680, expected_channel=0, expected_method=655370, output=0x774bea9fc460) at librabbitmq/amqp_socket.c:1018
#8 0x0000774beb478f86 in amqp_login_inner (state=0x774be8068680, vhost=0x774be804a86f "XXX", channel_max=0, frame_max=131072, heartbeat=60, client_properties=0x774beb47d4f0 <amqp_empty_table>,
sasl_method=AMQP_SASL_METHOD_PLAIN, vl=0x774bea9fc6a0) at librabbitmq/amqp_mq.c:1270
#9 0x0000774beb4798e3 in amqp_login (state=0x774be8068680, vhost=0x774be804a86f "XXX", channel_max=0, frame_max=131072, heartbeat=60, sasl_method=AMQP_SASL_METHOD_PLAIN)
at librabbitmq/amqp_socket.c:1472

参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(gdb) frame 1
#1 0x0000774beb476fa4 in amqp_poll (fd=115, event=2, deadline=...) at librabbitmq/amqp_socket.c:286
286 res = poll(&pfd, 1, timeout_ms);
(gdb) p deadline
$1 = {time_point_ns = 18446744073709551615}
(gdb) frame 2
#2 0x0000774beb477aec in recv_with_timeout (state=0x774be8068680, timeout=...) at librabbitmq/amqp_socket.c:725
725 res = amqp_poll(fd, AMQP_SF_POLLIN, timeout);
(gdb) p timeout
$5 = {time_point_ns = 18446744073709551615}
(gdb) frame 3
#3 0x0000774beb477e66 in wait_frame_inner (state=0x774be8068680, decoded_frame=0x774bea9fc330, timeout=0x0) at librabbitmq/amqp_socket.c:846
846 res = recv_with_timeout(state, deadline);
(gdb) p deadline
$6 = {time_point_ns = 18446744073709551615}
(gdb) p timeout_deadline
$7 = {time_point_ns = 18446744073709551615}
(gdb) p state->next_recv_heartbeat
$8 = {time_point_ns = 18446744073709551615}
(gdb) p state->next_send_heartbeat
$9 = {time_point_ns = 18446744073709551615}

从参数可以看出来心跳超时仍然是UINT64_MAX。这是为什么呢。

因为重连登录的时候,心跳尚未赋值,仍然会在登录的时候出现类似的情况。

6.2. 修复

https://github.com/alanxz/rabbitmq-c/commit/7f92d532360dd254bf0484085e7a51b812cc578f

作者为登录添加了一个超时时间。

7. 如何修复

  1. 开启超时心跳;
  2. 登录增加超时参数。
1
2
cd 3rd/rabbitmq-c
git checkout 7f92d532360dd254bf0484085e7a51b812cc578f