原 GreenPlum数据库SQL查询卡慢,报错或告警 Interconnect encountered a network error, please check your network
Tags: 原创故障处理GreenPlumGP网络问题net.ipv4.ipfrag_high_threshnet.ipv4.ipfrag_low_threshnet.ipv4.ipfrag_max_distnet.ipv4.ipfrag_time
现象
环境:GreenPlum 6.25.3 , centos 7.6
查询用户自建的表,会卡住很久,最后报错:
1 2 3 4 5 6 7 8 9 | ERROR: Interconnect encountered a network error, please check your network (seg3 slice1 gp2.ops.bj1:33001 pid=69361) DETAIL: Failed to send packet (seq 1) to 10.0.3.33:56292 (pid 37236 cid 6) after 3580 retries in 3600 seconds gpdb=> select * from test; //The table does not have any data ......... long long long time. ERROR: Interconnect encountered a network error, please check your network (seg0 slice1 10.60.80.29:40000 pid=23670) DETAIL: Failed to send packet (seq 1) to 10.60.80.28:10670 (pid 15645 cid -1) after 3574 retries in 3600 seconds |
但是,查询系统表不报错。
另一个系统的报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | WARNING: interconnect may encountered a network error, please check your network (seg10 slice1 118.88.23.3:6010 pid=2732467) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:50494 (pid 3169149 cid 77) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg0 slice1 118.88.23.3:6000 pid=2732457) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:49520 (pid 3169167 cid 96) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg3 slice1 118.88.23.3:6003 pid=2732460) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:36031 (pid 3169160 cid 87) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg6 slice1 118.88.23.3:6006 pid=2732462) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:35373 (pid 3169150 cid 79) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg9 slice1 118.88.23.3:6009 pid=2732465) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:38228 (pid 3169154 cid 81) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg13 slice1 118.88.23.3:6013 pid=2732469) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:60469 (pid 3169164 cid 93) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg14 slice1 118.88.23.3:6014 pid=2732470) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:51999 (pid 3169166 cid 95) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg20 slice1 118.88.23.3:6020 pid=2732477) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:35373 (pid 3169150 cid 79) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg22 slice1 118.88.23.3:6022 pid=2732479) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:46447 (pid 3169162 cid 91) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg34 slice1 118.88.23.4:6009 pid=53881) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:47159 (pid 3169158 cid 90) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg37 slice1 118.88.23.4:6012 pid=53886) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:38228 (pid 3169154 cid 81) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg38 slice1 118.88.23.4:6013 pid=53885) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:50494 (pid 3169149 cid 77) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg39 slice1 118.88.23.4:6014 pid=53889) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:43048 (pid 3169153 cid 83) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg43 slice1 118.88.23.4:6018 pid=53894) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:49520 (pid 3169167 cid 96) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg45 slice1 118.88.23.4:6020 pid=53890) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:49520 (pid 3169167 cid 96) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg48 slice1 118.88.23.4:6023 pid=53896) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:51999 (pid 3169166 cid 95) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg52 slice1 118.88.23.5:6002 pid=3699426) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:47532 (pid 3169165 cid 92) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg54 slice1 118.88.23.5:6004 pid=3699427) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:35373 (pid 3169150 cid 79) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg55 slice1 118.88.23.5:6005 pid=3699431) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:51999 (pid 3169166 cid 95) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg56 slice1 118.88.23.5:6006 pid=3699430) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:37415 (pid 3169155 cid 84) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg57 slice1 118.88.23.5:6007 pid=3699432) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:54803 (pid 3169159 cid 88) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg60 slice1 118.88.23.5:6010 pid=3699434) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:50494 (pid 3169149 cid 77) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg61 slice1 118.88.23.5:6011 pid=3699436) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:60469 (pid 3169164 cid 93) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg63 slice1 118.88.23.5:6013 pid=3699437) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:47532 (pid 3169165 cid 92) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg66 slice1 118.88.23.5:6016 pid=3699440) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:46447 (pid 3169162 cid 91) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg68 slice1 118.88.23.5:6018 pid=3699444) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:53822 (pid 3169157 cid 86) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg71 slice1 118.88.23.5:6021 pid=3699445) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:47532 (pid 3169165 cid 92) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg72 slice1 118.88.23.5:6022 pid=3699446) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:41666 (pid 3169146 cid 75) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg50 slice1 118.88.23.5:6000 pid=3699425) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:44234 (pid 3169147 cid 76) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg53 slice1 118.88.23.5:6003 pid=3699428) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:36031 (pid 3169160 cid 87) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg29 slice1 118.88.23.4:6004 pid=53878) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:55308 (pid 3169168 cid 97) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg5 slice1 118.88.23.3:6005 pid=2732461) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:36031 (pid 3169160 cid 87) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg44 slice1 118.88.23.4:6019 pid=53891) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:37415 (pid 3169155 cid 84) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg64 slice1 118.88.23.5:6014 pid=3699438) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:37415 (pid 3169155 cid 84) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg16 slice1 118.88.23.3:6016 pid=2732472) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:55308 (pid 3169168 cid 97) after 100 retries. WARNING: interconnect may encountered a network error, please check your network (seg8 slice1 118.88.23.3:6008 pid=2732464) DETAIL: Failed to send packet (seq 1) to 118.88.23.6:46447 (pid 3169162 cid 91) after 100 retries. |
这里的100是由参数控制的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | [gpadmin@mdw conf]$ gpconfig -s gp_interconnect_min_retries_before_timeout Values on all segments are consistent GUC : gp_interconnect_min_retries_before_timeout Master value: 100 Segment value: 100 -- 可以配置低一点来检查复现SQL时快时慢的问题 gpconfig -c gp_interconnect_min_retries_before_timeout -v 5 gpstop -u -- 或 set gp_interconnect_min_retries_before_timeout=5 SET client_min_messages=DEBUG3; -- SET client_min_messages=notice; show client_min_messages; |
另外,查询失败的报错信息中反复出现其他segment向118.88.23.6发包失败的信息,怀疑UDP发包到118.88.23.6节点后,该节点组包失败。
可能的原因
1、防火墙
2、udp修改为tcp
3、网卡的mtu配置过大
4、/etc/hosts
文件配置错误
5、若是偶发现象,则可能是丢包引起的,需要修改参数(推荐)
解决
防火墙问题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | systemctl status firewalld getenforce systemctl start firewalld firewall-cmd --add-port=0-65535/tcp --permanent firewall-cmd --add-port=0-65535/udp --permanent firewall-cmd --reload firewall-cmd --list-ports systemctl stop firewalld systemctl disable firewalld systemctl status firewalld sudo iptables -F sudo iptables -X sudo iptables -Z sudo service iptables save sudo service iptables stop |
udp修改为tcp
1、之前某条SQL很慢,将udp调整为tcp后,速度快了很多,说明在udp重传方面占用了很多的时间。
若因为网络问题或配置问题导致udp丢包严重,则可以修改为tcp类型:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | gpconfig -s gp_interconnect_type -- gpconfig -c gp_interconnect_type -v udpifc gpconfig -c gp_interconnect_type -v tcp gpconfig -c gp_interconnect_tcp_listener_backlog -v 10240 gpstop -M fast -ar -- gp_interconnect_tcp_listener_backlog可以不用重启 # 修改新值(网络性能不好或segment数过多的时候) cat >> /etc/sysctl.conf <<"EOF" net.ipv4.ipfrag_time = 240 net.ipv4.ipfrag_high_thresh = 161943040 net.ipv4.ipfrag_low_thresh = 101457280 net.ipv4.ipfrag_max_dist = 1000 net.core.somaxconn = 65535 EOF sysctl -p |
udp严重依赖于IP分片,通过如下的命令分析该值是否有所增加,若增加很快,则建议修改为tcp模式,修改后需要注意测试SQL执行速度是否正常,是否有很简单的SQL却执行很长时间的情况(在Navicat中很快,在其它web中很慢):
1 2 3 4 | gpssh -f all_hosts "netstat -s | grep failed | grep reassemblies" gpssh -f all_hosts "netstat -s | grep timeout | grep dropp" |
网卡的mtu配置过大(默认为1500)
需要配置小一点:
1 | ifconfig eth0 mtu 1100 |
较大的 mtu 比如 9000 也是可以, 但这样有风险, 如果用于互联机器的某个设备, 比如交换机/路由器不支持这么大的 mtu, 那么会导致机器之间无法互联互通。
降低gp_max_packet_size
先将 gp_max_packet_size
降低到 mtu, 一般 1500 以下。
通过降低 gp_max_packet_size 控制下计算节点发送数据包的大小来避免 IP 分片, 但这样相当于由计算节点自身软件来完成 IP 分片了, 与可能会 offload 到网卡硬件实现的 IP 分片相比, 降低 gp_max_packet_size 的同时性能表现也会急剧下降. 而且现行的 Linux 发包优化技术, 像 TSO, GSO 都是尽可能地将分片放在网络栈的最底层来做, 这样可以显著降低网络栈上层之间交换数据包的数量从而来得到不错的性能提升。
1 2 3 4 5 6 7 8 9 | [gpadmin@mdw conf]$ gpconfig -s gp_max_packet_size Values on all segments are consistent GUC : gp_max_packet_size Master value: 8192 Segment value: 8192 gpconfig -c gp_max_packet_size -v 1024 gpstop -u |
/etc/hosts文件配置错误
在初始化GP系统的时候,有一个特别的报错:
1 2 3 | -Host sdw5 is assigned as localhost in /etc/hosts -This will cause segment->master communication failures - Remove sdw5 from local host line in /etc/ hosts |
当时没处理,就直接初始化系统了,结果初始化完成后,系统表查询正常,但是用户新建的表不能查询,一直卡住。。。
仔细查看主机的/etc/hosts文件,发现有个地方很特别,就是“127.0.0.1 localhost sdw5”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [gpadmin@sdw5 ~]$ cat /etc/hosts 127.0.0.1 localhost sdw5 ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 172.72.6.50 mdw 172.72.6.51 smdw 172.72.6.52 sdw1 172.72.6.53 sdw2 172.72.6.54 sdw3 172.72.6.55 sdw4 172.72.6.52 sdw1 |
坑。。。。
果断修改为“127.0.0.1 localhost ”,不能添加sdw5,最后初始化GP系统,最后建表查询正常了。。。这个问题耗了好几天。。。
若是偶发现象,SQL时快时慢,则可能是丢包引起的(推荐)
排查网络性能问题,是否有丢包现象。