合 StarRocks告警backend heartbeat got exception... Read timed out
现象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | 2023-06-05 02:09:40,935 WARN (heartbeat-mgr-pool-2|119) [HeartbeatMgr$BackendHeartbeatHandler.call():337] backend heartbeat got exception, addr: 192.17.0.185:9050 org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:455) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:354) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:243) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.13.0.jar:0.13.0] at com.starrocks.thrift.HeartbeatService$Client.recv_heartbeat(HeartbeatService.java:61) ~[starrocks-fe.jar:?] at com.starrocks.thrift.HeartbeatService$Client.heartbeat(HeartbeatService.java:48) ~[starrocks-fe.jar:?] at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:297) [starrocks-fe.jar:?] at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:275) [starrocks-fe.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_372] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_372] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_372] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_372] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_372] at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_372] at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_372] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_372] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_372] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_372] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_372] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:125) ~[libthrift-0.13.0.jar:0.13.0] ... 13 more 2023-06-05 02:09:40,938 WARN (heartbeat mgr|32) [HeartbeatMgr.runAfterCatalogReady():177] get bad heartbeat response: type: BACKEND, status: BAD, msg: java.net.SocketTimeoutException: Read timed out 2023-06-05 02:11:35,998 WARN (heartbeat mgr|32) [HeartbeatMgr.runAfterCatalogReady():177] get bad heartbeat response: type: BROKER, status: BAD, msg: java.net.SocketTimeoutException: Read timed out, name: broker_name, host: 192.17.0.185, port: 8000 2023-06-05 02:11:46,007 WARN (heartbeat-mgr-pool-3|120) [HeartbeatMgr$BackendHeartbeatHandler.call():337] backend heartbeat got exception, addr: 192.17.0.185:9050 org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: connect timed out at org.apache.thrift.transport.TSocket.open(TSocket.java:226) ~[libthrift-0.13.0.jar:0.13.0] at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:144) ~[starrocks-fe.jar:?] at com.starrocks.common.GenericPool$ThriftClientFactory.create(GenericPool.java:129) ~[starrocks-fe.jar:?] at org.apache.commons.pool2.BaseKeyedPooledObjectFactory.makeObject(BaseKeyedPooledObjectFactory.java:62) ~[commons-pool2-2.3.jar:2.3] at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1036) ~[commons-pool2-2.3.jar:2.3] at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:356) ~[commons-pool2-2.3.jar:2.3] at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:278) ~[commons-pool2-2.3.jar:2.3] at com.starrocks.common.GenericPool.borrowObject(GenericPool.java:101) ~[starrocks-fe.jar:?] at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:289) [starrocks-fe.jar:?] at com.starrocks.system.HeartbeatMgr$BackendHeartbeatHandler.call(HeartbeatMgr.java:275) [starrocks-fe.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_372] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_372] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_372] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_372] Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_372] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_372] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_372] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_372] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_372] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_372] at org.apache.thrift.transport.TSocket.open(TSocket.java:221) ~[libthrift-0.13.0.jar:0.13.0] ... 13 more |
分析
1 2 3 4 5 6 7 8 9 10 11 | MySQL [(none)]> show backends; +-----------+--------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+ | BackendId | IP | HeartbeatPort | BePort | HttpPort | BrpcPort | LastStartTime | LastHeartbeat | Alive | SystemDecommissioned | ClusterDecommissioned | TabletNum | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | MaxDiskUsedPct | ErrMsg | Version | Status | DataTotalCapacity | DataUsedPct | CpuCores | NumRunningQueries | MemUsedPct | CpuUsedPct | +-----------+--------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+ | 10007 | 192.16.0.104 | 9050 | 9060 | 8040 | 8060 | 2023-05-30 16:16:48 | 2023-06-05 09:21:21 | true | false | false | 156 | 8.506 MB | 472.245 GB | 490.989 GB | 3.82 % | 3.82 % | | 3.0.0-48f4d81 | {"lastSuccessReportTabletsTime":"2023-06-05 09:21:02"} | 472.253 GB | 0.00 % | 8 | 0 | 1.82 % | 0.1 % | | 10008 | 192.16.0.186 | 9050 | 9060 | 8040 | 8060 | 2023-05-30 16:16:48 | 2023-06-05 09:21:21 | true | false | false | 154 | 8.650 MB | 445.186 GB | 459.989 GB | 3.22 % | 3.22 % | | 3.0.0-48f4d81 | {"lastSuccessReportTabletsTime":"2023-06-05 09:20:28"} | 445.194 GB | 0.00 % | 8 | 0 | 3.29 % | 0.2 % | | 10009 | 192.17.0.185 | 9050 | 9060 | 8040 | 8060 | 2023-06-05 02:20:01 | 2023-06-05 09:21:21 | true | false | false | 154 | 8.708 MB | 476.169 GB | 490.989 GB | 3.02 % | 3.02 % | | 3.0.0-48f4d81 | {"lastSuccessReportTabletsTime":"2023-06-05 09:20:33"} | 476.178 GB | 0.00 % | 8 | 0 | 3.21 % | 0.1 % | +-----------+--------------+---------------+--------+----------+----------+---------------------+---------------------+-------+----------------------+-----------------------+-----------+------------------+---------------+---------------+---------+----------------+--------+---------------+--------------------------------------------------------+-------------------+-------------+----------+-------------------+------------+------------+ 3 rows in set (0.00 sec) MySQL [(none)]> |
出问题的FE节点由于和之前的2个FE节点不在同一个网段,所以会导致报心跳的错误,不过已经自动恢复了。