合 【故障处理】rac节点不能启动报“has a disk HB, but no network HB”的错误
今天同事说有一套rac 19c的环境,不能使用了,让我帮忙看看。
这套rac环境是搭建在华为云ECS上的,操作系统为CentOS 7.6。根据经验,rac不能启动,主要是2个方面的原因:一个是共享存储,一个网络。共享存储常见原因是盘掉了,或盘坏了,或多路径软件出问题等等,而网络问题常见原因是私网网卡坏了,或节点之间网络不通(注意:修改ssh端口或修改oracle和grid密码不会影响rac的正常运行)。
很不幸,这套环境的共享和网络都有问题,下面慢慢分析。
原因一:共享盘掉了
首先,看看2个节点的共享盘是不是一致的,查看后发现节点2少了一块盘,让客户把节点2的盘重新挂载一下,
然后查看,共享盘已经一致了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | [root@oracle-rac2 ~]# ll /dev/asm* lrwxrwxrwx 1 root root 3 Jul 30 11:09 /dev/asm-diska -> sde lrwxrwxrwx 1 root root 3 Jul 30 11:09 /dev/asm-diskb -> sdd lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskc -> sdc lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskd -> sdb lrwxrwxrwx 1 root root 3 Jul 30 10:55 /dev/asm-diske -> sda [root@oracle-rac1 trace]# ll /dev/asm* lrwxrwxrwx 1 root root 3 Jul 30 11:10 /dev/asm-diska -> sde lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diskb -> sdb lrwxrwxrwx 1 root root 3 Jul 30 10:23 /dev/asm-diskc -> sda lrwxrwxrwx 1 root root 3 Jul 30 11:10 /dev/asm-diskd -> sdd lrwxrwxrwx 1 root root 3 Jul 30 11:03 /dev/asm-diske -> sdc [root@oracle-rac2 ~]# $GRID_HOME/bin/kfod disks=asm st=true ds=true cluster=true -------------------------------------------------------------------------------- Disk Size Header Path Disk Group User Group ================================================================================ 1: 81920 MB MEMBER /dev/asm-diska DATA grid asmadmin 2: 81920 MB MEMBER /dev/asm-diskb OCR grid asmadmin 3: 81920 MB MEMBER /dev/asm-diskc DATA grid asmadmin 4: 81920 MB MEMBER /dev/asm-diskd DATA grid asmadmin 5: 81920 MB MEMBER /dev/asm-diske DATA grid asmadmin -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME HOST_NAME ================================================================================ [root@oracle-rac1 trace]# $GRID_HOME/bin/kfod disks=asm st=true ds=true cluster=true -------------------------------------------------------------------------------- Disk Size Header Path Disk Group User Group ================================================================================ 1: 81920 MB MEMBER /dev/asm-diska DATA grid asmadmin 2: 81920 MB MEMBER /dev/asm-diskb DATA grid asmadmin 3: 81920 MB MEMBER /dev/asm-diskc DATA grid asmadmin 4: 81920 MB MEMBER /dev/asm-diskd OCR grid asmadmin 5: 81920 MB MEMBER /dev/asm-diske DATA grid asmadmin -------------------------------------------------------------------------------- ORACLE_SID ORACLE_HOME HOST_NAME ================================================================================ |
这里,磁盘顺序虽然不一样,但是,没有关系,用的是udev绑定的,不影响rac的运行。
等rac节点正常启动后,可以看到如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | -- 节点1 SYS@orcl1> set line 9999 SYS@orcl1> set pagesize 9999 SYS@orcl1> col path format a60 SYS@orcl1> SELECT a.group_number, disk_number,mount_status, a.name, path FROM v$asm_disk a order by a.disk_number; select instance_name,status from v$instance; GROUP_NUMBER DISK_NUMBER MOUNT_STATUS NAME PATH ------------ ----------- -------------- ------------- --------------- 1 0 CACHED DATA_0000 /dev/asm-diskc 2 0 CACHED OCR_0000 /dev/asm-diskd 1 1 CACHED DATA_0001 /dev/asm-diske 1 2 CACHED DATA_0002 /dev/asm-diska 1 3 CACHED DATA_0003 /dev/asm-diskb -- 节点2 SQL> set line 9999 SQL> set pagesize 9999 SQL> col path format a60 SQL> SELECT a.group_number, disk_number,mount_status, a.name, path FROM v$asm_disk a order by a.disk_number; select instance_name,status from v$instance; GROUP_NUMBER DISK_NUMBER MOUNT_S NAME PATH ------------ ----------- ------- ---------- --------------- 2 0 CACHED OCR_0000 /dev/asm-diskb 1 0 CACHED DATA_0000 /dev/asm-diske 1 1 CACHED DATA_0001 /dev/asm-diskc 1 2 CACHED DATA_0002 /dev/asm-diska 1 3 CACHED DATA_0003 /dev/asm-diskd |
原因二:安全组封了
登陆ECS,发现只有节点1在运行,而节点2没有运行集群服务。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | [root@oracle-rac1 ~]# crsctl stat res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE oracle-rac1 STABLE ora.chad ONLINE ONLINE oracle-rac1 STABLE ora.net1.network ONLINE ONLINE oracle-rac1 STABLE ora.ons ONLINE ONLINE oracle-rac1 STABLE -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 ONLINE OFFLINE STABLE ora.DATA.dg(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE oracle-rac1 STABLE ora.OCR.dg(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.asm(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 Started,STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.asmnet1.asmnetwork(ora.asmgroup) 1 ONLINE ONLINE oracle-rac1 STABLE 2 ONLINE OFFLINE STABLE 3 OFFLINE OFFLINE STABLE ora.cvu 1 ONLINE ONLINE oracle-rac1 STABLE ora.oracle-rac1.vip 1 ONLINE ONLINE oracle-rac1 STABLE ora.oracle-rac2.vip 1 ONLINE INTERMEDIATE oracle-rac1 FAILED OVER,STABLE ora.orcl.db 1 ONLINE ONLINE oracle-rac1 Open,HOME=/u01/app/o racle/product/19.3.0 /dbhome_1,STABLE 2 ONLINE OFFLINE Instance Shutdown,ST ABLE ora.qosmserver 1 ONLINE ONLINE oracle-rac1 STABLE ora.scan1.vip 1 ONLINE ONLINE oracle-rac1 STABLE -------------------------------------------------------------------------------- |
使用命令crsctl start has
启动节点2的集群服务后,通过crsctl stat res -t -init
观察启动过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | [root@oracle-rac2 ~]# crsctl start has CRS-4123: Oracle High Availability Services has been started. [root@oracle-rac2 ~]# crsctl stat res -t -init -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE STABLE ora.cluster_interconnect.haip 1 ONLINE OFFLINE STABLE ora.crf 1 ONLINE ONLINE oracle-rac2 STABLE ora.crsd 1 ONLINE OFFLINE STABLE ora.cssd 1 ONLINE OFFLINE oracle-rac2 STARTING ora.cssdmonitor 1 ONLINE ONLINE oracle-rac2 STABLE ora.ctssd 1 ONLINE OFFLINE STABLE ora.diskmon 1 OFFLINE OFFLINE STABLE ora.evmd 1 ONLINE INTERMEDIATE oracle-rac2 STABLE ora.gipcd 1 ONLINE ONLINE oracle-rac2 STABLE ora.gpnpd 1 ONLINE ONLINE oracle-rac2 STABLE ora.mdnsd 1 ONLINE ONLINE oracle-rac2 STABLE ora.storage 1 ONLINE OFFLINE STABLE -------------------------------------------------------------------------------- |
多次执行crsctl stat res -t -init
看到,启动一直卡在ora.cssd服务这里,查看日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | [root@oracle-rac2 ~]# cd /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ [root@oracle-rac2 trace]# tailf ocssd.trc 。。。 2021-07-30 09:49:44.851 : CSSD:1881474816: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 2021-07-30 09:49:45.579 : CSSD:1530889984: [ INFO] clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 2021-07-30 09:49:45.835 : CSSD:1493997312: [ INFO] clssnmvDHBValidateNCopy: node 1, oracle-rac1, has a disk HB, but no network HB, DHB has rcfg 523049052, wrtcnt, 1472163, LATS 2100744, lastSeqNo 1472162, uniqueness 1627554226, timestamp 1627609785/55291034 2021-07-30 09:49:45.852 : CSSD:1881474816: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 2021-07-30 09:49:45.874 : CSSD:1486112512: [ INFO] clssscSelect: gipcwait returned with status gipcretTimeout (16) 2021-07-30 09:49:46.579 : CSSD:1530889984: [ INFO] clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 2021-07-30 09:49:46.837 : CSSD:1493997312: [ INFO] clssnmvDHBValidateNCopy: node 1, oracle-rac1, has a disk HB, but no network HB, DHB has rcfg 523049052, wrtcnt, 1472164, LATS 2101744, lastSeqNo 1472163, uniqueness 1627554226, timestamp 1627609786/55292034 2021-07-30 09:49:46.850 : CSSD:1487689472: [ INFO] clssnmRcfgMgrThread: Local Join 2021-07-30 09:49:46.850 : CSSD:1487689472: [ INFO] clssnmLocalJoinEvent: begin on node(2), waittime 193000 2021-07-30 09:49:46.850 : CSSD:1487689472: [ INFO] clssnmLocalJoinEvent: set curtime (2101764) for my node 2021-07-30 09:49:46.850 : CSSD:1487689472: [ INFO] clssnmLocalJoinEvent: scanning 32 nodes 2021-07-30 09:49:46.850 : CSSD:1487689472: [ INFO] clssnmLocalJoinEvent: Node oracle-rac1, number 1, is in an existing cluster with disk state 3 2021-07-30 09:49:46.850 : CSSD:1487689472: [ WARNING] clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk 。。。。 |
重复输出以上内容,里边有关键词“has a disk HB, but no network HB”,意思是有磁盘心跳,但是没有网络心跳
,猜测是私网通信(网络心跳)的问题,而ohasd进程没有啥有用的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | [root@oracle-rac2 trace]# tailf alert.log 2021-07-30 09:47:56.758 [GIPCD(13240)]CRS-8500: Oracle Clusterware GIPCD process is starting with operating system process ID 13240 2021-07-30 09:48:01.127 [OSYSMOND(13348)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 13348 2021-07-30 09:48:01.234 [CSSDMONITOR(13346)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 13346 2021-07-30 09:48:02.268 [CSSDAGENT(13407)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 13407 2021-07-30 09:48:03.024 [OCSSD(13467)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 13467 2021-07-30 09:48:04.135 [OCSSD(13467)]CRS-1713: CSSD daemon is started in hub mode 2021-07-30 09:48:06.416 [OCSSD(13467)]CRS-1707: Lease acquisition for node oracle-rac2 number 2 completed 2021-07-30 09:48:07.528 [OCSSD(13467)]CRS-1621: The IPMI configuration data for this node stored in the Oracle registry is incomplete; details at (:CSSNK00002:) in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc 2021-07-30 09:48:07.528 [OCSSD(13467)]CRS-1617: The information required to do node kill for node oracle-rac2 is incomplete; details at (:CSSNM00004:) in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc 2021-07-30 09:48:07.534 [OCSSD(13467)]CRS-1605: CSSD voting file is online: /dev/asm-diskb; details in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc. 2021-07-30 09:58:02.736 [CSSDAGENT(13407)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ohasd_cssdagent_root.trc. 2021-07-30 09:58:02.737 [OCSSD(13467)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00086:) in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc. 2021-07-30 09:58:02.785 [OHASD(12890)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.cssd'. Details at (:CRSPE00221:) {0:5:3} in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ohasd.trc. 2021-07-30 09:58:03.738 [OCSSD(13467)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc 2021-07-30 09:58:03.738 [OCSSD(13467)]CRS-1603: CSSD on node oracle-rac2 has been shut down. 2021-07-30 09:58:03.840 [OCSSD(13467)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00086:) in /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc. 2021-07-30T09:58:08.746077+08:00 Errors in file /u01/app/grid/diag/crs/oracle-rac2/crs/trace/ocssd.trc (incident=353): CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /u01/app/grid/diag/crs/oracle-rac2/crs/incident/incdir_353/ocssd_i353.trc 2021-07-30 09:58:08.739 [OCSSD(13467)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 13467 experienced fatal signal or exception code 6. 2021-07-30 09:58:09.829 [CSSDMONITOR(20631)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 20631 [root@oracle-rac2 trace]# tailf ohasd.trc 2021-07-30 09:50:51.551 : CRSPE:1226725120: [ INFO] {0:0:110} Processing PE command id=133 origin:oracle-rac2. Description: [Stat Resource : 0x7f2cec183100] 2021-07-30 09:50:51.551 : CRSPE:1226725120: [ INFO] {0:0:110} Expression Filter : ((LAST_SERVER == oracle-rac2) AND (NAME == ora.cssd)) 2021-07-30 09:50:51.562 :UiServer:1220421376: [ INFO] {0:0:110} Done for ctx=0x7f2cf803a9e0 2021-07-30 09:51:02.159 :GIPCHTHR:1243535104: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30080 loopCount 38 sendCount 0 recvCount 102 postCount 0 sendCmplCount 0 recvCmplCount 0 2021-07-30 09:51:04.819 :GIPCHTHR:1216218880: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30860loopCount 28 2021-07-30 09:51:13.992 :UiServer:1220421376: [ INFO] {0:0:111} Sending to PE. ctx= 0x7f2cf803d490, ClientPID=13128 set Properties (grid,101398) 2021-07-30 09:51:13.992 : CRSPE:1226725120: [ INFO] {0:0:111} Processing PE command id=134 origin:oracle-rac2. Description: [Stat Resource : 0x7f2cec183100] 2021-07-30 09:51:14.002 :UiServer:1220421376: [ INFO] {0:0:111} Done for ctx=0x7f2cf803d490 2021-07-30 09:51:32.232 :GIPCHTHR:1243535104: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30070 loopCount 31 sendCount 0 recvCount 36 postCount 0 sendCmplCount 0 recvCmplCount 0 2021-07-30 09:51:35.322 :GIPCHTHR:1216218880: gipchaDaemonWork: DaemonThread heart beat, time interval since last heartBeat 30500loopCount 28 2021-07-30 09:51:49.722 : CRSPE:1226725120: [ INFO] {0:0:2} waiting for message 'RESOURCE_START[ora.cssd 1 1] ID 4098:339, tint:{0:5:3}' to be completed on server : oracle-rac2 2021-07-30 09:51:51.431 :UiServer:1220421376: [ INFO] {0:0:112} Sending to PE. ctx= 0x7f2cf803a810, ClientPID=12992 set Properties (root,102335), orig.tint: {0:0:2} 2021-07-30 09:51:51.431 : CRSPE:1226725120: [ INFO] {0:0:112} Processing PE command id=135 origin:oracle-rac2. Description: [Stat Resource : 0x7f2cec183100] 2021-07-30 09:51:51.438 :UiServer:1220421376: [ INFO] {0:0:112} Done for ctx=0x7f2cf803a810 2021-07-30 09:51:51.487 :UiServer:1220421376: [ INFO] {0:0:113} Sending to PE. ctx= 0x7f2cf803da20, ClientPID=12992 set Properties (root,103205), orig.tint: {0:0:2} 2021-07-30 09:51:51.487 : CRSPE:1226725120: [ INFO] {0:0:113} Processing PE command id=136 origin:oracle-rac2. Description: [Stat Resource : 0x7f2cec183100] 2021-07-30 09:51:51.494 :UiServer:1220421376: [ INFO] {0:0:113} Done for ctx=0x7f2cf803da20 2021-07-30 09:51:51.539 :UiServer:1220421376: [ INFO] {0:0:114} Sending to PE. ctx= 0x7f2cf80012a0, ClientPID=12992 set Properties (root,103573), orig.tint: {0:0:2} 2021-07-30 09:51:51.539 : CRSPE:1226725120: [ INFO] {0:0:114} Processing PE command id=137 origin:oracle-rac2. Description: [Stat Resource : 0x7f2cec183100] 2021-07-30 09:51:51.539 : CRSPE:1226725120: [ INFO] {0:0:114} Expression Filter : ((LAST_SERVER == oracle-rac2) AND (NAME == ora.cssd)) 2021-07-30 09:51:51.545 :UiServer:1220421376: [ INFO] {0:0:114} Done for ctx=0x7f2cf80012a0 |
节点1的日志也没有啥可用的信息。
使用ping命令ping公网和私网,都没有问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [root@oracle-rac2 ~]# ping oracle-rac1 PING oracle-rac1 (172.18.0.66) 56(84) bytes of data. 64 bytes from oracle-rac1 (172.18.0.66): icmp_seq=1 ttl=64 time=0.250 ms ^C --- oracle-rac1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.250/0.250/0.250/0.000 ms [root@oracle-rac2 ~]# ping oracle-rac1-priv PING oracle-rac1-priv (172.18.1.66) 56(84) bytes of data. 64 bytes from oracle-rac1-priv (172.18.1.66): icmp_seq=1 ttl=64 time=0.360 ms ^C --- oracle-rac1-priv ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.360/0.360/0.360/0.000 ms |
接下来用traceroute
来查看:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | [root@oracle-rac2 ~]# traceroute oracle-rac1 traceroute to oracle-rac1 (172.18.0.66), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * *^C [root@oracle-rac2 ~]# traceroute oracle-rac1-priv traceroute to oracle-rac1-priv (172.18.1.66), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 * * * 4 * * * 5 * * * 6 *^C [root@oracle-rac2 ~]# traceroute oracle-rac2 traceroute to oracle-rac2 (127.0.0.1), 30 hops max, 60 byte packets 1 localhost (127.0.0.1) 0.021 ms 0.003 ms 0.003 ms |
发现,使用traceroute结果不通(而且公网和私网都不通),说明网络的确有问题。可见,能ping通,不一定网络就没有问题,能ping通只是说明节点之间的ICMP协议通信没有问题。
之前没碰到过这类错误,只能求助于MOS和各大搜索引擎了,下面总结一下网友碰到的原因:
CSSD not starting up on second Node in a 2 Node Cluster. (Doc ID 2519544.1) :原因是服务器启动了安全类的软件或中病毒了,需要将安全类的软件停止才可以。
禁用HAIP:这个环境之前已经禁用过了,可以参考:https://www.dbaup.com/dbbao44oracle-racjiqunzhongdeipleixingjianjie.html
交换机和网卡有问题:我把网卡删掉,重建了一次网卡(类似于拔掉网卡,再插入),依然没有解决。网卡先down再up也没有解决,可见我的系统不是网卡本身的问题。
防火墙问题:该系统防火墙是关闭的。
本人提供Oracle(OCP、OCM)、MySQL(OCP)、PostgreSQL(PGCA、PGCE、PGCM)等数据库的培训和考证业务,私聊QQ646634621或微信dbaup66,谢谢!
traceroute默认使用UDP数据包探测。