合 大页配置错误导致内存耗尽进而导致PMON (ospid: 12917): terminating the instance due to error 471
现象
客户反馈说有一个库不能启动,报错:
1 2 3 4 5 6 7 8 9 10 11 12 | SQL> startup ORACLE instance started. Total System Global Area 7918845952 bytes Fixed Size 8639328 bytes Variable Size 2197818528 bytes Database Buffers 5687476224 bytes Redo Buffers 24911872 bytes Database mounted. ORA-03113: end-of-file on communication channel Process ID: 12459 Session ID: 201 Serial number: 23649 |
分析
于是我连接到服务器上,使用nomount到mount再到open的方式来启动,nomount和mount都正常,但是最后open阶段报错了,告警日志如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | 2022-03-03T21:10:21.910113+08:00 alter database open 2022-03-03T21:10:21.963402+08:00 Ping without log force is disabled: instance mounted in exclusive mode. 2022-03-03T21:10:21.998833+08:00 Beginning crash recovery of 1 threads 2022-03-03T21:10:25.417324+08:00 Warning: VKTM detected a time stall. Time drifts can result in unexpected behavior such as time-outs. Please see the VKTM trace file for more details: /u01/app/oracle/diag/rdbms/ais/ais/trace/ais_vktm_12924.trc 2022-03-03T21:11:11.410737+08:00 Instance Critical Process (pid: 19, ospid: 12954, DBW0) died unexpectedly 2022-03-03T21:11:20.449081+08:00 PMON (ospid: 12917): terminating the instance due to error 471 2022-03-03T21:11:20.449081+08:00 PMON (ospid: 12917): terminating the instance due to error 471 2022-03-03T21:11:21.377591+08:00 System state dump requested by (instance=1, osid=12917 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/ais/ais/trace/ais_diag_12938_20220303211121.trc 2022-03-03T21:11:22.482920+08:00 Dumping diagnostic data in directory=[cdmp_20220303211118], requested by (instance=1, osid=12917 (PMON)), summary=[abnormal instance termination]. 2022-03-03T21:11:23.604548+08:00 Instance terminated by PMON, pid = 12917 2022-03-03T21:11:23.608875+08:00 Warning: 2 processes are still attach to shmid 15: 2022-03-03T21:11:23.608875+08:00 Warning: 2 processes are still attach to shmid 15: (size: 36864 bytes, creator pid: 12913, last attach/detach pid: 12934) 2022-03-03T21:11:24.438847+08:00 USER (ospid: 13315): terminating the instance due to error 471 2022-03-03T21:11:24.443487+08:00 Instance terminated by USER, pid = 13315 |
报错是“PMON (ospid: 12917): terminating the instance due to error 471”,PMON死掉了。
查询系统日志如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | [root@ais ~]# more /var/log/messages | grep oom | tail -n 10more /var/log/messages | grep oom | tail -n 10 Mar 3 21:05:31 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:05:31 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Mar 3 21:05:31 ais kernel: systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Mar 3 21:05:31 ais kernel: [<ffffffffb57c227d>] oom_kill_process+0x2cd/0x490 Mar 3 21:05:31 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:05:31 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Mar 3 21:11:05 ais kernel: systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Mar 3 21:11:05 ais kernel: [<ffffffffb57c227d>] oom_kill_process+0x2cd/0x490 Mar 3 21:11:05 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:11:05 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [root@ais ~]# [root@ais ~]# |
可以看到,发生了OOM,oracle是被系统杀死了,出现这个情况,一般是因为内存或资源要耗尽,系统强制kill掉进程,查看内存:
1 2 3 4 5 6 7 | [root@ais ~]# free -h total used free shared buff/cache available Mem: 125G 124G 611M 8.8M 346M 245M Swap: 0B 0B 0B [root@ais ~]# [root@ais ~]# cat /etc/sysctl.conf | grep nr_hugepages vm.nr_hugepages=88850 |
可以看到,系统剩余不到250MB的内存,而大页配置了88850*2/1024=174GB,远超系统内存。
这才想起来,原来我在配置大页的时候,使用了另一个系统的脚本,然后下午重启了主机,所以导致问题出现。