合 Oracle服务器不返回消息给客户端(失去连接),告警日志中“Fatal NI connect error 12170”、TNS-12535、TNS-00505、nt OS err code、nt main err code等错误处理
Tags: Oracle故障处理DCDora-12170TNS-00505TNS-12535丢失连接假死长连接
现象
现象1
在数据库服务器上执行一个update语句或执行一个存储过程,时间大概40分钟,可以正常结束,可以获取到执行成功的返回消息,但是拿到远程客户端去执行的时候(无论是使用sqlplus还是plsql developer或者Navicat执行),执行了很久也没有返回结果,,,但是,通过查询会话,发现该SQL已经正常执行完成,只是没有返回消息给客户端:
1 | SELECT dbms_sqltune.report_sql_monitor(sql_id => '99qfh6psd3cpy',type => 'html',report_level => 'all') FROM dual; |
查看告警日志报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 11.2.0.4.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production Time: 08-MAY-2017 10:24:32 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.176.172.44)(PORT=55353)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for 64-bit Windows: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Windows NT TCP/IP NT Protocol Adapter for 64-bit Windows: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 13-4月 -2023 11:08:23 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS: 操作超时 ns secondary err code: 12560 nt main err code: 505 TNS-00505: 操作超时 nt secondary err code: 60 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.142.28.51)(PORT=53889)) Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 19.0.0.0.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production Version 19.3.0.0.0 Time: 08-MAY-2023 19:04:19 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.100.2.63)(PORT=62813)) 2023-05-08T19:04:23.016284+08:00 |
仔细分析出问题的时间点,就是SQL执行完成的时间。。。。
现象2
有意思的是,INACTIVE状态的会话并不会被断开,而ACTIVE状态的会话跑着跑着却会被断开,但其实服务端会话仍然在,这个通过开多个sqlplus客户端可以观察到。
分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [oracle@lhrora19c ~]$ oerr ora 12170 12170, 00000, "TNS:Connect timeout occurred" // *Cause: The server shut down because connection establishment or // communication with a client failed to complete within the allotted time // interval. This may be a result of network or system delays; or this may // indicate that a malicious client is trying to cause a Denial of Service // attack on the server. // *Action: If the error occurred because of a slow network or system, // reconfigure one or all of the parameters SQLNET.INBOUND_CONNECT_TIMEOUT, // SQLNET.SEND_TIMEOUT, SQLNET.RECV_TIMEOUT in sqlnet.ora to larger values. // If a malicious client is suspected, use the address in sqlnet.log to // identify the source and restrict access. Note that logged addresses may // not be reliable as they can be forged (e.g. in TCP/IP). [oracle@lhrora19c ~]$ |
从MOS上的信息反馈看,这个类型错误提示是一种正常的Oracle工作机制。当客户端进程Client Process与服务器进程Server Process建立联系之后,两者就形成了“同生共死”的关系(专有连接模式)。除非客户端主动发起中断或者Server Process被异常kill。
在实际运行环境中,这种理想状态常常被打破。如果Client Process只是保持连接,不执行语句,会话就处于idle状态。这种连接很容易被诸如防火墙等网络层面设备切断。
在Oracle11gR2中,如果长期没有连接动作的Server Process被外力切断,Oracle就会自动将信息作为提示错误写入到alert log中,作为一种提示。在11R1版本中,这种信息是会写入到sqlnet.log中。
归纳MOS和网络中的各种方法,大体有两重策略,分别为使用DCD和禁用ADR。
DCD全称Dead Connection Detection,是一种基于主动测探方式检查Oracle僵尸客户端进程Client Process的策略。配置DCD的关键是设置sqlnet.expire_time参数在SQL Net体系下,Oracle会依据这个时间间隔给所有的Client Process发送网络通信包,用来确定Client是否存活。正是借助这个包通信,可以让防火墙认为这个网络连接还是处在active状态,不会进行强制断开动作。类似的机制还有Linux上的tcp keep live机制,也是使用类似的策略进行检查。
另一种方式也是Oracle推荐的,就是关闭11g的ADR机制。ADR(Automatic Diagnostic Repository)是Oracle进行自动诊断、自动提醒的工具组件。Oracle认为如果用户不需要在SQL Net组件中应用ADR,可以再sqlnet.ora中进行配置关闭。
nt secondary err code返回值的不同
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | The 'nt secondary err code' translates to underlying network transport timeouts for the following Operating Systems: For the Solaris system: nt secondary err code: 145: ETIMEDOUT 145 /* Connection timed out */ For the Linux operating system: nt secondary err code: 110 ETIMEDOUT 110 Connection timed out For the HP-UX system: nt secondary err code: 238: ETIMEDOUT 238 /* Connection timed out */ For AIX: nt secondary err code: 78: ETIMEDOUT 78 /* Connection timed out */ For Windows based platforms: nt secondary err code: 60 (which translates to Winsock Error: 10060) |
生成日志
可以配置trace进行跟踪,生成日志来分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | -- 服务端trace TRACE_LEVEL_SERVER = 16 TRACE_FILE_SERVER = SERVER TRACE_DIRECTORY_SERVER= /u01/app/oracle/network/trace TRACE_TIMESTAMP_SERVER = ON TRACE_UNIQUE_SERVER = ON DIAG_ADR_ENABLED=OFF -- 客户端trace DIAG_ADR_ENABLED = off TRACE_LEVEL_CLIENT = 16 TRACE_UNIQUE_CLIENT = on TRACE_DIRECTORY_CLIENT = C:\Users\Administrator\Desktop\instantclient_21_9\network\admin\log TRACE_FILE_CLIENT = client |
也可以使用事件跟踪“ORA-03135: 连接失去联系错误”生成trace:
1 2 3 4 5 6 7 8 9 10 11 | SELECT VALUE FROM V$DIAG_INFO WHERE NAME LIKE '%Default%'; 15:24:14 SQL> alter session set events '3135 trace name errorstack level 3'; 会话已更改。 已用时间: 00: 00: 00.02 15:24:16 SQL> show parameter name ORA-03135: 连接失去联系 进程 ID: 1000 会话 ID: 371 序列号: 15777 |
SQL查询排查
1 2 3 4 | select a.LAST_CALL_ET,a.sid,a.SERIAL#,a.status,a.sql_id,a.USERNAME,a.EVENT,port from v$session a where a.username is not null and a.PROGRAM='sqlplus.exe'; SELECT dbms_sqltune.report_sql_monitor(sql_id => 'dgrq7u1w7vp9j',type => 'html',report_level => 'all') FROM dual; |
解决
1、修改数据库服务器端的listener.ora 和 sqlnet.ora文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | -- listener.ora INBOUND_CONNECT_TIMEOUT_LISTENER = 604800 DIAG_ADR_ENABLED_LISTENER = OFF -- sqlnet.ora USE_NS_PROBES_FOR_DCD=TRUE sqlnet.expire_time = 5 SQLNET.INBOUND_CONNECT_TIMEOUT=604800 SQLNET.OUTBOUND_CONNECT_TIMEOUT=604800 SQLNET.RECV_TIMEOUT=604800 SQLNET.SEND_TIMEOUT=604800 TCP.CONNECT_TIMEOUT=604800 DISABLE_OOB=ON -- 12c SQLNET.EXPIRE_ACTION=trace |
2、若是Linux服务器,则配置如下参数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | cat /proc/sys/net/ipv4/tcp_keepalive_time cat /proc/sys/net/ipv4/tcp_keepalive_intvl cat /proc/sys/net/ipv4/tcp_keepalive_probes cat /proc/sys/net/ipv4/tcp_max_tw_buckets cat /proc/sys/net/ipv4/tcp_max_syn_backlog cat /proc/sys/net/core/somaxconn cat >> /etc/sysctl.conf <<"EOF" net.ipv4.tcp_keepalive_time=300 net.ipv4.tcp_keepalive_intvl=30 net.ipv4.tcp_keepalive_probes=10 net.ipv4.tcp_max_tw_buckets = 262144 net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 EOF sysctl -p |
对于Linux服务器,如果参数net.ipv4.tcp_max_tw_buckets
、net.core.somaxconn
和net.ipv4.tcp_max_syn_backlog
配置过小,则可能会导致该文中的问题。
对于Windows平台请参考:https://www.dbaup.com/linuxhewindowspingtaishangtcp_keepalive_timetcp_keepalive_intvlhetcp_keepalive_probesdepeizhi.html
3、重新reload监听器配置,或者重启监听器
其它可能原因排查
1、数据库防火墙原因,请关闭防火墙或配置防火墙的长连接。这里的防火墙涉及数据库服务端和客户端2个主机的防火墙。
2、检查客户端和服务器之间是否有丢包及延迟现象,可以使用ping或mtr测试,参考:https://www.dbaup.com/shiyong-mtr-fenxiwangluoyanchijidiubaowenti.html
3、检查profile文件是否有限制:
1 2 3 4 5 6 7 | show parameter resource_limit select * from dba_profiles d where d.resource_name='IDLE_TIME'; select u.username,d.profile, d.resource_name ,d.limit from dba_users u , dba_profiles d where u.PROFILE=d.profile and d.resource_name='IDLE_TIME' and U.username='LHR'; alter profile ORA_STIG_PROFILE limit IDLE_TIME UNLIMITED; |
4、若数据库是12.2版本及以上,请检查是否配置了参数MAX_IDLE_TIME
和MAX_IDLE_BLOCKER_TIME
。
5、是否配置了定时任务,定时清理会话,请查询:dba_jobs
和dba_scheduler_jobs
。
6、检查pga_aggregate_limit
和pga_aggregate_target
是否过小。
7、其它杀毒软件导致的短连接,请关闭Windows Defender 防火墙等。
8、是否有云层面、高层次、顶层的防火墙配置,例如,新华三防火墙“H3C SECPATH F5000-M”需要配置老化时间为较长时间:
参考:
https://zhiliao.h3c.com/questions/dispcont/114689
https://zhiliao.h3c.com/Theme/details/164108
另外,例如天融信防火墙配置TCP维持超时等时间,如下: