原 GreenPlum在何时会发生自动故障切换、节点漂移及如何排查（OOM）

发布日期 2023年7月17日 · 已更新 2024年7月1日

0 578 2

Tags：原创故障处理 GreenPlum OOM segment 实例漂移自动故障转移节点切换节点漂移

👉 本文共约2192个字，系统预计阅读时间或需9分钟。

简介
情况1（大部分情况）：发生了OOM
情况2：最大进程数超限导致系统资源不足
情况3：kill掉实例
情况4：master的主机名不能解析
情况5：负载过大或CPU压力或主机压力较大或发生大量的swap置换
情况6：某个segment的主机网络不通
排查SQL
自动故障切换模拟
原来环境
模拟故障
日志
恢复
自动故障切换+产生大量日志模拟
总结

简介

若发生了OOM，则在gpcc的告警通知里会有“[告警]Out of memory errors”，例如：

情况1（大部分情况）：发生了OOM

发生自动切换的一个示例是发生了OOM，在master的日志文件中会有如下的内容：

“FTS: cannot establish libpq connection (content=0, dbid=11): could not fork new process for connection: Cannot allocate memory”或“FATAL: Out of memory. Failed on request of size 144 bytes. (context 'GPORCAmemory pool') ”或“ATAL: the database system is in recovery mode”，

若没有swap内存配置，会发生OOM，特别严重时会导致segment自动故障切换。

FTS: cannot establish libpq connection (content=0, dbid=11): could not fork new process for connection: Cannot allocate memory
The previous session was reset because its gang was disconnected (session id = 6072). The new session id = 109485
FATAL:  Out of memory.  Failed on request of size 144 bytes. (context 'GPORCAmemory pool')
FATAL:  the database system is in recovery mode
gang was lost due to cluster reconfiguration(cdbgang_async.c:97)
rejecting TCP connection to master using internalconnection protocol
Any temporary tables for this session have been dropped because the gang was disconnected (session id = 85341)
failed to acquire resources on one or more segments

FTS: cannot establish libpq connection (content=0, dbid=11): could not fork new process for connection: Cannot allocate memory

The previous session was reset because its gang was disconnected (session id = 6072). The new session id = 109485

FATAL: Out of memory. Failed on request of size 144 bytes. (context 'GPORCAmemory pool')

FATAL: the database system is in recovery mode

gang was lost due to cluster reconfiguration(cdbgang_async.c:97)

rejecting TCP connection to master using internalconnection protocol

Any temporary tables for this session have been dropped because the gang was disconnected (session id = 85341)

failed to acquire resources on one or more segments

模拟OOM错误：


docker rm -f gpdbtest
docker run -itd  --name gpdbtest -h gpdb6 \
-m 2GB   --memory-swap=2GB  \
-p 28180:28080 \
--privileged=true  lhrbest/greenplum:6.23.1  /usr/sbin/init

docker exec -it gpdbtest bash
su - gpadmin
gpstart -am

gpconfig -c gp_vmem_protect_limit -v 1024
gpstop -M fast -ar
gpcc start

http://192.168.88.162:28180

docker stats gpdbtest

psql -c 'drop database sbtest2;'
psql -c 'create database sbtest2;'
sysbench /usr/share/sysbench/oltp_common.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 \
--pgsql-user=gpadmin --pgsql-password=lhr --pgsql-db=sbtest2  \
--time=300 --table-size=2000000 --tables=10 --threads=10  \
--events=999999999   prepare

select * from gp_segment_configuration ;

docker rm -f gpdbtest

docker run -itd --name gpdbtest -h gpdb6 \

-m 2GB --memory-swap=2GB \

-p 28180:28080 \

--privileged=true lhrbest/greenplum:6.23.1 /usr/sbin/init

docker exec -it gpdbtest bash

su - gpadmin

gpstart -am

gpconfig -c gp_vmem_protect_limit -v 1024

gpstop -M fast -ar

gpcc start

http://192.168.88.162:28180

docker stats gpdbtest

psql -c 'drop database sbtest2;'

psql -c 'create database sbtest2;'

sysbench /usr/share/sysbench/oltp_common.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 \

--pgsql-user=gpadmin --pgsql-password=lhr --pgsql-db=sbtest2 \

--time=300 --table-size=2000000 --tables=10 --threads=10 \

--events=999999999 prepare

select * from gp_segment_configuration ;

情况2：最大进程数超限导致系统资源不足

最大进程数超限，此时，日志报错：

could not fork new process for connection: Resource temporarily unavailable

could not fork new process for connection: Resource temporarily unavailable
 (seg0 119.10.25.26:6000)

FATAL:  InitMotionLayerIPC: failed to create thread (ic_udpifc.c:1488)
DETAIL:  pthread_create() failed with err 11
 (seg11 19.10.25.26:7003)

could not fork new process for connection: Resource temporarily unavailable

(seg0 119.10.25.26:6000)

FATAL: InitMotionLayerIPC: failed to create thread (ic_udpifc.c:1488)

DETAIL: pthread_create() failed with err 11

(seg11 19.10.25.26:7003)

该报错，多半是因为内核参数没有做正确修改，修复如下：


ll  /lib64/security/pam_limits.so
echo "session required /lib64/security/pam_limits.so" >> /etc/pam.d/login

cat >> /etc/security/limits.conf <<"EOF"

* soft nofile 655350 
* hard nofile 655350 
* soft nproc 655350
* hard nproc 655350

gpadmin soft priority -20

EOF

sed  -i 's/4096/655350/' /etc/security/limits.d/20-nproc.conf 
cat /etc/security/limits.d/20-nproc.conf

cat >> /etc/sysctl.conf <<"EOF"
fs.file-max=9000000
fs.inotify.max_user_instances = 1000000
fs.inotify.max_user_watches = 1000000
kernel.pid_max=4194304
EOF

sysctl -p

ll /lib64/security/pam_limits.so

echo "session required /lib64/security/pam_limits.so" >> /etc/pam.d/login

cat >> /etc/security/limits.conf <<"EOF"

* soft nofile 655350

* hard nofile 655350

* soft nproc 655350

* hard nproc 655350

gpadmin soft priority -20

EOF

sed -i 's/4096/655350/' /etc/security/limits.d/20-nproc.conf

cat /etc/security/limits.d/20-nproc.conf

cat >> /etc/sysctl.conf <<"EOF"

fs.file-max=9000000

fs.inotify.max_user_instances = 1000000

fs.inotify.max_user_watches = 1000000

kernel.pid_max=4194304

EOF

sysctl -p

重启主机生效。

参考：https://www.dbaup.com/linuxzhongdesoft-nproc-hard-nprocsoft-nofilehard-nofileulimitdezuidajinchengshuhezuidaai.html

模拟“DETAIL: pthread_create() failed with err 11”错误：

docker rm -f gpdbtest
docker run -itd  --name gpdbtest -h gpdb6 \
-p 28180:28080 \
--privileged=true  lhrbest/greenplum:6.23.1  /usr/sbin/init

docker exec -it gpdbtest bash

cat >> /etc/security/limits.conf <<"EOF"

* soft nofile 800
* hard nofile 800
* soft nproc 800
* hard nproc 800

EOF

sed  -i 's/1024/800/' /etc/security/limits.d/20-nproc.conf 
cat /etc/security/limits.d/20-nproc.conf

gpstart -m
gpconfig -c max_connections -v 2000  -m 1000
gpstop -M fast -ar

psql -c 'drop database sbtest2;'
psql -c 'create database sbtest2;'
sysbench /usr/share/sysbench/oltp_common.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 \
--pgsql-user=gpadmin --pgsql-password=lhr --pgsql-db=sbtest2  \
--time=300 --table-size=10 --tables=200 --threads=200  \
--events=999999999   prepare

docker rm -f gpdbtest

docker run -itd --name gpdbtest -h gpdb6 \

-p 28180:28080 \

--privileged=true lhrbest/greenplum:6.23.1 /usr/sbin/init

docker exec -it gpdbtest bash

cat >> /etc/security/limits.conf <<"EOF"

* soft nofile 800

* hard nofile 800

* soft nproc 800

* hard nproc 800

EOF

sed -i 's/1024/800/' /etc/security/limits.d/20-nproc.conf

cat /etc/security/limits.d/20-nproc.conf

gpstart -m

gpconfig -c max_connections -v 2000 -m 1000

gpstop -M fast -ar

psql -c 'drop database sbtest2;'

psql -c 'create database sbtest2;'

sysbench /usr/share/sysbench/oltp_common.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 \

--pgsql-user=gpadmin --pgsql-password=lhr --pgsql-db=sbtest2 \

--time=300 --table-size=10 --tables=200 --threads=200 \

--events=999999999 prepare

情况3：kill掉实例

当手工kill掉某个PG实例后，也会自动发生切换。

情况4：master的主机名不能解析

所有的segment节点不能解析master的主机，效果类似于segment宕机，所以，这个时候会发生切换。

情况5：负载过大或CPU压力或主机压力较大或发生大量的swap置换

当某个节点的负载过大或CPU压力或主机压力较大时也会发生故障启动切换。

之前碰到过1条SQL语句，每分钟插入8万条数据导致主备切换（按秒统计时，每秒1万条）。

情况6：某个segment的主机网络不通

某个主机网络临时不通的话，或出现丢包严重，忘了抖动等，也会发生切换。 和情况4差不多。

此时，报错信息一般为：“Connection reset by peer” ，或“error code = 110 (Connection timed out)”，出现丢包现象，若频繁出现，则可以使用长ping来观察网络情况。

参考：https://dbaup.com/shiyong-mtr-fenxiwangluoyanchijidiubaowenti.html#zong_jie

排查SQL

-- 找到发生切换的时间点
select * from gp_configuration_history ORDER BY  time desc limit 100;

-- 查询时间视图
select 'gp_log_master_ext' tb,min(logtime),max(logtime) from gpmetrics.gp_log_master_ext
union all
select 'gpcc_pg_log_history' tb,min(logtime),max(logtime) from gpmetrics.gpcc_pg_log_history;

-- 查询切换时的详细日志（有时候查不出来，需要去查询gp_log_master_ext或直接查询日志文件）
SELECT
    d.logtime,
    d.loguser,
    d.logdatabase,
    d.loghost,
    d.logsessiontime,
    d.logseverity,
    d.logmessage,
    d.logdebug,
    d.logdetail,
    logcontext
from
    gpmetrics.gpcc_pg_log_history d 
where
    d.logseverity not in ( 'LOG' ) 
    and
     d.logstate not in ( '58P01' ) -- and d.logtime >= now()-interval '7 day'    
and
     d.logtime >= '2024-04-10 09:30' 
    and d.logtime <= '2024-04-10 09:40'
and ( d.logmessage like 'Out of memory%' 
     or d.logmessage like 'failed to acquire resources on one or more segments%' 
     or d.logmessage like 'FTS detected connection lost during dispatch to%' 
     or d.logmessage like 'ERROR: FTS double fault detected%'
     or d.logmessage like 'Canceling query because of high VMEM usage%'
     or d.logmessage like '%server closed the connection unexpectedly%'
     or d.logmessage like '%broadcast failed to one or more segments%'
     or d.logmessage like '%gang was lost due to cluster reconfiguration%'
     or d.logmessage like '%because the gang was disconnected%'
     or d.logmessage like '%interconnect error: connection closed prematurely%'
     or d.logdetail like '%Cannot allocate memory%' 
     or d.logdetail like '%FATAL:  out of memory%' 
     or d.logdetail like '%System memory limit reached, failed to allocate %' 
     or d.logdetail like 'FTS detected one or more segments are down'
     or d.logdetail like '%Vmem limit reached, failed to allocate%'
     or d.logdetail like 'FATAL:  InitMotionLayerIPC: failed to create thread%'
     or d.logdetail like 'could not fork new process for connection%'
     or d.logdetail like '%Resource temporarily unavailable%'
     or d.logdetail like '%Connection reset by peer%'
     or d.logdetail like '%Connection refused%'
    )
order by d.logtime 
    limit 500;

SELECT d.logtime,
    d.loguser,
    d.logdatabase,
    d.loghost,
    d.logsessiontime,
    d.logseverity,
    d.logmessage,
    d.logdebug,
    d.logdetail,
    logcontext
FROM gpmetrics.gp_log_master_ext d 
where d.logseverity not in ('LOG') 
and d.logstate not in ('58P01')
-- and d.logtime >= now()-interval '7 day' 
 and d.logtime >= '2024-04-15 15:30' 
 and d.logtime <= '2024-04-15 15:40'
and ( d.logmessage like 'Out of memory%' 
     or d.logmessage like 'failed to acquire resources on one or more segments%' 
     or d.logmessage like 'FTS detected connection lost during dispatch to%' 
     or d.logmessage like 'ERROR: FTS double fault detected%'
     or d.logmessage like 'Canceling query because of high VMEM usage%'
     or d.logmessage like '%server closed the connection unexpectedly%'
     or d.logmessage like '%broadcast failed to one or more segments%'
     or d.logmessage like '%gang was lost due to cluster reconfiguration%'
     or d.logmessage like '%because the gang was disconnected%'
     or d.logmessage like '%interconnect error: connection closed prematurely%'
     or d.logdetail like '%Cannot allocate memory%' 
     or d.logdetail like '%FATAL:  out of memory%' 
     or d.logdetail like '%System memory limit reached, failed to allocate %' 
     or d.logdetail like 'FTS detected one or more segments are down'
     or d.logdetail like '%Vmem limit reached, failed to allocate%'
     or d.logdetail like 'FATAL:  InitMotionLayerIPC: failed to create thread%'
     or d.logdetail like 'could not fork new process for connection%'
     or d.logdetail like '%Resource temporarily unavailable%'
     or d.logdetail like '%Connection reset by peer%'
     or d.logdetail like '%Connection refused%'
    )
order by d.logtime  
;

-- 查看日志
ps -ef|grep post| grep bin

cd /opt/greenplum/data/master/gpseg-1/pg_log

-- 每隔15秒收集1次所有主机的系统信息
SELECT
   ctime::date || ' ' 
     || CASE WHEN extract(hour from ctime) < 10 THEN '0' ELSE '' END ||extract(hour from ctime)
     ||':'|| CASE WHEN extract(MINUTE from ctime) < 10 THEN '0' ELSE '' END || extract(MINUTE from ctime)
     AS time1
 , hostname
 , ROUND(AVG(cpu_sys::numeric),2) AS cpu_sys 
 , ROUND(avg(cpu_user::numeric),2) AS cpu_user
 , ROUND(avg((cpu_sys+cpu_user)::numeric),2) AS cpu_use
 , ROUND(avg((cpu_iowait)::numeric),2) AS cpu_iowait
 , ROUND(avg(gsh.cpu_idle::numeric),2) AS cpu_idle
 , ROUND(avg(mem_total::numeric/1024/1024/1024)) AS mem_total_GB
 , ROUND(avg(mem_used::numeric/1024/1024/1024),2) AS mem_used_GB
 -- , ROUND(avg(mem_actual_used::numeric/1024/1024/1024),2) AS mem_actual_GB
 , ROUND(avg((mem_buffers+mem_cached)::numeric/1024/1024/1024),2) AS mem_buffer_cache_GB
 -- , ROUND(avg(((gsh.mem_total-mem_used)::numeric)/1024/1024/1024),2) AS mem_available_GB
 , ROUND(avg(gsh.swap_total::numeric/1024/1024/1024)) as swap_total
 , ROUND(avg(gsh.swap_used::numeric/1024/1024/1024),2) as swap_used
 , ROUND(avg(gsh.swap_page_in::numeric),2) as swap_page_in
 , ROUND(avg(gsh.swap_page_out::numeric),2) as swap_page_out
 , ROUND(avg(load0::numeric),2) AS load0
 , ROUND(avg(load1::numeric),2) AS load1
 , ROUND(avg(load2::numeric),2) AS load2
 , ROUND((avg(disk_rb_rate::numeric) / 1024 / 1024),2) AS disk_R_MBs
 , ROUND((avg(disk_wb_rate::numeric) / 1024 / 1024),2) AS disk_W_MBs
 , ROUND((avg(net_rb_rate::numeric) / 1024 / 1024),2) AS net_I_MBs
 , ROUND((avg(net_wb_rate::numeric) / 1024 / 1024),2) AS net_O_MBs
FROM
   gpmetrics.gpcc_system_history gsh
WHERE hostname LIKE 'mdw%'
AND ctime >= '2024-09-14 16:30:00'
AND ctime < '2024-09-14 19:00:00'
GROUP BY 1,2
ORDER BY 1
;

-- 系统内存和swap信息
SELECT DISTINCT gsh.hostname,
ROUND(gsh.mem_total::numeric/1024/1024/1024) AS mem_total_G, 
ROUND(gsh.swap_total::numeric/1024/1024/1024) AS swap_total_G
FROM  gpmetrics.gpcc_system_history gsh
where ctime >=  NOW() - INTERVAL '1 minute'
and gsh.mem_total > 0
order by 1
limit 10;

 -- 出问题时间段的SQL语句
 SELECT substr(gqh.ctime::VARCHAR,0,17) ctime,
 substr(gqh.query_text,0,50000) query_text,
 count(*) cnt, 
 TRUNC(sum(gqh.disk_read_bytes)/1024/1024/1024,2) disk_read_GB,
 TRUNC(sum(disk_write_bytes)/1024/1024/1024,2) disk_write_GB
from  gpmetrics.gpcc_queries_history gqh
WHERE ctime >= '2024-10-27 20:30' AND ctime <= '2024-10-27 22:30'
and lower(query_text) like '%result_lis_xh_ls%'
group by substr(gqh.ctime::VARCHAR,0,17),substr(gqh.query_text,0,50000)
ORDER BY 4 desc , 5 desc ;

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

-- 找到发生切换的时间点

select * from gp_configuration_history ORDER BY time desc limit 100;

-- 查询时间视图

select 'gp_log_master_ext' tb,min(logtime),max(logtime) from gpmetrics.gp_log_master_ext

union all

select 'gpcc_pg_log_history' tb,min(logtime),max(logtime) from gpmetrics.gpcc_pg_log_history;

-- 查询切换时的详细日志（有时候查不出来，需要去查询gp_log_master_ext或直接查询日志文件）

SELECT

d.logtime,

d.loguser,

d.logdatabase,

d.loghost,

d.logsessiontime,

d.logseverity,

d.logmessage,

d.logdebug,

d.logdetail,

logcontext

from

gpmetrics.gpcc_pg_log_history d

where

d.logseverity not in ( 'LOG' )

and

d.logstate not in ( '58P01' ) -- and d.logtime >= now()-interval '7 day'

and

d.logtime >= '2024-04-10 09:30'

and d.logtime <= '2024-04-10 09:40'

and ( d.logmessage like 'Out of memory%'

or d.logmessage like 'failed to acquire resources on one or more segments%'

or d.logmessage like 'FTS detected connection lost during dispatch to%'

or d.logmessage like 'ERROR: FTS double fault detected%'

or d.logmessage like 'Canceling query because of high VMEM usage%'

or d.logmessage like '%server closed the connection unexpectedly%'

or d.logmessage like '%broadcast failed to one or more segments%'

or d.logmessage like '%gang was lost due to cluster reconfiguration%'

or d.logmessage like '%because the gang was disconnected%'

or d.logmessage like '%interconnect error: connection closed prematurely%'

or d.logdetail like '%Cannot allocate memory%'

or d.logdetail like '%FATAL: out of memory%'

or d.logdetail like '%System memory limit reached, failed to allocate %'

or d.logdetail like 'FTS detected one or more segments are down'

or d.logdetail like '%Vmem limit reached, failed to allocate%'

or d.logdetail like 'FATAL: InitMotionLayerIPC: failed to create thread%'

or d.logdetail like 'could not fork new process for connection%'

or d.logdetail like '%Resource temporarily unavailable%'

or d.logdetail like '%Connection reset by peer%'

or d.logdetail like '%Connection refused%'

)

order by d.logtime

limit 500;

SELECT d.logtime,

d.loguser,

d.logdatabase,

d.loghost,

d.logsessiontime,

d.logseverity,

d.logmessage,

d.logdebug,

d.logdetail,

logcontext

FROM gpmetrics.gp_log_master_ext d

where d.logseverity not in ('LOG')

and d.logstate not in ('58P01')

-- and d.logtime >= now()-interval '7 day'

and d.logtime >= '2024-04-15 15:30'

and d.logtime <= '2024-04-15 15:40'

and ( d.logmessage like 'Out of memory%'

or d.logmessage like 'failed to acquire resources on one or more segments%'

or d.logmessage like 'FTS detected connection lost during dispatch to%'

or d.logmessage like 'ERROR: FTS double fault detected%'

or d.logmessage like 'Canceling query because of high VMEM usage%'

or d.logmessage like '%server closed the connection unexpectedly%'

or d.logmessage like '%broadcast failed to one or more segments%'

or d.logmessage like '%gang was lost due to cluster reconfiguration%'

or d.logmessage like '%because the gang was disconnected%'

or d.logmessage like '%interconnect error: connection closed prematurely%'

or d.logdetail like '%Cannot allocate memory%'

or d.logdetail like '%FATAL: out of memory%'

or d.logdetail like '%System memory limit reached, failed to allocate %'

or d.logdetail like 'FTS detected one or more segments are down'

or d.logdetail like '%Vmem limit reached, failed to allocate%'

or d.logdetail like 'FATAL: InitMotionLayerIPC: failed to create thread%'

or d.logdetail like 'could not fork new process for connection%'

or d.logdetail like '%Resource temporarily unavailable%'

or d.logdetail like '%Connection reset by peer%'

or d.logdetail like '%Connection refused%'

)

order by d.logtime

;

-- 查看日志

ps -ef|grep post| grep bin

cd /opt/greenplum/data/master/gpseg-1/pg_log

-- 每隔15秒收集1次所有主机的系统信息

SELECT

ctime::date || ' '

|| CASE WHEN extract(hour from ctime) < 10 THEN '0' ELSE '' END ||extract(hour from ctime)

||':'|| CASE WHEN extract(MINUTE from ctime) < 10 THEN '0' ELSE '' END || extract(MINUTE from ctime)

AS time1

, hostname

, ROUND(AVG(cpu_sys::numeric),2) AS cpu_sys

, ROUND(avg(cpu_user::numeric),2) AS cpu_user

, ROUND(avg((cpu_sys+cpu_user)::numeric),2) AS cpu_use

, ROUND(avg((cpu_iowait)::numeric),2) AS cpu_iowait

, ROUND(avg(gsh.cpu_idle::numeric),2) AS cpu_idle

, ROUND(avg(mem_total::numeric/1024/1024/1024)) AS mem_total_GB

, ROUND(avg(mem_used::numeric/1024/1024/1024),2) AS mem_used_GB

-- , ROUND(avg(mem_actual_used::numeric/1024/1024/1024),2) AS mem_actual_GB

, ROUND(avg((mem_buffers+mem_cached)::numeric/1024/1024/1024),2) AS mem_buffer_cache_GB

-- , ROUND(avg(((gsh.mem_total-mem_used)::numeric)/1024/1024/1024),2) AS mem_available_GB

, ROUND(avg(gsh.swap_total::numeric/1024/1024/1024)) as swap_total

, ROUND(avg(gsh.swap_used::numeric/1024/1024/1024),2) as swap_used

, ROUND(avg(gsh.swap_page_in::numeric),2) as swap_page_in

, ROUND(avg(gsh.swap_page_out::numeric),2) as swap_page_out

, ROUND(avg(load0::numeric),2) AS load0

, ROUND(avg(load1::numeric),2) AS load1

, ROUND(avg(load2::numeric),2) AS load2

, ROUND((avg(disk_rb_rate::numeric) / 1024 / 1024),2) AS disk_R_MBs

, ROUND((avg(disk_wb_rate::numeric) / 1024 / 1024),2) AS disk_W_MBs

, ROUND((avg(net_rb_rate::numeric) / 1024 / 1024),2) AS net_I_MBs

, ROUND((avg(net_wb_rate::numeric) / 1024 / 1024),2) AS net_O_MBs

FROM

gpmetrics.gpcc_system_history gsh

WHERE hostname LIKE 'mdw%'

AND ctime >= '2024-09-14 16:30:00'

AND ctime < '2024-09-14 19:00:00'

GROUP BY 1,2

ORDER BY 1

;

-- 系统内存和swap信息

SELECT DISTINCT gsh.hostname,

ROUND(gsh.mem_total::numeric/1024/1024/1024) AS mem_total_G,

ROUND(gsh.swap_total::numeric/1024/1024/1024) AS swap_total_G

FROM gpmetrics.gpcc_system_history gsh

where ctime >= NOW() - INTERVAL '1 minute'

and gsh.mem_total > 0

order by 1

limit 10;

-- 出问题时间段的SQL语句

SELECT substr(gqh.ctime::VARCHAR,0,17) ctime,

substr(gqh.query_text,0,50000) query_text,

count(*) cnt,

TRUNC(sum(gqh.disk_read_bytes)/1024/1024/1024,2) disk_read_GB,

TRUNC(sum(disk_write_bytes)/1024/1024/1024,2) disk_write_GB

from gpmetrics.gpcc_queries_history gqh

WHERE ctime >= '2024-10-27 20:30' AND ctime <= '2024-10-27 22:30'

and lower(query_text) like '%result_lis_xh_ls%'

group by substr(gqh.ctime::VARCHAR,0,17),substr(gqh.query_text,0,50000)

ORDER BY 4 desc , 5 desc ;

本人提供Oracle(OCP、OCM)、MySQL(OCP)、PostgreSQL(PGCA、PGCE、PGCM)等数据库的培训和考证业务，私聊QQ646634621或微信dbaup66，谢谢！

后续精彩内容已被站长无情隐藏，请输入验证码解锁本文！

获取验证码：请先关注本站微信公众号，然后回复“验证码”，获取验证码。在微信里搜索“AiDBA”或者“dbaup6”或者微信扫描右侧二维码都可以关注本站微信公众号。

打赏赞(2)

标签：原创故障处理 GreenPlum OOM segment 实例漂移自动故障转移节点切换节点漂移

小麦苗

学习或考证，均可联系麦老师，请加微信db_bao或QQ646634621

发表回复取消回复

要发表评论，您必须先登录。

原 GreenPlum在何时会发生自动故障切换、节点漂移及如何排查（OOM）

简介

情况1（大部分情况）：发生了OOM

情况2：最大进程数超限导致系统资源不足

情况3：kill掉实例

情况4：master的主机名不能解析

情况5：负载过大或CPU压力或主机压力较大或发生大量的swap置换

情况6：某个segment的主机网络不通

排查SQL

相关文章

您可能还喜欢...

发表回复取消回复

网站公告

网站寄语

本站其它工具

搜索本网站

标签云☁

网站日历

网站归档

网站分类

2024 年 11 月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

原 GreenPlum在何时会发生自动故障切换、节点漂移及如何排查（OOM）

简介

情况1（大部分情况）：发生了OOM

情况2：最大进程数超限导致系统资源不足

情况3：kill掉实例

情况4：master的主机名不能解析

情况5：负载过大或CPU压力或主机压力较大或发生大量的swap置换

情况6：某个segment的主机网络不通

排查SQL

相关文章

您可能还喜欢...

Linux中使用ssh连接服务器很慢的几种情况

可以ping通，但ssh报错：ssh: connect to host 100.8.6.26 port 22: Connection refused

ssh_exchange_identification: read: Connection reset by peer

发表回复 取消回复

网站公告

网站寄语

本站其它工具

搜索本网站

标签云☁

网站日历

网站归档

网站分类

发表回复取消回复