求助:6.0版本一主一备集群 如何替换故障节点

个人学习,尝试HA功能的故障恢复。

环境:

192.168.122.146 db1

192.168.122.195 db2

192.168.122.200 VIP

测试步骤
模拟故障操作,并尝试恢复

  1. 默认安装,路径为 /opt/openGauss/install/data/dn1 主备成功启动。集群状态正常 db1为主,测试数据同步等,正常
  2. db1 执行 rm -rf /opt/openGauss/ , 并重启OS。db2自动变成主。 db1 down。
[omm@db2 dn1]$ gs_om -t status --detail
[  CMServer State   ]

node   node_ip         instance                                            state
----------------------------------------------------------------------------------
1  db1 192.168.122.146  1    /opt/openGauss/install/data/cmserver/cm_server Down
2  db2 192.168.122.195  2    /opt/openGauss/install/data/cmserver/cm_server Primary

[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[  Datanode State   ]

node   node_ip         instance                                    state            
------------------------------------------------------------------------------------
1  db1 192.168.122.146  6001 15000  /opt/openGauss/install/data/dn1 P Down    Unknown
2  db2 192.168.122.195  6002 15000  /opt/openGauss/install/data/dn1 S Primary Normal
[omm@db2 dn1]$ 
  1. 尝试恢复db1节点和. 没有找到官方文档。 参考 openGauss集群故障节点替换操作 - 墨天轮 。 在 build 步骤, 错误日志如下
[omm@db1 dn1]$ gs_ctl build -D /opt/openGauss/install/data/dn1 -b standby_full -C "localhost=192.168.122.146 localport=15000 remotehost=192.168.122.195 remoteport=15000"  
0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

0 LOG:  [Alarm Module]Host Name: db1 

0 LOG:  [Alarm Module]Host IP: db1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

0 LOG:  [Alarm Module]Cluster Name: Cluster_template 

0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58

0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
[2025-11-28 12:37:40.129][8735][][gs_ctl]: gs_ctl standby full build ,datadir is /opt/openGauss/install/data/dn1,conn_str is 'localhost=192.168.122.146 localport=15000 remotehost=192.168.122.195 remoteport=15000'
[2025-11-28 12:37:40.129][8735][][gs_ctl]: fopen build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.129][8735][][gs_ctl]: fprintf build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.132][8735][][gs_ctl]: fsync build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.132][8735][][gs_ctl]: stop failed, killing gaussdb by force ...
[2025-11-28 12:37:40.132][8735][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/opt/openGauss/install/data/dn1")  print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/opt/openGauss/install/data/dn1] 
[2025-11-28 12:37:40.140][8735][][gs_ctl]: server stopped
[2025-11-28 12:37:40.140][8735][][gs_ctl]: current workdir is (/opt/openGauss/install/data/dn1).
[2025-11-28 12:37:40.140][8735][][gs_ctl]: set gaussdb state file when standby full build build:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(FULL_BUILD).
[2025-11-28 12:37:40.140][8735][dn_6001_6002][gs_ctl]: Get repl_auth_mode is  and repl_uuid is 
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: standby build try host(192.168.122.195) port(15000) failed
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: could not connect to server.
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: standby full build failed(/opt/openGauss/install/data/dn1).
[omm@db1 dn1]$ 

相关环境

  1. 修改postgresql.conf 对应的值
local_bind_address = '192.168.122.146'
replconninfo1 = 'localhost=192.168.122.146 localport=15001 localheartbeatport=15005 localservice=15004 remotehost=192.1
68.122.195 remoteport=15001 remoteheartbeatport=15005 remoteservice=15004' 

application_name = 'dn_6001'

synchronous_standby_names = 'ANY 1(dn_6001,dn_6002)'
  1. 从146 到195 的 15000 端口通。
[omm@db1 dn1]$ nc -v 192.168.122.195 15001
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15001.
^C
[omm@db1 dn1]$ nc -v 192.168.122.195 15000
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15000.
^C
[omm@db1 dn1]$ nc -v 192.168.122.195 15005
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15005.
^C

求助:

  1. 有没有官方文档, 支持完全重建节点?
  2. 上面哪里有错?
  3. 我要提供哪些信息?