个人学习,尝试HA功能的故障恢复。
环境:
192.168.122.146 db1
192.168.122.195 db2
192.168.122.200 VIP
测试步骤
模拟故障操作,并尝试恢复
- 默认安装,路径为 /opt/openGauss/install/data/dn1 主备成功启动。集群状态正常 db1为主,测试数据同步等,正常
- db1 执行
rm -rf /opt/openGauss/, 并重启OS。db2自动变成主。 db1 down。
[omm@db2 dn1]$ gs_om -t status --detail
[ CMServer State ]
node node_ip instance state
----------------------------------------------------------------------------------
1 db1 192.168.122.146 1 /opt/openGauss/install/data/cmserver/cm_server Down
2 db2 192.168.122.195 2 /opt/openGauss/install/data/cmserver/cm_server Primary
[ Cluster State ]
cluster_state : Degraded
redistributing : No
balanced : No
current_az : AZ_ALL
[ Datanode State ]
node node_ip instance state
------------------------------------------------------------------------------------
1 db1 192.168.122.146 6001 15000 /opt/openGauss/install/data/dn1 P Down Unknown
2 db2 192.168.122.195 6002 15000 /opt/openGauss/install/data/dn1 S Primary Normal
[omm@db2 dn1]$
- 尝试恢复db1节点和. 没有找到官方文档。 参考 openGauss集群故障节点替换操作 - 墨天轮 。 在 build 步骤, 错误日志如下
[omm@db1 dn1]$ gs_ctl build -D /opt/openGauss/install/data/dn1 -b standby_full -C "localhost=192.168.122.146 localport=15000 remotehost=192.168.122.195 remoteport=15000"
0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env.
0 LOG: [Alarm Module]Host Name: db1
0 LOG: [Alarm Module]Host IP: db1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
0 LOG: [Alarm Module]Cluster Name: Cluster_template
0 LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58
0 WARNING: failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING: failed to parse feature control file: gaussdb.version.
0 WARNING: Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
[2025-11-28 12:37:40.129][8735][][gs_ctl]: gs_ctl standby full build ,datadir is /opt/openGauss/install/data/dn1,conn_str is 'localhost=192.168.122.146 localport=15000 remotehost=192.168.122.195 remoteport=15000'
[2025-11-28 12:37:40.129][8735][][gs_ctl]: fopen build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.129][8735][][gs_ctl]: fprintf build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.132][8735][][gs_ctl]: fsync build pid file "/opt/openGauss/install/data/dn1/gs_build.pid" success
[2025-11-28 12:37:40.132][8735][][gs_ctl]: stop failed, killing gaussdb by force ...
[2025-11-28 12:37:40.132][8735][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/opt/openGauss/install/data/dn1") print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/opt/openGauss/install/data/dn1]
[2025-11-28 12:37:40.140][8735][][gs_ctl]: server stopped
[2025-11-28 12:37:40.140][8735][][gs_ctl]: current workdir is (/opt/openGauss/install/data/dn1).
[2025-11-28 12:37:40.140][8735][][gs_ctl]: set gaussdb state file when standby full build build:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(FULL_BUILD).
[2025-11-28 12:37:40.140][8735][dn_6001_6002][gs_ctl]: Get repl_auth_mode is and repl_uuid is
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: standby build try host(192.168.122.195) port(15000) failed
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: could not connect to server.
[2025-11-28 12:37:40.144][8735][dn_6001_6002][gs_ctl]: standby full build failed(/opt/openGauss/install/data/dn1).
[omm@db1 dn1]$
相关环境
- 修改
postgresql.conf对应的值
local_bind_address = '192.168.122.146'
replconninfo1 = 'localhost=192.168.122.146 localport=15001 localheartbeatport=15005 localservice=15004 remotehost=192.1
68.122.195 remoteport=15001 remoteheartbeatport=15005 remoteservice=15004'
application_name = 'dn_6001'
synchronous_standby_names = 'ANY 1(dn_6001,dn_6002)'
- 从146 到195 的 15000 端口通。
[omm@db1 dn1]$ nc -v 192.168.122.195 15001
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15001.
^C
[omm@db1 dn1]$ nc -v 192.168.122.195 15000
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15000.
^C
[omm@db1 dn1]$ nc -v 192.168.122.195 15005
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.122.195:15005.
^C
求助:
- 有没有官方文档, 支持完全重建节点?
- 上面哪里有错?
- 我要提供哪些信息?