写在前面
1.问题已在6.0.2修复,后续版本未测试,大概率也ok;
2.使用云服务商虚拟机安装,不排除是虚拟机问题有点关系但不多;
3.没修复,只知道错误点,规避掉解决的,又不是不能用.jpg
问题现象
早晨开开心心上班来,看到项目管理中openGauss从5.0.0升级到6.0.0的任务静静地躺在那里,仿佛对我说,你来呀,那么,就开干吧。
首先,得去下载安装包,打开openGauss官网下载页面一看,诶嘿,怎么回事,还有个6.0.1LTS,秉持着用新不用旧的理念,和leader果断沟通,反正是LTS,用啥不是用,都没开始,就601吧,获得许可后,开搞!
熟练的打开云平台,快速部署一台虚拟机,将601安装包替换进以前的脚本目录中,bash install.sh,bingo,那么,你跑着吧,我去接水咯。
回来一看,没毛病,看来还是一如既往的顺利呢,gauss酱,查个状态吧。
[root@sss003 gauss]# su - omm -c "gs_om -t status --detail"
-bash:行1: gs_om:未找到命令
额,怎么回事,好像有点不太对,大早上的,这BUG是不是来的有点早?没辙,看看问题出在哪里吧。
定位过程
熟悉的路径,熟悉的小日志,俺来了。轻门熟路,日志给俺痛头一击。
---------------Execute pre installation script-----------------
Parsing the configuration file.
Successfully parsed the configuration file.
Installing the tools on the local node.
Successfully installed the tools on the local node.
Setting host ip env
Successfully set host ip env.
Preparing SSH service.
Successfully prepared SSH service.
Checking OS software.
Successfully check OS software.
Checking OS version.
Successfully checked OS version.
Checking cpu instructions.
Successfully checked cpu instructions.
Creating cluster's path.
Successfully created cluster's path.
Set and check OS parameter.
Setting OS parameters.
Successfully set OS parameters.
[GAUSS-51400] : Failed to execute the command: python3 '/opt/openGauss/script/gs_checkos' -h sss003 -i A -l '/var/log/vdi/gaussdb/omm/om/gs_local.log' -X '/opt/openGauss/cluster_config.xml'.Error:
Checking items:
A1. [ OS version status ] : Normal
A2. [ Kernel version status ] : Normal
A3. [ Unicode status ] : Normal
A4. [ Time zone status ] : Normal
A5. [ Swap memory status ] : Normal
A6. [ System control parameters status ] : Warning
A7. [ File system configuration status ] : Warning
A8. [ Disk configuration status ] : Normal
A9. [ Pre-read block size status ] : Normal
A10.[ IO scheduler status ] : Normal
[sss003]:
[GAUSS-51632] : Failed to do python3 '/opt/openGauss/script/local/LocalCheckOS.py' -t Check_Network_Bond_Mode -X '/opt/openGauss/cluster_config.xml' -l '/var/log/vdi/gaussdb/omm/om/gs_local.log'. Error:
[GAUSS-50604] : Failed to obtain network interface card of backIp(fe80::4417:bdb2:313c:5971). Error:
[GAUSS-50604] : Failed to obtain network interface card of backIp(fe80::4417:bdb2:313c:5971)..
介是嘛玩意,俺的ip明明是10.125.8.133,你为啥要去获取一个不熟悉的ipv6fe80::4417:bdb2:313c:5971捏,查下IP。
[root@sss003 gauss]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 0c:da:41:1d:da:60 brd ff:ff:ff:ff:ff:ff
inet 10.125.8.133/22 brd 10.125.11.255 scope global dynamic noprefixroute ens7
valid_lft 20150sec preferred_lft 20150sec
inet6 fe80::17c1:5060:87e8:8ea9/64 scope link dadfailed tentative noprefixroute
valid_lft forever preferred_lft forever
inet6 fe80::4417:bdb2:313c:5971/64 scope link dadfailed tentative noprefixroute
valid_lft forever preferred_lft forever
inet6 fe80::308:ee7:aab3:7a19/64 scope link noprefixroute
valid_lft forever preferred_lft forever
好吧,错怪你了,还真有这玩意,但是,人家有咋了,你干嘛报错呢?
看到提示说/opt/openGauss/script/local/LocalCheckOS.py在执行Check_Network_Bond_Mode检查项报错,那么开干吧。
熟悉的断点神,出动吧。翻开笔记本,开始背笔记。
python文件
import pdb;pdb.set_trace()
进入断点
p打印变量值
n逐步执行
c全部执行
好的,打开LocalCheckOS.py,811行狠狠地粘贴import pdb;pdb.set_trace(),那么,再来一次吧。
[root@sss003 ~]# python3 '/opt/openGauss/script/local/LocalCheckOS.py' -t Check_Network_Bond_Mode -X '/opt/openGauss/cluster_config.xml' -l '/var/log/vdi/gaussdb/omm/om/gs_local.log'
> /opt/openGauss/script/local/LocalCheckOS.py(813)CheckNetWorkBonding()
-> networkCardNum = NetUtil.getNICNum(serviceIP)
(Pdb) p serviceIP
'fe80::4417:bdb2:313c:5971'
(Pdb)
嗯?不对劲,十分有十一分不对劲,你怎么传下来就是个IPV6呢?serviceIP是个啥东西,哪里来的?
from gspylib.common.Common import DefaultValue
...
elif (g_opts.action == ACTION_CHECK_NETWORK_BOND_MODE):
CheckNetWorkBonding(DefaultValue.getIpByHostName(), True)
哦吼,搞错地方了,那么DefaultValue.getIpByHostName(),再来一次吧。
找到/opt/openGauss/script/gspylib/common/Common.py,
@staticmethod
def getIpByHostName():
'''
function: get local host ip by the hostname
input : NA
output: hostIp
'''
# get hostname
hostname = socket.gethostname()
# get local host in /etc/hosts
cmd = "grep -E \"^[1-9 \\t].*%s[ \\t]*#Gauss.* IP Hosts Mapping$\" " \
"/etc/hosts | grep -E \" %s \"" % (hostname, hostname)
(status, output) = subprocess.getstatusoutput(cmd)
if (status == 0 and output != ""):
hostIp = output.strip().split(' ')[0].strip()
return hostIp
# get local host by os function
addr_info = socket.getaddrinfo(hostname, None)
for info in addr_info:
# Extract IPv4 or IPv6 addresses from address information
hostIp = info[NetUtil.ADDRESS_FAMILY_INDEX][NetUtil.IP_ADDRESS_INDEX]
# due to two loopback address in ubuntu, 127.0.1.1 are choosed by hostname.
# there is need to choose 127.0.0.1
distname, version, idnum = LinuxDistro.linux_distribution()
version = LinuxDistro.linux_distribution()[1].split('/')[0]
if distname in (UBUNTU, DEBIAN) and hostIp == "127.0.1.1":
hostIp = "127.0.0.1"
return hostIp
吔,怎么还是个staticmethod,这咋整,怎么执行嘞,还是问下万能的AI吧。
在文件末尾添加 if __name__ == "__main__": 块来直接调用 DefaultValue.getIpByHostName() 方法。
好的,说干就干,again and again。
[root@sss003 ~]# PYTHONPATH=/opt/openGauss/script/ python3 /opt/openGauss/script/gspylib/common/Common.py
> /opt/openGauss/script/gspylib/common/Common.py(720)getIpByHostName()
-> hostname = socket.gethostname()
(Pdb) p hostname
*** NameError: name 'hostname' is not defined
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(723)getIpByHostName()
-> cmd = "grep -E \"^[1-9 \\t].*%s[ \\t]*#Gauss.* IP Hosts Mapping$\" " \
(Pdb) p hostname
'sss003'
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(724)getIpByHostName()
-> "/etc/hosts | grep -E \" %s \"" % (hostname, hostname)
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(723)getIpByHostName()
-> cmd = "grep -E \"^[1-9 \\t].*%s[ \\t]*#Gauss.* IP Hosts Mapping$\" " \
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(725)getIpByHostName()
-> (status, output) = subprocess.getstatusoutput(cmd)
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(726)getIpByHostName()
-> if (status == 0 and output != ""):
(Pdb) p status
1
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(731)getIpByHostName()
-> addr_info = socket.getaddrinfo(hostname, None)
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(732)getIpByHostName()
-> for info in addr_info:
(Pdb) p addr_info
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.125.8.133', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.125.8.133', 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.125.8.133', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('fe80::308:ee7:aab3:7a19', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('fe80::308:ee7:aab3:7a19', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('fe80::308:ee7:aab3:7a19', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('fe80::17c1:5060:87e8:8ea9', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('fe80::17c1:5060:87e8:8ea9', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('fe80::17c1:5060:87e8:8ea9', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('fe80::4417:bdb2:313c:5971', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('fe80::4417:bdb2:313c:5971', 0, 0, 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_RAW: 3>, 0, '', ('fe80::4417:bdb2:313c:5971', 0, 0, 0))]
(Pdb) n
> /opt/openGauss/script/gspylib/common/Common.py(734)getIpByHostName()
-> hostIp = info[NetUtil.ADDRESS_FAMILY_INDEX][NetUtil.IP_ADDRESS_INDEX]
(Pdb)
看到这里,再回头看看代码,释然了,感觉没啥毛病哇,不过是查了所有IP,遍历之后,把最后一个IP当做返回值传给了CheckNetWorkBonding,那么,他为啥要报错呢?
唉呀,好烦,回头再去找LocalCheckOS的麻烦。
[root@sss003 ~]# python3 '/opt/openGauss/script/local/LocalCheckOS.py' -t Check_Network_Bond_Mode -X '/opt/openGauss/cluster_config.xml' -l '/var/log/vdi/gaussdb/omm/om/gs_local.log'
> /opt/openGauss/script/local/LocalCheckOS.py(813)CheckNetWorkBonding()
-> networkCardNum = NetUtil.getNICNum(serviceIP)
(Pdb) n
Exception: [GAUSS-50604] : Failed to obtain network interface card of backIp(fe80::4417:bdb2:313c:5971). Error:
[GAUSS-50604] : Failed to obtain network interface card of backIp(fe80::4417:bdb2:313c:5971).
...
from base_utils.os.net_util import NetUtil
奥,这又到了NetUtil.getNICNum,继续,和Common.py一个待遇。
# 在 net_util.py 文件的末尾添加以下代码
if __name__ == "__main__":
# 这里假设 getNICNum 方法需要一个 IP 地址作为参数
ip_address = "fe80::4417:bdb2:313c:5971" # 替换为你要查询的IP地址
nic_num = NetUtil.getNICNum(ip_address)
print(f"The NIC number for IP {ip_address} is {nic_num}")
再跑一下NetUtil
[root@sss003 ~]# PYTHONPATH=/opt/openGauss/script/ python3 /opt/openGauss/script/base_utils/os/net_util.py
> /opt/openGauss/script/base_utils/os/net_util.py(552)getNICNum()
-> net_work_num = ""
(Pdb) n
> /opt/openGauss/script/base_utils/os/net_util.py(553)getNICNum()
-> net_work_info = psutil.net_if_addrs()
(Pdb) n
> /opt/openGauss/script/base_utils/os/net_util.py(554)getNICNum()
-> for nic_num in list(net_work_info.keys()):
(Pdb) p net_work_info
{'lo': [snicaddr(family=<AddressFamily.AF_INET: 2>, address='127.0.0.1', netmask='255.0.0.0', broadcast=None, ptp=None), snicaddr(family=<AddressFamily.AF_INET6: 10>, address='::1', netmask='ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff', broadcast=None, ptp=None), snicaddr(family=<AddressFamily.AF_PACKET: 17>, address='00:00:00:00:00:00', netmask=None, broadcast=None, ptp=None)], 'ens7': [snicaddr(family=<AddressFamily.AF_INET: 2>, address='10.125.8.133', netmask='255.255.252.0', broadcast='10.125.11.255', ptp=None), snicaddr(family=<AddressFamily.AF_INET6: 10>, address='fe80::17c1:5060:87e8:8ea9%ens7', netmask='ffff:ffff:ffff:ffff::', broadcast=None, ptp=None), snicaddr(family=<AddressFamily.AF_INET6: 10>, address='fe80::4417:bdb2:313c:5971%ens7', netmask='ffff:ffff:ffff:ffff::', broadcast=None, ptp=None), snicaddr(family=<AddressFamily.AF_INET6: 10>, address='fe80::308:ee7:aab3:7a19%ens7', netmask='ffff:ffff:ffff:ffff::', broadcast=None, ptp=None), snicaddr(family=<AddressFamily.AF_PACKET: 17>, address='0c:da:41:1d:da:60', netmask=None, broadcast='ff:ff:ff:ff:ff:ff', ptp=None)]}
根据这个结果和NetUtil.getNICNum代码一对比
@staticmethod
def getNICNum(ip_address):
"""
function: Obtain network interface card number by psutil module
input: ip_address
output: netWorkNum
"""
try:
net_work_num = ""
net_work_info = psutil.net_if_addrs()
for nic_num in list(net_work_info.keys()):
for net_info in net_work_info[nic_num]:
if net_info.address == ip_address:
net_work_num = nic_num
break
if net_work_num == "":
raise Exception(ErrorCode.GAUSS_506["GAUSS_50604"] % ip_address)
return net_work_num
except Exception as excep:
raise Exception(ErrorCode.GAUSS_506["GAUSS_50604"] % ip_address +
" Error: \n%s" % str(excep))
根据psutil.net_if_addrs()的结果取匹配传参进来的ip,获取网卡接口名称,但是psutil.net_if_addrs返回的结果中fe80::4417:bdb2:313c:5971是带着%ens7的,所以,无法匹配到,这个方法就报错了,而虚机带的这个ip,是模板虚机带出来的,没有实际意义,也无法去掉,那么,这个问题该怎么办呢?
诶,这个时候就可以再回到DefaultValue.getIpByHostName(),可以看到代码,在有几种特殊情况下,是直接使用了127.0.0.1的,那么,大胆一点,我们统统使用127.0.0.1就好了,又不是不能用.jpg
改完再跑一次,ok,问题解决。
[root@sss003 ~]# python3 '/opt/openGauss/script/local/LocalCheckOS.py' -t Check_Network_Bond_Mode -X '/opt/openGauss/cluster_config.xml' -l '/var/log/vdi/gaussdb/omm/om/gs_local.log'
> /opt/openGauss/script/local/LocalCheckOS.py(813)CheckNetWorkBonding()
-> networkCardNum = NetUtil.getNICNum(serviceIP)
(Pdb) c
> /opt/openGauss/script/base_utils/os/net_util.py(552)getNICNum()
-> net_work_num = ""
(Pdb) c
BondMode Null
最后,为什么这个方法,在psutil.net_if_addrs()返回的结果中,fe80::4417:bdb2:313c:5971是带着%ens7的,这个无法考证,貌似是个内核方法,所以没办法从这边入手,最后也不知道getIpByHostName这个方法为什么要循环遍历,然后把最后一个IP当做返回值,不过看到602已经不是这部分代码了,彻底重写了,改从/etc/hosts里面做筛选了。

