Today morning I noticed that the bvmhost-p09-04 is not responding, so I restarted it using the ipmitool from noc01. After reboot the three build machines in title alerted as being down. They couldn't be accessed through virsh console.
bvmhost-p09-04
noc01
virsh console
It would be nice to have them back, but not urgent
I started reinstalling the buildvm-ppc64le-25. I didn't do it before, so starting with just one should be enough. When trying to run ansible-playbook /srv/web/infra/ansible/playbooks/groups/buildvm.yml -l buildvm-ppc64le-25.iad2.fedoraproject.org after destroying the VM first I encountered the following issues:
buildvm-ppc64le-25
ansible-playbook /srv/web/infra/ansible/playbooks/groups/buildvm.yml -l buildvm-ppc64le-25.iad2.fedoraproject.org
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python' It was easy to workaround by adding -e ansible_python_interpreter=/usr/bin/python3 (this should be probably added to some vars file, so I tried to add it to buildvm_ppc64le file, but that didn't help, so I reverted the commit)
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python'
-e ansible_python_interpreter=/usr/bin/python3
buildvm_ppc64le
Failed to resolve server ntap-iad2-c02-fedora01-nfs01a: Name or service not known when mounting /mnt/fedora koji mount This seems like issue with namespace resolution, as the IP of the machine doesn't seem to be a problem. I found out that the /etc/resolv.conf is different than on buildvm-ppc64le-24, which I didn't touch. The machine is managed by systemd-resolved, so I don't think changing the /etc/resolv.conf is a good approach here.
Failed to resolve server ntap-iad2-c02-fedora01-nfs01a: Name or service not known
/mnt/fedora
/etc/resolv.conf
buildvm-ppc64le-24
systemd-resolved
I though that this playbook is being run often and I'm surprised I wasn't even able to get through the nfs/client role when running it.
I fixed the DNS issue with
nmcli connection modify eth0 ipv4.dns-search "iad2.fedoraproject.org,fedoraproject.org" nmcli device disconnect eth0 && nmcli device connect eth0
The buildvm-ppc64le-25 is back!
DNS should be configured correctly. The problem here is a long standing bug where linux-system-roles.network sets up everything correctly in NetworkManager, but somehow systemd-resolved isn't notified of the configuration, so it's still running with some default. I guess we should file a upstream bug on it and ask for advice.
On the 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python' that happens when the vmhost has no facts cache entry for this. You can 'fix' it by doing a 'ansible -m setup vmhostname' and then re-running the playbook. I think we could just look at dropping all the python3 detection stuff here since as far as I know there's no python2 using targets that we have left. ;)
ok. I removed the special handling from virt_instance_create and tested on buildvm-x86-01.stg and it seems to work fine. ;)
Now we just need to figure out how to fix the resolved thing.
ok. I put an ugly workaround in roles/nfs/client to just always run 'nmcli c up eth0' This should work on the new hosts and old ones shouldn't change, so they should be ok too.
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Let me add this as comment in playbook, as I couldn't even find anything about that.
Log in to comment on this ticket.