#12219 buildvm-ppc64le-{25,26,29} not responding
Closed: Fixed with Explanation 5 months ago by kevin. Opened 5 months ago by zlopez.

Describe what you would like us to do:


Today morning I noticed that the bvmhost-p09-04 is not responding, so I restarted it using the ipmitool from noc01. After reboot the three build machines in title alerted as being down. They couldn't be accessed through virsh console.

When do you need this to be done by? (YYYY/MM/DD)


It would be nice to have them back, but not urgent


I started reinstalling the buildvm-ppc64le-25. I didn't do it before, so starting with just one should be enough. When trying to run ansible-playbook /srv/web/infra/ansible/playbooks/groups/buildvm.yml -l buildvm-ppc64le-25.iad2.fedoraproject.org after destroying the VM first I encountered the following issues:

  • 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python'
    It was easy to workaround by adding -e ansible_python_interpreter=/usr/bin/python3 (this should be probably added to some vars file, so I tried to add it to buildvm_ppc64le file, but that didn't help, so I reverted the commit)

  • Failed to resolve server ntap-iad2-c02-fedora01-nfs01a: Name or service not known when mounting /mnt/fedora koji mount
    This seems like issue with namespace resolution, as the IP of the machine doesn't seem to be a problem. I found out that the /etc/resolv.conf is different than on buildvm-ppc64le-24, which I didn't touch. The machine is managed by systemd-resolved, so I don't think changing the /etc/resolv.conf is a good approach here.

I though that this playbook is being run often and I'm surprised I wasn't even able to get through the nfs/client role when running it.

I fixed the DNS issue with

nmcli connection modify eth0 ipv4.dns-search "iad2.fedoraproject.org,fedoraproject.org"
nmcli device disconnect eth0 && nmcli device connect eth0

The buildvm-ppc64le-25 is back!

DNS should be configured correctly. The problem here is a long standing bug where linux-system-roles.network sets up everything correctly in NetworkManager, but somehow systemd-resolved isn't notified of the configuration, so it's still running with some default. I guess we should file a upstream bug on it and ask for advice.

On the 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python' that happens when the vmhost has no facts cache entry for this. You can 'fix' it by doing a 'ansible -m setup vmhostname' and then re-running the playbook. I think we could just look at dropping all the python3 detection stuff here since as far as I know there's no python2 using targets that we have left. ;)

ok. I removed the special handling from virt_instance_create and tested on buildvm-x86-01.stg and it seems to work fine. ;)

Now we just need to figure out how to fix the resolved thing.

ok. I put an ugly workaround in roles/nfs/client to just always run 'nmcli c up eth0'
This should work on the new hosts and old ones shouldn't change, so they should be ok too.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

5 months ago

On the 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_python' that happens when the vmhost has no facts cache entry for this. You can 'fix' it by doing a 'ansible -m setup vmhostname' and then re-running the playbook. I think we could just look at dropping all the python3 detection stuff here since as far as I know there's no python2 using targets that we have left. ;)

Let me add this as comment in playbook, as I couldn't even find anything about that.

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog