Last few days we've been getting nagios notifications about testcloud having more than 100 running instances on taskotron-dev (qa11.qa). Note that it does not happen on production and staging. The only difference I can think of is that we run Fedora 24 on qa11 and Fedora 25 on the rest.
I did some digging and these are two examples that cause it: https://taskotron-dev.fedoraproject.org/execdb/jobs/0072eb5c-ef4f-11e6-8b74-525400cb0b45 https://taskotron-dev.fedoraproject.org/execdb/jobs/5dcc5f80-ef50-11e6-8b74-525400cb0b45
Looking at the logs of one of them (https://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/308325/steps/runtask/logs/taskotron-overlord.log) we can see this is suspicious:
[libtaskotron:minion.py:245] 2017-02-10 05:09:11 INFO Shutting down the minion... [testcloud.instance:instance.py:446] 2017-02-10 05:09:11 DEBUG removing instance taskotron-0072eb5c-ef4f-11e6-8b74-525400cb0b45 from libvirt. [testcloud.instance:instance.py:421] 2017-02-10 05:09:12 DEBUG stopping instance taskotron-0072eb5c-ef4f-11e6-8b74-525400cb0b45. [libtaskotron:minion.py:257] 2017-02-10 05:09:13 ERROR Domain not found: no domain with matching uuid '287714b4-8250-4c8e-a541-2f5a267e0acf' (taskotron-008e0a86-ef4f-11e6-8068-525400cb0b45) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/libtaskotron/minion.py", line 246, in execute task_vm.teardown() File "/usr/lib/python2.7/site-packages/libtaskotron/ext/disposable/vm.py", line 179, in teardown tc_instance.remove(autostop=True) File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 457, in remove self.stop() File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 423, in stop system_domains = _list_system_domains("qemu:///system") File "/usr/lib/python2.7/site-packages/testcloud/instance.py", line 82, in _list_system_domains domains[domain.name()] = DOMAIN_STATUS_ENUM[domain.state()[0]] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2767, in state if ret is None: raise libvirtError ('virDomainGetState() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid '287714b4-8250-4c8e-a541-2f5a267e0acf' (taskotron-008e0a86-ef4f-11e6-8068-525400cb0b45)
It can be seen that first, libtaskotron is trying to stop taskotron-0072eb5c-ef4f-11e6-8b74-525400cb0b45 but then complains that taskotron-008e0a86-ef4f-11e6-8068-525400cb0b45, which is different testcloud instance, cannot be found so the minion is not stopped/destroyed. Somehow the testcloud instance related to the taskotron task is swapped in the process?
taskotron-0072eb5c-ef4f-11e6-8b74-525400cb0b45
taskotron-008e0a86-ef4f-11e6-8068-525400cb0b45
Find out what the problem is and propose a fix.
This ticket is a duplicate of https://pagure.io/testcloud/issue/47
This ticket had assigned some Differential requests: D1135
It's been happening on prod as well :/
! In #905#12966, @mkrizek wrote: It's been happening on prod as well :/
Yeah, I've been trying to clean off the prod virthosts by hand to keep them running
So, the problem seems to be in this testcloud code (instance.py:_list_system_domains()): ```lang=python for domain in conn.listAllDomains(): # the libvirt docs seem to indicate that the second int is for state # details, only used when state is ERROR, so only looking at the first # int returned for domain.state()
instance.py:_list_system_domains()
domains[domain.name()] = DOMAIN_STATUS_ENUM[domain.state()[0]]
There's a race condition. Since many VMs are running and being destroyed at the same time, a domain foo might exist during `listAllDomains()`, but might no longer exist during `domain.state()`. Which raises a libvirt error:
libvirtError: Domain not found: no domain with matching uuid ...<snip>
which in turn crashes testcloud, which aborts our teardown method in libtaskotron. Thus the task VM (which is a different one from the one "missing") stays active. As an example, here's a task 1 that couldn't spawn a VM properly, but testcloud cleaned all the files just fine: http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/317643/steps/runtask/logs/taskotron-overlord.log
[testcloud.instance:instance.py:468] 2017-02-17 12:21:35 DEBUG removing instance /var/lib/testcloud/instances/taskotron-6e5e695c-efb8-11e6-97cb-525400cb0b45 from disk
Notice the timestamp and the uuid. Now here's a task 2 that crashed because VM for task 1 was removed during listing system domains: http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/317644/steps/runtask/logs/taskotron-overlord.log
[testcloud.instance:instance.py:446] 2017-02-17 12:21:34 DEBUG removing instance taskotron-6e68700a-efb8-11e6-ba5c-525400cb0b45 from libvirt. [libtaskotron:minion.py:257] 2017-02-17 12:21:35 ERROR Domain not found: no domain with matching uuid 'be18ca20-1e68-4e67-9c1b-b402674d6f9b' (taskotron-6e5e695c-efb8-11e6-97cb-525400cb0b45) ``` Notice the timestamp (the very same second) and also the uuid (it matches task 1's uuid). As a result, the VM for task 2 was not properly stopped and removed.
[[ https://koji.fedoraproject.org/koji/taskinfo?taskID=17919265 | Testcloud scratch build]] deployed to dev, seems to be working well (it's hard to say whether the race condition was fixed).
testcloud-0.1.11 is in Bodhi, we'll update dev and stg once it hits updates-testing
We still see some zombie VMs around. I found another cause - in [[http://taskotron-dev.fedoraproject.org/taskmaster/builders/x86_64/builds/384547 | this job]] buildbot killed the task for taking more than 40 minutes to run:
command timed out: 2400 seconds without output running ['runtask', '-i', 'java-1.8.0-openjdk-1.8.0.121-5.b14.fc24', '-t', 'koji_build', '-a', 'x86_64', '-j', 'x86_64/384547', '--uuid', 'e702f95e-f7e2-11e6-8068-525400cb0b45', 'runtask.yml'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2810.013207
But that only kills runtask, and the VM is kept running. I have no idea why that happens, because in libtaskotron we have a timer for 15 minutes that should kill the VM if no output is received. But as a workaround, let's add this buildbot step to all jobs:
$ testcloud instance remove --force taskotron-$uuid; true
Also, we should put such packages on abipkgdiff blacklist, here's a list:
mariadb java-1.8.0-openjdk libreoffice
New testcloud, new buildbot step and extended abicheck blacklist has been deployed to dev and stg.
Everything has been deployed to production. Things are fixed now, hopefully.
Metadata Update from @mkrizek: - Issue tagged with: infrastructure
Log in to comment on this ticket.