The times I've seen this is right after reboot on the master node - the buildmaster service fails because systemd kills the process when startup doesn't complete quickly enough.
If I start the buildmaster outside of systemd, stop it and restart the systemd service, everything works fine.
Figure out how to keep the buildmaster service from failing on startup - either making the initial startup process faster or increasing the timeout.
Metadata Update from @tflink: - Issue tagged with: infrastructure
Metadata Update from @frantisekz: - Issue assigned to frantisekz
I've found two possible solutions so far (or their power combined :D ):
increase buildbot's timeout for buildmaster - 10 seconds is hardcoded value in ./master/buildbot/scripts/logwatcher.py and/or ./slave/buildbot/scripts/logwatcher.py
try to play a little with .service file of buildmaster, possible solution might look like:
[Unit] Description=Buildmaster for taskbot After=network.target + StartLimitInterval=300 + StartLimitBurst=5 [Service] Type=forking + Restart=on-failure + RestartSec=90
I'll be trying them during tomorrow morning, and due to nature of this bug, I'll have to reboot the server each time, so expect some outages on -dev :P .
I've deployed .service file fix to -dev.
Buildmaster service started just fine after reboot with modified .service file (after one fail).
Fixed by #254
Metadata Update from @frantisekz: - Issue close_status updated to: Fixed
I just got pinged about playbooks failing. apparently, all the buildmaster restarts are failing in ansible:
fatal: [taskotron01.qa.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Unable to restart service buildmaster: Job for buildmaster.service failed because the control process exited with error code.\nSee \"systemctl status buildmaster.service\" and \"journalctl -xe\" for details.\n"}
Taking a quick look at the journal on taskotron01.qa, it looks like the start did fail but a restart was rescheduled. Since this is a new failure as far as I know and this was deployed this week, I'm assuming that this change had something to do with the ansible-reported failure.
Metadata Update from @tflink: - Issue status updated to: Open (was: Closed)
The fix above fixed the "service doesn't start on boot" problem. But this can happen also when restarting the service, if the I/O access is too slow (I assume there are simply way too many files to read). You can easily simulate it with:
sync; echo 1 > /proc/sys/vm/drop_caches systemctl restart buildmaster
Thanks to the auto-restart enabled, it will get started after a while. But of course the playbook fails.
The service gets restarted only in playbook roles/taskotron/buildmaster-configure/tasks/main.yml and only when the service content changes. Yesterday, it got deployed to all systems due to some changes that were not deployed before. In daily operations, the service should not get restarted.
roles/taskotron/buildmaster-configure/tasks/main.yml
I see the following options: a. Patch the buildbots logwatcher.py (see #comment-494658) to increase the timeout from 10 seconds to a higher value. b. Ignore the result of starting/restarting the service in the playbook. c. Make the playbook wait a minute and try again, if the first invocation of restarting the service fails. d. Ignore this problem (document it as a comment in the playbook) and live with the fact, that if the service content changes, the first invocation might fail, and you might need to wait a while and run it again.
logwatcher.py
Preferences?
From my point, C seems simple enough and good enough.
Frantisek "Teobald" implemented a C-style fix in: https://pagure.io/fedora-qa/qa-ansible/pull-request/2
Fixed by https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=49611b2e6e12efc2bcebde3b92617fbb6443702b
Log in to comment on this ticket.