#139 buildmaster process starts too slowly for systemd unit files
Closed: Fixed 7 years ago Opened 9 years ago by tflink.

The times I've seen this is right after reboot on the master node - the buildmaster service fails because systemd kills the process when startup doesn't complete quickly enough.

If I start the buildmaster outside of systemd, stop it and restart the systemd service, everything works fine.

Figure out how to keep the buildmaster service from failing on startup - either making the initial startup process faster or increasing the timeout.


Metadata Update from @tflink:
- Issue tagged with: infrastructure

7 years ago

Metadata Update from @frantisekz:
- Issue assigned to frantisekz

7 years ago

I've found two possible solutions so far (or their power combined :D ):

  • increase buildbot's timeout for buildmaster - 10 seconds is hardcoded value in ./master/buildbot/scripts/logwatcher.py and/or ./slave/buildbot/scripts/logwatcher.py

  • try to play a little with .service file of buildmaster, possible solution might look like:

[Unit]
Description=Buildmaster for taskbot
After=network.target
+ StartLimitInterval=300
+ StartLimitBurst=5

[Service]
Type=forking
+ Restart=on-failure
+ RestartSec=90

I'll be trying them during tomorrow morning, and due to nature of this bug, I'll have to reboot the server each time, so expect some outages on -dev :P .

I've deployed .service file fix to -dev.

Buildmaster service started just fine after reboot with modified .service file (after one fail).

Metadata Update from @frantisekz:
- Issue close_status updated to: Fixed

7 years ago

I just got pinged about playbooks failing. apparently, all the buildmaster restarts are failing in ansible:

fatal: [taskotron01.qa.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Unable to restart service buildmaster: Job for buildmaster.service failed because the control process exited with error code.\nSee \"systemctl  status buildmaster.service\" and \"journalctl  -xe\" for details.\n"}

Taking a quick look at the journal on taskotron01.qa, it looks like the start did fail but a restart was rescheduled. Since this is a new failure as far as I know and this was deployed this week, I'm assuming that this change had something to do with the ansible-reported failure.

Metadata Update from @tflink:
- Issue status updated to: Open (was: Closed)

7 years ago

The fix above fixed the "service doesn't start on boot" problem. But this can happen also when restarting the service, if the I/O access is too slow (I assume there are simply way too many files to read). You can easily simulate it with:

sync; echo 1 > /proc/sys/vm/drop_caches
systemctl restart buildmaster

Thanks to the auto-restart enabled, it will get started after a while. But of course the playbook fails.

The service gets restarted only in playbook roles/taskotron/buildmaster-configure/tasks/main.yml and only when the service content changes. Yesterday, it got deployed to all systems due to some changes that were not deployed before. In daily operations, the service should not get restarted.

I see the following options:
a. Patch the buildbots logwatcher.py (see #comment-494658) to increase the timeout from 10 seconds to a higher value.
b. Ignore the result of starting/restarting the service in the playbook.
c. Make the playbook wait a minute and try again, if the first invocation of restarting the service fails.
d. Ignore this problem (document it as a comment in the playbook) and live with the fact, that if the service content changes, the first invocation might fail, and you might need to wait a while and run it again.

Preferences?

From my point, C seems simple enough and good enough.

Metadata Update from @frantisekz:
- Issue close_status updated to: Fixed

7 years ago

Log in to comment on this ticket.

Metadata