#11829 Recurring issue that Bugzilla stops sending messages to fedora-messaging
Opened 2 months ago by frostyx. Modified 3 days ago

This issue appears several times a week. For some reason Bugzilla randomly stops sending messages to fedora-messaging and Fedora Infra team needs to "poke it". Which probably means redeploying Bugzilla2fedmsg in OpenShift.

Describe what you would like us to do:

Is this issue fixable? Or can we bandage it by an automatic daily redeploy of Bugzilla2fedmsg?

When do you need this to be done by? (YYYY/MM/DD)

At your convenience. But the sooner the better. I get pinged at least once a week and I need to ping you at least once a week to fix it.


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 months ago

I was looking at the pod today and there isn't anything helpful in the logs. So I'm not sure what actually causes bugzilla2fedmsg to stop processing.

Yeah, it's pretty odd. It used to be pretty reliable, but now it seems to get stuck pretty often. ;(

Perhaps @abompard has some ideas here?

I have restarted bugzilla2fedmsg with debug logging, hopefully it'll tell us what's going on next time it's stuck

I've had to restart it a few times... in the logs it looked like it was getting heartbeat messages back and forth, but nothing else. ;(

So, I don't think the debugging as it is now is going to tell us much.

This is happening again :-/
Can anybody poke it, please?

Done. Would it help if we lowered the threshold for the nagios alert?

I think so. I am not sure what the nagios alert threshold is but when we talked about this issue the last time, it occurred multiple times a day, so I guess nagios needs to check in the span of hours?

Right now it's set in roles/nagios_client/templates/check_datanommer_history.cfg.j2

command[check_datanommer_bugzilla]={{libdir}}/nagios/plugins/check_datanommer_timesince.py bugzilla 86400 259200

so, warning is 24 hours and critical is 72 hours.

Note that it's a balancing act because if we set it to something too small, there could be an actual gap in bug changes, but if we set it to high we miss when they stop flowing.

Do you think we can solve this caveman style and set a timer to reboot the Bugzilla2fedmsg service every hour? At this point, it sounds like the least painful solution to me

I restarted it again today, so I think the reboot every hour would be probably good solution.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog