#11321 ResultsDB pods get OOMKilled
Closed: Fixed 2 years ago by lholecek. Opened 2 years ago by lholecek.

As one of maintainers of ResultsDB web service, I started getting frequent alerts about pods running out of memory from Alertmanager.

description = Container api in Pod resultsdb/resultsdb-api-3-7fzqz ran out of memory and has been killed.
summary = Containers in pod resultsdb-api-3-7fzqz has been OOMKilled.

The underlying issue might be different because logs show some errors and process restarts.


I wanted to update container images to see if it fixes the problems, but it seems that ResultsDB backend does not use the upstream repository quay.io/factory2/resultsdb but instead it uses quay.io/fedora-kube-sig/resultsdb-backend.

Would it be possible to use upstream image repository?

I'm not sure what are the differences between the two images, but one thing that comes to mind is that upstream image uses entrypoint script to initialize virtual environment before running given command. For example the command to initialize database is:

/app/entrypoint.sh resultsdb init_db

It shouldn't be hard to switch the image and create a PR to ansible repository. Not sure if resultsdb has staging deployment as well, so we can first test it there.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

I think @lrossett set this up this way, I am not sure if there was a reason not to use those other images, or we just didn't know about them. ;)

We started providing the upstream image relatively recently. The image quay.io/fedora-kube-sig/resultsdb-backend:latest-f35 (details) is almost 1 year old and uses resultsdb Python packages that we do not maintain anymore.

The change is now deployed on staging, so if everything works as it should we can deploy it on production.

New stage instance is deployed now.

I see some errors when it tries to send a message (when new result is created):

[WARNING] pika.channel: Received remote Channel.Close (403): "ACCESS_REFUSED - access to topic 'org.fedoraproject.stg.taskotron.result.new' in exchange 'amq.topic' in vhost '/pubsub' refused for user 'resultsdb.stg'" on <Channel number=2 OPEN conn=<pika.adapters.twisted_connection._TwistedConnectionAdapter object at 0x7fc7302e4c10>>
[ERROR] fedora_messaging.twisted.protocol: Message was forbidden by the broker: ACCESS_REFUSED - access to topic 'org.fedoraproject.stg.taskotron.result.new' in exchange 'amq.topic' in vhost '/pubsub' refused for user 'resultsdb.stg'

Strangely, I still see recent messages from resultsdb being sent: https://apps.stg.fedoraproject.org/datagrepper/raw?topic=org.fedoraproject.stg.resultsdb.result.new&delta=1000

@abompard Any idea what could be the issue here?

I don't see any more problems with resultsdb service.

I will close this and update the image for production later.

Metadata Update from @lholecek:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog