Issue #9234: Bodhi gives unusual number of 500 errors today: NoSuchColumnError: "Could not locate column in row for column 'comments.id'" - fedora-infrastructure

fedora-infrastructure

#9234 Bodhi gives unusual number of 500 errors today: NoSuchColumnError: "Could not locate column in row for column 'comments.id'"

Closed: Fixed 2 years ago by kevin. Opened 3 years ago by churchyard.

I've checked https://status.fedoraproject.org/ and there is nothing out of the ordinary for bodhi, so I assume this is not known.

Every single time I load almost any bodhi page I get:

500 Internal Server Error

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/pyramid_mako/__init__.py", line 148, in __call__
    result = template.render_unicode(**system)
  File "/usr/lib/python3.8/site-packages/mako/template.py", line 481, in render_unicode
    return runtime._render(
  File "/usr/lib/python3.8/site-packages/mako/runtime.py", line 878, in _render
    _render_context(
  File "/usr/lib/python3.8/site-packages/mako/runtime.py", line 920, in _render_context
    _exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
  File "/usr/lib/python3.8/site-packages/mako/runtime.py", line 947, in _exec_template
    callable_(context, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/bodhi/server/templates/master.html", line 142, in render_body
    ${self.body()}
  File "/usr/local/lib/python3.8/site-packages/bodhi/server/templates/update.html", line 198, in render_body
    <!--<h4>Comments <span class="badge badge-secondary">${len(update.comments)}</span>
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 287, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 723, in get
    value = self.callable_(state, passive)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/strategies.py", line 759, in _load_for_state
    return self._emit_lazyload(
  File "<string>", line 1, in <lambda>

  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/strategies.py", line 900, in _emit_lazyload
    q(session)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/ext/baked.py", line 545, in all
    return list(self)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/loading.py", line 101, in instances
    cursor.close()
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/loading.py", line 81, in instances
    rows = [proc(row) for row in fetch]
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/loading.py", line 81, in <listcomp>
    rows = [proc(row) for row in fetch]
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/loading.py", line 525, in _instance
    tuple([row[column] for column in pk_cols]),
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/loading.py", line 525, in <listcomp>
    tuple([row[column] for column in pk_cols]),
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/result.py", line 681, in _key_fallback
    util.raise_(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
NoSuchColumnError: "Could not locate column in row for column 'comments.id'"

A quick refresh makes it go away until I load a new page.

I also got the Could not locate column in row for column 'comments.id' message in the bottom right corner when I was:

creating an update (according to the UI, the update was not successful, but it was in fact created)
edition an update to switch builds (according to the UI, the edit was not successful, and it indeed wasn't, next submit made it successful but the update does not admit the build swap ever happened - https://bodhi.fedoraproject.org/updates/FEDORA-2020-982b2950db (I can only see in me email notification that the build NEVR got changed)

smooge commented 3 years ago

@cverna can you help on this and also let us know who to hand this to in future.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: bodhi, groomed, medium-gain, medium-trouble, ops

3 years ago

cverna commented 3 years ago

I think I have fixed it for now with https://github.com/fedora-infra/bodhi/commit/8f7d7859330e1b370bbeb02e9c9f6f7a53f94dc3.

I am mostly away from keyboard this weekend but i ll keep monitoring this ticket, and look for a proper fix next week.

cverna commented 3 years ago

Ok happens again :( so going to try reverting back to bodhi 5.4.1

cverna commented 3 years ago

Ok so this is still happening in 5.4.1 so I reverted back to 5.5 we are getting the following traceback

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/pyramid/tweens.py", line 41, in excview_tween
    response = handler(request)
  File "/usr/lib/python3.8/site-packages/pyramid/router.py", line 147, in handle_request
    response = _call_view(
  File "/usr/lib/python3.8/site-packages/pyramid/view.py", line 667, in _call_view
    response = view_callable(context, request)
  File "/usr/lib/python3.8/site-packages/pyramid/config/views.py", line 169, in __call__
    return view(context, request)
  File "/usr/lib/python3.8/site-packages/pyramid/config/views.py", line 188, in attr_view
    return view(context, request)
  File "/usr/lib/python3.8/site-packages/pyramid/config/views.py", line 214, in predicate_wrapper
    return view(context, request)
  File "/usr/lib/python3.8/site-packages/pyramid/viewderivers.py", line 436, in rendered_view
    result = view(context, request)
  File "/usr/lib/python3.8/site-packages/pyramid/viewderivers.py", line 144, in _requestonly_view
    response = view(request)
  File "/usr/lib/python3.8/site-packages/cornice/service.py", line 502, in wrapper
    response = view_(request)
  File "/usr/local/lib/python3.8/site-packages/bodhi/server/services/releases.py", line 327, in query_releases_json
    total = db.execute(count_query).scalar()
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 163, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/orm/session.py", line 1291, in execute
    return self._connection_for_bind(bind, close_with_result=True).execute(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1014, in execute
    return meth(self, multiparams, params)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1127, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
    util.raise_(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib64/python3.8/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) lost synchronization with server: got message type "e", length 1937009006

[SQL: SELECT count(DISTINCT releases.id) AS count_1 
FROM releases]
(Background on this error at: http://sqlalche.me/e/13/e3q8 )

I have also attached the pod logs

mattia commented 3 years ago

Could it be caused by some dependency change? I see the infra repo was added to the container two days ago, maybe something was upgraded to a version that causes problems.

churchyard commented 3 years ago

Cross reference for the GitHub issue: https://github.com/fedora-infra/bodhi/issues/4105

pingou commented 3 years ago

Every single time I load almost any bodhi page I get

I don't seem to be able to reproduce this, though I don't see a resolution here or in the upstream ticket.

What's the current status? Can someone still reproduce this?

churchyard commented 3 years ago

~30 minutes ago over the API:

$ fedora-update-feedback 
Username: churchyard
FAS Password: 
Authenticating with bodhi ...
Querying dnf for installed packages ...
Querying bodhi for updates ...
 Updates (testing): [ =======================================                                                                                                                                                             ]  20% Error: "Remote bodhi instance returned an error message: BodhiError {\n    errors: [\n        {\n            \"location\": \"body\",\n            \"description\": \"\\\"Could not locate column in row for column \\\'comments.id\\\'\\\"\",\n            \"name\": \"NoSuchColumnError\",\n        },\n    ],\n    status: \"error\",\n}"

The web interface does not give the error that often (possibly not at all) now.

churchyard commented 3 years ago

Actually, just now:

kevin commented 3 years ago

I think perhaps @cverna fixed this? Is anyone still seeing it?

Issue status updated to: Closed (was: Open)
Issue close_status updated to: Fixed

3 years ago

Issue status updated to: Open (was: Closed)

3 years ago

otaylor commented 3 years ago

Still seeing this :-( - a couple of minutes ago [Aug 25 around 15:50 UTC]

curl 'https://bodhi.fedoraproject.org/updates/?rows_per_page=10&releases=F32F'

Gave me a response:

{"status": "error", "errors": [{"location": "body", "name": "NoSuchColumnError", "description": "\"Could not locate column in row for column 'testcases.id'\""}]}

churchyard commented 3 years ago

Might be a different problem, previously I always got comments.id.

otaylor commented 3 years ago

The underlying traceback that @cverna pasted: https://pagure.io/fedora-infrastructure/issue/9234#comment-671224 seems to indicate a problem with the connection from postgresql getting in a bad state - so I doubt the exact reported error matters.

From some quick google searching, it seems likely that the problem is either:
a) a concurrency problem - perhaps the two concurrent threads in the WSGI daemon process are sharing resources but not locking them [or same with Bodhi's composer threads]
b) the postgresql server unexpectedly closing the connections to Bodhi - because of query timeouts or something

Temporarily changing the WSGI configuration to a single thread per process might one place to start.

kevin commented 3 years ago

Submitted https://pagure.io/fedora-infra/ansible/pull-request/227 as a freeze break request.

kevin commented 3 years ago

However, before that I scaled the web pod to 1 instance instead of 2 and that seems to be helping?

kevin commented 3 years ago

So, yeah, when there's just 1 web pod it seems to be happy?

So what should we do here? Does that help us isolate what the problem might be?

praiskup commented 3 years ago

The error I got now:

500 Internal Server Error

"Could not locate column in row for column 'builds.type'"

kevin commented 3 years ago

So, this has not happened in a long while. I am going to close it. Please let us know if anyone sees it again.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

adamwill commented 2 years ago

This is back with 5.7.1.

Theory: this shows up for a while whenever we update production Bodhi, then gradually eases off? Some kind of caching effect?

https://openqa.fedoraproject.org/tests/1059650#step/_advisory_update/27 - openQA test attempt to download with Bodhi failing on "Could not locate column in row" three times in a row.

Metadata Update from @kevin:
- Issue status updated to: Open (was: Closed)

2 years ago

kevin commented 2 years ago

Ping @asaleh this seems to have started after your bodhi upgrade. ;)

asaleh commented 2 years ago

@adamwill I looked into it, but once again, can't reproduce, i.e. when trying to download the same update you mention in the advisory update 27, it just downloads.

adamwill commented 2 years ago

It's intermittent, and not related to any particular update. It just happens on some attempts. To reproduce it you have to just run bodhi updates download --updateid (some update) a bunch of times and it'll happen on some of them. At least, two days ago it was. As I said above it may be one of those effects where a cache or something similar is being filled after the version upgrade / restart, and once that's done, the bug doesn't happen any more...

Edited 2 years ago by adamwill

kevin commented 2 years ago

I think this is fixed now?

adamwill commented 2 years ago

well, "fixed", or "gone away again until the next time we update bodhi"...

mattia commented 2 years ago

well, "fixed", or "gone away again until the next time we update bodhi"...

So far, so good. After 12 hours Bodhi logs don't show any sign of 500 Errors.
I think we found the root cause, it was a change in joins I made some time ago [1]... @asaleh has hot patched the prod instance, provided everything runs smoothly we will land the patch in Bodhi code permanently, so next updates should not start again going crazy.

[1] https://github.com/fedora-infra/bodhi/issues/4302#issuecomment-982433941

kevin commented 2 years ago

Awesome! Thanks @asaleh and @mattia

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

adamwill commented 2 years ago

Did someone update bodhi again?

kevin commented 2 years ago

Is this still happening? We were doing a mass update/reboot cycle, so it might have done this while things like the database rebooted?

adamwill commented 2 years ago

Yeah...so far it seems to have been a one-off, I guess it was just a coincidence that it was this error again.

mattia commented 2 years ago

I've just restarted Bodhi because since yesterday ansible change was throwing a lot of errors again...

Shall we look at switching bodhi workers to async workers as proposed in https://pagure.io/fork/mattia/fedora-infra/ansible/c/2a918955d91d982b466ae9c153668c52d43c98c1?branch=bodhi_workers ? Last time it was said that my proposed change was to be applied during the last Bodhi upgrade, but apparently didn't.

kevin commented 2 years ago

We could sure. We are however in freeze... is this worth making a freeze break out of?

If so, can you make a PR and propose it?

mattia commented 2 years ago

We could sure. We are however in freeze... is this worth making a freeze break out of?

If so, can you make a PR and propose it?

I'm going to propose a PR to start changing staging (I assume staging is not in freeze?) and wait the freeze end for prod.

kevin commented 2 years ago

Yeah, staging is never frozen. :)

I merged your pr and pushed it out.

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Done

Attachments 2

bodhi-web-58.log

Attached 3 years ago View Comment

png

Attached 3 years ago View Comment

fedora-infrastructure

Source Code

#9234 Bodhi gives unusual number of 500 errors today: NoSuchColumnError: "Could not locate column in row for column 'comments.id'" Closed: Fixed 2 years ago by kevin. Opened 3 years ago by churchyard.

Metadata

medium-gain medium-trouble bodhi ops

Boards 1

Attachments 2

#9234 Bodhi gives unusual number of 500 errors today: NoSuchColumnError: "Could not locate column in row for column 'comments.id'"

Closed: Fixed 2 years ago by kevin. Opened 3 years ago by churchyard.