#10865 mbs frontend down and returning errors
Closed: Upstream 2 years ago by kevin. Opened 2 years ago by kevin.

[07:59:08] <zodbot> PROBLEM - proxy10.fedoraproject.org/http-mbs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string items not found on https://mbs.fedoraproject.org:443/module-build-service/1/component-builds/ - 3891 bytes in 0.055 second response time (noc02)
[07:59:34] <zodbot> PROBLEM - proxy11.fedoraproject.org/http-mbs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string items not found on https://mbs.fedoraproject.org:443/module-build-service/1/component-builds/ - 3886 bytes in 0.155 second response time (noc02)
[07:59:41] <zodbot> PROBLEM - proxy13.fedoraproject.org/http-mbs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string items not found on https://mbs.fedoraproject.org:443/module-build-service/1/component-builds/ - 3886 bytes in 0.047 second response time (noc01)

The error log on mbs-frontend01 has a bunch of debugging, but not sure what it means.

Restarting httpd gets it working for a bit, but then it goes down again. :(

CC: @breilly


httpd on the frontend does not seem to be stuck, but it is processing a very large amount of solvables for the input modules of a perl build:

[Mon Aug 29 15:47:49.904238 2022] [:error] [pid 1194] DEBUG:MBS:Jobs:
[Mon Aug 29 15:47:49.904312 2022] [:error] [pid 1193] DEBUG:MBS:Found: ['perl:5.36:3620220819085903:f36']
[Mon Aug 29 15:47:49.904878 2022] [:error] [pid 1194] DEBUG:MBS:  - favor perl-libwww-perl:6.67:3520220828203120:f35p534-3520220828203120.x86_64
[Mon Aug 29 15:47:49.905575 2022] [:error] [pid 1194] DEBUG:MBS:  - favor platform:f37:1:00000000-1.x86_64
[Mon Aug 29 15:47:49.906190 2022] [:error] [pid 1194] DEBUG:MBS:  - favor perl:5.34:3520220822104650:f35-3520220822104650.x86_64
[Mon Aug 29 15:47:49.906720 2022] [:error] [pid 1193] DEBUG:MBS:Looking for perl:5.36 buildrequiring platform:f35:1:00000000
[Mon Aug 29 15:47:49.906952 2022] [:error] [pid 1194] DEBUG:MBS:  - install perl-XML-Parser:2.46:20220829151528:0-20220829151528.src
[Mon Aug 29 15:47:49.914336 2022] [:error] [pid 1193] DEBUG:MBS:Adding module perl:5.34:3620220822104650:f36 with requires: <Id #-2147483642 (module(platform) with module(platform:f36))>
[Mon Aug 29 15:47:49.920328 2022] [:error] [pid 1194] DEBUG:MBS:Transaction:
[Mon Aug 29 15:47:49.921494 2022] [:error] [pid 1194] DEBUG:MBS:  - perl-XML-Parser:2.46:20220829151528:0-20220829151528.src
[Mon Aug 29 15:47:49.922478 2022] [:error] [pid 1194] DEBUG:MBS:  - platform:f37:1:00000000-1.x86_64
[Mon Aug 29 15:47:49.923397 2022] [:error] [pid 1194] DEBUG:MBS:  - perl:5.34:3720220822104650:f37-3720220822104650.x86_64
[Mon Aug 29 15:47:49.924384 2022] [:error] [pid 1194] DEBUG:MBS:  - perl-libwww-perl:6.67:3720220828203120:f37p534-3720220828203120.x86_64
[Mon Aug 29 15:47:49.924831 2022] [:error] [pid 1193] DEBUG:MBS:Adding module perl:5.34:3520220822104650:f35 with requires: <Id #-2147483640 (module(platform) with module(platform:f35))>
[Mon Aug 29 15:47:49.925303 2022] [:error] [pid 1194] DEBUG:MBS:  - perl-IO-Socket-SSL:2.074:3720220823103554:f37p534-3720220823103554.x86_64
[Mon Aug 29 15:47:49.929329 2022] [:error] [pid 1194] DEBUG:MBS:Adding module perl:5.36:3720220819085903:f37 with requires: <Id #-2147483644 (module(platform) with module(platform:f37))>
[Mon Aug 29 15:47:49.935884 2022] [:error] [pid 1193] DEBUG:MBS:Adding module perl:5.34:3420220503131308:f34 with requires: <Id #-2147483634 (module(platform) with module(platform:f34))>
[Mon Aug 29 15:47:49.940425 2022] [:error] [pid 1194] DEBUG:MBS:  - ^ Not all favored solvables found in the result, skipping.
[Mon Aug 29 15:47:49.940605 2022] [:error] [pid 1193] DEBUG:MBS:Found: ['perl:5.36:3520220819085903:f35']
[Mon Aug 29 15:47:49.941500 2022] [:error] [pid 1193] DEBUG:MBS:Looking for perl:5.36 buildrequiring platform:f34:1:00000000
[Mon Aug 29 15:47:49.942335 2022] [:error] [pid 1194] DEBUG:MBS:Testing perl-XML-Parser:2.46:20220829151528:0-20220829151528.src with combination: (<Solvable #1672 perl-libwww-perl:6.67:3520220828203120:f35p534-3520220828203120.x86_64>, <Solvable #1674 platform:f37:1:00000000-1.x86_64>, <Solvable #1178 perl:5.34:3420220503131308:f34-3420220503131308.x86_64>)

Here's the input module list, which is quite large:

[Mon Aug 29 14:34:08.599766 2022] [:error] [pid 28918] INFO:MBS:Starting resolving with following input modules: ['perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220819085903:f37', 'perl:5.36:3620220819085903:f36', 'perl:5.36:3520220819085903:f35', 'perl:5.36:3820220819085903:f38', 'perl:5.36:3720220

There were two httpd processes processing the build out of a couple dozen - the others were waiting on requests. We killed the two build processes and restarted httpd and things started moving again for wsgi requests.

It's as yet unclear what the root cause for the stall was, nor if resubmitting that build will cause the same outcome. @ppisar please resubmit your build, and we'll see if it reoccurs.

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

There were a number of build submissions that caused the frontend to spend a very long time resolving dependencies. This is one of the few complex parts that happens in the frontend rather than the backend. We were seeing the frontend take ~15 minutes or longer to complete the request (and generating a ton of log spew in the process).

Not all submissions were triggering this. The problem seems confined to perl-CGI, and even a few of these finished quickly.

It is still unclear how this caused the entire system to be unresponsive. At the time we were looking, there were only two httpd processes working on requests. Other requests were not getting through. The proxy would return a 503 and local curl requests from the machine itself would stall. Yet it appeared that there were many httpd processes waiting to take a request.

a testmodule submission currently goes through with no issue.

@ppisar when you resubmit, please be aware that the command may take 15-20 minutes to return (assuming we see the same dependency slowness).

The 60/120s timeout is hard-coded in fedpkg client. Will MBS continue processing my submission when fedpkg disconnects on TCP level? Otherwise, I will have to locate all relevant timeouts in fedpkg code and that will require many resubmission.

I think it's a timeout in pyrpkg.module_submit_build(). I increased it to 4200 seconds and submitted the build now.

Before submitting, I verified that MBS is responsive with "fedpkg module-overview --unfinished". After submission, the client waited and then, without a timetout reportded:

Could not execute module_build: The build failed with an unexpected error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>

So we have timeout on the httpd side. I will wait another 20 minutes and then check whether a build task was spawned.

No build task were spawned.

I noticed before this incident that submitting builds takes long. I suspect that MBS uses very inefficient way of resolving modular dependencies. I can try removing some context from the module which we won't be able to send to Bodhi because of a Bodhi bug https://github.com/fedora-infra/bodhi/issues/4198, but in the and I'd like to see this inefficiency resolved in MBS.

This module in question has 16 contexts. It's at 5th level of depth:

platform (4) ← perl (2) ← perl-IO-Socket-SSL (1) ← perl-libwww-perl (2) ← perl-XML-Parser (1)

The numbers of parentheses denotes a number of active streams. I don't think that this dependency graph is so huge.

I also noticed that when I submitted another module (with shallower dependency graph) and the module resulted into incompatible streams (e.g. platforms), the resolution before giving an error took long (e.g. 30 seconds) and spanned my terminal repeating "platform:f35 and platform:f36, " again and again. Maybe that is a bug in the error path when building the error message. Though this should not the case of perl-XML-Parser which should have all dependencies satisfied.

Metadata Update from @mizdebsk:
- Issue tagged with: mbs

2 years ago

So, where are we here? Can we close this? Or do we still need to track down an underlying cause?

We need a debugging and a fix from MBS maintainers. Exploring 16-state world should be negligible for MBS resources. Something is terribly wrong there.

I reduced the contexts to 4 and it still report 504 after 8 and half minutes:

 $ time fedpkg module-build
Submitting the module build...
submitting

submitted

Could not execute module_build: The build failed with an unexpected error: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


real    8m23,612s
user    0m0,648s
sys     0m0,077s

I will try reduce it further. But this bug seems to depend on something else then the number of contexts.

But it spawned the 4 tasks (15414..15417).

I tried 7 contexts: got a 500 error (internal server error) after 2m46,584s and no tasks were spawned.
Retried 7 contexts: got a 504 error (gateway timeout) after 8m21,936s and no tasks were spawned.
Retried 7 contexts: got a 504 error (gateway timeout) after 8m21,501s and no tasks were spawned.

So, where are we here @breilly @mikem ?

Is this something you are working on upstream (and should be tracked there), or anything we can do here?

I don't believe the frontend was ever intended to handle overly long requests. The real work of building modules is supposed to happen in the backend. I believe the frontend was intended to accept the build parameters, perform basic validation, and pass the rest off to the backend. Unfortunately, dependency resolution happens at submission time and it appears we've found some cases where that resolution takes entirely too long, causing timeouts and jamming up the frontend.

So, I think that gives us two ways to attack this. FIrst, make the resolution faster if we can (there could be a bug or infra situation here slowing things down). Second, shift this now-more-expensive calculation to the backend, where it can be handled asynchronously. The catch is the latter is likely a significant api change.

The first option is likely safest, if it works. We should probably start by replicating the problematic submission in a way that we can trace it and see where the slowdowns are happening.

Shall we keep this ticket open to track this? Or can we track it upstream?

Thanks. Lets track it there and upgrade down here once there's a fix. :)

Metadata Update from @kevin:
- Issue close_status updated to: Upstream
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog