#11439 rpm-ostree installer image builds sometimes fail with 502 errors since Fedora-Rawhide-20230524.n.0
Opened 2 years ago by adamwill. Modified 3 months ago

  • Describe the issue

Since Fedora-Rawhide-20230524.n.0 , all rpm-ostree-based installer image builds are failing in Rawhide composes. When we reach the lorax-embed-repo.tmpl stage, an ostree command fails with a 502 error:

2023-05-24 06:43:09,577: running /mnt/koji/compose/rawhide/Fedora-Rawhide-20230524.n.0/work/x86_64/Silverblue/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl
running /mnt/koji/compose/rawhide/Fedora-Rawhide-20230524.n.0/work/x86_64/Silverblue/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl
2023-05-24 06:43:45,597: command output:
error: Server returned HTTP 502

command output:
error: Server returned HTTP 502

2023-05-24 06:43:45,597: command returned failure (1)
command returned failure (1)
2023-05-24 06:43:45,597: template command error in /mnt/koji/compose/rawhide/Fedora-Rawhide-20230524.n.0/work/x86_64/Silverblue/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl:
template command error in /mnt/koji/compose/rawhide/Fedora-Rawhide-20230524.n.0/work/x86_64/Silverblue/lorax_templates/ostree-based-installer/lorax-embed-repo.tmpl:
2023-05-24 06:43:45,597:   runcmd ostree --repo=/var/tmp/lorax/lorax.xsb2zkxh/installtree/ostree/repo pull --mirror fedora fedora/rawhide/x86_64/silverblue
  runcmd ostree --repo=/var/tmp/lorax/lorax.xsb2zkxh/installtree/ostree/repo pull --mirror fedora fedora/rawhide/x86_64/silverblue
2023-05-24 06:43:45,601:   subprocess.CalledProcessError: Command '['ostree', '--repo=/var/tmp/lorax/lorax.xsb2zkxh/installtree/ostree/repo', 'pull', '--mirror', 'fedora', 'fedora/rawhide/x86_64/silverblue']' returned non-zero exit status 1.
  subprocess.CalledProcessError: Command '['ostree', '--repo=/var/tmp/lorax/lorax.xsb2zkxh/installtree/ostree/repo', 'pull', '--mirror', 'fedora', 'fedora/rawhide/x86_64/silverblue']' returned non-zero exit status 1.

I can't immediately see what might be causing this. I don't think it's RPM 4.19 - that didn't land in the 0524.n.0 compose.

@dustymabe @walters @jlebon @kevin

  • When do you need this? (YYYY/MM/DD)
    ASAP

  • When is this no longer needed or useful? (YYYY/MM/DD)
    When we stop making Fedora, or ostree installers

  • If we cannot complete your request, what is the impact?
    We won't ever get new ostree installer images again


Metadata Update from @phsmoura:
- Issue tagged with: high-gain, high-trouble, ops

2 years ago

Yeah I really don't know what's going on. It failed again today.

502 is "Bad gateway" - I think there's some sort of proxy set up here that's failing.

It does seem like the failures kinda vary from day to day to some extent. Yesterday only a couple failed, today a lot failed again...

Yeah, can we do anything to find out the exact url thats giving that 502?

IIUC it's trying to hit https://kojipkgs.fedoraproject.org/compose/ostree/repo/ (see source). If we could cross check logs from the server hosting that URL with the timestamp of the failure we might be able to see what server error is happening?

Sadly, 'the server' here is 4 servers. ;)

There's two internal proxies which proxy to 2 varnish cache servers.

and there's... a lot of requests, without knowing the full url. ;(

That said. I looked at the last rawhide kinote failure and during that second:

[Tue Jun 06 07:32:51.033654 2023] [proxy_http:error] [pid 1658589:tid 1661428] (20014)Internal
 error (specific information not available): [client 10.3.170.93:38168] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.092135 2023] [proxy_http:error] [pid 1658591:tid 1661494] (20014)Internal
 error (specific information not available): [client 10.3.170.93:38146] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.226408 2023] [proxy_http:error] [pid 1658593:tid 1661473] (20014)Internal
 error (specific information not available): [client 10.3.170.93:38220] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.226449 2023] [proxy:error] [pid 1658593:tid 1661473] [client 10.3.170.93:
38220] AH00898: Error reading from remote server returned by /compose/ostree/repo/objects/d5/d
9a47372c5167e241bde5b201906fa421fbc3630b8ff5499c0a20fe8af6a54.filez
[Tue Jun 06 07:32:51.306888 2023] [proxy_http:error] [pid 1658591:tid 1661549] (20014)Internal
 error (specific information not available): [client 10.3.170.91:46796] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.475373 2023] [proxy_http:error] [pid 1658591:tid 1661550] (20014)Internal
 error (specific information not available): [client 10.3.170.91:46764] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.512124 2023] [proxy_http:error] [pid 1658591:tid 1661516] (20014)Internal
 error (specific information not available): [client 10.3.170.91:46802] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.757229 2023] [proxy_http:error] [pid 1658593:tid 1661463] (20014)Internal
 error (specific information not available): [client 10.3.170.91:46728] AH01102: error reading
 status line from remote server localhost:10062
[Tue Jun 06 07:32:51.899661 2023] [proxy_http:error] [pid 1658583:tid 1661247] (20014)Internal
 error (specific information not available): [client 10.3.170.91:46714] AH01102: error reading
 status line from remote server localhost:10062

10062 is a haproxy proxy to the 2 kojipkgs servers.

The kojipkgs servers have tons of requests for that second, but none of them look like errors. I don't see any haproxy errors at that time.
I can keep looking

I went looking today and found something similar, with a matching 502 line in the access logs:

kojipkgs.fedoraproject.org-error.log:[Wed Nov 01 14:49:09.574630 2023] [proxy_http:error] [pid 342992:tid 345842] (20014)Internal error (specific information not available): [client 10.3.163.123:45150] AH01102: error reading status line from remote server localhost:10062

kojipkgs.fedoraproject.org-access.log:10.3.163.123 - - [01/Nov/2023:14:49:09 +0000] "GET /compose/ostree/repo/objects/66/faa0fd385133310e825e68e463a3027755272b3e956cffbaf68463dd47d698.filez HTTP/1.1" 502 - "-" "libostree/2023.6"

not that it really helps much. Maybe we can bump the haproxy log level? Currently we're only logging warning or above, maybe whatever's going on here is logged as info for some reason?

Thats from something on the openshift cluster... not this compose issue, but it could of course be the same thing/related.

We could bump the log level... likely it will be a lot, but perhaps it will help.

Of course proxies are frozen...

Since https://github.com/ostreedev/ostree/releases/tag/v2022.7 we retry on all errors...is this still happening with that ostree version?

Looks like we need to use --network-retries=x instead as we're not right now.
Edit: This defaults to 5 times already. (https://github.com/ostreedev/ostree/pull/2759)

note that we are seeing 502 errors in our extended upgrade tests in Fedora CoreOS. We can trigger the tests anytime so maybe that would be an easier way to test config changes?

So, one thing I notice poking through the ostree code a bit - is there any backoff logic on the retries? I don't think there is. Would it potentially help if we gave ostree some backoff logic, like curl itself has - initial backoff of one second, then double on each retry?

That's a good observation, and probably yes; moving to https://github.com/ostreedev/ostree/issues/3207

ok, I was finally annoyed by this enough to dig into it more.

I ran it until I could duplicate it on a builder with verbose:

OT: _ostree_fetcher_should_retry_request: error: 121:0 Server returned HTTP 502, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 502

Which makes me think it's not retrying on server errors?

Doing OSTREE_DEBUG_HTTP=1 gives:

* Connection #1115 to host kojipkgs.fedoraproject.org left intact
* processing: https://kojipkgs.fedoraproject.org/compose/ostree/repo/objects/d3/1f64c2505d6c0376c0e
6b48bc780fd7e731d13729a4c11051e28c00d8abc7e.filez
* Found bundle for host: 0x55f7e8ff55f0 [serially]
* Can not multiplex, even if we wanted to
* Re-using existing connection with host kojipkgs.fedoraproject.org
> GET /compose/ostree/repo/objects/d3/1f64c2505d6c0376c0e6b48bc780fd7e731d13729a4c11051e28c00d8abc7e.filez HTTP/1.1
Host: kojipkgs.fedoraproject.org
User-Agent: libostree/2024.1
Accept: */*

* old SSL session ID is stale, removing
< HTTP/1.1 502 Proxy Error
< Date: Thu, 14 Mar 2024 21:55:02 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< X-Frame-Options: SAMEORIGIN
< X-Xss-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Referrer-Policy: same-origin
< Content-Length: 418
< Content-Type: text/html; charset=iso-8859-1

What this looks like is it's hitting a proxy (apache) and that proxy gives:

[Thu Mar 14 21:55:00.597074 2024] [proxy_http:error] [pid 2810208:tid 2813178] (20014)Internal error (specific information not available): [client 10.3.169.53:45680] AH01102: error reading status line from remote server localhost:10062
[Thu Mar 14 21:55:01.153465 2024] [proxy_http:error] [pid 2810208:tid 2813186] (20014)Internal error (specific information not available): [client 10.3.169.53:45818] AH01102: error reading status line from remote server localhost:10062
[Thu Mar 14 21:55:01.355611 2024] [proxy_http:error] [pid 2810208:tid 2813185] (20014)Internal error (specific information not available): [client 10.3.169.53:45784] AH01102: error reading status line from remote server localhost:10062
[Thu Mar 14 21:55:01.643045 2024] [proxy_http:error] [pid 2810194:tid 2813074] (20014)Internal error (specific information not available): [client 10.3.169.53:45868] AH01102: error reading status line from remote server localhost:10062
[Thu Mar 14 21:55:02.539307 2024] [proxy_http:error] [pid 2810214:tid 2813085] (20014)Internal error (specific information not available): [client 10.3.169.53:46098] AH01102: error reading status line from remote server localhost:10062
[Thu Mar 14 21:55:02.810906 2024] [proxy_http:error] [pid 2810194:tid 2813081] (20014)Internal error (specific information not available): [client 10.3.169.53:46230] AH01102: error reading status line from remote server localhost:10062

localhost:10062 is haproxy. It has 2 varnish severs it proxies to. Behind those is a apache again that varish just gets the files from locally. I never see this request there.
So, the problem is on the haproxy -> varnish boundry, but with the logs available I am not sure which it is.
varnish currently logs almost nothing. :(

There's also:
proxy-initial-not-pooled
If this variable is set, no pooled connection will be reused if the client request is the initial request on the frontend connection. This avoids the "proxy: error reading status line from remote server" error message caused by the race condition that the backend server closed the pooled connection after the connection check by the proxy and before data sent by the proxy reached the backend. It has to be kept in mind that setting this variable downgrades performance, especially with HTTP/1.0 clients.

Which we could try setting?

So, I will work on trying to add some logging to see which place is at falt.
Perhaps we could see if it's not retrying on network errors and get it to do so?

I will try and dig back into the ostree retry code a bit and see what might be going on there. Based on my earlier look at it I was pretty sure it would be doing 5 retries here (but burning through them almost instantly because there is no backoff code and this is a very close network connection), but that output kinda suggests it isn't, so I'll look again...

I think perhaps setting keep alive on haproxy might help here...

https://pagure.io/fedora-infra/ansible/pull-request/1867

Or at least worth trying. What might be happening is haproxy is closing things to the backend as soon as they are done, but thats causing a lot of connections to the varnish side and it's not processing them all in time. Keeping the backend connections alive should result in smaller churn.

According to the docs that option is on by default, right?

On Fri, Mar 15, 2024 at 12:07=E2=80=AFPM Kevin Fenzi pagure@pagure.io wro=
te:

kevin added a new comment to an issue you are following:
``
I think perhaps setting keep alive on haproxy might help here...

https://pagure.io/fedora-infra/ansible/pull-request/1867

Or at least worth trying. What might be happening is haproxy is closing
things to the backend as soon as they are done, but thats causing a lot o=
f
connections to the varnish side and it's not processing them all in time.
Keeping the backend connections alive should result in smaller churn.
``

To reply, visit the link below or just reply to this email
https://pagure.io/releng/issue/11439

So on the retry topic - the logs @kevin is seeing are from ostree's internal "should we retry this request?" logic, _ostree_fetcher_should_retry_request. They do indeed indicate that that function decides not to retry because 502 isn't considered 'transient' (and we could, theoretically, patch that). But...the commit Colin referred to as retrying 'all' errors is, I think, this one. IIUC, that is setting an option that should make libcurl retry the request.

I'm not totally sure about that, though. If I'm wrong, it might be worth trying patching that ostree retry logic to retry on 502.

The thing that's confusing me right now is that the error message here should include the text
"While fetching <url>".

Wait...ohhhh, I see it now. We missed a code path here in retries when processing mirrorlists. That less-than-helpful error message is from here:
https://github.com/ostreedev/ostree/blob/7fdc792f856c835560a89b829bbe186765a1e6d2/src/libostree/ostree-fetcher-curl.c#L370

And that would explain why it's biting in Fedora infra where we use mirrorlists, but other ostree users (who usually don't) aren't impacted.

ohhh, right, and now I look at that I remember how the retry_all thing works - by changing what error is returned depending on whether it's set...

Nevermind, indeed the logic here was not doing what it said on the tin. https://github.com/ostreedev/ostree/pull/3216

thanks! I just asked what you thought, but doing it is good for bonus points :)

simultaneously @kevin trialled a server end fix with apparently good results:

(nirik) well, could be cooinicdence, but so far 5 passes with 0 errors.

so, we have gone from not enough fixes to too many fixes :D I guess we could do something like turning off the server-side fix and doing a rawhide compose to see if the ostree-side fix helps, then turn on the server-side fix and do an f40 compose (i.e. no updated ostree) to confirm that the server-side fix helps?

Well, I think we want to do both of these things anyhow right?

So, how about we see what tomorrow brings. If it's still not fixed we need to dig more?

Does ostree print anything when retrying? can we see if it is in logs? Seems like it does? So we can tell if it's still hitting retries, right?

Well, long term we want both the belt and the britches sure, but short term, we want to know if the britches work, right? And if the belt is too good, we can't really tell about the britches. If your server side fix fixes it so hard that ostree never needs to retry, we won't know if the retry code is fixed now, and if so, whether it helps.

And yeah, it does log, at least at debug level. Before, you were seeing this message, indicating we were in the "should I retry?" logic:

OT: _ostree_fetcher_should_retry_request: error: 121:0 Server returned HTTP 502, n_retries_remaining: 5

but you did not see any further messages, indicating it decided not to retry. If the function decides to retry, you should also see:

Should retry request (remaining: <NUM> retries), due to transient error: <message about a timeout, probably>

Woah, thanks for the debugging and help here!

Well, right now we have the ostree with the retry stuff in rawhide, so it should be used in the next rawhide compose, but we don't in f40...

So if rawhide fails, neither fix fixes it. If it works, but we see retry in logs, we know the ostree retry worked, but the server fix didn't. If it works and no retries in the logs, we know the sever fix worked.

f40 doesn't have the ostree fix, so if it works there we know the server fix worked, if it doesn't then we know it didn't.

Did I miss any cases?

Yes. There is no case where we find out whether the ostree fix works if the server fix works. We will only know whether the ostree fix works if the server fix doesn't work.

Ah true. ok. I can revert the server side fix for now if you want... Or we could just wait until after we get a beta and revert the server side one after that?

yeah, I agree we might want to wait the fiddling till we're done with beta composes.

So, circling back... shall I drop the change I made to haproxy and see if the rpm-ostree retry is working?

ah, yeah, sounds like a plan. of course, we should also remember to put it back afterwards, as it does feel like the 'real' fix here.

ok. I reverted that haproxy change yesterday and everything seems to have worked in the last rawhide.

I don't see any retry messages, but perhaps I didn't dig enough?

As far as I know, we haven't had this issue in a while so it should be safe to move on and only use the "correct" config.

Folks should we close this ticket now, or wait to test more to focus more on the fixes?

Log in to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog