#11926 mildly urgent: proxy/cache Discussion URLs hit by start.fedoraproject.org
Closed: Fixed 9 months ago by kevin. Opened 10 months ago by mattdm.

Describe what you would like us to do:


There are a number of URLs, including

which get hit a lot by the default Fedora web browser start page (see https://gitlab.com/fedora/websites-apps/fedora-websites/fedora-websites-3.0/-/blob/develop/pages/start.vue?ref_type=heads#L97)

There are also follow-up requests to topics that are returned in the json from these index queries.

This is causing a lot of extra load on Fedora Discussion (particularly the search query), but it might actually be that the others are bad too, just obscured in the reports I get.

Could we put something in the middle here that caches these responses? I don't think that Discourse itself should be hit more than, say, every fifteen minutes? (Even limiting to one per minute would be an order-of-magnitude improvement. But there's no real reason for it to be that up-to-date.)

When do you need this to be done by? (YYYY/MM/DD)


We're not, presently, in trouble, but we're getting close to thresholds where we would need to pay more. This seems like a bad reason to pay thousands of dollars more for hosting. So.... by the end of June, maybe?


Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

10 months ago

I was thinking about a solution for this, but couldn't find any that is easy to implement.

  • We could implement a proxy/cache system (i.e. squid) to cache those requests.
    This only moves the load from discussion.fp-o to our own infra. And I don't think we have any squid running so that requires setting one up.
    Since the requests come from the browser side, it must be open to the whole world.
    That is my least favorite option.
  • We drop client-side requests on start.fp-o and go full static.
    Discussion.fp-o is then only queried at build time and we build the site once a day, as for the previous version. We could rebuild more often but start.fp-o is currently tied to the fedoraproject.org website which is quite long to build. And even that generates a lot of unnecessary internal traffic as we need to upload that build to all fedora proxies, and it's not a small one.
    It would be much better if start. fp-o was separated from fedoraproject.org (it's planned, but just not yet).
  • We go full dynamic and build a server-side app for start.fp-o.
    That server-side part handles any fetching/caching of queries to third-party applications such as Discussion.
    This is my favorite solution but probably the most difficult to implement as we need to rework the website and the deployment (to OpenShift).
  • Or we drop all Discussion queries from start.fp-o.

cc @glb

The third option does sound best to me as well for start.fp.o. However, it would be a lot of work.

The first option seems OK to me as a quick fix. Could the server be configured to only proxy the specific json queries that we need for start.fp.o? Also, I don't know the details of what you are running server-side, but if you are already running Apache httpd, you might be able to configure the caching directly in httpd instead of installing Squid. I haven't used it personally, but here is a link to the documentation for httpd's mod_cache:

https://httpd.apache.org/docs/2.4/mod/mod_cache.html#cacheenable

Edit: FWIW, this example might be similar to what we would want?:

https://taylor.callsen.me/creating-a-caching-proxy-server-with-apache/#:~:text=4.-,Bringing%20it%20all%20together,-The%20complete%20Apache

I'm not sure I like the idea of another app for this. ;)

Crazy idea: How about we setup a cloudfront distribution for these and start hits cloudfront which caches them?
That unfortunately puts cloudfront in our critical path, but it would be super easy to do.

Otherwise IMHO, just doing static and once a day builds should be ok. The only thing that likely changes much is the 'recently solved' and perhaps we could switch that just to a link instead of pulling from it?

I think a static build of the whole site is a bit intensive. Is there a way to only rebuild the start page automatically? If so, I think that would be OK, but I'd like to do it at least four times per day so things like announcements from discussion.fp.o (or things like release announcements or CVE announcements from Fedora Magazine) wouldn't be potentially delayed for as long 24 hours.

I like the idea of splitting off the start page build from the rest. Then we could just build every hour or something and it would be super quick...

I'm okay with the spun-off static start page idea, but FWIW Cloudfront was what I had in mind. Isn't this kind of thing what it's for? :)

I don't know how Cloudfront works, but if it creates caches at specific times, I'd suggest using something like 5 minutes past the hour so that the Fedora Magazine posts that typically run on the hour (08:00 UTC for normal posts, 14:00 UTC for release announcements) will show up quickly. Otherwise, it is all good with me. I don't (fore)see any obvious problems.

We will split the start page and statically build it eventually, but it will take some time to get there.
In the meantime, I think the Cloudfront solution is the fastest to implement.
Once the distribution point is created, it should not take long for us to update the start page to use it.

ok. I can set that up... although I don't like putting cloudflare so directly in our production path. :)

So what exactly do we need on the cloudflare setup? what origin should it use? just all of discussion.fedoraproject.org ? or more targeted?

Ideally, we would need the origin paths mentioned by mattdm:

https://discussion.fedoraproject.org/c/news/announce-list/76.json
https://discussion.fedoraproject.org/tags/c/ask/common-issues/82/none/f40.json (or /f*.json to catch all versions)
https://discussion.fedoraproject.org/search.json?q=%23ask%20status%3Asolved%20order%3Alatest_topic

or, if that is easier to configure, all of https://discussion.fedoraproject.org.

edit: Discourse disables caching on those urls with the cache-control: no-cache, no-store header, so the Cloudfront distribution needs to set its own cache setting (TTL of 1 or 2 hours should be a good start).

I've created https://d36melcmqgchij.cloudfront.net with the entire site... easier than particular parts.

Let me know if that works or if anything more is needed for now.

So I need to update the start page to query d36melcmqgchij.cloudfront.net instead of discussion.fedoraproject.org right?

For gathering the data yes. Ideally I think any actual links people would end up clicking on should still go direct to discussion if thats possible?

I've submitted MR !986 to route the start page API queries through this new proxy.

Ideally I think any actual links people would end up clicking on should still go direct to discussion if thats possible?

Yes, I've made sure the links still go to the "real" discourse site. Thanks for pointing out that those should be maintained. 🙂

While updating the start page code, I noticed that the user avatars (displayed next to the latest solved issues) are being fetched from https://sea1.discourse-cdn.com/fedoraproject. Is that endpoint "metered"? If so, it might be necessary to add that to the proxy as well.

Does this new proxy adjust/remove the same origin headers? If not, this might not work.

Edit: https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-cloudfront-supports-cors-security-custom-http-response-headers/

curl https://d36melcmqgchij.cloudfront.net/c/news/announce-list/76.json -I
HTTP/2 200 
content-type: application/json; charset=utf-8
...
cache-control: no-cache, no-store
access-control-allow-origin: https://fedoraproject.org
x-cache: Miss from cloudfront

The CORS headers are preserved so we are good on that front.
But the cache-control is still not set, and I get a cache miss for every request I make, so I think we need to enforce a cache TTL on cloudfront.

This cache-control configuration is the root cause of this issue and is set by Discourse. With this current configuration, no one will cache this request (browsers or proxies).

https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html#DownloadDistValuesObjectCaching

If it is difficult, don't worry about it, but if you could add stg.fedoraproject.org, fedora.gitlab.io, and localhost to that access-control-allow-origin list from cloudfront, that should enable some of this functionality to be previewed/tested on the staging sites before the start page goes live on the production site.

just FYI, I read most of these tickets via email and pagure doesn't send any edits via email... :) So, IMHO, it would alway be better to just add a new comment over editing.

Anyhow, so I guess this won't work unless we can get cloudfront to ignore the cache control settings on the discussion side?

I am not sure what you mean by access-control-allow-origin list here? Currently it's open to everyone. We could of course lock it down if desired, but I didn't bother until we got it working...

Not sure localhost will work here as the development env uses a specific port. Maybe http://localhost:3000.
But I agree that having any of the 2 other domains would be really helpful as we can only test this in production right now.

I'm seeing the same results that darknao reported in his earlier comment. It looks like both cache-control and access-control-allow-origin are set. The documentation for cloudfront appears to indicate that these headers can be adjusted on the proxy:

https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/modifying-response-headers.html

Not sure localhost will work here as the development env uses a specific port. Maybe http://localhost:3000.

Yes! Thanks for catching that darknao. If possible (and not difficult), I would like http://localhost:3000 added to the access-control-allow-origin list (along with https://fedoraproject.org, https://stg.fedoraproject.org, and https://fedora.gitlab.io).

Anyhow, so I guess this won't work unless we can get cloudfront to ignore the cache control settings on the discussion side?

Yes, in the CloudFront configuration, that should be :
Object Caching: Customize
Minimum TTL: 2h

I am not sure what you mean by access-control-allow-origin list here? Currently it's open to everyone. We could of course lock it down if desired, but I didn't bother until we got it working...

It's cross-reference access-control used by browsers. Currently, only fedoraproject.org can request assets from discourse (from the browser. So javascript or other assets).
I think you can override this by following https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/creating-response-headers-policies.html (Origin override checkbox)

The cloudfront console interface is... a bit confusing to me.

But I think I have adjusted things. Can you test and see if I missed anything?

It doesn't look like there is any access-control-allow-origin now. I don't think that will work.

Huh... I have:

Access-Control-Allow-Origin

https://fedoraproject.org
http://localhost:3000
https://stg.fedoraproject.org
https://fedora.gitlab.io

Origin override

I'm just going by what I see in the output from the curl command that darknao provided earlier. Do those values (https://fedoraproject.org https://stg.fedoraproject.org https://fedora.gitlab.io http://localhost:3000) need to be on the same line as the key (Access-Control-Allow-Origin)? The typical format for showing/setting HTTP headers is <key>: <value>, [value, ...], but I have no idea how CloudFront works.

Actually, it looks like Access-Control-Allow-Origin only allows one value, so you might have to set it to * for all our staging sites to work. I don't know if that would be a problem or not?

Excerpted from https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Access-Control-Allow-Origin:

Limiting the possible Access-Control-Allow-Origin values to a set of allowed origins requires code on the server side to check the value of the Origin request header, compare that to a list of allowed origins, and then if the Origin value is in the list, set the Access-Control-Allow-Origin value to the same value as the Origin value.

Well, if darknao will approve my MR, we should be able to see if it is working on GitLab pages. 🙂

I just tested this on a local build and I see a problem.

startpage.jpg

The "Common Issues" query is working, but the "Latest Solved Issues" query is not. The difference between them is that the latest solved issues query requires query string parameters. I don't know why the latter would be being blocked, but that is the only difference between the queries (unless I have a typo somewhere, but I don't see any).

Can we try it with Access-Control-Allow-Origin set to "All origins"? Are we concerned about this proxy being used to harvest data from our discourse instance?

sure. Set it to allow all...

Ok so the cache setting is correct and I get cache hit everytime. That's good.
The CORS headers are good too, and works for all mentioned domains. Perfect.
One of the URL is using query parameters, but CloudFront ignore them for caching, which then doesn't return the correct result.

There should be a "Query string forwarding and caching" parameter that you can set to "Forward all, cache based on all" (I assume it's currently set to None)
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html#DownloadDistValuesQueryString

No dice with it set to allow all. Hopefully what darknao suggests will work instead.

also note that the previous cache setting (if you already changed it) was working just fine

ok, set the query string thing.

I currently have Access-Control-Allow-Origin set to all... should I switch it back to that list of domains?

ok, set the query string thing.

Hmm, it still doesn't seem to be working. 😕

I currently have Access-Control-Allow-Origin set to all... should I switch it back to that list of domains?

It doesn't matter to me. I guess the explicit list is, in theory, a little more secure.

explicit list is best I think, it's working in both cases.

For the query string, it's working now. So I think everything is good and we can push the updated start.fp-o to production.

@glb the "latest solved issues" is not working on your side cause the function tries to query each topic details on discussion.fp-o (instead of the cloudfront proxy) and fails due to the CORS headers set there.
Should be fine on production (or you could use the cloudfront URL for that too, since it forwards all requests in the end)

I'm sure I have the solved query pointing at the cloudfront proxy:

[/home/glb/Repositories/fedora-websites-3.0]$ git diff HEAD~1
...

 const fp_solved_issues = async () => {
-  let dcdata = await $fetch(`${discourse_uri}/search.json?${solved_query}`);
+  let dcdata = await $fetch(`${_dc_proxy_uri}/search.json?${solved_query}`);
   let solved = dcdata.posts;

   let i = 0;
[/home/glb/Repositories/fedora-websites-3.0]$

I'm running/testing the route-startpage-queries-through-proxy branch locally with npm run dev.

I'm talking about the lines just below:

    let topic;
    if (solved[i].topic_id) {
      topic = await $fetch(`${discourse_uri}/t/${solved[i].topic_id}.json`);
    }

Yeah, I get it now. :person_facepalming:

I guess I should change that to go through the proxy.

That's the only request left that doesn't use the proxy. Now that we have it, I'd say let's use it for all requests :) (and this one also set the no-cache, no-store header so even if it's not the most queried URL, it will still be beneficial to have a cached version).

Side effect to this: start.fp-o is now loading super fast :D

I think I've made the needed update to my MR on GitLab. Let me know if I've missed anything.

BTW: There is still the avatars query. Do you have any ideas about how that should be handled?

avatars are cached (thanks to the cache-control header properly set this time) so I don't think there is a need to cache them on that proxy.

ok. I am going to put back the Access-Control-Allow-Origin list... if everything looks ok after that, is there anything left to do here?

I don't think so. If it still works on https://fedora.gitlab.io/websites-apps/fedora-websites/fedora-websites-3.0/start then we are (probably) OK. 🙂

Everything looks good on staging so I'm pushing it on prod right now.
I think we can close this ticket. I would be interested in the stats on Discourse following this change just to see how much of an improvement it makes.

Ditto. If we are well under whatever the threshold is where the cost would increase, we might consider reducing the cache time so that the solved issues will update more frequently (just to keep the start page a little more interesting/active).

Thanks! Agree it would be good to look at this after a bit and see what effect it's having...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

9 months ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog
Attachments 2