#2280 Yet another AWS hitcounter follow-up
Merged 2 years ago by praiskup. Opened 2 years ago by frostyx.
copr/ frostyx/copr hitcounter-pt-6  into  main

@@ -5,6 +5,7 @@ 

  import os

  import re

  from datetime import datetime

+ from requests.utils import unquote

  from copr_common.request import SafeRequest

  from copr_backend.helpers import BackendConfigReader

  
@@ -118,6 +119,21 @@ 

                        url, bot.group(1))

              continue

  

+         # Convert encoded characters from their %40 values back to @.

+         url = unquote(url)

+ 

+         # I don't know how or why but occasionally there is an URL that is

+         # encoded twice (%2540oamg -> %40oamg - > @oamg), and yet its status

+         # code is 200. AFAIK these appear only for EPEL-7 chroots and their

+         # User-Agent is something like urlgrabber/3.10%20yum/3.4.3

+         # I wasn't able to reproduce such accesses, and we decided to not count

+         # them

+         if url != unquote(url):

+             log.warning("Skipping: %s (double encoded URL, user-agent: '%s', "

+                         "status: %s)", access["cs-uri-stem"],

+                         access["cs(User-Agent)"], access["sc-status"])

+             continue

+ 

          # We don't want to count every accessed URL, only those pointing to

          # RPM files and repo file

          key_strings = url_to_key_strings(url)
@@ -125,6 +141,12 @@ 

              log.debug("Skipping: %s", url)

              continue

  

+         if any(x for x in key_strings

+                if x.startswith("chroot_rpms_dl_stat|")

+                and x.endswith("|srpm-builds")):

+             log.debug("Skipping %s (SRPM build)", url)

+             continue

+ 

          log.debug("Processing: %s", url)

  

          # When counting RPM access, we want to iterate both project hits and

Fixing the issues that we discovered in PR#2274

Build succeeded.

Single unquote() call is a correct thing to do, +1. But double unquote... such an URL shouldn't ever return the correct response (but 404 or other error).
I would prefer just skipping these URLs, and logging them as error (not stopping the script) ... so we can later better understand where these come from (AWS CloudFront? lighttpd?, status code?)

rebased onto ff995346cf1e4183aa96be282383c5b8ebe00977

2 years ago

rebased onto ff995346cf1e4183aa96be282383c5b8ebe00977

2 years ago

Build succeeded.

+1 (I would use log.error instead of log.debug, but it is a nit)

rebased onto ca284b3

2 years ago

Build succeeded.

Commit 72d6210 fixes this pull-request

Pull-Request has been merged by praiskup

2 years ago

Commit cd9178e fixes this pull-request

Pull-Request has been merged by praiskup

2 years ago
Metadata