Smooge tells me that the current system's "rawdb" output is threatening to overwhelm the system. Will originally estimated that growth of rawdb would be something like 10GB/year, but I think we underestimated Fedora growth — and particularly, EPEL usage.
The current system processes raw Apache log files into rawdb and then generates totals.db from that file. It would be possible to instead accumulate directly into the totals.db.
for every line in the http log: if the line matches the data format we're expecting collapse date to just "weeknum" extract (os_name, os_variant, os_version, os_arch, sys_age, repo_tag, repo_arch) if line matching weeknum + extracted info isn't in the db add it with hits = 1 else if it is in the db increment hits
The last part could be done with a sqlite "upsert".
The main complications, as I understand it are:
For the first, it might just be a matter of "well, that's too bad", if we don't have the space.
For the second, though, we could do this a different way.
Am I missing anything here? It's very possible I am!
This could, at the same time, have another accumulator, the "IP hits" accumulator (see https://pagure.io/fedora-infrastructure/issue/10443). The logic for that would be slightly different:
for every line in the http log: if the line matches the data format we're expecting collapse date to "daynum" and "weeknum" extract the IP address if the IP address is in a table of daynum:IP continue to next else add IP address to daynum:IP table extract (repo_tag, repo_arch) if line matching weeknum + extracted info isn't in the db add it with iphits = 1 else if it is in the db increment iphits for that line
And with this
Another observation: totals.db should be "append only". So, rather than growing one file, we could write to a new week-####.db every week. This is mildly more annoying to process, but means that data-processing clients can rsync just what they need (or, wget with proper timestamping, or whatever), and cache that, rather than an ever-increasing totals.db file.
Log in to comment on this ticket.