Looks like data is not making it through the system? This graph basically says it all:
<img alt="fedora-updates-timeseries-age-all-stacked.png" src="/fedora-infrastructure/issue/raw/files/0d764a14c5c8640febf5e86f9d291611440263ed65a6c0d6df62e0b13947a12b-fedora-updates-timeseries-age-all-stacked.png" />
If this can be a fixed in a way that pulls in the old data, this isn't urgent. If we have it by mid-next month, that's fine. However, if it's losing data that won't be filled in later, I'd really like it to be fixed ASAP.
Thanks.
How I 'fixed' this the last 2 times.. was to run the programs over past logs in a 'force' mode.
./parse-access-log.py --progress --sqlite /var/lib/countme/raw.db --no-dup-check /mnt/fedora_stats/combined-http/2021/04/01/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/02/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/03/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/04/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/05/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/06/mirrors.fedoraproject.org-access.log /mnt/fedora_stats/combined-http/2021/04/07/mirrors.fedoraproject.org-access.log
Doing it without dup-check it 'runs' but does not seem to add all the entries for dates so we get an 'undercount'. There will be a large drop for one week in late march where we did not get logs from proxies. The raw db is a large binary db so I do not know what its 'state' is and this maybe the problem.
Metadata Update from @mohanboddu: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Trying that parse-access-log.py now over the april logs.
Where is that image generated? I don't see it in the normal ones on https://data-analysis.fedoraproject.org/csv-reports/images/ ?
The picture is some program Matt has and not part of the data-analysis images. There will be a drop on the week of March 25 as we did not get logs from most servers for 4 days. however dates after April 1 should show up.
I forgot to say that I run the script as the countme user to keep permissions on dbs and such correct. After the db's are finished.. I run a
sh /usr/local/bin/countme-update-totals.sh --rawdb /var/lib/countme/raw.db --totals-db /var/lib/countme/totals.db --totals-csv /var/lib/countme/totals.csv
which should update the db that matt uses and the csv file I have used to see if things are working. If it doesn't then it will be a dev issue to work through the code
@kevin Yeah, this is from my own graphing code for the new DNF Better Counting. Those reports are from the Classic IP Addr Counting System. Here's a direct look at the database:
$ sqlite3 totals.db SQLite version 3.34.1 2021-01-20 14:10:07 Enter ".help" for usage hints. sqlite> SELECT date(julianday('1970-01-05')+ weeknum * 7) as week,sum(hits) from countme_totals WHERE os_name="Fedora" AND os_variant = "workstation" AND repo_tag LIKE "updates-released-%" AND weeknum>2654 GROUP BY weeknum; 2020-11-23|366463 2020-11-30|370701 2020-12-07|366514 2020-12-14|364404 2020-12-21|337926 2020-12-28|331814 2021-01-04|364301 2021-01-11|375153 2021-01-18|379698 2021-01-25|383402 2021-02-01|379734 2021-02-08|388700 2021-02-15|389459 2021-02-22|390081 2021-03-01|398500 2021-03-08|401790 2021-03-15|403854 2021-03-22|254365 2021-03-29|262491 2021-04-05|3 2021-04-12|2 sqlite>
As you can see, we do have a drop on the week of March 25, but... rather than coming back after that, it just drops out.
Hmmm, in fact:
sqlite> select * from countme_totals where weeknum>2673 ...> ; 2|2674|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64 2|2674|Fedora|32|generic|x86_64|1|fedora-32|x86_64 2|2674|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le 2|2674|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 2|2674|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64 2|2674|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64 2|2674|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le 2|2674|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le 12|2674|Fedora|35|silverblue|x86_64|1|rawhide|x86_64 2|2674|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64 119|2674|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64 119|2674|Fedora|33|coreos|x86_64|1|fedora-33|x86_64 2|2674|Fedora|35|silverblue|x86_64|2|rawhide|x86_64 1|2674|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 7|2674|Fedora|35|coreos|x86_64|1|rawhide|x86_64 1|2675|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le 1|2675|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le 1|2675|Fedora|32|generic|x86_64|1|fedora-32|x86_64 1|2675|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64 3|2675|Fedora|35|silverblue|x86_64|2|rawhide|x86_64 123|2675|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64 123|2675|Fedora|33|coreos|x86_64|1|fedora-33|x86_64 1|2675|Fedora|35|coreos|x86_64|1|rawhide|x86_64 3|2675|Fedora|35|silverblue|x86_64|1|rawhide|x86_64 1|2675|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64 1|2675|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64 1|2675|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 1|2675|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 1|2675|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le 1|2675|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64 44|2675|Fedora|34|silverblue|x86_64|1|updates-testing-f34|x86_64 72|2675|Fedora|34|silverblue|x86_64|1|updates-released-f34|x86_64 76|2675|Fedora|34|silverblue|x86_64|1|fedora-34|x86_64 32|2675|Fedora|34|coreos|x86_64|1|updates-testing-f34|x86_64 41|2675|Fedora|34|coreos|x86_64|1|updates-released-f34|x86_64 41|2675|Fedora|34|coreos|x86_64|1|fedora-34|x86_64 1|2675|Fedora|34|silverblue|x86_64|1|updates-testing-modular-f34|x86_64 2|2675|Fedora|34|iot|arm|1|updates-testing-f34|armhfp 2|2675|Fedora|34|iot|arm|1|updates-released-f34|armhfp 1|2675|Fedora|34|iot|arm|1|fedora-34|armhfp
^ those are ALL of the rows we've gotten for all of April.
Try now? I just ran the update-totals smooge noted above.
Same, I'm afraid.
OK I am going to say this is a devel issue then. The code is running, but the db is not getting updated correctly. We had this happen once a long time ago and it needed a rerun of all the logs from 2020. Out of the blue hypothesis: I am guessing the raw.db is 'trashed' somewhere and needs fixing and then whatever in the code which is causing the 'trash' needs to be found.
ok theroy wrong.. raw.db says its at least right somewhat
sqlite> PRAGMA integrity_check; ok
OK so in /var/lib/countme I mv out the 'old' db and reran the commands completely;
$ ./countme-totals.py --update-from /var/lib/countme/raw.db --csv-dump /var/lib/countme/totals.csv --progress /var/lib/countme/totals.db $ ls -l /var/www/html/csv-reports/countme/ /var/lib/countme/ /var/lib/countme/: total 6584688 -rw-r--r--. 1 countme countme 6714486784 Apr 23 16:56 raw.db -rw-rw-r--. 1 countme countme 7322886 Apr 23 17:20 totals.csv -rw-r--r--. 1 countme countme 7081984 Apr 23 17:20 totals.db -rw-r--r--. 1 countme countme 6733824 Mar 25 09:03 totals.db-old -rw-r--r--. 1 countme countme 7086080 Apr 22 09:04 totals.db-old2 /var/www/html/csv-reports/countme/: total 14072 -rw-rw-r--. 1 countme countme 7322881 Apr 23 16:38 totals.csv -rw-r--r--. 1 countme countme 7086080 Apr 23 16:38 totals.db
The data in the file did not change for those last two weeks. The raw.db has the data in it
> SELECT * from countme_raw WHERE os_name="Fedora" AND os_variant = "workstation" AND repo_tag LIKE "updates-released-%" AND date(julianday('2021-04-18')) limit 10; 1581323291|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-modular-f31|x86_64 1581345132|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-f31|x86_64 1581512575|77.20.132.93|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 1581521049|2604:3d08:1381:8800::93cf|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 1581521084|2604:3d08:1381:8800::93cf|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 1581589087|2001:470:ddd2:1:b8f9:4fe2:f0a4:de1e|Fedora|31|workstation|x86_64|3|updates-released-f31|x86_64 1581612017|2804:14c:4a:8950:3dc8:ff26:127c:8644|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 1581613651|2804:14c:4a:8950:80be:9ff8:4219:230e|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 1581615305|2804:14c:4a:8950:d401:6525:8ca7:fce|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 1581615337|2804:14c:4a:8950:d401:6525:8ca7:fce|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64
so the problem must be in countme-totals.py
countme-totals.py
And I found that my sql was wrong so I was counting dates in the raw.db incorrectly. There is a problem with counting after 2021-03-29 (if my improved query is correct) so the problem is in the parse-access-log program.
sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-03-28'; 1807618 sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-03-29'; 1728833 sqlite> SELECT count(*) from countme_raw where date(timestamp, 'unixepoch') >= '2021-04-01'; 441117
Related to the changes for rpm-ostree maybe?
OK I have checked with the original log files to make sure that this wasn't a 'countme is right and we dont' have countme= in the logs'.
$ grep -c countme= */mirrors.fedoraproject.org-access.log 01/mirrors.fedoraproject.org-access.log:277038 02/mirrors.fedoraproject.org-access.log:247856 03/mirrors.fedoraproject.org-access.log:192878 04/mirrors.fedoraproject.org-access.log:151194 05/mirrors.fedoraproject.org-access.log:143573 06/mirrors.fedoraproject.org-access.log:1130080 07/mirrors.fedoraproject.org-access.log:376237 08/mirrors.fedoraproject.org-access.log:302063 09/mirrors.fedoraproject.org-access.log:259353 10/mirrors.fedoraproject.org-access.log:236679
I am going to git revert the commits, rerun the logs and see if that fixes it.
OK I reran the countme with older code and it updated all the items and databases. So I believe the changes in commit 2872a824a04c8557500a5ea257fc7cac0a9907d5 may be causing this issue. My guess is the regex pattern:
@@ -103,8 +104,8 @@ LOG_PATTERN_FIELDS = { # # For more info on the User-Agent header, see RFC7231, Section 5.5.3: # https://tools.ietf.org/html/rfc7231#section-5.5.3) -LIBDNF_USER_AGENT_PATTERN = ( - r'(?P<product>libdnf(?:/(?P<product_version>\S+))?)\s+' +COUNTME_USER_AGENT_PATTERN = ( + r'(?P<product>(libdnf(?:/(?P<product_version>\S+)))|rpm-ostree?)\s+' r'\(' r'(?P<os_name>.*)\s' r'(?P<os_version>[0-9a-z._-]*?);\s'
I think I have applied a fix to the regex so that it will evaluate the logs.
<img alt="fedora-updates-timeseries-age-all-stacked.png" src="/fedora-infrastructure/issue/raw/files/bde81a9fc4ecb3bd1feaec6f18e1860e323a09a3e875795cb6ea0887cc232c76-fedora-updates-timeseries-age-all-stacked.png" />
That looks better!
Awesome. So, shall we close this for now?
Re-open if you see anything further.
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Broken again this week :(
<img alt="fedora-updates-timeseries-age-all.png" src="/fedora-infrastructure/issue/raw/files/f6fe323f40d1bbb4c28286eac8819801f88c55849fe1041712691956a5390a7d-fedora-updates-timeseries-age-all.png" />
Metadata Update from @mattdm: - Issue status updated to: Open (was: Closed)
sqlite> select * from countme_totals where weeknum=2677; 160|2677|Fedora|34|silverblue|x86_64|1|updates-released-f34|x86_64 161|2677|Fedora|34|silverblue|x86_64|1|fedora-34|x86_64 1|2677|Fedora|32|generic|x86_64|1|updates-released-f32|x86_64 92|2677|Fedora|34|silverblue|x86_64|2|fedora-34|x86_64 90|2677|Fedora|34|silverblue|x86_64|2|updates-released-f34|x86_64 2|2677|Fedora|35|silverblue|x86_64|3|rawhide|x86_64 3|2677|Fedora|34|iot|aarch64|1|updates-released-f34|aarch64 3|2677|Fedora|34|iot|aarch64|1|fedora-34|aarch64 1|2677|Fedora|32|workstation|x86_64|1|updates-released-f32|x86_64 120|2677|Fedora|33|coreos|x86_64|1|updates-released-f33|x86_64 120|2677|Fedora|33|coreos|x86_64|1|fedora-33|x86_64 1|2677|Fedora|32|generic|x86_64|1|fedora-modular-32|x86_64 1|2677|Fedora|32|generic|ppc64le|1|fedora-modular-32|ppc64le 1|2677|Fedora|32|generic|ppc64le|1|updates-released-f32|ppc64le 1|2677|Fedora|32|workstation|x86_64|1|fedora-modular-32|x86_64 1|2677|Fedora|32|workstation|x86_64|1|updates-released-modular-f32|x86_64 1|2677|Fedora|32|generic|x86_64|1|updates-released-modular-f32|x86_64 1|2677|Fedora|32|generic|x86_64|1|fedora-32|x86_64 1|2677|Fedora|32|generic|ppc64le|1|updates-released-modular-f32|ppc64le 48|2677|Fedora|34|coreos|x86_64|1|updates-released-f34|x86_64 48|2677|Fedora|34|coreos|x86_64|1|fedora-34|x86_64 2|2677|Fedora|34|silverblue|x86_64|1|fedora-modular-34|x86_64 11|2677|Fedora|35|silverblue|x86_64|1|rawhide|x86_64 1|2677|Fedora|34|silverblue|x86_64|2|updates-testing-modular-f34|x86_64 1|2677|Fedora|34|silverblue|x86_64|1|updates-testing-modular-f34|x86_64 1|2677|Fedora|34|silverblue|x86_64|1|updates-testing-f34|x86_64 1|2677|Fedora|34|silverblue|x86_64|1|updates-released-modular-f34|x86_64 1|2677|Fedora|34|iot|x86_64|2|fedora-34|x86_64 1|2677|Fedora|34|iot|x86_64|2|updates-released-f34|x86_64 3|2677|Fedora|34|silverblue|x86_64|2|updates-testing-f34|x86_64 1|2677|Fedora|35|coreos|x86_64|1|rawhide|x86_64 3|2677|Fedora|34|iot|x86_64|1|fedora-34|x86_64 3|2677|Fedora|34|iot|x86_64|1|updates-released-f34|x86_64 1|2677|Fedora|34|iot|arm|2|fedora-34|armhfp 1|2677|Fedora|34|iot|arm|2|updates-released-f34|armhfp
Any guesses? It looks like the ones that are coming through are mostly rpm-ostree (but maybe not all of those, even?). I'm eager to see the F34 numbers!
So I have rerun all the jobs with the code and it still counted nothing. I have removed the rpm-ostree regex from the code and am trying one more time to see if that fixes it. If it does then this needs to be a dev ticket/mini-initiative to fix the code properly.
I am not sure what is causing this at all. I am going to reconstruct the raw.db in a different db and see if that is the issue. This will probably take all weekend.
@smooge Thanks, I appreciate it. This is all going to be worth it :)
I have rerun all the data from 2020-01-01 until 2021-05-05?. It seems to have changed a lot of the csv lines so I am guessing the counts have been off or there are other problems in the code. I think we are at 'need a developer to fix'.
@smooge The new graphs look better:
<img alt="fedora-updates-timeseries-age-all.png" src="/fedora-infrastructure/issue/raw/files/cceed9a8e9a2be556f302190927d374d039fd6ea76d3a3dc24920b684740ea71-fedora-updates-timeseries-age-all.png" /><img alt="fedora-updates-timeseries-release-all.png" src="/fedora-infrastructure/issue/raw/files/96a583d11582c4c37995106f6ae924f9f8868fa4cc20464df3467181b4b56fb8-fedora-updates-timeseries-release-all.png" />
Do you think this is going to break again next week?
DNF worked for 1 week without breaking.
Yeah. The numbers seem lower this week than last week for some reason, but I'm not sure if that's something broken or just what the data looks like. I guess we'll see!
Not sure myself. I recreated the databases from scratch using just the current code which was used this week on the logs. I expect that this still needs someone to look at the code long term.
Metadata Update from @smooge: - Issue assigned to smooge
Yeah, looks broken again. Can we escalate this to "mini-initiative"?
<img alt="fedora-updates-timeseries-age-all-stacked.png" src="/fedora-infrastructure/issue/raw/files/f6dd6059ac38bb2aff895c830ce0fc57660247627b35efa547ecfb5d906e35f8-fedora-updates-timeseries-age-all-stacked.png" />
We could but I am worried it would sit around not getting touched.
A mini-initiative means there are people who know the project and can work on this while doing their own regular day to day jobs. There needs to be someone on the team who is pushing it forward.
An initiative is meant for when there are not people who know what the project and/or the work is going to take out from their day to day jobs but mgmt can say 'that is ok, this other thing is more important than XYZ thing group B wanted'. It also means that it gets reviewed and pushed to conclusion.
There is no one of the CPE team who knows this project or its code. Also currently most of CPE's people are working CentOS Stream work. That would push it more towards being an initiative to get a focus on it.
I will see what I can look at over the weekend but I think for long term work statistics needs to be a 'real person' job versus volunteer.
OK I don't know if things are 'broken'. The raw data for the last week has been 'weird'
[smooge@log01 05][PROD-IAD2]$ for i in *; do echo -n "$i "; grep -c countme= $i/mirrors.fedoraproject.org-access.log; done 01 285863 02 211358 03 193647 04 1622168 05 403975 06 332682 07 298408 08 271678 09 201907 10 193090 11 1662151 12 410051 13 347021 14 294174 15 71515 16 121151 17 200247 18 979033
The 15th and 16th are well below normal. The raw files are the usual sizes but the number of counts from clients took a nose dive. I do not have the sql guruhood to compare this to what the DB says.
For what it's worth there's weird dip in Old Counting Method numbers as well.
So not sure what is up.. the xz sizes of the days are the following:
14 507388 total 15 501832 total 16 476472 total
but the combined logs are much smaller
8.0G 14/mirrors.fedoraproject.org-access.log 2.2G 15/mirrors.fedoraproject.org-access.log 4.5G 16/mirrors.fedoraproject.org-access.log
OK something went wrong on the 15th and 16th for combining logs so we logs 6 gigs of lines. Redoing the merge did not give any errors and a new file. I will see about reprocessing the data for those days.
OK I have remerged the files and I have rerun the countme on those dates. The totals.db and totalscvs show new data so please check to see if this 'fixes' things. I will try to fix the old data also.
Thank you so much @smooge. I think we still need some CPE official attention on this, because you fixing it every week or so is not sustainable. :)
Yeah.. it definitely is still broken. I just looked over my simple read the EPEL ones and the counts for last week are 1/2 of what they were the week before. There seem to be multiple problems going on with the log combining, the log parsing and the raw and total databases. I don't what they all are, and my 'fix' is currently to rebuild the raw db from scratch which is doubly unsustainable as it takes 4-6 days to do so.
OK ran various 'test' runs and the bug seems to be something to do with the raw.db where all data is put in first before being condensed into the totals.db. If I start with a new db for the last week and then run the totals I get 'correct' looking data counts.
I then made a copy of the known raw.db and ran the logs for the last 1 week again on it telling it to not check if it had an entry with it.. it did fix that week but it also 'fix' the week previous by adding some entries to it. At this point we are going to make this an development initiative with help from someone who really understands DB's. Either the writing to the db is getting corrupted or it is too large for what it is doing (the file is 8.3 GB in size now). It may be that there are some sort of cleaning and reordering which is needed to 'regularly' keep it nice.
In any case, at this point I am going to say this will be continually 'broke'
We now have a iniative working on this... I'm gonna go ahead and close this now. Feel free to re-open if there's some need to track things here.
Metadata Update from @kevin: - Issue close_status updated to: Upstream - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.