Moved from: https://pagure.io/taskotron/taskotron-trigger/issue/60 because this seems to be a buildbot issue, not a trigger one.
There's a race condition that seems to occur once or multiple times per hour that causes exactly the same tasks (including the same UUID) to be executed twice. Both tasks fail, because they overwrite their files in the output directories. Here's an example:
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5910 https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5911
Compare the Build Properties tab, it's exactly the same job.
Build Properties
I wonder if this can be an issue with fedmsg when we receive some message twice. But in that case, it should receive a different UUID, right? If that's true, this seems to be an issue with trigger instead, or trigger<->buildbot integration.
I haven't seen this before, so this also might be an issue with the latest buildbot 1.7+ (currently deployed on dev).
CC @frantisekz @tflink @jskladan
In jobs.csv, it's mentioned only once for armhfp:
jobs.csv
armhfp
[root@taskotron-dev01 ~][PROD]# grep 'boost-nowide-0-20171029.gitec9672b.fc30' /var/log/taskotron-trigger/jobs.csv 2019-02-11 21:30:06.436575;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;x86_64;noarch 2019-02-11 21:30:06.592033;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;armhfp;noarch 2019-02-11 21:30:06.675796;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;python-versions;noarch;noarch 2019-02-11 21:30:06.760844;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmgrill;noarch;noarch 2019-02-11 21:30:06.858730;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;x86_64;noarch 2019-02-11 21:30:06.987246;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;armhfp;noarch
But, I can't find a job for the abicheck;x86_64 part, neither in buildbot nor in resultsdb! It almost seems like the 5910 job should have been scheduled for x86_64 (with a unique UUID), but it was scheduled for armhfp instead...
abicheck;x86_64
x86_64
This is not related to just abicheck and specific arches. Here's an example of double-scheduled rpmgrill on noarch:
abicheck
rpmgrill
noarch
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6040 https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6041
And that job should have been scheduled just once:
[root@taskotron-dev01 ~][PROD]# grep libcue-2.1.0-8.fc30 /var/log/taskotron-trigger/jobs.csv 2019-02-11 22:18:01.989463;libcue-2.1.0-8.fc30;koji_build;rpmlint;x86_64;noarch 2019-02-11 22:18:02.107293;libcue-2.1.0-8.fc30;koji_build;rpmlint;armhfp;noarch 2019-02-11 22:18:02.167086;libcue-2.1.0-8.fc30;koji_build;python-versions;noarch;noarch 2019-02-11 22:18:02.244493;libcue-2.1.0-8.fc30;koji_build;rpmgrill;noarch;noarch 2019-02-11 22:18:02.325802;libcue-2.1.0-8.fc30;koji_build;abicheck;x86_64;noarch 2019-02-11 22:18:02.425296;libcue-2.1.0-8.fc30;koji_build;abicheck;armhfp;noarch
Here's trigger log for the first example:
[2019-02-11 21:30:06][fedmsg.consumers INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on x86_64 (run on noarch host) [2019-02-11 21:30:06][fedmsg.consumers INFO] BuildbotRunner received: item: boost-nowide-0-20171029.gitec9672b.fc30 item_type: koji_build taskname: abicheck arch: x86_64 host_arch: noarch [2019-02-11 21:30:06][fedmsg.consumers INFO] post-hook build request status: 200 [2019-02-11 21:30:06][fedmsg.consumers INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on armhfp (run on noarch host) [2019-02-11 21:30:06][fedmsg.consumers INFO] BuildbotRunner received: item: boost-nowide-0-20171029.gitec9672b.fc30 item_type: koji_build taskname: abicheck arch: armhfp host_arch: noarch [2019-02-11 21:30:07][fedmsg.consumers INFO] post-hook build request status: 200
And for the second one:
[2019-02-11 22:18:02][fedmsg.consumers INFO] triggering rpmgrill for libcue-2.1.0-8.fc30 on noarch (run on noarch host) [2019-02-11 22:18:02][fedmsg.consumers INFO] BuildbotRunner received: item: libcue-2.1.0-8.fc30 item_type: koji_build taskname: rpmgrill arch: noarch host_arch: noarch [2019-02-11 22:18:02][fedmsg.consumers INFO] post-hook build request status: 200
My theory was that the POST request gets repeated on error or something, but it doesn't seem so. I'm starting to think this might be a buildbot issue.
Buildbot's twistd.log is not much helpful:
twistd.log
2019-02-11 21:30:06+0000 [-] added buildset 5910 to database ... 2019-02-11 21:30:08+0000 [-] starting build <Build all number:5910 results:success>.. pinging the worker <WorkerForBuilder builder='all' worker='qa11-15' state=BUILDING> ... 2019-02-11 21:30:08+0000 [-] acquireLocks(build <Build all number:5910 results:success>, locks [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)]) ... 2019-02-11 21:31:30+0000 [-] <Build all number:5910 results:failure>: build finished 2019-02-11 21:31:30+0000 [-] releaseLocks(<Build all number:5910 results:failure>): [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)]
Doesn't explain why it was run twice.
I have pushed a workaround to make sure the first task always finishes fine and the duplicate task is killed as soon as possible: https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=988522a54450d92ee2e433548e6e355c7c212e63 It doesn't solve the problem, but it least makes sure we have results in resultsdb and as artifacts. The downside is that we'll keep receiving email notifications about all those duplicate tasks marked as failures. You can easily spot them by looking at the log, but the amount of those can obscure real errors.
The same problem is happening on taskotron-stg, so this is not unfortunately a fluke on dev, but it will most likely also affect production once upgraded.
The workarounds seem to be working fine even on production, lowering priority.
Metadata Update from @kparal: - Issue priority set to: Low (was: High)
Log in to comment on this ticket.