#273 Jobs are sometimes executed twice (with the same UUID)
Opened 6 years ago by kparal. Modified 5 years ago

Moved from:
https://pagure.io/taskotron/taskotron-trigger/issue/60
because this seems to be a buildbot issue, not a trigger one.


There's a race condition that seems to occur once or multiple times per hour that causes exactly the same tasks (including the same UUID) to be executed twice. Both tasks fail, because they overwrite their files in the output directories. Here's an example:

https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5910
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/5911

Compare the Build Properties tab, it's exactly the same job.

I wonder if this can be an issue with fedmsg when we receive some message twice. But in that case, it should receive a different UUID, right? If that's true, this seems to be an issue with trigger instead, or trigger<->buildbot integration.

I haven't seen this before, so this also might be an issue with the latest buildbot 1.7+ (currently deployed on dev).

CC @frantisekz @tflink @jskladan


In jobs.csv, it's mentioned only once for armhfp:

[root@taskotron-dev01 ~][PROD]# grep 'boost-nowide-0-20171029.gitec9672b.fc30' /var/log/taskotron-trigger/jobs.csv
2019-02-11 21:30:06.436575;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;x86_64;noarch
2019-02-11 21:30:06.592033;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmlint;armhfp;noarch
2019-02-11 21:30:06.675796;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;python-versions;noarch;noarch
2019-02-11 21:30:06.760844;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;rpmgrill;noarch;noarch
2019-02-11 21:30:06.858730;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;x86_64;noarch
2019-02-11 21:30:06.987246;boost-nowide-0-20171029.gitec9672b.fc30;koji_build;abicheck;armhfp;noarch

But, I can't find a job for the abicheck;x86_64 part, neither in buildbot nor in resultsdb! It almost seems like the 5910 job should have been scheduled for x86_64 (with a unique UUID), but it was scheduled for armhfp instead...


This is not related to just abicheck and specific arches. Here's an example of double-scheduled rpmgrill on noarch:

https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6040
https://taskotron-dev.fedoraproject.org/taskmaster/#/builders/2/builds/6041

And that job should have been scheduled just once:

[root@taskotron-dev01 ~][PROD]# grep libcue-2.1.0-8.fc30 /var/log/taskotron-trigger/jobs.csv
2019-02-11 22:18:01.989463;libcue-2.1.0-8.fc30;koji_build;rpmlint;x86_64;noarch
2019-02-11 22:18:02.107293;libcue-2.1.0-8.fc30;koji_build;rpmlint;armhfp;noarch
2019-02-11 22:18:02.167086;libcue-2.1.0-8.fc30;koji_build;python-versions;noarch;noarch
2019-02-11 22:18:02.244493;libcue-2.1.0-8.fc30;koji_build;rpmgrill;noarch;noarch
2019-02-11 22:18:02.325802;libcue-2.1.0-8.fc30;koji_build;abicheck;x86_64;noarch
2019-02-11 22:18:02.425296;libcue-2.1.0-8.fc30;koji_build;abicheck;armhfp;noarch

Here's trigger log for the first example:

[2019-02-11 21:30:06][fedmsg.consumers    INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on x86_64 (run on noarch host)
[2019-02-11 21:30:06][fedmsg.consumers    INFO] BuildbotRunner received:
item: boost-nowide-0-20171029.gitec9672b.fc30
item_type: koji_build
taskname: abicheck
arch: x86_64
host_arch: noarch
[2019-02-11 21:30:06][fedmsg.consumers    INFO] post-hook build request status: 200
[2019-02-11 21:30:06][fedmsg.consumers    INFO] triggering abicheck for boost-nowide-0-20171029.gitec9672b.fc30 on armhfp (run on noarch host)
[2019-02-11 21:30:06][fedmsg.consumers    INFO] BuildbotRunner received:
item: boost-nowide-0-20171029.gitec9672b.fc30
item_type: koji_build
taskname: abicheck
arch: armhfp
host_arch: noarch
[2019-02-11 21:30:07][fedmsg.consumers    INFO] post-hook build request status: 200

And for the second one:

[2019-02-11 22:18:02][fedmsg.consumers    INFO] triggering rpmgrill for libcue-2.1.0-8.fc30 on noarch (run on noarch host)
[2019-02-11 22:18:02][fedmsg.consumers    INFO] BuildbotRunner received:
item: libcue-2.1.0-8.fc30
item_type: koji_build
taskname: rpmgrill
arch: noarch
host_arch: noarch
[2019-02-11 22:18:02][fedmsg.consumers    INFO] post-hook build request status: 200

My theory was that the POST request gets repeated on error or something, but it doesn't seem so. I'm starting to think this might be a buildbot issue.


Buildbot's twistd.log is not much helpful:

2019-02-11 21:30:06+0000 [-] added buildset 5910 to database
...
2019-02-11 21:30:08+0000 [-] starting build <Build all number:5910 results:success>.. pinging the worker <WorkerForBuilder builder='all' worker='qa11-15' state=BUILDING>
...
2019-02-11 21:30:08+0000 [-] acquireLocks(build <Build all number:5910 results:success>, locks [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)])
...
2019-02-11 21:31:30+0000 [-]  <Build all number:5910 results:failure>: build finished
2019-02-11 21:31:30+0000 [-] releaseLocks(<Build all number:5910 results:failure>): [(<WorkerLock(worker_builds, 1)[qa11-15] 140601999098040>, <buildbot.locks.LockAccess object at 0x7fe0764ff2e8>)]

Doesn't explain why it was run twice.


I have pushed a workaround to make sure the first task always finishes fine and the duplicate task is killed as soon as possible:
https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=988522a54450d92ee2e433548e6e355c7c212e63
It doesn't solve the problem, but it least makes sure we have results in resultsdb and as artifacts.
The downside is that we'll keep receiving email notifications about all those duplicate tasks marked as failures. You can easily spot them by looking at the log, but the amount of those can obscure real errors.

The same problem is happening on taskotron-stg, so this is not unfortunately a fluke on dev, but it will most likely also affect production once upgraded.

The workarounds seem to be working fine even on production, lowering priority.

Metadata Update from @kparal:
- Issue priority set to: Low (was: High)

5 years ago

Log in to comment on this ticket.

Metadata