#4358 Fix min_avail calculation
Merged 2 months ago by mikem. Opened 3 months ago by mikem.
mikem/koji fix-min-avail  into  master

file modified
+2 -2
@@ -377,7 +377,7 @@ 

          refusals = self.get_refusals()

          for task in self.free_tasks:

              task['_hosts'] = []

-             min_avail = min(0, task['weight'] - self.capacity_overcommit)

+             min_avail = max(0, task['weight'] - self.capacity_overcommit)

              h_refused = refusals.get(task['task_id'], {})

              for host in self.hosts_by_bin.get(task['_bin'], []):

                  if (host['ready'] and
@@ -402,7 +402,7 @@ 

  

          # tasks are already in priority order

          for task in self.free_tasks:

-             min_avail = task['weight'] - self.capacity_overcommit

+             min_avail = max(0, task['weight'] - self.capacity_overcommit)

              task['_hosts'].sort(key=lambda h: h['_rank'])

              logger.debug('Task %i choices: %s', task['task_id'],

                           [(h['name'], "%(_rank).2f" % h) for h in task['_hosts']])

@@ -222,6 +222,23 @@ 

          self.assertEqual(t_assigned, list(range(5)))

          self.assertEqual(h_used, list(range(5)))

  

+     def test_stop_at_capacity(self):

+         # just one host

+         host = self.mkhost(id=1, capacity=5.0)

+         self.sched._get_hosts.return_value = [host]

+         self.sched.get_hosts()

+         # and more tasks than will fit

+         self.sched.free_tasks = [self.mktask(task_id=n, weight=1.0) for n in range(10)]

+ 

+         self.sched.do_schedule()

+ 

+         # 5 tasks with weight=1.0 should fill up capacity

+         # (overcommit only applies for a single task going over)

+         self.assertEqual(len(self.assigns), 5)

+         t_assigned = [t['task_id'] for t, h in self.assigns]

+         t_assigned.sort()

+         self.assertEqual(t_assigned, list(range(5)))

+ 

      def test_active_tasks(self):

          self.context.opts['CapacityOvercommit'] = 1.0

          hosts = [self.mkhost(id=n, capacity=2.0) for n in range(5)]

The scheduler was incorrectly calculating min_avail (the amount of needed capacity) for tasks, in two different ways.

In the first task loop, using min instead of max is very wrong and causes the scheduler to miscalculate demand. This won't necessarily lead to obvious problems, but mostly defeats the code's attempt to rank hosts based on demand.

In the second task loop, the calculation is less wrong, but can be negative for tasks with weight less than the overcommit setting. This treats overcommit as "extra capacity" rather than the intention, which was to allow a single task to overshoot the capacity.

The added unit test demonstrates the problem from the second loop, and will fail without this fix.

Fixes https://pagure.io/koji/issue/4359

Currently Includes one of the fixes from #4357 to get tests to pass, but that presumably disappear in a future rebase

Metadata Update from @tkopecek:
- Pull-request tagged with: testing-basic

3 months ago

Metadata Update from @mfilip:
- Pull-request tagged with: testing-done

2 months ago

rebased onto caeac8e

2 months ago

rebased to master, which dropped the already merged cherry-pick from #4357
unit tests still pass

Commit 4729cf8 fixes this pull-request

Pull-Request has been merged by mikem

2 months ago