wdmd: close device when test fails
Instead of just not petting the device after a test fails,
close the device. Because the close generates a ping, we
want to get it done early, otherwise if wdmd exited (e.g.
crash or sigkill) just before the device was ready to fire,
the close generated by the kernel extends the life of the
machine by an extra 60 sec. This means we need to re-open
the device if we want to resume petting it.
So, depending on whether the tests happen just prior
to the expiry or just after the expiry, the watchdog
will fire between 60 and 70 seconds after the expiry
time.
It would be 70 seconds if:
we do the check just before the expiration, the client
expires, 10 seconds (TEST_INTERVAL) later, we see the
expiration, close the device, which generates a ping,
which causes the firing to be 60 seconds after the close,
which is already 10 seconds after the expiration.
It would be 60 seconds if:
we do the check just after the expiration, we see
the expiration, close the device, which generates a
ping, which causes the firing to be 60 seconds after
the close, which is just after at the expiration
time.
Previously, the assumption was that the host would
be reset between 50 and 60 seconds from the expiration
time, but this did not account for the fact that
the daemon could exit just before the host reset,
which would lead the kernel to generate a new ping.
If we can patch the kernel so that a device close
does not generate a ping, then we do not need to
close the device when a test fails, but we can
simply not pet the device, as we've been doing.
Signed-off-by: David Teigland <teigland@redhat.com>