sanlock: change paxos_acquire error for initial host state
paxos_acquire will return IDLIVE for a dead host if the acquire happens
soon after add_lockspace, before two host renewal checks have happened.
This differs from the normal behavior of paxos_acquire, where it will
wait for the dead host to time out, and then run a ballot to acquire the
lease. The difference in behavior depends on how long after
add_lockspace the paxos_acquire is called, which can make it complicated
for applications to know how to respond (perhaps wanting to retry the
acquire for a while.)
Internally:
After add_lockspace, the first time check_other_leases() (checking
renewals of other host_ids) is called, first_check and last_check are
both set to 'now', and last_live is also set to 'now' because no
previous renewal timestamp from that host_id has been recorded. This
also applies to the host_id of a dead host.
The core of the issue is that after add_lockspace, a dead host looks no
different from a live host. The dead/alive difference is based on
seeing the host_id lease timestamp change. Until at least a renewal
period after add_lockspace, the timestamp won't be changed, so there's
no basis for concluding dead vs alive. In most cases, it is only
safe to treat this unknown ase as alive. But, in this case, the
handling of the unknown case is fail and retry for a while vs
block for a while.
Before change:
paxos_acquire is called just after add_lockspace, and the
liveness check of the current lease owner was:
last_live is non-zero, and last_check equals last_live.
A dead host is considered alive by that condition, causing
paxos_acquire to return IDLIVE. It would need to be retried
for a period of another renewal interval, before paxos_acquire
would wait for the dead host to time out.
A live host is also considered alive, and IDLIVE is correctly
returned immediately.
After change:
paxos_acquire is called just after add_lockspace, and the
liveness check of the current lease owner is:
last_live is non-zero, and last_check equals last_live, and
first_check is not equal to last_live.
A dead host would not be considered alive by that condition,
causing paxos_acquire to wait for the dead host to time out,
just as if acquire was called long after add_lockspace.
A live host would also not be considered alive, causing
paxos_acquire to wait until another renewal from the live
host is seen before returning IDLIVE. This may be an
unwelcome change, where previously the correct result
was returned immediately, now the correct result is only
returned after a delay. However, SANLK_ACQUIRE_OWNER_NOWAIT
can be used in sanlock_acquire() to avoid waiting for a dead
host's lease to expire, and may already be used by callers
wanting to avoid blocking in sanlock_acquire for long periods
waiting for lease expiration.