c325044 sanlock: change paxos_acquire error for initial host state

Authored and Committed by teigland 7 months ago
    sanlock: change paxos_acquire error for initial host state
    
    paxos_acquire will return IDLIVE for a dead host if the acquire happens
    soon after add_lockspace, before two host renewal checks have happened.
    This differs from the normal behavior of paxos_acquire, where it will
    wait for the dead host to time out, and then run a ballot to acquire the
    lease.  The difference in behavior depends on how long after
    add_lockspace the paxos_acquire is called, which can make it complicated
    for applications to know how to respond (perhaps wanting to retry the
    acquire for a while.)
    
    Internally:
    
    After add_lockspace, the first time check_other_leases() (checking
    renewals of other host_ids) is called, first_check and last_check are
    both set to 'now', and last_live is also set to 'now' because no
    previous renewal timestamp from that host_id has been recorded.  This
    also applies to the host_id of a dead host.
    
    The core of the issue is that after add_lockspace, a dead host looks no
    different from a live host.  The dead/alive difference is based on
    seeing the host_id lease timestamp change.  Until at least a renewal
    period after add_lockspace, the timestamp won't be changed, so there's
    no basis for concluding dead vs alive.  In most cases, it is only
    safe to treat this unknown ase as alive.  But, in this case, the
    handling of the unknown case is fail and retry for a while vs
    block for a while.
    
    Before change:
    
    paxos_acquire is called just after add_lockspace, and the
    liveness check of the current lease owner was:
      last_live is non-zero, and last_check equals last_live.
    
    A dead host is considered alive by that condition, causing
    paxos_acquire to return IDLIVE.  It would need to be retried
    for a period of another renewal interval, before paxos_acquire
    would wait for the dead host to time out.
    A live host is also considered alive, and IDLIVE is correctly
    returned immediately.
    
    After change:
    
    paxos_acquire is called just after add_lockspace, and the
    liveness check of the current lease owner is:
      last_live is non-zero, and last_check equals last_live, and
      first_check is not equal to last_live.
    
    A dead host would not be considered alive by that condition,
    causing paxos_acquire to wait for the dead host to time out,
    just as if acquire was called long after add_lockspace.
    A live host would also not be considered alive, causing
    paxos_acquire to wait until another renewal from the live
    host is seen before returning IDLIVE.  This may be an
    unwelcome change, where previously the correct result
    was returned immediately, now the correct result is only
    returned after a delay.  However, SANLK_ACQUIRE_OWNER_NOWAIT
    can be used in sanlock_acquire() to avoid waiting for a dead
    host's lease to expire, and may already be used by callers
    wanting to avoid blocking in sanlock_acquire for long periods
    waiting for lease expiration.
    
        
file modified
+21 -1