sanlock: do not close connection in error handling
When processing a client connection, a problem with the message
or with the client handling would cause the sanlock daemon to
close the client connection (the socket fd) and release any
leases if the connection was "registered". If this happened,
the client may be unaware of it, and may continue running,
using the leases that have been dropped. This could lead to
different clients on different hosts believing that they hold
the same lease concurrently.
A known cause of this is when a sanlock client program makes
libsanlock calls on a registered connection concurrently
from multiple threads without serialization. These calls are:
sanlock_acquire, sanlock_release, sanlock_inquire,
sanlock_convert, sanlock_restrict, sanlock_killpath.
These calls involve:
- sending a header to the sanlock daemon
- sending a body to the sanlock daemon
- receving a reply from the sanlock daemon
If these steps are interleaved from multiple threads, the
sanlock daemon will read incorrect data when processing
a request, or the wrong result could be received by the caller.
A specific example that's been seen involves two concurrent
sanlock_release calls from different threads. The proper
sequence would be:
sanlock_release_1
send header_1
send body_1
recv result_1
sanlock_release_2
send header_2
send body_2
recv result_2
Without locking, data from both requests are interleaved,
causing a sequence to be received in the daemon:
header_1
header_2
body_1
The sanlock daemon expects the correct sequence of data,
so it misinterprets the mixed data and reports errors when
it finds unexpected fields in what it thinks are headers
and body structs.
The sanlock daemon recvs header_1, then recvs header_2,
and thinks header_2 is body_1. When it finds an unknown
resource name in what it thinks is body_1 (really header_2),
it logs an error to sanlock.log:
cmd_release 19,86,41799 no resource ...
Then the sanlock dameon recvs data from body_1, and thinks
it is header_2. When it finds an invalid magic number in
header_2 (really body_1), it logs an error to sanlock.log:
ci 19 recv 32 magic 0 vs 4282010
The error path after seeing a bad magic number in a message
header is to call deadfn(). For a registered connection,
this is client_pid_dead() which closes the socket fd for
the client and drops leases held by the client.
Closing the connection is the wrong way to handle a bad message
because the client is still running, and a closed connection
implies that a registered client has exited.
The fix is to simply ignore the bad data (logging an error).
By ignoring the messages, the sanlock clients will likely be
stuck waiting for replies, or possibly receive errors from
their calls. So, this fix only prevents leases from being
dropped incorrectly. Clients must still serialize access
to sockets.
Other error conditions in processing a client connection also
use this same incorrect error handling, and they are also
changed to simply ignore the issue and log an error.