From 4759dd1e97ef4ca0aaa5587f4b299f6f141c7f67 Mon Sep 17 00:00:00 2001 From: David Teigland Date: Dec 18 2023 14:45:24 +0000 Subject: update README.rst --- diff --git a/README.rst b/README.rst index 59df2f1..a3feb1e 100644 --- a/README.rst +++ b/README.rst @@ -2,19 +2,20 @@ See https://pagure.io/sanlock Mailing list https://lists.fedorahosted.org/admin/lists/sanlock-devel.lists.fedorahosted.org/ -From sanlock(8) at sanlock.git/src/sanlock.8 +See sanlock(8) from sanlock.git/src/sanlock.8 +and wdmd(8) from sanlock.git/wdmd/wdmd.8 :: - SANLOCK(8) System Manager's Manual SANLOCK(8) +SANLOCK(8) System Manager's Manual SANLOCK(8) - NAME +NAME sanlock - shared storage lock manager - SYNOPSIS +SYNOPSIS sanlock [COMMAND] [ACTION] ... - DESCRIPTION +DESCRIPTION sanlock is a lock manager built on shared storage. Hosts with access to the storage can perform locking. An application running on the hosts is given a small amount of space on the shared block device or @@ -22,193 +23,192 @@ From sanlock(8) at sanlock.git/src/sanlock.8 tion. Internally, the sanlock daemon manages locks using two disk- based lease algorithms: delta leases and paxos leases. - · delta leases are slow to acquire and demand regular i/o to shared + • delta leases are slow to acquire and demand regular i/o to shared storage. sanlock only uses them internally to hold a lease on its "host_id" (an integer host identifier from 1-2000). They prevent two - hosts from using the same host identifier. The delta lease renewals + hosts from using the same host identifier. The delta lease renewals also indicate if a host is alive. ("Light-Weight Leases for Storage- Centric Coordination", Chockler and Malkhi.) - · paxos leases are fast to acquire and sanlock makes them available to - applications as general purpose resource leases. The disk paxos - algorithm uses host_id's internally to represent different hosts, and + • paxos leases are fast to acquire and sanlock makes them available to + applications as general purpose resource leases. The disk paxos al‐ + gorithm uses host_id's internally to represent different hosts, and the owner of a paxos lease. delta leases provide unique host_id's for implementing paxos leases, and delta lease renewals serve as a proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie Lamport.) Externally, the sanlock daemon exposes a locking interface through lib‐ - sanlock in terms of "lockspaces" and "resources". A lockspace is a - locking context that an application creates for itself on shared stor‐ - age. When the application on each host is started, it "joins" the + sanlock in terms of "lockspaces" and "resources". A lockspace is a + locking context that an application creates for itself on shared stor‐ + age. When the application on each host is started, it "joins" the lockspace. It can then create "resources" on the shared storage. Each resource represents an application-specific entity. The application can acquire and release leases on resources. To use sanlock from an application: - · Allocate shared storage for an application, e.g. a shared LUN or LV + • Allocate shared storage for an application, e.g. a shared LUN or LV from a SAN, or files from NFS. - · Provide the storage to the application. + • Provide the storage to the application. - · The application uses this storage with libsanlock to create a - lockspace and resources for itself. + • The application uses this storage with libsanlock to create a lock‐ + space and resources for itself. - · The application joins the lockspace when it starts. + • The application joins the lockspace when it starts. - · The application acquires and releases leases on resources. + • The application acquires and releases leases on resources. How lockspaces and resources translate to delta leases and paxos leases within sanlock: Lockspaces - · A lockspace is based on delta leases held by each host using the + • A lockspace is based on delta leases held by each host using the lockspace. - · A lockspace is a series of 2000 delta leases on disk, and requires + • A lockspace is a series of 2000 delta leases on disk, and requires 1MB of storage. (See Storage below for size variations.) - · A lockspace can support up to 2000 concurrent hosts using it, each + • A lockspace can support up to 2000 concurrent hosts using it, each using a different delta lease. - · Applications can i) create, ii) join and iii) leave a lockspace, + • Applications can i) create, ii) join and iii) leave a lockspace, which corresponds to i) initializing the set of delta leases on disk, ii) acquiring one of the delta leases and iii) releasing the delta lease. - · When a lockspace is created, a unique lockspace name and disk loca‐ + • When a lockspace is created, a unique lockspace name and disk loca‐ tion is provided by the application. - · When a lockspace is created/initialized, sanlock formats the sequence - of 2000 on-disk delta lease structures on the file or disk, e.g. + • When a lockspace is created/initialized, sanlock formats the sequence + of 2000 on-disk delta lease structures on the file or disk, e.g. /mnt/leasefile (NFS) or /dev/vg/lv (SAN). - · The 2000 individual delta leases in a lockspace are identified by + • The 2000 individual delta leases in a lockspace are identified by number: 1,2,3,...,2000. - · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by + • Each delta lease is a 512 byte sector in the 1MB lockspace, offset by its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset 512, delta lease 2000 is offset 1023488. (See Storage below for size variations.) - · When an application joins a lockspace, it must specify the lockspace - name, the lockspace location on shared disk/file, and the local - host's host_id. sanlock then acquires the delta lease corresponding - to the host_id, e.g. joining the lockspace with host_id 1 acquires + • When an application joins a lockspace, it must specify the lockspace + name, the lockspace location on shared disk/file, and the local + host's host_id. sanlock then acquires the delta lease corresponding + to the host_id, e.g. joining the lockspace with host_id 1 acquires delta lease 1. - · The terms delta lease, lockspace lease, and host_id lease are used - interchangably. + • The terms delta lease, lockspace lease, and host_id lease are used + interchangeably. - · sanlock acquires a delta lease by writing the host's unique name to + • sanlock acquires a delta lease by writing the host's unique name to the delta lease disk sector, reading it back after a delay, and veri‐ fying it is the same. - · If a unique host name is not specified, sanlock generates a uuid to - use as the host's name. The delta lease algorithm depends on hosts - using unique names. + • If a unique host name is not specified, sanlock uses the product_uuid + if one is available, otherwise generates a uuid to use as the host's + name. The delta lease algorithm depends on hosts using unique names. - · The application on each host should be configured with a unique + • The application on each host should be configured with a unique host_id, where the host_id is an integer 1-2000. - · If hosts are misconfigured and have the same host_id, the delta lease + • If hosts are misconfigured and have the same host_id, the delta lease algorithm is designed to detect this conflict, and only one host will be able to acquire the delta lease for that host_id. - · A delta lease ensures that a lockspace host_id is being used by a + • A delta lease ensures that a lockspace host_id is being used by a single host with the unique name specified in the delta lease. - · Resolving delta lease conflicts is slow, because the algorithm is - based on waiting and watching for some time for other hosts to write - to the same delta lease sector. If multiple hosts try to use the - same delta lease, the delay is increased substantially. So, it is - best to configure applications to use unique host_id's that will not + • Resolving delta lease conflicts is slow, because the algorithm is + based on waiting and watching for some time for other hosts to write + to the same delta lease sector. If multiple hosts try to use the + same delta lease, the delay is increased substantially. So, it is + best to configure applications to use unique host_id's that will not conflict. - · After sanlock acquires a delta lease, the lease must be renewed until + • After sanlock acquires a delta lease, the lease must be renewed until the application leaves the lockspace (which corresponds to releasing the delta lease on the host_id.) - · sanlock renews delta leases every 20 seconds (by default) by writing + • sanlock renews delta leases every 20 seconds (by default) by writing a new timestamp into the delta lease sector. - · When a host acquires a delta lease in a lockspace, it can be referred - to as "joining" the lockspace. Once it has joined the lockspace, it + • When a host acquires a delta lease in a lockspace, it can be referred + to as "joining" the lockspace. Once it has joined the lockspace, it can use resources associated with the lockspace. Resources - · A lockspace is a context for resources that can be locked and - unlocked by an application. + • A lockspace is a context for resources that can be locked and un‐ + locked by an application. - · sanlock uses paxos leases to implement leases on resources. The - terms paxos lease and resource lease are used interchangably. + • sanlock uses paxos leases to implement leases on resources. The + terms paxos lease and resource lease are used interchangeably. - · A paxos lease exists on shared storage and requires 1MB of space. It + • A paxos lease exists on shared storage and requires 1MB of space. It contains a unique resource name and the name of the lockspace. - · An application assigns its own meaning to a sanlock resource and the + • An application assigns its own meaning to a sanlock resource and the leases on it. A sanlock resource could represent some shared object like a file, or some unique role among the hosts. - · Resource leases are associated with a specific lockspace and can only - be used by hosts that have joined that lockspace (they are holding a + • Resource leases are associated with a specific lockspace and can only + be used by hosts that have joined that lockspace (they are holding a delta lease on a host_id in that lockspace.) - · An application must keep track of the disk locations of its - lockspaces and resources. sanlock does not maintain any persistent - index or directory of lockspaces or resources that have been created - by applications, so applications need to remember where they have - placed their own leases (which files or disks and offsets). + • An application must keep track of the disk locations of its lock‐ + spaces and resources. sanlock does not maintain any persistent index + or directory of lockspaces or resources that have been created by ap‐ + plications, so applications need to remember where they have placed + their own leases (which files or disks and offsets). - · sanlock does not renew paxos leases directly (although it could). - Instead, the renewal of a host's delta lease represents the renewal - of all that host's paxos leases in the associated lockspace. In - effect, many paxos lease renewals are factored out into one delta - lease renewal. This reduces i/o when many paxos leases are used. + • sanlock does not renew paxos leases directly (although it could). + Instead, the renewal of a host's delta lease represents the renewal + of all that host's paxos leases in the associated lockspace. In ef‐ + fect, many paxos lease renewals are factored out into one delta lease + renewal. This reduces i/o when many paxos leases are used. - · The disk paxos algorithm allows multiple hosts to all attempt to - acquire the same paxos lease at once, and will produce a single win‐ + • The disk paxos algorithm allows multiple hosts to all attempt to ac‐ + quire the same paxos lease at once, and will produce a single win‐ ner/owner of the resource lease. (Shared resource leases are also possible in addition to the default exclusive leases.) - · The disk paxos algorithm involves a specific sequence of reading and + • The disk paxos algorithm involves a specific sequence of reading and writing the sectors of the paxos lease disk area. Each host has a dedicated 512 byte sector in the paxos lease disk area where it writes its own "ballot", and each host reads the entire disk area to see the ballots of other hosts. The first sector of the disk area is - the "leader record" that holds the result of the last paxos ballot. + the "leader record" that holds the result of the last paxos ballot. The winner of the paxos ballot writes the result of the ballot to the leader record (the winner of the ballot may have selected another contending host as the owner of the paxos lease.) - · After a paxos lease is acquired, no further i/o is done in the paxos + • After a paxos lease is acquired, no further i/o is done in the paxos lease disk area. - · Releasing the paxos lease involves writing a single sector to clear + • Releasing the paxos lease involves writing a single sector to clear the current owner in the leader record. - · If a host holding a paxos lease fails, the disk area of the paxos + • If a host holding a paxos lease fails, the disk area of the paxos lease still indicates that the paxos lease is owned by the failed host. If another host attempts to acquire the paxos lease, and finds - the lease is held by another host_id, it will check the delta lease + the lease is held by another host_id, it will check the delta lease of that host_id. If the delta lease of the host_id is being renewed, then the paxos lease is owned and cannot be acquired. If the delta - lease of the owner's host_id has expired, then the paxos lease is - expired and can be taken (by going through the paxos lease algo‐ - rithm.) + lease of the owner's host_id has expired, then the paxos lease is ex‐ + pired and can be taken (by going through the paxos lease algorithm.) - · The "interaction" or "awareness" between hosts of each other is lim‐ + • The "interaction" or "awareness" between hosts of each other is lim‐ ited to the case where they attempt to acquire the same paxos lease, and need to check if the referenced delta lease has expired or not. - · When hosts do not attempt to lock the same resources concurrently, + • When hosts do not attempt to lock the same resources concurrently, there is no host interaction or awareness. The state or actions of one host have no effect on others. - · To speed up checking delta lease expiration (in the case of a paxos + • To speed up checking delta lease expiration (in the case of a paxos lease conflict), sanlock keeps track of past renewals of other delta leases in the lockspace. @@ -216,72 +216,72 @@ From sanlock(8) at sanlock.git/src/sanlock.8 The resource index (rindex) is an optional sanlock feature that appli‐ cations can use to keep track of resource lease offsets. Without the - rindex, an application must keep track of where its resource leases - exist on disk and find available locations when creating new leases. + rindex, an application must keep track of where its resource leases ex‐ + ist on disk and find available locations when creating new leases. - The sanlock rindex uses two align-size areas on disk following the - lockspace. The first area holds rindex entries; each entry records a - resource lease name and location. The second area holds a private + The sanlock rindex uses two align-size areas on disk following the + lockspace. The first area holds rindex entries; each entry records a + resource lease name and location. The second area holds a private paxos lease, used by sanlock internally to protect rindex updates. - The application creates the rindex on disk with the "format" function. - Format is a disk-only operation and does not interact with the live - lockspace, so it can be called without first calling add_lockspace. + The application creates the rindex on disk with the "format" function. + Format is a disk-only operation and does not interact with the live + lockspace, so it can be called without first calling add_lockspace. The application needs to follow the convention of writing the lockspace at the start of the device (offset 0) and formatting the rindex immedi‐ - ately following the lockspace area. When formatting, the application - must set flags for sector size and align size to match those for the + ately following the lockspace area. When formatting, the application + must set flags for sector size and align size to match those for the lockspace. To use the rindex, the application: - · Uses the "create" function to create a new resource lease on disk. - This takes the place of the write_resource function. The create - function requires the location of the rindex and the name of the new - resource lease. sanlock finds a free lease area, writes the new - resource lease at that location, updates the rindex with the - name:offset, and returns the offset to the caller. The caller uses - this offset when acquiring the resource lease. + • Uses the "create" function to create a new resource lease on disk. + This takes the place of the write_resource function. The create + function requires the location of the rindex and the name of the new + resource lease. sanlock finds a free lease area, writes the new re‐ + source lease at that location, updates the rindex with the name:off‐ + set, and returns the offset to the caller. The caller uses this off‐ + set when acquiring the resource lease. - · Uses the "delete" function to remove a resource disk on disk (also + • Uses the "delete" function to remove a resource disk on disk (also corresponding to the write_resource function.) sanlock clears the resource lease and the rindex entry for it. A subsequent call to create may use this same disk location for a different resource lease. - · Uses the "lookup" function to discover the offset of a resource lease - given the resource lease name. The caller would typically call this + • Uses the "lookup" function to discover the offset of a resource lease + given the resource lease name. The caller would typically call this prior to acquiring the resource lease. - · Uses the "rebuild" function to recreate the rindex if it is damaged - or becomes inconsistent. This function scans the disk for resource + • Uses the "rebuild" function to recreate the rindex if it is damaged + or becomes inconsistent. This function scans the disk for resource leases and creates new rindex entries to match the leases it finds. - · The "update" function manipulates rindex entries directly and should + • The "update" function manipulates rindex entries directly and should not normally be used by the application. In normal usage, the create and delete functions manipulate rindex entries. Update is mainly useful for testing or repairs. Expiration - · If a host fails to renew its delta lease, e.g. it looses access to + • If a host fails to renew its delta lease, e.g. it looses access to the storage, its delta lease will eventually expire and another host will be able to take over any resource leases held by the host. san‐ - lock must ensure that the application on two different hosts is not + lock must ensure that the application on two different hosts is not holding and using the same lease concurrently. - · When sanlock has failed to renew a delta lease for a period of time, - it will begin taking measures to stop local processes (applications) + • When sanlock has failed to renew a delta lease for a period of time, + it will begin taking measures to stop local processes (applications) from using any resource leases associated with the expiring lockspace delta lease. sanlock enters this "recovery mode" well ahead of the time when another host could take over the locally owned leases. sanlock must have sufficient time to stop all local processes that are using the expiring leases. - · sanlock uses three methods to stop local processes that are using - expiring leases: + • sanlock uses three methods to stop local processes that are using ex‐ + piring leases: - 1. Graceful shutdown. sanlock will execute a "graceful shutdown" + 1. Graceful shutdown. sanlock will execute a "graceful shutdown" program that the application previously specified for this case. The shutdown program tells the application to shut down because its leases are expiring. The application must respond by stopping its @@ -292,27 +292,27 @@ From sanlock(8) at sanlock.git/src/sanlock.8 next method of stopping. 2. Forced shutdown. sanlock will send SIGKILL to processes using the - expiring leases. The processes have a fixed amount of time to exit - after receiving SIGKILL. If any do not exit in this time, sanlock + expiring leases. The processes have a fixed amount of time to exit + after receiving SIGKILL. If any do not exit in this time, sanlock will proceed to the next method. - 3. Host reset. sanlock will trigger the host's watchdog device to - forcibly reset it. sanlock carefully manages the timing of the - watchdog device so that it fires shortly before any other host could + 3. Host reset. sanlock will trigger the host's watchdog device to + forcibly reset it. sanlock carefully manages the timing of the + watchdog device so that it fires shortly before any other host could take over the resource leases held by local processes. Failures - If a process holding resource leases fails or exits without releasing - its leases, sanlock will release the leases for it automatically - (unless persistent resource leases were used.) + If a process holding resource leases fails or exits without releasing + its leases, sanlock will release the leases for it automatically (un‐ + less persistent resource leases were used.) - If the sanlock daemon cannot renew a lockspace delta lease for a spe‐ - cific period of time (see Expiration), sanlock will enter "recovery - mode" where it attempts to stop and/or kill any processes holding - resource leases in the expiring lockspace. If the processes do not - exit in time, sanlock will force the host to be reset using the local - watchdog device. + If the sanlock daemon cannot renew a lockspace delta lease for a spe‐ + cific period of time (see Expiration), sanlock will enter "recovery + mode" where it attempts to stop and/or kill any processes holding re‐ + source leases in the expiring lockspace. If the processes do not exit + in time, sanlock will force the host to be reset using the local watch‐ + dog device. If the sanlock daemon crashes or hangs, it will not renew the expiry time of the per-lockspace connections it had to the wdmd daemon. This @@ -322,18 +322,18 @@ From sanlock(8) at sanlock.git/src/sanlock.8 Watchdog sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐ - plexes multiple timeouts onto the single watchdog timer. This is - required because delta leases for each lockspace are renewed and expire + plexes multiple timeouts onto the single watchdog timer. This is re‐ + quired because delta leases for each lockspace are renewed and expire independently. - sanlock maintains a wdmd connection for each lockspace delta lease - being renewed. Each connection has an expiry time for some seconds in + sanlock maintains a wdmd connection for each lockspace delta lease be‐ + ing renewed. Each connection has an expiry time for some seconds in the future. After each successful delta lease renewal, the expiry time - is renewed for the associated wdmd connection. If wdmd finds any con‐ - nection expired, it will not renew the /dev/watchdog timer. Given - enough successive failed renewals, the watchdog device will fire and - reset the host. (Given the multiplexing nature of wdmd, shorter over‐ - lapping renewal failures from multiple lockspaces could cause spurious + is renewed for the associated wdmd connection. If wdmd finds any con‐ + nection expired, it will not renew the /dev/watchdog timer. Given + enough successive failed renewals, the watchdog device will fire and + reset the host. (Given the multiplexing nature of wdmd, shorter over‐ + lapping renewal failures from multiple lockspaces could cause spurious watchdog firing.) The direct link between delta lease renewals and watchdog renewals pro‐ @@ -342,17 +342,17 @@ From sanlock(8) at sanlock.git/src/sanlock.8 the watchdog on another host has fired based on the delta lease time. Furthermore, if the watchdog device on another host fails to fire when it should, the continuation of delta lease renewals from the other host - will make this evident and prevent leases from being taken from the + will make this evident and prevent leases from being taken from the failed host. - If sanlock is able to stop/kill all processing using an expiring - lockspace, the associated wdmd connection for that lockspace is - removed. The expired wdmd connection will no longer block /dev/watch‐ - dog renewals, and the host should avoid being reset. + If sanlock is able to stop/kill all processing using an expiring lock‐ + space, the associated wdmd connection for that lockspace is removed. + The expired wdmd connection will no longer block /dev/watchdog re‐ + newals, and the host should avoid being reset. Storage - The sector size and the align size should be specified when creating + The sector size and the align size should be specified when creating lockspaces and resources (and rindex). The "align size" is the size on disk of a lockspace or a resource, i.e. the amount of disk space it uses. Lockspaces and resources should use matching sector and align @@ -372,19 +372,19 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sector_size 4096, align_size 8M, max_hosts 2000 When sector_size and align_size are not specified, the behavior matches - the behavior before these sizes could be configured: on devices which - report sector size 512, 512/1M/2000 is used, on devices which report - sector size 4096, 4096/8M/2000 is used, and on files, 512/1M/2000 is - always used. (Other combinations are not compatible with sanlock ver‐ + the behavior before these sizes could be configured: on devices which + report sector size 512, 512/1M/2000 is used, on devices which report + sector size 4096, 4096/8M/2000 is used, and on files, 512/1M/2000 is + always used. (Other combinations are not compatible with sanlock ver‐ sion 3.6 or earlier.) - Using sanlock on shared block devices that do host based mirroring or - replication is not likely to work correctly. When using sanlock on + Using sanlock on shared block devices that do host based mirroring or + replication is not likely to work correctly. When using sanlock on shared files, all sanlock io should go to one file server. Example - This is an example of creating and using lockspaces and resources from + This is an example of creating and using lockspaces and resources from the command line. (Most applications would use sanlock through libsan‐ lock rather than through the command line.) @@ -446,9 +446,9 @@ From sanlock(8) at sanlock.git/src/sanlock.8 8. Acquire resource leases for the application on host2. - Acquiring the exclusive lease on the first resource will fail - because it is held by host1. Acquiring the shared lease on the - second resource will succeed. + Acquiring the exclusive lease on the first resource will fail be‐ + cause it is held by host1. Acquiring the shared lease on the sec‐ + ond resource will succeed. # export P=`pidof sleep` # sanlock client acquire -r test:RA:/dev/leases:1048576 -p $P @@ -473,7 +473,7 @@ From sanlock(8) at sanlock.git/src/sanlock.8 # sanlock shutdown - OPTIONS +OPTIONS COMMAND can be one of three primary top level choices sanlock daemon start daemon @@ -497,12 +497,16 @@ From sanlock(8) at sanlock.git/src/sanlock.8 -G gid group id + -H num renewal history size + -t num max worker threads -g sec seconds for graceful recovery -w 0|1 use watchdog through wdmd + -o sec io timeout + -h 0|1 use high priority (RR) scheduling -l num use mlockall (0 none, 1 current, 2 current and future) @@ -517,21 +521,21 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock client status Print processes, lockspaces, and resources being managed by the sanlock - daemon. Add -D to show extra internal daemon status for debugging. - Add -o p to show resources by pid, or -o s to show resources by - lockspace. + daemon. Add -D to show extra internal daemon status for debugging. + Add -o p to show resources by pid, or -o s to show resources by lock‐ + space. sanlock client host_status - Print state of host_id delta leases read during the last renewal. - State of all lockspaces is shown (use -s to select one). Add -D to + Print state of host_id delta leases read during the last renewal. + State of all lockspaces is shown (use -s to select one). Add -D to show extra internal daemon status for debugging. sanlock client gets - Print lockspaces being managed by the sanlock daemon. The LOCKSPACE - string will be followed by ADD or REM if the lockspace is currently - being added or removed. Add -h 1 to also show hosts in each lockspace. + Print lockspaces being managed by the sanlock daemon. The LOCKSPACE + string will be followed by ADD or REM if the lockspace is currently be‐ + ing added or removed. Add -h 1 to also show hosts in each lockspace. sanlock client renewal -s LOCKSPACE @@ -546,19 +550,19 @@ From sanlock(8) at sanlock.git/src/sanlock.8 Ask the sanlock daemon to exit. Without the force option (-f 0), the command will be ignored if any lockspaces exist. With the force option - (-f 1), any registered processes will be killed, their resource leases - released, and lockspaces removed. With the wait option (-w 1), the - command will wait for a result from the daemon indicating that it has - shut down and is exiting, or cannot shut down because lockspaces exist + (-f 1), any registered processes will be killed, their resource leases + released, and lockspaces removed. With the wait option (-w 1), the + command will wait for a result from the daemon indicating that it has + shut down and is exiting, or cannot shut down because lockspaces exist (command fails). sanlock client init -s LOCKSPACE - Tell the sanlock daemon to initialize a lockspace on disk. The -o - option can be used to specify the io timeout to be written in the - host_id leases. The -Z and -A options can be used to specify the sec‐ - tor size and align size, and both should be set together. (Also see - sanlock direct init.) + Tell the sanlock daemon to initialize a lockspace on disk. The -o op‐ + tion can be used to specify the io timeout to be written in the host_id + leases. The -Z and -A options can be used to specify the sector size + and align size, and both should be set together. (Also see sanlock di‐ + rect init.) sanlock client init -r RESOURCE @@ -568,8 +572,8 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock client read -s LOCKSPACE - Tell the sanlock daemon to read a lockspace from disk. Only the - LOCKSPACE path and offset are required. If host_id is zero, the first + Tell the sanlock daemon to read a lockspace from disk. Only the LOCK‐ + SPACE path and offset are required. If host_id is zero, the first record at offset (host_id 1) is used. The complete LOCKSPACE is printed. Add -D to print other details. (Also see sanlock direct read_leader.) @@ -583,10 +587,10 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock client add_lockspace -s LOCKSPACE - Tell the sanlock daemon to acquire the specified host_id in the - lockspace. This will allow resources to be acquired in the lockspace. - The -o option can be used to specify the io timeout of the acquiring - host, and will be written in the host_id lease. + Tell the sanlock daemon to acquire the specified host_id in the lock‐ + space. This will allow resources to be acquired in the lockspace. The + -o option can be used to specify the io timeout of the acquiring host, + and will be written in the host_id lease. sanlock client inq_lockspace -s LOCKSPACE @@ -595,9 +599,9 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock client rem_lockspace -s LOCKSPACE - Tell the sanlock daemon to release the specified host_id in the - lockspace. Any processes holding resource leases in this lockspace - will be killed, and the resource leases not released. + Tell the sanlock daemon to release the specified host_id in the lock‐ + space. Any processes holding resource leases in this lockspace will be + killed, and the resource leases not released. sanlock client command -r RESOURCE -c path args @@ -610,9 +614,9 @@ From sanlock(8) at sanlock.git/src/sanlock.8 Tell the sanlock daemon to acquire or release the specified resource lease for the given pid. The pid must be registered with the sanlock - daemon. acquire can optionally take a versioned RESOURCE string - RESOURCE:lver, where lver is the version of the lease that must be - acquired, or fail. + daemon. acquire can optionally take a versioned RESOURCE string RE‐ + SOURCE:lver, where lver is the version of the lease that must be ac‐ + quired, or fail. sanlock client convert -r RESOURCE -p pid @@ -630,18 +634,18 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock client request -r RESOURCE -f force_mode - Request the owner of a resource do something specified by force_mode. - A versioned RESOURCE:lver string must be used with a greater version + Request the owner of a resource do something specified by force_mode. + A versioned RESOURCE:lver string must be used with a greater version than is presently held. Zero lver and force_mode clears the request. sanlock client examine -r RESOURCE - Examine the request record for the currently held resource lease and + Examine the request record for the currently held resource lease and carry out the action specified by the requested force_mode. sanlock client examine -s LOCKSPACE - Examine requests for all resource leases currently held in the named + Examine requests for all resource leases currently held in the named lockspace. Only lockspace_name is used from the LOCKSPACE argument. sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num @@ -651,27 +655,27 @@ From sanlock(8) at sanlock.git/src/sanlock.8 its bitmap, and set the generation, event and data values in its own delta lease. An application that has registered for events from this lockspace on the destination host will get the event that has been set - when the destination sees the event during its next delta lease - renewal. + when the destination sees the event during its next delta lease re‐ + newal. sanlock client set_config -s LOCKSPACE Set a configuration value for a lockspace. Only lockspace_name is used - from the LOCKSPACE argument. The USED flag has the same effect on a - lockspace as a process holding a resource lease that will not exit. - The USED_BY_ORPHANS flag means that an orphan resource lease will have + from the LOCKSPACE argument. The USED flag has the same effect on a + lockspace as a process holding a resource lease that will not exit. + The USED_BY_ORPHANS flag means that an orphan resource lease will have the same effect as the USED. -u 0|1 Set (1) or clear (0) the USED flag. -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag. sanlock client format -x RINDEX - Create a resource index on disk. Use -Z and -A to set the sector size + Create a resource index on disk. Use -Z and -A to set the sector size and align size to match the lockspace. sanlock client create -x RINDEX -e resource_name - Create a new resource lease on disk, using the rindex to find a free + Create a new resource lease on disk, using the rindex to find a free offset. sanlock client delete -x RINDEX -e resource_name[:offset] @@ -711,7 +715,7 @@ From sanlock(8) at sanlock.git/src/sanlock.8 max_hosts in the given space. When initializing a resource, sanlock initializes a single paxos lease in the space. With -s, the -o option specifies the io timeout to be written in the host_id leases. With -r, - the -z 1 option invalidates the resource lease on disk so it cannot be + the -z 1 option invalidates the resource lease on disk so it cannot be used until reinitialized normally. sanlock direct read_leader -s LOCKSPACE @@ -773,62 +777,62 @@ From sanlock(8) at sanlock.git/src/sanlock.8 sanlock version shows the build version. - OTHER +OTHER Request/Examine - The first part of making a request for a resource is writing the - request record of the resource (the sector following the leader - record). To make a successful request: + The first part of making a request for a resource is writing the re‐ + quest record of the resource (the sector following the leader record). + To make a successful request: - · RESOURCE:lver must be greater than the lver presently held by the + • RESOURCE:lver must be greater than the lver presently held by the other host. This implies the leader record must be read to discover the lver, prior to making a request. - · RESOURCE:lver must be greater than or equal to the lver presently + • RESOURCE:lver must be greater than or equal to the lver presently written to the request record. Two hosts may write a new request at the same time for the same lver, in which case both would succeed, but the force_mode from the last would win. - · The force_mode must be greater than zero. + • The force_mode must be greater than zero. - · To unconditionally clear the request record (set both lver and + • To unconditionally clear the request record (set both lver and force_mode to 0), make request with RESOURCE:0 and force_mode 0. The owner of the requested resource will not know of the request unless - it is explicitly told to examine its resources via the "examine" + it is explicitly told to examine its resources via the "examine" api/command, or otherwise notfied. - The second part of making a request is notifying the resource lease - owner that it should examine the request records of its resource - leases. The notification will cause the lease owner to automatically - run the equivalent of "sanlock client examine -s LOCKSPACE" for the + The second part of making a request is notifying the resource lease + owner that it should examine the request records of its resource + leases. The notification will cause the lease owner to automatically + run the equivalent of "sanlock client examine -s LOCKSPACE" for the lockspace of the requested resource. - The notification is made using a bitmap in each host_id delta lease. - Each bit represents each of the possible host_ids (1-2000). If host A - wants to notify host B to examine its resources, A sets the bit in its - own bitmap that corresponds to the host_id of B. When B next renews - its delta lease, it reads the delta leases for all hosts and checks - each bitmap to see if its own host_id has been set. It finds the bit - for its own host_id set in A's bitmap, and examines its resource - request records. (The bit remains set in A's bitmap for set_bit‐ - map_seconds.) + The notification is made using a bitmap in each host_id delta lease. + Each bit represents each of the possible host_ids (1-2000). If host A + wants to notify host B to examine its resources, A sets the bit in its + own bitmap that corresponds to the host_id of B. When B next renews + its delta lease, it reads the delta leases for all hosts and checks + each bitmap to see if its own host_id has been set. It finds the bit + for its own host_id set in A's bitmap, and examines its resource re‐ + quest records. (The bit remains set in A's bitmap for set_bitmap_sec‐ + onds.) force_mode determines the action the resource lease owner should take: - · FORCE (1): kill the process holding the resource lease. When the + • FORCE (1): kill the process holding the resource lease. When the process has exited, the resource lease will be released, and can then be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if SIGKILL is restricted.) - · GRACEFUL (2): run the program configured by sanlock_killpath against + • GRACEFUL (2): run the program configured by sanlock_killpath against the process holding the resource lease. If no killpath is defined, then FORCE is used. Persistent and orphan resource leases A resource lease can be acquired with the PERSISTENT flag (-P 1). If the process holding the lease exits, the lease will not be released, - but kept on an orphan list. Another local process can acquire an - orphan lease using the ORPHAN flag (-O 1), or release the orphan lease + but kept on an orphan list. Another local process can acquire an or‐ + phan lease using the ORPHAN flag (-O 1), or release the orphan lease using the ORPHAN flag (-O 1). All orphan leases can be released by setting the lockspace name (-s lockspace_name) with no resource name. @@ -842,14 +846,14 @@ From sanlock(8) at sanlock.git/src/sanlock.8 For each successful renewal, a record is saved that includes: - · the timestamp written in the delta lease by the renewal + • the timestamp written in the delta lease by the renewal - · the time in milliseconds taken by the delta lease read + • the time in milliseconds taken by the delta lease read - · the time in milliseconds taken by the delta lease write + • the time in milliseconds taken by the delta lease write - Also counted and recorded are the number io timeouts and other io - errors that occur between successful renewals. + Also counted and recorded are the number io timeouts and other io er‐ + rors that occur between successful renewals. Two consecutive successful renewals would be recorded as: timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0 @@ -857,18 +861,18 @@ From sanlock(8) at sanlock.git/src/sanlock.8 Those fields are: - · timestamp is the value written into the delta lease during that - renewal. + • timestamp is the value written into the delta lease during that re‐ + newal. - · read_ms/write_ms are the milliseconds taken for the renewal + • read_ms/write_ms are the milliseconds taken for the renewal read/write ios. - · next_timeouts are the number of io timeouts that occured after the + • next_timeouts are the number of io timeouts that occurred after the renewal recorded on that line, and before the next successful renewal on the following line. - · next_errors are the number of io errors (not timeouts) that occured - after renewal recorded on that line, and before the next successful + • next_errors are the number of io errors (not timeouts) that occurred + after renewal recorded on that line, and before the next successful renewal on the following line. The command 'sanlock client renewal -s lockspace_name' reports the full @@ -876,44 +880,65 @@ From sanlock(8) at sanlock.git/src/sanlock.8 about 1 hour of history when using a 20 second renewal interval for a 10 second io timeout. - INTERNALS + Configurable watchdog timeout + Watchdog devices usually have a 60 second timeout, but some devices + have a configurable timeout. To use a different watchdog timeout, set + sanlock.conf watchdog_fire_timeout (in seconds) to a value supported by + the device. The same watchdog_fire_timeout must be configured on all + hosts (so all hosts must have watchdog devices that support the same + timeout). Unmatching values will invalidate the lease protection pro‐ + vided by the watchdog. + + watchdog_fire_timeout and io_timeout should usually be configured to‐ + gether. By default, sanlock uses watchdog_fire_timeout=60 with + io_timeout=10. Other combinations to consider are: + watchdog_fire_timeout=30 with io_timeout=5 + watchdog_fire_timeout=10 with io_timeout=2 + + Smaller values make it more likely that a host will be reset by the + watchdog while waiting for slow io to complete or for temporary io + failures to be resolved. Spurious watchdog resets will also become + more likely due to independent, overlapping lockspace outages, each of + which would be inconsequential by itself. + +INTERNALS Disk Format - · This example uses 512 byte sectors. + • This example uses 512 byte sectors. - · Each lockspace is 1MB. It holds 2000 delta_leases, one per sector, + • Each lockspace is 1MB. It holds 2000 delta_leases, one per sector, supporting up to 2000 hosts. - · Each paxos_lease is 1MB. It is used as a lease for one resource. + • Each paxos_lease is 1MB. It is used as a lease for one resource. - · The leader_record structure is used differently by each lease type. + • The leader_record structure is used differently by each lease type. - · To display all leader_record fields, see sanlock direct read_leader. + • To display all leader_record fields, see sanlock direct read_leader. - · A lockspace is often followed on disk by the paxos_leases used within + • A lockspace is often followed on disk by the paxos_leases used within that lockspace, but this layout is not required. - · The request_record and host_id bitmap are used for requests/events. + • The request_record and host_id bitmap are used for requests/events. - · The mode_block contains the SHARED flag indicating a lease is held in + • The mode_block contains the SHARED flag indicating a lease is held in the shared mode. - · In a lockspace, the host using host_id N writes to a single + • In a lockspace, the host using host_id N writes to a single delta_lease in sector N-1. No other hosts write to this sector. All hosts read all lockspace sectors when renewing their own delta_lease, and are able to monitor renewals of all delta_leases. - · In a paxos_lease, each host has a dedicated sector it writes to, con‐ + • In a paxos_lease, each host has a dedicated sector it writes to, con‐ taining its own paxos_dblock and mode_block structures. Its sector is based on its host_id; host_id 1 writes to the dblock/mode_block in sector 2 of the paxos_lease. - · The paxos_dblock structures are used by the paxos_lease algorithm, + • The paxos_dblock structures are used by the paxos_lease algorithm, and the result is written to the leader_record. 0x000000 lockspace foo:0:/path:0 - (There is no representation on disk of the lockspace in general, only - the sequence of specific delta_leases which collectively represent the + (There is no representation on disk of the lockspace in general, only + the sequence of specific delta_leases which collectively represent the lockspace.) delta_lease foo:1:/path:0 @@ -991,15 +1016,15 @@ From sanlock(8) at sanlock.git/src/sanlock.8 0xFA280 mode_block (paxos_dblock + 128) Lease ownership - Not shown in the leader_record structures above are the owner_id, - owner_generation and timestamp fields. These are the fields that - define the lease owner. - - The delta_lease at sector N for host_id N+1 has leader_record.owner_id - N+1. The leader_record.owner_generation is incremented each time the - delta_lease is acquired. When a delta_lease is acquired, the - leader_record.timestamp field is set to the time of the host and the - leader_record.resource_name is set to the unique name of the host. + Not shown in the leader_record structures above are the owner_id, + owner_generation and timestamp fields. These are the fields that de‐ + fine the lease owner. + + The delta_lease at sector N for host_id N+1 has leader_record.owner_id + N+1. The leader_record.owner_generation is incremented each time the + delta_lease is acquired. When a delta_lease is acquired, the + leader_record.timestamp field is set to the time of the host and the + leader_record.resource_name is set to the unique name of the host. When the host renews the delta_lease, it writes a new leader_record.timestamp. When a host releases a delta_lease, it writes zero to leader_record.timestamp. @@ -1014,110 +1039,150 @@ From sanlock(8) at sanlock.git/src/sanlock.8 leader_record.timestamp is set. When a host releases a paxos_lease, it sets leader_record.timestamp to 0. - When a paxos_lease is free (leader_record.timestamp is 0), multiple - hosts may attempt to acquire it. The paxos algorithm, using the - paxos_dblock structures, will select only one of the hosts as the new + When a paxos_lease is free (leader_record.timestamp is 0), multiple + hosts may attempt to acquire it. The paxos algorithm, using the + paxos_dblock structures, will select only one of the hosts as the new owner, and that owner is written in the leader_record. The paxos_lease will no longer be free (non-zero timestamp). Other hosts will see this and will not attempt to acquire the paxos_lease until it is free again. - If a paxos_lease is owned (non-zero timestamp), but the owner has not - renewed its delta_lease for a specific length of time, then the owner - value in the paxos_lease becomes expired, and other hosts will use the + If a paxos_lease is owned (non-zero timestamp), but the owner has not + renewed its delta_lease for a specific length of time, then the owner + value in the paxos_lease becomes expired, and other hosts will use the paxos algorithm to acquire the paxos_lease, and set a new owner. - FILES +FILES /etc/sanlock/sanlock.conf - · quiet_fail = 1 + • quiet_fail = 1 See -Q - · debug_renew = 0 + • debug_renew = 0 See -R - · logfile_priority = 4 + • logfile_priority = 4 See -L - · logfile_use_utc = 0 + • logfile_use_utc = 0 Use UTC instead of local time in log messages. - · syslog_priority = 3 + • syslog_priority = 3 See -S - · names_log_priority = 4 - Log resource names at this priority level (uses syslog priority num‐ - bers). If this is greater than or equal to logfile_priority, each + • names_log_priority = 4 + Log resource names at this priority level (uses syslog priority num‐ + bers). If this is greater than or equal to logfile_priority, each requested resource name and location is recorded in sanlock.log. - · use_watchdog = 1 + • use_watchdog = 1 See -w - · high_priority = 1 + • high_priority = 1 See -h - · mlock_level = 1 + • mlock_level = 1 See -l - · sh_retries = 8 - The number of times to try acquiring a paxos lease when acquiring a + • sh_retries = 8 + The number of times to try acquiring a paxos lease when acquiring a shared lease when the paxos lease is held by another host acquiring a shared lease. - · uname = sanlock + • uname = sanlock See -U - · gname = sanlock + • gname = sanlock See -G - · our_host_name = + • our_host_name = See -e - · renewal_read_extend_sec = + • renewal_read_extend_sec = If a renewal read i/o times out, wait this many additional seconds - for that read to complete at the start of the subsequent renewal - attempt. When not configured, sanlock waits for an additional - io_timeout seconds for a previous timed out read to complete. + for that read to complete at the start of the subsequent renewal at‐ + tempt. When not configured, sanlock waits for an additional io_time‐ + out seconds for a previous timed out read to complete. - · renewal_history_size = 180 + • renewal_history_size = 180 See -H - · paxos_debug_all = 0 + • paxos_debug_all = 0 Include all details in the paxos debug logging. - · debug_io = - Add debug logging for each i/o. "submit" (no quotes) produces debug - output at submission time, "complete" produces debug output at com‐ + • debug_io = + Add debug logging for each i/o. "submit" (no quotes) produces debug + output at submission time, "complete" produces debug output at com‐ pletion time, and "submit,complete" (no space) produces both. - · max_sectors_kb = | - Set to "ignore" (no quotes) to prevent sanlock from checking or - changing max_sectors_kb for the lockspace disk when starting a - lockspace. Set to "align" (no quotes) to set max_sectors_kb for the - lockspace disk to the align size of the lockspace. Set to a number + • max_sectors_kb = | + Set to "ignore" (no quotes) to prevent sanlock from checking or + changing max_sectors_kb for the lockspace disk when starting a lock‐ + space. Set to "align" (no quotes) to set max_sectors_kb for the + lockspace disk to the align size of the lockspace. Set to a number to set a specific number of KB for all lockspace disks. - SEE ALSO + • debug_clients = 0 + Enable or disable debug logging for all client connections to the + sanlock daemon. + + • debug_cmd = +|- + Enable (+name) or disable (-name) debug logging at the command pro‐ + cessing level for specifically named commands, e.g. "debug_cmd = +ac‐ + quire", or "debug_cmd = -inq_lockspace". Repeat this line for each + command name. Use a plus prefix before the name to enable and a mi‐ + nus prefix to disable. By default sanlock disables some command + level debugging for commands that are often repetitive and fill the + in memory debug buffer. This only affects debug logging, not errors + or warnings, and disabling command level debugging for a command does + not disable lower level debugging for that command. Special values + +all and -all can be used to enable or disable all commands, and can + be used before or after other debug_cmd lines. + + • write_init_io_timeout = + The io timeout to use when initializing ondisk lease structures for a + lockspace or resource. This timeout is not used as a part of either + lease algorithm (as the standard io_timeout is.) + + • max_worker_threads = + See -t + + • io_timeout = + The io timeout for disk operations, most notably delta lease re‐ + newals. This value is basis for calculating most other timeout val‐ + ues. (Some special cases may use a different io timeout.) Tune this + value with caution, it can substantially alter the overall sanlock + behavior. + + • watchdog_fire_timeout = + The watchdog device timeout. The watchdog device must support the + specified value. It is critical that all hosts use the same value. + Not doing so will invalidate the lease protection provided by san‐ + lock. The io_timeout should usually be tuned along with this value, + e.g. watchdog_fire_timeout = 30 with io_timeout = 5. + +SEE ALSO wdmd(8) 2015-01-23 SANLOCK(8) +:: - WDMD(8) System Manager's Manual WDMD(8) +WDMD(8) System Manager's Manual WDMD(8) - NAME +NAME wdmd - watchdog multiplexing daemon - SYNOPSIS +SYNOPSIS wdmd [OPTIONS] - DESCRIPTION +DESCRIPTION This daemon opens /dev/watchdog and allows multiple independent sources - to detmermine whether each KEEPALIVE is done. Every test interval (10 - seconds), the daemon tests each source. If any test fails, the - KEEPALIVE is not done. In a standard configuration, the watchdog timer - will reset the system if no KEEPALIVE is done for 60 seconds ("fire - timeout"). This means that if a single test fails 5-6 times in row, - the watchdog will fire and reset the system. With multiple test + to detmermine whether each KEEPALIVE is done. Every test interval (de‐ + fault 10 seconds), the daemon tests each source. If any test fails, + the KEEPALIVE is not done. In the default configuration, the watchdog + timer will reset the system if no KEEPALIVE is done for 60 seconds + ("fire timeout"). This means that if a single test fails 5-6 times in + row, the watchdog will fire and reset the system. With multiple test sources, fewer separate failures back to back can also cause a reset, e.g. @@ -1135,19 +1200,20 @@ From sanlock(8) at sanlock.git/src/sanlock.8 T60, and the tests at T60 would not be run.) A crucial aspect to the design and function of wdmd is that if any sin‐ - gle source does not pass tests for the fire timeout, the watchdog is - guaranteed to fire, regardless of whether other sources on the system - have passed or failed. A spurious reset due to the combined effects of - multiple failing tests as shown above, is an accepted side effect. + gle source does not pass the test for the length of the fire timeout, + the watchdog is guaranteed to fire, regardless of whether other sources + on the system have passed or failed. A spurious reset due to the com‐ + bined effects of multiple failing tests as shown above, is an accepted + side effect. - The wdmd init script will load the softdog module if no other watchdog + The wdmd init script will load the softdog module if no other watchdog module has been loaded. - wdmd cannot be used on the system with any other program that needs to + wdmd cannot be used on the system with any other program that needs to open /dev/watchdog, e.g. watchdog(8). Test Source: clients - Using libwdmd, programs connect to wdmd via a unix socket, and send + Using libwdmd, programs connect to wdmd via a unix socket, and send regular messages to wdmd to update an expiry time for their connection. Every test interval, wdmd will check if the expiry time for a connec‐ tion has been reached. If so, the test for that client fails. @@ -1158,7 +1224,7 @@ From sanlock(8) at sanlock.git/src/sanlock.8 failure. If a script does not exit by the end of the test interval, it is considered a failure. - OPTIONS +OPTIONS --version, -V Print version. @@ -1170,18 +1236,15 @@ From sanlock(8) at sanlock.git/src/sanlock.8 --probe, -p Print path of functional watchdog device. Exit code 0 indi‐ - cates a - functional device was found. Exit code 1 indicates a func‐ - tional device - was not found. + cates a functional device was found. Exit code 1 indicates + a functional device was not found. -D Enable debugging to stderr and don't fork. -H 0|1 Enable (1) or disable (0) high priority features such as real‐ - time - scheduling priority and mlockall. + time scheduling priority and mlockall. -G name Group ownership for the socket. @@ -1198,7 +1261,12 @@ From sanlock(8) at sanlock.git/src/sanlock.8 -w path The path to the watchdog device to try first. - 2011-08-01 WDMD(8) + --trytimeout, -t seconds + Set the timeout for the watchdog device. Use this to check + for supported timeout values. -:: + --forcefire, -F + Force the watchdog to fire and reset the machine. + Use with -t. + 2011-08-01 WDMD(8)