The Corosync/Pacemaker pair is the successor of the long obsolete and unsupported Heartbeat v2 software package. While Heartbeat v2 was playing the role of both the Group Communication System (GCS) and the Cluster Resource Manager (CRM), these roles are split under the new system. Corosync is the GCS and in charge of communication, while Pacemaker sits on top of the GCS and plays the role of the CRM, managing the resources and responding to changes in the cluster status.
An existing cluster under Heartbeat v2 can be migrated to Corosync/Pacemaker
using the script ngcp-migrate-ha-crm
. This script automates the following
steps:
/etc/ha.d/haresources
to disable stopping of resources by Heartbeat
v2
config.yml
variables ha.gcs
and ha.crm
to corosync
and
pacemaker
respectively
The script must be run on the standby node, and the steps described above are first performed locally (on the standby node), followed by the node’s peer (the active node). This is to minimise resource downtime.
caution | |
Switching the values of |
The migration needs to be run on non-migrated pairs of nodes, from the
standby node of each pair, and the services need to be all in a good state
(from ngcp-service summary point of view). The program will perform these
sanity checks before making it possible to proceed. The user will also be
prompted for confirmation, which can be skipped for non-interactive use with
the FORCE
environment variable.
For a Carrier system the configuration settings will be set per pair, so that it does not affect the entire cluster. Once the whole cluster has been migrated these configuration files should be merged into the global one. If the switch does not need to be staged, then the ngcp-parallel-ssh(8) command can help with that, with its inactive host selector, such as ngcp-parallel-ssh inactive "FORCE=yes ngcp-migrate-ha-crm".
Corosync is the Group Communication System (GCS). Its configuration resides in
/etc/corosync/corosync.conf
and describes the following details:
sp
Config details for both nodes:
sp1
and sp2
, or a
and b
node names)
1
and 2
respectively)
Corosync uses a voting system to determine the state of the cluster. Each
configured node in the cluster receives one vote. A quorum is defined as a
majority presence within the cluster, meaning at least 50% of the configured
nodes plus one, or q = n / 2 + 1
. For example, if 8 nodes were configured, a
quorum would be present if at least 5 nodes are communicating with each other.
In this state, the cluster is said to be quorate, which means it can operate
normally. (Any remaining nodes, 3 in the worst case, would see the cluster as
inquorate and would relinquish all their resources.)
A two-node cluster is a special case as under the formula above, a quorum would
consist of 2 functioning nodes. The Corosync config setting two_node: 1
overrides this and artificially sets the quorum to 1. This means that under a
split-brain scenario (each node seeing only 1 vote), both nodes would see the
cluster as quorate and try to become active, instead of both nodes going standby.
In addition to this, Pacemaker itself also uses an internal scoring system for individual resources. This mechanism is described below and not directly related to the quorum.
Pacemaker uses the communication service provided by Corosync to manage local resources. All status and configuration information is shared between all Pacemaker instances within the cluster as long as communication is up. This means that any configuration change done on any node will immediately and automatically be propagated to all other nodes in the cluster.
Pacemaker internally uses an XML document to store its configuration, called
"CIB" stored in /var/lib/pacemaker/cib/cib.xml
. However, this XML document
must never be edited or modified directly. Instead, a shell-like interface
crm
is provided to talk to Pacemaker, query status information, alter cluster
state, view and modify configuration, etc. Any configuration change done through
crm
is immediately reflected in the CIB XML, locally as well as on all other
nodes.
warning | |
To repeat, do not ever directly modify Pacemaker’s XML configuration. |
As an added bonus, to make things more awkward, the syntax used by crm
is not XML at all, but rather uses a Cisco-like hierarchy.
Commands can be issued to crm
either directly from the shell as command-line
arguments, or interactively by entering a Cisco-like shell. So for example, the
current config can be viewed either from the shell with:
root@sp1:~# crm config show ...
Or interactively, as either:
root@sp1:~# crm crm(live/sp1)# config crm(live/sp1)configure# show ...
or
root@sp1:~# crm crm(live/sp1)# config show ...
Interactive online help is provided by the ls
command to list commands valid
in the current context, or using the help
command for a more verbose help
output.
The current cluster status can be viewed with the top-level status
command:
crm(live/sp1)# status Stack: corosync Current DC: sp2 (version unknown) - partition with quorum Last updated: Fri Nov 22 18:38:06 2019 Last change: Fri Nov 22 18:25:28 2019 by hacluster via crmd on sp1
2 nodes configured 7 resources configured
Online: [ sp1 sp2 ]
Full list of resources:
Resource Group: g_vips p_vip_eth1_v4_1 (ocf::heartbeat:IPaddr): Started sp1 p_vip_eth2_v4_1 (ocf::heartbeat:IPaddr): Started sp1 Resource Group: g_ngcp p_monit_services (ocf::ngcp:monit-services): Started sp1 Clone Set: c_ping [p_ping] Started: [ sp1 sp2 ] Clone Set: fencing [st-null] Started: [ sp1 sp2 ]
If the status is queried from sp2
instead, the output will be the same. Most
importantly, the resources will not show up as "stopped" on sp2
but instead
will be reported as running on sp1
.
The resources reported are described in the configuration section below.
The NGCP templates do not operate on Pacemaker’s CIB XML directly, but instead
produce a file in CRM syntax in /etc/pacemaker/cluster.crm
. This file is not
handled by Pacemaker directly, but instead is loaded into Pacemaker via the
crm
command config load replace
. It shouldn’t be necessary to do this
manually, as the script ngcp-ha-crm-reload
handles this automatically, which
is called from the config file’s postbuild script.
Changes to the config don’t need to be saved explicitly. This is done automatically by Pacemaker, as well as sharing any changes with all other members of the cluster.
In crm
, changes made to the config are cached until made active with
commit
, or discarded with refresh
. Changes to resource status can be
avoided by enabling maintenance mode (see below).
important | |
However, since our config is loaded from a template, any changes done
to the config through |
The currently active config can be shown with config show
and should be
logically identical to the contents of /etc/pacemaker/cluster.crm
:
crm(live/sp1)# config show node 1: sp1 node 2: sp2 primitive p_monit_services ocf:ngcp:monit-services \ meta migration-threshold=20 \ meta failure-timeout=800 \ op monitor interval=20 timeout=60 on-fail=restart \ op_params on-fail=restart primitive p_ping ocf:pacemaker:ping \ params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \ meta failure-timeout=800 \ op monitor interval=1 timeout=60 on-fail=restart \ op_params timeout=60 on-fail=restart primitive p_vip_eth1_v4_1 IPaddr \ params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart primitive p_vip_eth2_v4_1 IPaddr \ params ip=192.168.1.161 nic=eth2 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart primitive st-null stonith:null \ params hostlist="sp1 sp2" group g_ngcp p_monit_services group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1 clone c_ping p_ping clone fencing st-null location l_ngcp g_ngcp \ rule pingd: defined pingd colocation l_ngcp_with_vip inf: g_ngcp g_vips location l_vips g_vips \ rule pingd: defined pingd order o_vip_then_ngcp Mandatory: g_vips g_ngcp property cib-bootstrap-options: \ have-watchdog=false \ cluster-infrastructure=corosync \ cluster-name=sp \ stonith-enabled=yes \ no-quorum-policy=ignore \ startup-fencing=yes \ maintenance-mode=false \ last-lrm-refresh=1574443528 rsc_defaults rsc-options: \ resource-stickiness=100
clone c_ping p_ping
defines a clone
type object
with the name c_ping
.
config del ...
), when
starting or stopping a resource, when referring to resources from a group, etc.
Resources are the primary type of objects that Pacemaker handles. A resource is
anything that can be started or stopped, and a resource is normally allowed to
run on one node only. A resource is defined as a primitive
type object.
Pacemaker supports many types of resources, all of which have different options
that can be given to them. The config syntax defines that options given to a
resource itself are prefixed with params
, while options that influence how a
resource should be managed are prefixed with meta
. Options that are relevant
to operations that can be performed on a resource are prefixed with op
.
Resources are grouped into classes, providers, and types. Details about them
(e.g. which options they support) can be obtained through the ra
menu.
crm(live/sp1)ra# info IPaddr Manages virtual IPv4 addresses (portable version) (ocf:heartbeat:IPaddr) ...
primitive p_vip_eth1_v4_1 IPaddr \ params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart
This defines a resource of type IPaddr
with name p_vip_eth1_v4_1
and the
given parameters (address, netmask, interface). Pacemaker will check for the
existence of the address every 5 seconds, with an action timeout of 60 seconds.
If the monitor action fails, the resource is restarted.
primitive p_monit_services ocf:ngcp:monit-services \ meta migration-threshold=20 \ meta failure-timeout=800 \ op monitor interval=20 timeout=60 on-fail=restart \ op_params on-fail=restart
While Pacemaker has support for native systemd services, for the time being
we’re still relying on monit to manage our services. Therefore, services are
defined in Pacemaker virtually identical to how they were defined in Heartbeat
v2, through a monit-services
start/stop script. The old Heartbeat script was
/etc/ha.d/resource.d/monit-services
and the new script used by Pacemaker is
/etc/ngcp-ocf/monit-services
.
info | |
The primary difference between the two scripts is the support for a
|
meta migration-threshold=20
means that the resource will be migrated away
(instead of restarted) after 20 failures. See the discussion on failure counts
below.
meta failure-timeout=800
means that the failure count should be reset to
zero if the last failure occurred more than 800 seconds ago. (However, the
actual timer depends on the cluster-recheck-interval
.)
primitive p_ping ocf:pacemaker:ping \ params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \ meta failure-timeout=800 \ op monitor interval=1 timeout=60 on-fail=restart \ op_params timeout=60 on-fail=restart
The builtin pingd
service, using a resource name that intelligently is not
named pingd
but rather ping
, replaces Heartbeat’s ping nodes. It
supports multiple ping backends, and uses fping
by default.
Each configured ping node (each entry in host_list
) produces a score of 1
if that ping node is up. The scores are summed up and multiplied by the
multiplier
. So in the example above, a score of 2000 is generated if both
ping nodes are up. Pacemaker will then prefer the node which produces the
higher score.
dampen=5s
means to wait 5 seconds after a change occurred to prevent
transient glitches from causing service flapping.
primitive st-null stonith:null \ params hostlist="sp1 sp2"
Pacemaker will generate a warning if no fencing mechanism is configured,
therefore we configure the null
fencing mechanism.
Pacemaker supports several proper fencing mechanism and these might eventually get supported in the future.
group g_ngcp p_monit_services group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1
To manage, control, and restrict multiple resources at the same time, resources
can be grouped into single objects. The group g_ngcp
is pointless for the time
being (it contains only a single other resource) but will become useful once
native systemd resources are in use. The group g_vips
ensures that all shared
IP addresses are active at the same time.
clone c_ping p_ping clone fencing st-null
Since a single resource normally only runs on one node, a clone can be defined
to allow a resource to run on all nodes. We want the pingd
service and the
fencing service to always run on all nodes.
colocation l_ngcp_with_vip inf: g_ngcp g_vips
This tells Pacemaker that we want to force the g_ngcp
resource on the same
node that is running the g_vips
resource.
location l_ngcp g_ngcp \ rule pingd: defined pingd location l_vips g_vips \ rule pingd: defined pingd
This tells Pacemaker that these resources depend on the pingd
service being
healthy. If pingd
fails on one node (ping nodes are unavailable), then
Pacemaker will shut down the constrained resources.
order o_vip_then_ngcp Mandatory: g_vips g_ngcp
This tells Pacemaker that the shared IP addresses must be up and running before the system services can be started.
property cib-bootstrap-options: \ have-watchdog=false \ cluster-infrastructure=corosync \ cluster-name=sp \ stonith-enabled=yes \ no-quorum-policy=ignore \ startup-fencing=yes \ maintenance-mode=false \ last-lrm-refresh=1574443528
Relevant options are:
have-watchdog=false
indicates that no external watchdog service such as
SBD
is in use.
cluster-name=sp
is to match the configuration of Corosync.
stonith-enabled=yes
is required to suppress a warning message, even though
no real STONITH (null
fencing mechanism) is in use.
no-quorum-policy=ignore
tells Pacemaker to continue normally if quorum is
lost. This is the only setting that makes sense in a two-node cluster.
startup-fencing=yes
is also needed to suppress a warning even though no
real fencing is in use. This tells Pacemaker to shoot nodes that are not
present immediately after startup.
maintenance-mode=false
tells Pacemaker to actually perform resource actions.
If maintenance mode is enabled, Pacemaker will continue to run, but will not
start or stop any services. This should be enabled before loading a new config,
and then disabled afterwards. The script ngcp-ha-crm-reload
does this.
Pacemaker keeps a failure count for each resource, which is somewhat hidden from
view, but can largely influence its behaviour. Each time a service fails (either
during runtime or during startup), the failure count is increased by one. If the
failure count exceeds the configured migration-threshold
, Pacemaker will cease
trying to start the service and will migrate the service away to another node.
In crm status
this shows up as stopped
.
Failure counts can be cleared automatically if the failure-timeout
setting is
configured for a resource. This timeout is counted after the last time the
resource has failed, and is checked periodically according to the
cluster-recheck-interval
. In other words, a very short failure timeout won’t
have any effect unless the recheck interval is also very short.
important | |
If no faiure timeout is configured, any existing failure count must be cleared manually. |
The failure count for a resource can be checked from the shell via
crm_failcount
, for example:
root@sp1:~# crm_failcount -G -r p_monit_services scope=status name=fail-count-p_monit_services value=0
The failure count on a different node can also be examined:
root@sp1:~# crm_failcount -G -r g_ngcp -N sp2 scope=status name=fail-count-g_ngcp value=0
The same can be done via crm
:
crm(live/sp1)resource# failcount g_vips show sp1 scope=status name=fail-count-g_vips value=0 crm(live/sp1)resource# failcount c_ping show sp2 scope=status name=fail-count-c_ping value=0
As a shortcut, the script ngcp-ha-show-failcounts
is provided:
root@sp1:~# ngcp-ha-show-failcounts p_vip_eth1_v4_1: sp1: 0 sp2: 0 p_vip_eth2_v4_1: sp1: 0 sp2: 0 p_monit_services: sp1: 0 sp2: 0
Analogous to checking a failure count, it can be cleared using any one of these methods:
root@sp1:~# crm_failcount -D -r p_monit_services Cleaned up p_monit_services on sp1 root@sp1:~# crm resource failcount g_ngcp delete sp2 Cleaned up p_monit_services on sp2 root@sp1:~# crm crm(live/sp1)# resource crm(live/sp1)resource# failcount c_ping delete sp1 Cleaned up p_ping:0 on sp1 Cleaned up p_ping:1 on sp1 crm(live/sp1)resource# bye root@sp1:~# ngcp-ha-clear-failcounts Cleaned up p_monit_services on sp2 Cleaned up p_monit_services on sp1
In addition, the crm
command resource cleanup
also resets failure counts.
Pacemaker uses an internal scoring system to determine which resources to run where. A resource will be run on the node on which it received the highest score. If a resource has a negative score, that resource will not be run at all. If a resource has the same score on multiple nodes, then the resource will be run on any one of those nodes. Scores can be calculated and acted upon through various config settings.
A score value of infinity
(and negative infinity) to force certain states is
provided, which evaluates to not infinity at all, but rather to a static value
of one million. This can be used to artificially manipulate resource scores to
force running a resource on a particular node, or forbid a resource from running
on particular nodes.
Scores can be inspected through the crm
command resource scores
.
The commands ngcp-make-active
and ngcp-make-standby
work normally. Under
Pacemaker, they function through the crm
command resource move
to create a
temporary location constraint on g_vips
. This can be done manually through:
crm resource move g_vips sp1 30
The lifetime of 30 seconds is needed because g_ngcp
depends on the location of
g_vips
, and therefore g_ngcp
needs to be stopped before g_vips
can be
stopped. The location constraint must remain active until g_ngcp
has been
completely and successfully stopped.
info | |
These commands only effect the status of the running resources, and not the status of the node itself. This means that after going standby, Pacemaker will immediately be ready to take over the resources again if needed. See below for a discussion on node status. |
Similarly, ngcp-check-active
uses the output of crm resource status g_vips
to determine whether the local node is active or not.
In addition to the status and location of individual resources, nodes themselves
can also go into standby mode. The submenu node
in crm
has the relevant
options.
A node in standby mode will not only give up all of its resources, but will also refuse to take them over until it’s back online. Therefore, it’s possible to set both nodes to standby mode and shut down all resources on both nodes.
info | |
A node in standby mode will still participate in GCS communications and remain visible to the rest of the cluster. |
Use crm node standby
to set the local node to standby mode. A remote node can
be set to standby using e.g. crm node standby sp2
.
By default, the standby mode remains active until it’s cancelled manually (a
lifetime of forever
). Alternatively, a lifetime of reboot
can be specified
to tell Pacemaker that after the next reboot, the node should automatically come
back online. Example: crm node standby sp2 reboot
To cancel standby mode, use crm node online
, optionally followed by the node
name.
To show the current status of all nodes, use crm node show
. The top-level
crm status
also shows this.
If Pacemaker’s maintenance mode is enabled, it will continue to operate
normally, i.e. continue to run and monitor resources, but will refuse to stop or
start any resources. This is useful to make changes to the running config, and
is done automatically by ngcp-ha-crm-reload
.
To enable and disable maintenance mode:
crm maintenance on crm maintenance off
or using the more lower level method:
crm configure property maintenance-mode=true crm configure property maintenance-mode=false