10. Corosync/Pacemaker

10.1. Overview

The Corosync/Pacemaker pair is the successor of the long obsolete and unsupported Heartbeat v2 software package. While Heartbeat v2 was playing the role of both the Group Communication System (GCS) and the Cluster Resource Manager (CRM), these roles are split under the new system. Corosync is the GCS and in charge of communication, while Pacemaker sits on top of the GCS and plays the role of the CRM, managing the resources and responding to changes in the cluster status.

10.2. Migration

Before the migration install the latest hotfixes for the packages: ngcp-system-tools-pro, ngcp-ngcpcfg, ngcp-ngcpcfg-ha.

An existing cluster under Heartbeat v2 can be migrated to Corosync/Pacemaker using the script ngcp-migrate-ha-crm. This script automates the following steps:

Locally download the Heartbeat v2 Debian package to allow rollback
Download and prefetch the Debian packages for Corosync, Pacemaker, and its dependencies
Remove /etc/ha.d/haresources to disable stopping of resources by Heartbeat v2
Stop Heartbeat v2 and remove its Debian package including all configuration
Install Corosync, Pacemaker, and all dependencies
Switch config.yml variables ha.gcs and ha.crm to corosync and pacemaker respectively
Rebuild all config from templates
Start the new GCS and CRM services

The script must be run on the standby node, and the steps described above are first performed locally (on the standby node), followed by the node’s peer (the active node). This is to minimise resource downtime.

caution
caution	Switching the values of `ha.gcs` or `ha.crm` is not enough to actually effect any change. These values are merely reflective of which software is installed and must not be modified without also altering the underlying software.

The migration needs to be run on non-migrated pairs of nodes, from the standby node of each pair, and the services need to be all in a good state (from ngcp-service summary point of view). The program will perform these sanity checks before making it possible to proceed. The user will also be prompted for confirmation, which can be skipped for non-interactive use with the FORCE environment variable.

For a Carrier system the configuration settings will be set per pair, so that it does not affect the entire cluster. Once the whole cluster has been migrated these configuration files should be merged into the global one. If the switch does not need to be staged, then the ngcp-parallel-ssh(8) command can help with that, with its inactive host selector, such as ngcp-parallel-ssh inactive "FORCE=yes ngcp-migrate-ha-crm".

10.2.1. Rollback

Rollback can be done by using ngcp-migrate-ha-crm --rollback, which involves reversing the steps outlined above: Removal or corosync and pacemaker, installation of the downloaded heartbeat-2 Debian package, and reverting ha.gcs and ha.crm to their previous values heartbeat-2.

10.3. Corosync

Corosync is the Group Communication System (GCS). Its configuration resides in /etc/corosync/corosync.conf and describes the following details:

Shared cluster name of sp
Quorum config as a two-node cluster (see below)
Config details for both nodes:
- Name (sp1 and sp2, or a and b node names)
- Node ID (1 and 2 respectively)
- Local IP address for communication

10.3.1. Quorum

Corosync uses a voting system to determine the state of the cluster. Each configured node in the cluster receives one vote. A quorum is defined as a majority presence within the cluster, meaning at least 50% of the configured nodes plus one, or q = n / 2 + 1. For example, if 8 nodes were configured, a quorum would be present if at least 5 nodes are communicating with each other. In this state, the cluster is said to be quorate, which means it can operate normally. (Any remaining nodes, 3 in the worst case, would see the cluster as inquorate and would relinquish all their resources.)

A two-node cluster is a special case as under the formula above, a quorum would consist of 2 functioning nodes. The Corosync config setting two_node: 1 overrides this and artificially sets the quorum to 1. This means that under a split-brain scenario (each node seeing only 1 vote), both nodes would see the cluster as quorate and try to become active, instead of both nodes going standby.

In addition to this, Pacemaker itself also uses an internal scoring system for individual resources. This mechanism is described below and not directly related to the quorum.

10.4. Pacemaker

Pacemaker uses the communication service provided by Corosync to manage local resources. All status and configuration information is shared between all Pacemaker instances within the cluster as long as communication is up. This means that any configuration change done on any node will immediately and automatically be propagated to all other nodes in the cluster.

Pacemaker internally uses an XML document to store its configuration, called "CIB" stored in /var/lib/pacemaker/cib/cib.xml. However, this XML document must never be edited or modified directly. Instead, a shell-like interface crm is provided to talk to Pacemaker, query status information, alter cluster state, view and modify configuration, etc. Any configuration change done through crm is immediately reflected in the CIB XML, locally as well as on all other nodes.

warning
warning	To repeat, do not ever directly modify Pacemaker’s XML configuration.

As an added bonus, just to make things more awkward, the syntax used by crm is not XML at all, but rather uses a Cisco-like hierarchy.

Commands can be issued to crm either directly from the shell as command-line arguments, or interactively by entering a Cisco-like shell. So for example, the current config can be viewed either from the shell with:

root@sp1:~# crm config show
...

Or interactively, as either:

root@sp1:~# crm
crm(live/sp1)# config
crm(live/sp1)configure# show
...

root@sp1:~# crm
crm(live/sp1)# config show
...

Interactive online help is provided by the ls command to list commands valid in the current context, or using the help command for a more verbose help output.

10.5. Query Status

The current cluster status can be viewed with the top-level status command:

crm(live/sp1)# status
Stack: corosync
Current DC: sp2 (version unknown) - partition with quorum
Last updated: Fri Nov 22 18:38:06 2019
Last change: Fri Nov 22 18:25:28 2019 by hacluster via crmd on sp1

2 nodes configured
7 resources configured

Online: [ sp1 sp2 ]

Full list of resources:

Resource Group: g_vips
    p_vip_eth1_v4_1  (ocf::heartbeat:IPaddr):        Started sp1
    p_vip_eth2_v4_1  (ocf::heartbeat:IPaddr):        Started sp1
Resource Group: g_ngcp
    p_monit_services (ocf::ngcp:monit-services):     Started sp1
Clone Set: c_ping [p_ping]
    Started: [ sp1 sp2 ]
Clone Set: fencing [st-null]
    Started: [ sp1 sp2 ]

If the status is queried from sp2 instead, the output will be the same. Most importantly, the resources will not show up as "stopped" on sp2 but instead will be reported as running on sp1.

The resources reported are described in the configuration section below.

10.6. Config Management

The NGCP templates do not operate on Pacemaker’s CIB XML directly, but instead produce a file in CRM syntax in /etc/pacemaker/cluster.crm. This file is not handled by Pacemaker directly, but instead is loaded into Pacemaker via the crm command config load replace. It shouldn’t be necessary to do this manually, as the script ngcp-ha-crm-reload handles this automatically, which is called from the config file’s postbuild script.

Changes to the config don’t need to be saved explicitly. This is done automatically by Pacemaker, as well as sharing any changes with all other members of the cluster.

In crm, changes made to the config are cached until made active with commit, or discarded with refresh. Changes to resource status can be avoided by enabling maintenance mode (see below).

important
important	However, since our config is loaded from a template, any changes done to the config through `crm` manually are transient and will be lost the next time a config reload happens.

The currently active config can be shown with config show and should be logically identical to the contents of /etc/pacemaker/cluster.crm:

crm(live/sp1)# config show
node 1: sp1
node 2: sp2
primitive p_monit_services ocf:ngcp:monit-services \
      meta migration-threshold=20 \
      meta failure-timeout=800 \
      op monitor interval=20 timeout=60 on-fail=restart \
      op_params on-fail=restart
primitive p_ping ocf:pacemaker:ping \
      params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \
      meta failure-timeout=800 \
      op monitor interval=1 timeout=60 on-fail=restart \
      op_params timeout=60 on-fail=restart
primitive p_vip_eth1_v4_1 IPaddr \
      params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \
      op monitor interval=5 timeout=60 on-fail=restart \
      op_params on-fail=restart
primitive p_vip_eth2_v4_1 IPaddr \
      params ip=192.168.1.161 nic=eth2 cidr_netmask=24 \
      op monitor interval=5 timeout=60 on-fail=restart \
      op_params on-fail=restart
primitive st-null stonith:null \
      params hostlist="sp1 sp2"
group g_ngcp p_monit_services
group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1
clone c_ping p_ping
clone fencing st-null
location l_ngcp g_ngcp \
      rule pingd: defined pingd
colocation l_ngcp_with_vip inf: g_ngcp g_vips
location l_vips g_vips \
      rule pingd: defined pingd
order o_vip_then_ngcp Mandatory: g_vips g_ngcp
property cib-bootstrap-options: \
      have-watchdog=false \
      cluster-infrastructure=corosync \
      cluster-name=sp \
      stonith-enabled=yes \
      no-quorum-policy=ignore \
      startup-fencing=yes \
      maintenance-mode=false \
      last-lrm-refresh=1574443528
rsc_defaults rsc-options: \
      resource-stickiness=100

10.6.1. General Concepts

The configuration consists of a collection of objects of various types with various attributes.
Each object has a unique identifying name that can be used to refer to it.
Usually the type of the object is the first word and the identifying name is the second. For example, clone c_ping p_ping defines a clone type object with the name c_ping.
The unique name is used e.g. when deleting an object (config del ...), when starting or stopping a resource, when referring to resources from a group, etc.

10.6.2. Resources

Resources are the primary type of objects that Pacemaker handles. A resource is anything that can be started or stopped, and a resource is normally allowed to run on one node only. A resource is defined as a primitive type object.

Pacemaker supports many types of resources, all of which have different options that can be given to them. The config syntax defines that options given to a resource itself are prefixed with params, while options that influence how a resource should be managed are prefixed with meta. Options that are relevant to operations that can be performed on a resource are prefixed with op.

Resources are grouped into classes, providers, and types. Details about them (e.g. which options they support) can be obtained through the ra menu.

crm(live/sp1)ra# info IPaddr
Manages virtual IPv4 addresses (portable version) (ocf:heartbeat:IPaddr)
...

10.6.2.1. Shared IP Addresses

primitive p_vip_eth1_v4_1 IPaddr \
      params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \
      op monitor interval=5 timeout=60 on-fail=restart \
      op_params on-fail=restart

This defines a resource of type IPaddr with name p_vip_eth1_v4_1 and the given parameters (address, netmask, interface). Pacemaker will check for the existence of the address every 5 seconds, with an action timeout of 60 seconds. If the monitor action fails, the resource is restarted.

10.6.2.2. System Services

primitive p_monit_services ocf:ngcp:monit-services \
      meta migration-threshold=20 \
      meta failure-timeout=800 \
      op monitor interval=20 timeout=60 on-fail=restart \
      op_params on-fail=restart

While Pacemaker has support for native systemd services, for the time being we’re still relying on monit to manage our services. Therefore, services are defined in Pacemaker virtually identical to how they were defined in Heartbeat v2, through a monit-services start/stop script. The old Heartbeat script was /etc/ha.d/resource.d/monit-services and the new script used by Pacemaker is /etc/ngcp-ocf/monit-services.

info

The primary difference between the two scripts is the support for a monitor action for Pacemaker, which can be configured via the config.yml variable ha.monitor_services. It can be set to full to periodically use the output of ngcp-service summary to determine whether all services are running or not. The default value is none, which preserves backwards compatbility with the behavior of Heartbeat v2, by performing no periodic checks of the status of the services.

meta migration-threshold=20 means that the resource will be migrated away (instead of restarted) after 20 failures. See the discussion on failure counts below.
meta failure-timeout=800 means that the failure count should be reset to zero if the last failure occurred more than 800 seconds ago. (However, the actual timer depends on the cluster-recheck-interval.)
Run the monitor action every 20 seconds with a timeout of 60 seconds and restart the resource on failure.

10.6.2.3. Ping Nodes

primitive p_ping ocf:pacemaker:ping \
      params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \
      meta failure-timeout=800 \
      op monitor interval=1 timeout=60 on-fail=restart \
      op_params timeout=60 on-fail=restart

The builtin pingd service, using a resource name that intelligently is not named pingd but rather just ping, replaces Heartbeat’s ping nodes. It supports multiple ping backends, and uses fping by default.

Each configured ping node (each entry in host_list) produces a score of 1 if that ping node is up. The scores are summed up and multiplied by the multiplier. So in the example above, a score of 2000 is generated if both ping nodes are up. Pacemaker will then prefer the node which produces the higher score.

dampen=5s means to wait 5 seconds after a change occurred to prevent transient glitches from causing service flapping.
Other options are the same as described above.

10.6.2.4. Fencing/STONITH

primitive st-null stonith:null \
      params hostlist="sp1 sp2"

Pacemaker will generate a warning if no fencing mechanism is configured, therefore we configure the null fencing mechanism.

Pacemaker supports several proper fencing mechanism and these might eventually get supported in the future.

10.6.2.5. Groups

group g_ngcp p_monit_services
group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1

To manage, control, and restrict multiple resources at the same time, resources can be grouped into single objects. The group g_ngcp is pointless for the time being (it contains only a single other resource) but will become useful once native systemd resources are in use. The group g_vips ensures that all shared IP addresses are active at the same time.

10.6.2.6. Clones

clone c_ping p_ping
clone fencing st-null

Since a single resource normally only runs on one node, a clone can be defined to allow a resource to run on all nodes. We want the pingd service and the fencing service to always run on all nodes.

10.6.2.7. Constraints

colocation l_ngcp_with_vip inf: g_ngcp g_vips

This tells Pacemaker that we want to force the g_ngcp resource on the same node that is running the g_vips resource.

location l_ngcp g_ngcp \
      rule pingd: defined pingd
location l_vips g_vips \
      rule pingd: defined pingd

This tells Pacemaker that these resources depend on the pingd service being healthy. If pingd fails on one node (ping nodes are unavailable), then Pacemaker will shut down the constrained resources.

order o_vip_then_ngcp Mandatory: g_vips g_ngcp

This tells Pacemaker that the shared IP addresses must be up and running before the system services can be started.

10.6.3. Cluster Options

property cib-bootstrap-options: \
      have-watchdog=false \
      cluster-infrastructure=corosync \
      cluster-name=sp \
      stonith-enabled=yes \
      no-quorum-policy=ignore \
      startup-fencing=yes \
      maintenance-mode=false \
      last-lrm-refresh=1574443528

Relevant options are:

have-watchdog=false indicates that no external watchdog service such as SBD is in use.
cluster-name=sp is to match the configuration of Corosync.
stonith-enabled=yes is required to suppress a warning message, even though no real STONITH (null fencing mechanism) is in use.
no-quorum-policy=ignore tells Pacemaker to continue normally if quorum is lost. This is the only setting that makes sense in a two-node cluster.
startup-fencing=yes is also needed to suppress a warning even though no real fencing is in use. This tells Pacemaker to shoot nodes that are not present immediately after startup.
maintenance-mode=false tells Pacemaker to actually perform resource actions. If maintenance mode is enabled, Pacemaker will continue to run, but will not start or stop any services. This should be enabled before loading a new config, and then disabled afterwards. The script ngcp-ha-crm-reload does this.

10.6.4. Failure Counts

Pacemaker keeps a failure count for each resource, which is somewhat hidden from view, but can largely influence its behaviour. Each time a service fails (either during runtime or during startup), the failure count is increased by one. If the failure count exceeds the configured migration-threshold, Pacemaker will cease trying to start the service and will migrate the service away to another node. In crm status this simply shows up as stopped.

Failure counts can be cleared automatically if the failure-timeout setting is configured for a resource. This timeout is counted after the last time the resource has failed, and is checked periodically according to the cluster-recheck-interval. In other words, a very short failure timeout won’t have any effect unless the recheck interval is also very short.

important
important	If no faiure timeout is configured, any existing failure count must be cleared manually.

10.6.4.1. Checking Failure Counts

The failure count for a resource can be checked from the shell via crm_failcount, for example:

root@sp1:~# crm_failcount -G -r p_monit_services
scope=status  name=fail-count-p_monit_services value=0

The failure count on a different node can also be examined:

root@sp1:~# crm_failcount -G -r g_ngcp -N sp2
scope=status  name=fail-count-g_ngcp value=0

The same can be done via crm:

crm(live/sp1)resource# failcount g_vips show sp1
scope=status  name=fail-count-g_vips value=0
crm(live/sp1)resource# failcount c_ping show sp2
scope=status  name=fail-count-c_ping value=0

As a shortcut, the script ngcp-ha-show-failcounts is provided:

root@sp1:~# ngcp-ha-show-failcounts
p_vip_eth1_v4_1:
  sp1: 0
  sp2: 0
p_vip_eth2_v4_1:
  sp1: 0
  sp2: 0
p_monit_services:
  sp1: 0
  sp2: 0

10.6.4.2. Clearing Failure Counts

Analogous to checking a failure count, it can be cleared using any one of these methods:

root@sp1:~# crm_failcount -D -r p_monit_services
Cleaned up p_monit_services on sp1
root@sp1:~# crm resource failcount g_ngcp delete sp2
Cleaned up p_monit_services on sp2
root@sp1:~# crm
crm(live/sp1)# resource
crm(live/sp1)resource# failcount c_ping delete sp1
Cleaned up p_ping:0 on sp1
Cleaned up p_ping:1 on sp1
crm(live/sp1)resource# bye
root@sp1:~# ngcp-ha-clear-failcounts
Cleaned up p_monit_services on sp2
Cleaned up p_monit_services on sp1

In addition, the crm command resource cleanup also resets failure counts.

10.6.5. Resource Scores

Pacemaker uses an internal scoring system to determine which resources to run where. A resource will be run on the node on which it received the highest score. If a resource has a negative score, that resource will not be run at all. If a resource has the same score on multiple nodes, then the resource will be run on any one of those nodes. Scores can be calculated and acted upon through various config settings.

A score value of infinity (and negative infinity) to force certain states is provided, which evaluates to not infinity at all, but rather to a static value of one million. This can be used to artificially manipulate resource scores to force running a resource on a particular node, or forbid a resource from running on particular nodes.

Scores can be inspected through the crm command resource scores.

10.7. Common Tasks

10.7.1. Takeover and Standby

The commands ngcp-make-active and ngcp-make-standby work normally. Under Pacemaker, they function through the crm command resource move to create a temporary location constraint on g_vips. This can be done manually through:

crm resource move g_vips sp1 30

The lifetime of 30 seconds is needed because g_ngcp depends on the location of g_vips, and therefore g_ngcp needs to be stopped before g_vips can be stopped. The location constraint must remain active until g_ngcp has been completely and successfully stopped.

info
info	These commands only effect the status of the running resources, and not the status of the node itself. This means that after going standby, Pacemaker will immediately be ready to take over the resources again if needed. See below for a discussion on node status.

Similarly, ngcp-check-active uses the output of crm resource status g_vips to determine whether the local node is active or not.

10.7.2. Node Status (Online/Standby)

In addition to the status and location of individual resources, nodes themselves can also go into standby mode. The submenu node in crm has the relevant options.

A node in standby mode will not only give up all of its resources, but will also refuse to take them over until it’s back online. Therefore, it’s possible to set both nodes to standby mode and shut down all resources on both nodes.

info
info	A node in standby mode will still participate in GCS communications and remain visible to the rest of the cluster.

Use crm node standby to set the local node to standby mode. A remote node can be set to standby using e.g. crm node standby sp2.

By default, the standby mode remains active until it’s cancelled manually (a lifetime of forever). Alternatively, a lifetime of reboot can be specified to tell Pacemaker that after the next reboot, the node should automatically come back online. Example: crm node standby sp2 reboot

To cancel standby mode, use crm node online, optionally followed by the node name.

To show the current status of all nodes, use crm node show. The top-level crm status also shows this.

10.7.3. Maintenance Mode

If Pacemaker’s maintenance mode is enabled, it will continue to operate normally, i.e. continue to run and monitor resources, but will refuse to stop or start any resources. This is useful to make changes to the running config, and is done automatically by ngcp-ha-crm-reload.

To enable and disable maintenance mode:

crm maintenance on
crm maintenance off

or using the more lower level method:

crm configure property maintenance-mode=true
crm configure property maintenance-mode=false

10.7.4. CLI Alternatives

Several of the commands available through crm are also available through standalone CLI tools, such as crm_failcount, crm_standby, crm_resource, etc. They generally have less friendly syntax and so researching them is left as an exercise to the reader.