Thursday, March 05, 2015

Heartbeat Inclusion Protocol Handling in MySQL Cluster

The below description was added to the MySQL Cluster 7.4 source code
and describes how new nodes are added into the heartbeat protocol at
startup of a node.

The protocol to include our node in the heartbeat protocol starts when
we call execCM_INFOCONF. We start by opening communication to all nodes
in the cluster. When we start this protocol we don't know anything about
which nodes are up and running and we don't which node is currently the
president of the heartbeat protocol.

For us to be successful with being included in the heartbeat protocol we
need to be connected to all nodes currently in the heartbeat protocol. It
is important to remember that QMGR (the source code module that
controls the heartbeat handling) sees a node as alive if it is included
in the heartbeat protocol. Higher level notions of aliveness is handled
primarily by the DBDIH block (DBDIH is responsible for the database
level of distribution such as which nodes have up-to-date replicas of a
certain database fragment), but also to some extent by NDBCNTR
(NDBCNTR is a source module that controls restart phases and is a
layer on top of QMGR and below the database handling level).

The protocol starts by the new node sending CM_REGREQ to all nodes it is
connected to. Only the president will respond to this message. We could
have a situation where there currently isn't a president choosen. In this
case an election is held whereby a new president is assigned. In the rest
of this comment we assume that a president already exists.

So if we were connected to the president we will get a response to the
CM_REGREQ from the president with CM_REGCONF. The CM_REGCONF contains
the set of nodes currently included in the heartbeat protocol.

The president will send in parallel to sending CM_REGCONF a CM_ADD(prepare)
message to all nodes included in the protocol.

When receiving CM_REGCONF the new node will send CM_NODEINFOREQ with
information about version of the binary, number of LDM workers and
MySQL version of binary.

The nodes already included in the heartbeat protocol will wait until it
receives both the CM_ADD(prepare) from the president and the
CM_NODEINFOREQ from the starting node. When it receives those two
messages it will send CM_ACKADD(prepare) to the president and
CM_NODEINFOCONF to the starting node with its own node information.

When the president received CM_ACKADD(prepare) from all nodes included
in the heartbeat protocol then it sends CM_ADD(AddCommit) to all nodes
included in the heartbeat protocol.

When the nodes receives CM_ADD(AddCommit) from the president then
they will enable communication to the new node and immediately start
sending heartbeats to the new node. They will also include the new
node in their view of the nodes included in the heartbeat protocol.
Next they will send CM_ACKADD(AddCommit) back to the president.

When the president has received CM_ACKADD(AddCommit) from all nodes
included in the heartbeat protocol then it sends CM_ADD(CommitNew)
to the starting node.

This is also the point where we report the node as included in the
heartbeat protocol to DBDIH as from here the rest of the protocol is
only about informing the new node about the outcome of inclusion
protocol. When we receive the response to this message the new node
can already have proceeded a bit into its restart.

The starting node after receiving CM_REGCONF waits for all nodes
included in the heartbeat protocol to send CM_NODEINFOCONF and
also for receiving the CM_ADD(CommitNew) from the president. When
all this have been received the new nodes adds itself and all nodes
it have been informed about into its view of the nodes included in
the heartbeat protocol and enables communication to all other
nodes included therein. Finally it sends CM_ACKADD(CommitNew) to
the president.

When the president has received CM_ACKADD(CommitNew) from the starting
node the inclusion protocol is completed and the president is ready
to receive a new node into the cluster.

It is the responsibility of the starting nodes to retry after a failed
node inclusion, they will do so with 3 seconds delay. This means that
at most one node per 3 seconds will normally be added to the cluster.
So this phase of adding nodes to the cluster can add up to a little bit
more than a minute of delay in a large cluster starting up.

We try to depict the above in a graph here as well:

New node           Nodes included in the heartbeat protocol     President
------------------------------------------------------------------------------------
----CM_REGREQ--------------------->->
----CM_REGREQ---------------------------------------------------------->

< ---------------CM_REGCONF---------------------------------------------
                                  << ------CM_ADD Prepare ---------------

-----CM_NODEINFOREQ--------------- >>

Nodes included in heartbeat protocol can receive CM_ADD(Prepare) and
CM_NODEINFOREQ in any order.

<< ---CM_NODEINFOCONF-------------- --------CM_ACKADD(Prepare)--------->>

                                  << -------CM_ADD(AddCommit)------------

Here nodes enables communication to new node and starts sending heartbeats

                                  ---------CM_ACKADD(AddCommit)------- >>

Here we report to DBDIH about new node included in heartbeat protocol
in master node.

< ----CM_ADD(CommitNew)--------------------------------------------------

Here new node enables communication to new nodes and starts sending
heartbeat messages.

-----CM_ACKADD(CommitNew)---------------------------------------------- >

Here the president can complete the inclusion protocol and is ready to
receive new nodes into the heartbeat protocol.

No comments: