Monday, March 23, 2015

Quick setup to run sysbench using MySQL Cluster 7.4

In developing MySQL Server and MySQL Cluster we use four types of testing.
We use unit testing, we use functional testing, we use system testing and
we use benchmark testing.

For the unit tests and functional tests we use the MTR test framework. This
is dealing with most issues of how to use SQL to query the MySQL database
engine.

For system testing the InnoDB team has a test framework that is used to ensure
that InnoDB will survive crashes of many sorts. In a similar fashion MySQL
Cluster uses the autotest framework that I described in an earlier blog:
Link to earlier blog.

For benchmark testing we have a few frameworks, I am going to describe the one
that I use on a regular basis. This framework can be downloaded from:
Link to download.

I started developing this framework in 2006. So it has been extended for quite
some time and it is a lot simpler to run it now than it ever was before.

In this blog I am going to describe  a very minimal setup needed to run a
sysbench benchmark against MySQL Cluster 7.4.

These benchmarks can be used to run the following types of tests:
1) Sysbench on a single server with InnoDB as storage engine
2) Sysbench on a single server with NDB as storage engine
3) Sysbench against InnoDB with InnoDB and sysbench on different servers
4) Sysbench against NDB with different programs on different servers
5) flexAsynch for MySQL Cluster testing on a set of servers
6) DBT2 on a single server with InnoDB as storage engine
7) DBT2 on a single server with NDB as storage engine
8) DBT2 using NDB as a storage engine against a lot of servers

Running against NDB as a storage engine means to use MySQL Cluster.
I mostly focus on 1) above for InnoDB testing. For MySQL Cluster I mostly
focus on 2) and 5). But occasionally I also test 8). 3), 4), 6) and
7) are perfectly possible setups and have been tested as well, but I don't
personally use those so much.

I will write a few blogs about how to setup a benchmark using these benchmark
scripts. I will first describe the basics of setting up a sysbench benchmark
against MySQL Cluster in this blog. Then a basic flexAsynch benchmark. Finally
a basic DBT2 benchmark on a single server.

After describing the basic configuration setup, I will describe in detail the
various parts to configure. There are numerous things to configure:

Mandatory items:
1) Which benchmark to run

2) Pointer to file locations for tarballs, data directory and install directory

3) Pointer to tarballs for MySQL, Sysbench and DBT2 benchmarks, also the
   benchmark scripts which are located in the DBT2 tarball.

4) Definition on which servers the various programs under test will be
   executing.

Optional parts of configuring SUT (System Under Test)
5) Configuration of MySQL Server. Started also in flexAsynch, but not really
   used in flexAsynch benchmark.

6) Configuration of InnoDB if running test against InnoDB.

7) Configuration of NDB storage engine and NDB nodes if running against NDB
   storage engine.

Optional parts of configuring the actual benchmark
8) Sysbench setup. Setting up how to run the actual sysbench benchmark, for how
long to run, what sysbench variant to use, how many tables and a lot of other
things can be configured about how sysbench is to run. This part is obviously
only interesting when running a sysbench test.

9) DBT2 setup. There are numerous ways to adapt the basic DBT2 benchmark
   although the benchmark transactions are always the same.

10) flexAsynch benchmark setup. Also in flexAsynch there are multiple ways to
    adapt the benchmark.

11) In connection with the flexAsynch benchmark one can also benchmark restarts
    in MySQL Cluster. First the scripts use flexAsynch to load the database and
    then various forms of restarts can be tested.

The full capabilities of those benchmark scripts are quite extensive, they are
the result of 9 years of incremental design where we have tried to extend the
use cases of these benchmarks to support as many environments as possible, but
also to automate as much as is possible.

Obviously to understand all of this in one go is a tad difficult. So we will
try to start by describing how to setup a very simple sysbench benchmark using
one machine running sysbench against the NDB storage engine using the latest
development version of MySQL Cluster 7.4.

This is a good first step to get acquainted with the benchmark scripts. I will
add more details in later blogs about running more advanced setups.

The scripts have been used mostly on Linux, also a fair deal of usage on
Solaris. When preparing this blog I ran it on Mac OS X. This requires some
changes that will soon arrive in the dbt2-0.37.50.7 updated version of the
benchmark scripts. So preferably use Linux when trying this out for the
first time.

The first step in running this test is to create a directory from where the
tests are located on disk. I usually use some home directory and create a
directory called bench there. So on Mac OS X that would /Users/mikael/bench.
On a Linux machine I would normally use /home/mikael/bench.

In this directory I create four directories. I create one directory called
tarballs, this is where I will place the DBT2, sysbench and MySQL tarballs.
I create one directory called ndb, this will be used as the data directory
and will be automatically filled by the scripts. Next I will create a
directory mysql, this directory is where I will place the binary installation.
Finally I create a directory called sysbench.

Next step is to download the DBT2 tarball and the sysbench tarball from the
website:
http://dev.mysql.com/downloads/benchmarks.html

Copy those tarballs to the tarballs directory created above.

Next step is to get the MySQL tarball that you want to test. It's ok to use
both source tarballs and binary tarballs. Personally I obviously mostly work
with source tarballs since I want to have full control over everything, but
it's ok to use a binary tarball as well.

Place this tarball also in the tarballs directory.

Now in order to be able to run the benchmark scripts it is necessary to
get the top-level script copied from the DBT2 tarball. To do this perform
the following commands:

cd /Users/mikael/bench #use your bench directory here
cd tarballs
tar xfz dbt2-0.37.50.6.tar.gz
cp dbt2-037.50.6/scripts/bench_prepare.sh ..
rm -rf dbt2-0.37.50.6

Now ls from /Users/mikael/bench should look like this:
Mikaels-MacBook-Pro:bench mikael$ ls
bench_prepare.sh ndb tarballs
mysql sysbench
Mikaels-MacBook-Pro:bench mikael$ pwd
/Users/mikael/bench

and ls from tarballs should look something like this:
Mikaels-MacBook-Pro:bench mikael$ cd tarballs
Mikaels-MacBook-Pro:tarballs mikael$ ls
dbt2-0.37.50.6.tar.gz mysql-cluster-gpl-7.4.5.tar.gz sysbench-0.4.12.7.tar.gz

Now it's time to create the config file.
This file is always located in an almost empty directory under the bench directory with only
one file in it. This file is always called autobench.conf.

Here is the content of the file in my simplistic test run.
#Define benchmark to run
BENCHMARK_TO_RUN="sysbench"
#Define file locations of datadir, installdir and tarballs
TARBALL_DIR="/Users/mikael/bench/tarballs"
REMOTE_BIN_INSTALL_DIR="/Users/mikael/bench/mysql"
DATA_DIR_BASE="/Users/mikael/bench/ndb"
#Define tarball versions
MYSQL_VERSION="mysql-cluster-gpl-7.4.5"
MYSQL_BASE="5.6"
SYSBENCH_VERSION="sysbench-0.4.12.7"
DBT2_VERSION="dbt2-0.37.50.7"
#Define engine to use
ENGINE="ndb"
#Define way to build
USE_BINARY_MYSQL_TARBALL="no"
#Define servers to use
SERVER_HOST="127.0.0.1"
NDB_MGMD_NODES="127.0.0.1"
NDBD_NODES="127.0.0.1"
#Configure MySQL Server
SERVER_PORT="3316"
USE_MALLOC_LIB="no"
#Configure NDB part
NDB_REPLICAS="1"
NDB_DATA_MEMORY="2G"
NDB_INDEX_MEMORY="200M"
NDB_MAX_NO_OF_EXECUTION_THREADS="8"
#Configure sysbench
RUN_RW="yes"
SYSBENCH_ROWS="100000"
THREAD_COUNTS_TO_RUN="16;32"
MAX_TIME="60"
NUM_TEST_RUNS="1"

At first we need to define the benchmark to run, this is sysbench in our case.
We need to point out the tarball directory, we need to point out the directory
where to place the installation (REMOTE_BIN_INSTALL_DIR), we also need to point
out the data directory.

Next step is to provide the names of the tarballs, this is done in
MYSQL_VERSION, SYSBENCH_VERSION and DBT2_VERSION. Since the scripts
have to adapt to different MySQL versions we also need to specify the base version
of MySQL which in the case of MySQL Cluster 7.4 is 5.6. The scripts currently
supports 5.1, 5.5 and 5.6. There is also some adaption needed for a new MySQL
version and so 5.7 is currently not supported but work is on the way for this.

Next we define if we want to use innodb or if we want to use ndb as the
storage engine in this test run.

We also define in the variable USE_BINARY_MYSQL_TARBALL whether we use a
source tarball or a binary tarball. In this case I use a source tarball.

We need to define the servers, there is in the case of sysbench always
one and only one MySQL Server, there can be one or many NDB management
servers. There can also be one or many NDB data nodes. But in this case
we wanted a simple setup on one machine so here there is one of each
sort and they are all located on 127.0.0.1 which is the local host.

We always configure the MySQL server port to ensure we don't have a
problem running the benchmark if another MySQL Server is running on
the host. By default we use libtcmalloc instead of libc malloc, we avoid
this default since there are better malloc libraries and this variable should
be set purposely and not by default in most cases. But setting it to not use
a special malloc library will definitely work, so we do that here as a first
step.

Since we run with NDB storage engine we need to set up a few configuration
variables. The NDB configuration can be quite extensive, but here we strive
for a simplistic one. We want only one replica which we set using NDB_REPLICAS.
We need to set the size of the NDB data memory and the NDB index memory.
Finally we set the number of execution threads to 8 which is pretty standard.
This should be sufficient for a simple benchmark run of sysbench.

The final section sets up a minimalistic set of parameters to run sysbench.
We run OLTP complex RW case. We insert 100.000 rows into the table and we
run two tests, one with 16 MySQL connections and one with 32 connections.
Each test will run for 1 minute and we will only run each test once.

Now that is a minimal configuration, it is possible to make it even smaller
but then it's hard to understand what the test actually does since one
needs to be an expert in the default settings which even I don't remember
for all variables.

Now this file autobench.conf is placed in
/Users/mikael/bench/sysbench/autobench.conf
in my case.

Now it is time to run the test case.

The test is executed by the following commands:
cd /Users/mikael/bench
./bench_prepare.sh --default-directory /Users/mikael/bench/sysbench

The test will run everything until completed. Actually I more or less always use
one more parameter --skip-cleanup. This means that I don't have to repeat the
building of the source code if I want to rerun the test. If I run it then a
second time I should run with both --skip-cleanup and --skip-build to ensure
that I skip rebuilding the source and skip cleaning away the built source code
at the end of the test run.

The test result is located in the file:
/Users/mikael/bench/sysbench/final_results.txt

The output from the sysbench program in this particular case can be found
in the file:
/Users/mikael/bench/sysbench/sysbench_results/oltp_complex_rw_1.res

Good luck in benchmarking.

My favourite tool in following the benchmarks is top. I usually lock the various
threads onto different CPUs to ensure that I can understand how the different
threads behaves. I haven't done this in this set up since this is just a basic
setup. But top is still a good tool to follow what is going on in the server.

One final note is that running this basic setup will require 20-30 GByte of
disk space and also during the test run there will be a fairly high load on
the disk. As many developers are not suitable for benchmarks of
database engines one way to adapt the benchmark for a developer
environment is to set the config parameter NDB_DISKLESS="yes".
This essentially makes the file system behave as if it was located on
/dev/null.

Friday, March 20, 2015

Controlling checkpoint speed in MySQL Cluster 7.4

A question from digitalpoint on a previous blog about 7.4 restarts requires
an explanation of how we control checkpoint speed in MySQL Cluster 7.4.
Since the explanation is fairly lengthy I will do it in this blog instead
of as a reply to a blog comment.

First of all some introduction into checkpoints in MySQL Cluster 7.4. We
actually use the term checkpoint for 2 things. We have LCPs (local
checkpoints) and GCPs (Global checkpoints). LCPs are the traditional
checkpoints and the one that will be described in this blog. GCPs are
about forming groups of transactions that will be durable after a
restart. GCPs happens very often whereas an LCP is a fairly long process.

So first I'll introduce why we're doing checkpoints (LCPs) in the first
place. There are two reasons for doing LCPs. The first is that MySQL
Cluster uses a log-based approach for recovery. Every update, insert,
delete and write is logged into the REDO log in MySQL Cluster. We also
have an UNDO log for disk data. To avoid that the logs grow to infinity
we use LCPs to be able to cut the log tail.

The logs have a fixed size, the REDO logs cannot be changed in size whereas
the UNDO log you can add new files to a logfile group. When a log is full
no updates are allowed anymore since we cannot recover the database anymore
if we allow non-logged updates.

All the above is true for tables that require durability. MySQL Cluster
actually also supports non-durable tables. These tables will be available
as long as the cluster survives, if the entire cluster fails, then those
tables will be restored to an empty state. Most applications need durable
tables, but there are also applications where durability doesn't make
sense and therefore it is better to run those tables without durability.

Checkpointing is obviously only relevant to durable tables where we recover
based on a combination of logs and checkpoints.

An LCP is used to cut the log tail. This means that we always need to write
the checkpoint at such a speed that we don't come into a point where we
run out of log (both REDO log and UNDO log). As an example assume that we
can execute 100.000 updates per second at top speed that updates 20 fields
of 200 bytes in size. The REDO log in this case will be about 200 bytes
plus 20 * 4 bytes for field info and around 70 bytes of fixed overhead for
a REDO log record. So about 350 bytes times 100.000 per second of REDO log
records created which is 35 MByte per second. Assuming that the data size is
50 GByte. We can either calculate the REDO log size based on the choice
of checkpoint write speed or the other way around that we calculate the
checkpoint write speed based on the REDO log size. If we start with a
checkpoint write speed of 100 MByte per second then it will take about
500 seconds to write an LCP. We need to have enough REDO log for at least
2 checkpoints and it is good have at least 100% safety margin. So then
the REDO log size needs to be big enough for 2000 seconds of log generation.
So 70 GByte of REDO log should be sufficient. The size of the REDO log is
a multiplication of number of log parts, number of log files and log file
size.

So being able to cut the log tail is the only reason for running LCPs in
normal operation when there are no restarts ongoing. Obviously cutting the
log to be short also means shorter restart times since there is less
log to execute. So increasing checkpoint write speed means faster restart
times at the expense of more work in normal operation.

We also use the LCPs as part of restarts. We need to run an entire LCP where
the starting node participates to ensure that the starting node is
recoverable. In this scenario we simply want the LCP to run as fast as
possible to complete the restart as soon as possible.

So this means that we have two different speeds for checkpoint write speeds,
we have the speed needed during normal operation and we have the speed needed
for restarts. Actually we can even separate out a total cluster restart
(we call this system restart) from node restarts. The reason is that during
an LCP in a system restart there is no other action other than to perform
the LCP. Thus we can set the checkpoint write speed to the absolute maximum
possible. During another nodes restart we also need to execute user
transactions and so we want to find a balance between speeding up the restart
and being able to service current users of the cluster.

So this means that we need three config parameters. We call them
MaxDiskWriteSpeed (speed during normal operation),
MaxDiskWriteSpeedOtherNodeRestart (speed during another node performing
a node restart), MaxDiskWriteSpeedOwnNodeRestart (speed during our own node
restart, but also more importantly of all nodes during a system restart).

Obviously there is a cap to the write speed we can handle, there is both a
limit to what the disk drives can handle and there is also a limit to how
much each CPU running an LDM thread can output.

My experiments shows that one CPU that runs an LDM thread can output about
100 MByte per second if used only for generation of checkpoint writes.
The checkpoint format is essentially packed rows in a format that can be
restored by simply running normal inserts. We have changed a bit back and
forth between different checkpoint formats, we have used page-based schemes
that required UNDO logging also of memory tables in earlier versions of
MySQL Cluster. The benefit of the current scheme is that enables us to
avoid UNDO logs entirely for memory tables.

Modern disks can write a few hundred MBytes per second. Even cheap SSDs can
handle up to about 500 MBytes per second of writes. The disk has to handle
both writing of checkpoints and writes of the REDO log (and also of the
UNDO log for disk data).

In 7.3 and earlier the above config parameters was named differently.
They were named DiskCheckpointSpeed (more or less equal to MaxDiskWriteSpeed
in 7.4) and DiskCheckpointSpeedRestart (more or less equal to
MaxDiskWriteSpeedOwnRestart). MaxDiskWriteOtherNodeRestart have no
similitude in 7.3 and earlier versions, one uses DiskCheckpointSpeed also
when another node is restarting. This is one major factor in how we can
speed up node restarts and rolling upgrades.

One more thing is that the execution of LCPs have been improved in 7.4.
In 7.4 all LDM threads can write checkpoints in parallel. The config
parameter specifies the total speed of the node, thus each LDM will
use a part of this speed. So with write speed set to 100 MByte with
4 LDM threads means that each LDM thread will write at about 25 Mbyte
per second and thus about 25% of the available CPU capacity in the
LDM thread for checkpoints. In earlier versions up to 2 LDM threads
could write in parallel and each such LDM thread used the config
parameter DiskCheckpointSpeed/DiskCheckpointSpeedRestart. Thus the
write speed of the node was usually twice as much as the config
parameter was set to.

The change in 7.4 to enable all LDM threads to write in parallel means that
we can set the write speed higher in 7.4 than earlier. The CPU overhead of
checkpointing is spread on all LDM threads and this ensures that we are
able to increase the speed without detrimenting normal user operation.

There is also one more innovation in 7.4. This is seen from the config
parameter MinDiskWriteSpeed.

The problem here is that checkpoint writes compete for CPU capacity with
normal user operations. So if the user needs to run transactions at the
very highest speed, it is a good idea to slow down the CPU usage from
from writing checkpoints.

Additionally if we experience an IO lag where the disk gets overloaded
we also want to slow down the writing of checkpoints.

What we have done here is that we have implemented an adaptive algorithm
that will slow down the checkpoint speed when we either have a CPU
overload situation or we have an IO overload situation. To get a balanced
adaptive algorithm we decrease speed faster than we increase the speed.

For CPU overload we start reacting when the CPU usage reaches 95% and more.
We use a mechanism to see how much CPU time the LDM thread has had each
second. This means that this algorithm will work best in an environment
where the LDM threads always have access to CPU time if they need to.
If the LDM threads are sharing CPU time with many other threads and processes
the data node will still function but this particular adaptive algorithm
won't work very well.

Personally I always set up my data nodes using CPU locking with the
ThreadConfig config parameter to get the optimal stable performance from
the data nodes.

It is actually even possible to setup a real-time environment with MySQL
Cluster, more on the how-tos of that in a separate blog.

IO lagging is discovered by analysing how many seconds of REDO log is
on its way to the disk which hasn't yet been written to disk.

The adaptive algorithm will always ensure that we never go below
MinDiskWriteSpeed independent of CPU load and IO load. This parameter should
be set at such a level to never jeopardize the system and not risking that
the log tail meets the log head due to not completing LCPs quick enough.

The maximum checkpoint write speed is then based on whether we are in normal
operation, whether another node is currently restarting or whether we are
restarting our own node.

So by this thorough introduction to execution of LCPs in MySQL Cluster 7.4
it is possible to answer digitalpoint's question.

In normal operation when there is no high load on the node, there is no
overload of the IO subsystem, then we will write with LCPs at a constant
speed of MaxDiskWriteSpeed. The MinDiskWriteSpeed we will reach in a
scenario with overload either of the CPUs or the IO subsystem due to
many transactions being executed by the users of MySQL Cluster.

So MinDiskWriteSpeed should be set at the minimum level required to ensure
that we don't run out of REDO log space between. MaxDiskWriteSpeed should
be set to a level dependent on how fast restarts we want to have and this
is the speed for the most part so also one should take into account the
possibility of wearing out the disk when setting this parameter.

MaxDiskWriteSpeedOtherNodeRestart should be set quite high to enable fast
restarts. We should still not set it drastically high to avoid that we
run into unnecessary overload situations during the node restart.

MaxDiskWriteSpeedOwnRestart can be set to a level that can be handled by
the IO subsystem and CPU subsystem since it has very little other
activity to handle during a restart.

The adaptive algorithm makes it possible to fairly aggressive in setting
the last two config parameters since the adaptive algorithm will ensure
that we still maintain a healthy level of CPU usage for user transactions.

A final word is that it is possible to change those parameters in any restart.
Actually we even implemented an undocumented feature whereby you can change
these parameters in live nodes. The reason we haven't documented those is
that they are intended for persons that completely understand how their
system behaves and thus we consider that they are able to read the code of
Backup.cpp to understand what to do to change this parameters. The change
done through this interface won't survive a restart, for this one needs to
change the configuration.

We have also added a new NDBINFO table making it possible to follow how
the checkpoint write speed and other disk writes is occuring in your own
cluster. More on that in a later blog.

Wednesday, March 11, 2015

Methods of optimised module interaction in MySQL Cluster 7.4

One of the important optimisations we did in MySQL Cluster 7.4 was to
use more of direct function calls as part of SCAN processing instead of
relying on asynchronous messages.

We want each message execution to be short, up to around 10
microseconds of execution per message we should strive to
keep it below. However if we decrease the time spent executing
a message to down to a microsecond or smaller than we spend too
much time in the message scheduler. So it is important to strike a
balance here.

For primary key lookups we do most of the work in one message
execution, so here we already were doing the optimal behaviour
in our model of execution.

For scans however we often scan hundreds or even thousands and
in some cases even millions of rows, so here there is a lot of
opportunity to decide how to execute the scan.

Previously the scan model used a number of messages per row
scanned. This was quite obviously a bit too conservative. We have
now made optimisations such that we can scan up to 5 rows within
the execution of one message execution.

When converting an asynchronous message into a function call
there are more than one way to achieve this. There are ways that
are extremely fast, but these leave no extra opportunity of
tracing. So for each such call we have to make a decision.

The excerpt below is taken from the code in MySQL Cluster 7.4
and explains the different ways to convert asynchronous messages
into function calls.

When calling another block immediately there is a set of ways to
do this.

Standard, non-optimised manner:
Use EXECUTE_DIRECT with four parameters that specify block reference
of receiver, the signal object, the global signal number and the
length of the signal (There is also a variant used when sending such
a signal to a different instance). This method is fairly optimised
but has quite a lot of potential for performance improvement.

Standard, optimised manner:
In this case we optimise things by translating to block object and
retrieving the function pointer in the block call. This gives
the compiler assistance to separate the loads and stores more from
each other.
The call will be seen as:
block->EXECUTE_DIRECT(signal, f);

This manner optimises the code but retains the flexibility and also
the possibility to trace signal execution between blocks.

Non-standard, optimised manner:
In this case we remove some of the flexibility of the call to enhance
the performance yet a bit more. In this case we remove the possibility
for a flexible receiver of the signal, it is directed to a certain
block and we also call the method directly without any indirection.
The call will be seen as e.g.:
c_tup->execTUPKEYREQ(signal);

The standard manner of calling EXECUTE_DIRECT are both always calling
functions with one parameter being the signal object and no return
value. There is however two ways of sending signals back. In some
cases a signal is always sent back when returning from the
EXECUTE_DIRECT call, in this case the signal object returned contains
a signal object with data for the return signal. In some cases
one gets a return signal in this manner by a combination of the signal
number and the parameters. In the example below with NEXT_SCANREQ
we always gets a return signal when specifying ZSCAN_COMMIT
but not in other cases.

The other manner of sending a return signal is to perform a new
EXECUTE_DIRECT signal. In those cases one needs to ensure that the
call chain is bounded to not run out of stack. It is also a good
idea to try and ensure that the EXECUTE_DIRECT can use the
tail-call optimisation to avoid using too much stack which is bad
for CPU caching. This means returning immediately after
EXECUTE_DIRECT but also avoiding having objects that needs
destruction after return and also avoiding taking the reference of
stack variables.

Using the non-standard manner one can obviously also change more
ways, one can return a bool for example as in the example with
execTUPKEYREQ, one can add parameters and one can even change the
name to a different name not according to the standard naming
conventions. Obviously doing this removes flexibility of using
blocks in a flexible manner.

Tuesday, March 10, 2015

LCP Pausing module in MySQL Cluster 7.4

A new feature that assists in making node restart much faster is
the new PAUSE LCP protocol. This is an excerpt from the MySQL
Cluster 7.4 source code. There is also a fair amount of new
comments in the 7.4 source code which are only valid in the code
context.

This module contains code that executes for the purpose of pausing
LCP reporting to our meta data for a short time while we are copying the
meta data to a new starting node.

In order to better understand the handling of the LCP protocol we will
describe the LCP protocol, this includes both the old and the new protocol.

The LCP protocol is controlled by the DIH in the master node.
When an LCP has been completed we will immediately start checking for
the need for a new LCP to be started.

The first step here is to ensure that we have had sufficient activity in
the cluster to necessitate an LCP to be executed again.

To check this we send TCGETOPSIZEREQ to all DBTCs in the cluster. This
will gather in an estimate of how much writes we've had in the cluster
since the last LCP was started. There are also various ways to ensure
that we start an LCP immediately if so needed.

If the activity was sufficient we will start the LCP. Before starting the LCP
we will calculate a number of GCI values that are important, oldest restorable
GCI and so forth. Next we will send TC_CLOPSIZEREQ to all DBTCs in
the cluster to clear the activity counter in DBTC as preparation for the next
LCP start.

In the old way we will then grab a mutex on the fragment info, this
mutex will be held until the LCP is completed. The mutex is held in
the master node, in a master takeover the mutex needs to be taken
also in the new master node. Since all LCPs goes through the master
node this has the same effect as a distributed mutex on the fragment
info.

In the new way we will start the LCP immediately without grabbing
the mutex.

The first step in starting is to calculate the set of LQHs involved in
the LCP and the set of DIHs involved in the LCP. A node is involved in
the LCP in DIH if it has had the meta data copied to it. It will
participate in an LCP in LQH if the data has been restored and we're
ready to perform a full LCP.

Next we update to the new LCP id of the new LCP.

The next step is performed in the master node by walking through all
fragment replicas of all active tables to see how much of the REDO log
we can cut away when starting the new LCP. At the first order of a
LCP of a fragment in an LDM instance we will set the new log tail in
that LDM instance.

After calculating the new GCI values and setting the LCP id we will
synchronize this information with all other nodes in the cluster.
This information will also be synchronized to the file system in
the Sysfile. This file is where all restarts start by looking at
the state of the our database on files.
The COPY_GCIREQ signal is used to distribute this message.

When all nodes have synchronized this information to disk and confirmed
this to the master then we are ready to start sending orders to perform
the individual checkpoints of the fragment replicas.

The next step is that we want to set the tables to be involved in the
LCP. At this point we want to ensure that the same set of tables is
calculated in all nodes. To ensure this we grab the mutex that ensures
no tables are able to commit their CREATE TABLE statements until we are
done with this step.
This is started by the signal START_LCP_REQ. This signal also contains
list of nodes involved in the LCP both for LQH and DIH.

CREATE TABLE can create new tables prior to this point  which we will
include, and that's ok as they cannot possibly affect the new redo tail
position. DROP TABLE can drop tables prior to this point, which could
remove the need to maintain some old redo, but that will be handled in
the following LCP.

Each table to execute the LCP on is marked with a proper state in the
variable tabLcpStatus. Also each fragment replica to execute the LCP
on is marked with true in the lcpOngoingFlag and we set the number of
replicas to perform LCP on per fragment as well.

These preparatory steps are done in a synchronized manner, so all nodes
have received information about the COPY_GCIREQ and now all nodes have
heard the START_LCP_REQ signals. So in a master takeover we can ask all
nodes about their LCP state and we can derive if we sent the COPY_GCIREQ
to all nodes and similarly we can derive if we sent and completed the
START_LCP_REQ step. To derive this requires all nodes to have heard of
those signals, not just one of them since a crash can occur in the
middle of signal sending.

In a master takeover if we haven't completed the COPY_GCIREQ step then
we can start the next LCP from the beginning again. If COPY_GCIREQ has
been completed but not the START_LCP_REQ, then we can restart the
START_LCP_REQ step. Finally if the START_LCP_REQ has been completed
then we know that the execution of checkpoints on individual fragment
replicas is ongoing. Obviously in a master take over we should ensure
that the processing of START_LCP_REQ is completed before we report
back our state to the master node to ensure that we make the master
takeover handling as simple as possible.

So now that we know exactly which tables and fragment replicas to
checkpoint it is time to start the actual checkpoint phase.

The master node will send LCP_FRAG_ORD to DBLQH for each of the fragment
replicas to execute the LCP on.

In the old way there was a queue of such LCP_FRAG_ORD with limited size in
DBDIH (queue size was 2 in 7.3 and earlier and 128 in 7.4 versions).
Also DBLQH had a queue for LCP_FRAG_ORDs, in 7.3 this was 2 in size and
in early versions of 7.4 it was 64.

In the new version we can send LCP_FRAG_ORD to LQH as before, LQH has an
infinite queue size (it simply stores the LCP_FRAG_ORD on the fragment
record, so there is no limit to the queue size since all fragments can
be in the queue). In addition at master takeover we also support receiving
the same order two or more times. By ensuring that we keep track of that
we already received a LCP_FRAG_ORD on a fragment we can also easily discard
LCP_FRAG_ORDs that we already received.

These features mean that LQH can process a Local Checkpoint without much
interaction with DIH / DIH Master, which enables simplifications at DIH
and DIH Master in later versions. In principle we could send off all
LCP_FRAG_ORDs immediately if we like and more or less turn the LDM
instances into independent LCP execution engines. This is a step in the
direction of more local control in LQH over LCP execution.

When all LCP_FRAG_ORD have been sent, then a special LCP_FRAG_ORD
is sent to all participating LQH nodes. This signal has the flag lastFragmentFlag
set, it doesn't contain any fragment to checkpoint, it is only a flag that
indicates that we've sent the last LCP_FRAG_ORD.

LQH will execute orders to execute LCP on a fragment in the order they are
received. As a fragment is completing its LCP it will generate a new message
LCP_FRAG_REP. This message is broadcasted to all participating DIHs. First
the message is sent from DBLQH to the local DIH. Finally the local DIH will
broadcast it to all participating DIHs.

This new Pausing LCP module is involved here by being able to queue also
LCP_FRAG_REP before they are broadcast to the participating DIHs. They are
queued on the fragment replica records in the local DIH and thus we have
no limits on the queue size.

This allows the DIH Master state to be stabilised as necessary during an
LCP, removing the need in some cases to wait for an LCP to complete before
performing some other activity.

When LQH have executed all the LCP_FRAG_ORDs and have received the
last fragment flag, then the LDM will perform a number of activities to
complete the local checkpoint. These activities is mostly used by the
disk data tables.

After all these activities have completed the LQH will send
LCP_COMPLETE_REP to the local DIH. The local DIH will broadcast it to all
participating DIHs.

When all LQHs have sent all LCP_FRAG_REP and it has also sent the
LCP_COMPLETE_REP, then the LCP is completed. So a node that has seen
LCP_COMPLETE_REP from all nodes participating in the LCP knows that
it has received all the LCP_FRAG_REP for the LCP.

In a master takeover in the old way we could not resend the LCP_FRAG_ORD
to the LQH again. To avoid this we used an extra master takeover
protocol EMPTY_LCP_REQ. This protocol ensures that all LQHs have completed
the queues and that all LCP_FRAG_REPs have been sent to all participating
DIHs and likewise with the LCP_COMPLETE_REP such that the new master has
a precise view of which fragment replicas have completed the LCP execution
so far.

Thus when the master takeover is completed we know that each DIH has received
all the LCP_FRAG_REP for which an LCP_FRAG_ORD have been sent and also
all LCP_COMPLETE_REP that have been produced. This means that we are now
ready to restart the process of sending LCP_FRAG_ORD again.

The problem with this approach is that can consume a very long time to
execute the entire LCP fragment queue in LQH if the queue size increases
(increased from 2 to 64 going from 7.3 to 7.4) and the size of the
fragments also increase. So the master takeover can take a substantial
time in this case.

So the new manner is to allow for the LQH to get LCP_FRAG_ORD and also
the special last LCP_FRAG_ORD several times with the same LCP id and
discard those that it receives for a second time. In this manner we can
simply restart sending the LCP_FRAG_ORD from the beginning. When we are
done with this we can start checking for completion of the LCP in the
normal way.

When the master has sent the last special LCP_FRAG_ORD and these have been
received by the receiving nodes, then the master will actually itself not
do anything more to execute the LCP. The non-master nodes will however send
LCP_COMPLETE_REP to the master node. So this means that a new LCP won't
start until all participating DIHs have completed the processing of the
last LCP.

So effectively taking over as master in this phase doesn't really require
any specific work other than redirecting the LCP_COMPLETE_REP from the
non-masters to the new master. If it has already been sent it should be
seen in the response to the MASTER_LCPREQ from the node. So after
receiving the last MASTER_LCPCONF we have information enough about whether
we need to send more LCP_FRAG_ORDs or not.

We can still optimise the sending of LCP_FRAG_ORD a little bit by avoiding
to send LCP_FRAG_ORD to a fragment replica where we have already received
a LCP_FRAG_REP for it. It would be possible to avoid sending extra
LCP_FRAG_ORDs in various ways, but it doesn't really cost much, LCP_FRAG_ORD
is a small signal and the number of signals sent is limited to the number
of fragment replicas. So this would make sense if we have to support
extremely large clusters and extremely many tables in combination.

As this description shows some interesting places to test master failures
are:
1) Master failure while clearing TC counters (TC_CLOPSIZEREQ).
2) Master failure while distributing COPY_GCIREQ.
3) Master failure while distributing START_LCP_REQ
4) Master failure while processing the LCP and sending LCP_FRAG_ORDs
4.1) Before any LCP_FRAG_REP received
4.2) After receiving many LCP_FRAG_REPs, but not all
4.3) After receiving all LCP_FRAG_REPs, but not all LCP_COMPLETE_REPs
4.4) After receiving all LCP_FRAG_REPs, and all LCP_COMPLETE_REPs.

While distributing above can be interpreted as one test case of before
distributing, one in the middle of distributing and one when all
responses have been received.

It is also important to similarly test PAUSE_LCP_REQ handling in all of
the above states. This can be handled by inserting an ERROR_INSERT that
effectively stops the process to copy meta data at some point and then
setting some variable that triggers the copying of meta data to continue
at a state that we wanted to accomplish.