Monday, March 02, 2015

200M reads per second in MySQL Cluster 7.4

By courtesy of Intel we had access to a very large cluster of Intel servers for a few
weeks. We took the opportunity to see the improvements of the Intel
servers in the new Haswell implementation on the Intel Xeon chips. We also took
the opportunity to see how far we can now scale flexAsynch, the NoSQL benchmark
we've developed for testing MySQL Cluster.

Last time we tested we were using MySQL Cluster 7.2 and the main bottleneck
then was that the API nodes could not push through more than around 300k reads
per second and we have a limit of up to 255 nodes in total. This meant that we
were able to reach a bit more than 70M reads per second using MySQL Cluster 7.2.

In MySQL Cluster 7.3 we improved the handling of thread contention in the NDB API
which means that we are now able to process much more traffic per API node.
In MySQL Cluster 7.4 we also improved the execution in the NDB API receive
processing, and we also improved the handling of scans and PK lookups in the data
nodes. This meant that now each API node can process more than
1M reads per second. This is very good throughput given that each read contains
about 150 bytes. So this means that each socket can handle more than 1Gb/second.

To describe what we achieved we'll first describe the HW involved.
The machines had 2 sockets with Intel E5-2697 v3 processors. These are
Haswell-based Intel Xeon that have 14 cores and 28 CPU threads per CPU socket.
Thus a total of 28 cores and 56 CPU threads in each server operating at 2.6GHz base
frequency and a turbo frequency of 3.6GHz. The machines were equipped with
64 GByte of memory each. They had an Infiniband connection and
a gigabit ethernet port for communication.

The communication to the outside was actually limited by the Infiniband interrupt
handling. The Infiniband interrupt handling was set up to be latency-optimised
which results in higher interrupt rates. We did however manage to push the
flexAsynch such that this limitiation was very minor, it limited the performance
loss to within 10% of the maximum performance available.

We started testing using just 2 data nodes with 2 replicas. In this test we were able
to reach 13.94M reads per second. Using 4 data nodes we reached
28.53M reads per second. Using 8 data nodes we were able to scale it almost
linearly up to 55.30M reads per second. We managed to continue the
almost linear scaling even up to 24 data nodes where we achieved
156.5M reads per second. We also achieved 104.7M reads per second on a
16-node cluster and 131.7M reads on a 20-node cluster. Finally we took the
benchmark to 32 data nodes where we were able to achieve a new record of
205.6M reads per second.



The configuration we used in most of these tests had:
 12 LDM threads, non-HT
 12 TC threads, HT
 2 send threads, non-HT
 8 receive threads, HT
where HT means that we used both CPU threads in a core and non-HT meant
that we only used one thread per CPU core.

We also tried with 20 LDM threads HT, which gave similar results to 12 LDM
threads non-HT. Finally we had threads for replication, main, io and other activities
that were not used much in those benchmarks.

We compared the improvement of Haswell versus Ivy Bridge (Intel Xeon v2) servers
by running a similar configuration with 24 data nodes. With Ivy Bridge
(which had 12 cores per socket and thus 24 cores and 48 CPU threads in total) we
reached 117.0M reads per second and with Haswell we reached
156.5M reads per second. So this is a 33.8% improvement. Important to note here
is that Haswell was slightly limited by the interrupt handling of Infiniband
whereas the Ivy Bridge servers were not  imited by this. So the real difference is
probably more in the order of 40-45%.

At 24 nodes we tested scaling on number of API nodes. We started at 1 API machine
using 4 API node connections. This gave 4.24M reads per second. We then tried with
3 API machines using a total of 12 API node connections where we achieved
12.84M reads per second. We then added 3 machines at a time with 12 new API
connections and this added more than 12M reads per second giving 62.71M reads
per second at 15 API machines, 122.8M reads per second at 30 API machines and
linear scaling continued until 37 API machines where we achieved 156.5M reads
per second. The best results was achieved at 37 API machines where we achieved
156.5M reads per second. Performance of 40 API machines was about the same as at
37 API machines at 156.0M reads per second. The performance was saturated here
since the interrupt handling could not handle more packets per second. Even
without this the data node was close to saturating the CPUs for both the LDM and
the TC threads and the send threads.

Running with clusters like this is interesting. The bottlenecks can be more tricky
to find than the normal case. One must remember that running a benchmark with
37 API machines and 24 data nodes where each machine has 28 CPU cores, thus
more than 1000 CPU cores are involved, it requires understanding a complex
queueing network.

What is interesting here is that the queueing network behaves best if there is some
well behaved bottleneck in the system. This bottleneck ensures that the flow
through the remainder of the system behaves well. However in some cases where
there is no bottleneck in the system one can enter into a wave of increasing and
decreasing performance. We have all experienced this type of
behaviour of queueing networks while being stuck in car queues.

What we discovered is that MySQL Cluster can enter such waves if the config doesn't
have any natural bottlenecks. What happens here is that the data nodes are able to
send results back to the API nodes in an eager fashion. This means that the API nodes
receives many small packets to process. Since small packets takes longer to process
per byte compared to large packets this has the consequence that the API node slows
down. This in turn means that the benchmark slows down. After a while the data nodes
starts sending larger packets again to speed things up and again it hits too eager
sending.

To handle this we introduced a new configuration parameter MaxSendDelay in
MySQL Cluster 7.4. This parameter ensures that we are not so eager in sending
responses back to the API nodes. We will send immediately if there is no other
competing traffic, but if there is other competing traffic, we will delay sending
a bit to ensure that we're sending larger packets. One can say that we're
introducing an artificial bottleneck into the send part. This artificial bottleneck
can in some cases improve throughput by as much as 100% and  more.

The conclusion is that MySQL Cluster 7.4 using the new Haswell computers is
capable of stunning performance. It can deliver 205.6M reads per second of
records a bit larger than 100 bytes, thus providing a data flow of more than
20 GBytes per second of key lookups or 12.3 billion reads per minute.

Monday, March 31, 2014

MySQL Cluster 7.4.0 Lab Release Improvements

The year of 2014 seems to be the year of the In-Memory Databases. We've seen a lot of commotion around the Hekaton product from Microsoft and other new products. At the same time everyone talks about WebScale and how to build distributed databases that can handle millions of queries per second. In addition we've seen the development quite efficient communication mechanisms the last few years. Databases that make use of In-Memory, WebScale and efficient communication mechanisms are the talk of today in the database world.

The background of NDB

The design of the NDB storage engine started out with exactly those base requirements already more than 20 years ago where the aim was to build the next generation telecom database. The base requirements was a shared-nothing database (for superior scalability and meeting telecom requirements on fail-over times). Already in those days one could buy a machine equipped with 1 GB of memory. Given that the telecom databases was used for extremely quick lookups of small amount of data it was natural to consider an in-memory database design. My research into databases at the time showed that they spent a considerable amount of time in the operating system to handle communication and hard disks. So by moving data to in-memory and by using communication mechanisms that avoided operating system we were able to deliver extremely efficient database operations already in the late 90s.

NDB Today

Today NDB is the storage engine of MySQL Cluster and has been in production usage for more than 10 years. Most everyone on the globe is touched by its operation in some telecom system, in some computer game or in some other type of web property application. We already delivered benchmarks with billions of reads per minute and the scalability of MySQL Cluster is so high that we simply don't have big enough computers or computer sites to show off its limits. A while ago we had access to a computer lab with hundreds of computers that was connected using Infiniband with a total bandwidth between the machines of 1 Tbit/sec. This means we can transport 128 GBytes per second between the machines. However MySQL Cluster could theoretically produce enough parallel read operations to swamp even such a network. So we're getting to the point that it becomes more and more uninteresting to show the scalability limits of MySQL Cluster. So this means that we also want to focus on efficiency and not only on scalability.

The Many-Core CPU Challenge

I gave a presentation to a set of students at Uppsala University about MySQL and MySQL Cluster. In this presentation I showed how the development of multi-core CPUs presented a tremendous challenge to software designers. In a short time of only 8 years Intel and other HW developers gave us the challenge to scale our software 60x. At MySQL we were up to the challenge, we've increased scalability of MySQL using InnoDB 20x in a time span of 5 years. At the same we've actually increased the scalability of the NDB storage engine more than 30x and this means that MySQL Cluster where we use MySQL together with the NDB storage engine has actually scaled 60x in total. This means that my test machine with 8 sockets and 96 CPU threads is now the limiting factor in my benchmarks.

The Many-Core CPU Solution

How have we achieved this? With the MySQL Server it has been a long set of handling various bottlenecks such as splitting the InnoDB buffer pool, handling the LOCK_open mutex and many more changes that collectively have made it possible to scale much beyond of what our software in 2008 could achieve. This improvement of scalability continues, so stay tuned for more, there are blogs to read now what is currently going on MySQL 5.7 development.

With the NDB storage engine the solution has been quite different. We started out building a distributed database already from the beginning consisting of a set of nodes that replicate data synchronously using transactions. In order to avoid usage of the operating system we built the architecture based on a set of independent modules that interact with messages. This was built on the architecture of AXE, a telecom switch operating system of unique efficiency. The first version had each node implemented in a signal thread. With the development based on independent modules meant that we inherited a simple task of dividing the thread into a number of functional modules. Currently we have separated into the local database part, the transaction part, the network send part, the network receive part and an asynchronous event part and finally the main part containing features for meta data handling. Given that we developed a shared nothing architecture it was simple to continue the partitioning to gain even more independent LDM parts by having each LDM thread handle different parts of the data. The transaction part is simple to use a simple round robin scheme and the network parts can easily be divided per socket we handle. In the future we could perform even more divisions of some functions.


NDB Layer by Layer approach


So what does this mean in effect. It means that we actually built a distributed database inside each node inside our distributed database. Given that we can also replicate using MySQL replication we can actually go even further and have multiple clusters connected together. For those that want even more to think of how NDB could be used to build systems with millions of CPUs can google on the word iClaustron. iClaustron is a hobby project I've played with since 2006 and I presented the aims of the project in a tech talk at Google which is available on YouTube.

The world is organised down into microcosmos and continues growing into macrocosmos. So why would software be different, we need to build systems of any size by using layers of layers of distribution.

So building MySQL Cluster is an interesting project in building layer by layer of distribution into the system.


The big challenge ahead of us


So what could be the next challenge that the hardware engineers will deliver to us software engineers. Personally I am preparing already now for it. This challenge that I hope they will bring to us is persistent memory. This means that we will have to build databases where all of a sudden we can make persistent writes at similar speed as we are currently writing to main memory. This will be an interesting challenge and personally I think that main memory databases have a unique advantage in this challenge since they already work at memory speed. So I feel a bit like a horse in the gates before a race, kicking and just eagerly waiting to get off on to the track to see how fast we can run home the next big challenge. But we have to wait until the hardware engineers first solves the issue with which technology will be the winner in this category and that can be commercialised.


So after these small philosophical thoughts let's get into what we're doing in first Lab Release of MySQL Cluster version 7.4 to get further on the path to these goals.

The improvements in MySQL Cluster 7.4.0 Labs release

As mentioned we are working on improving efficiency of MySQL Cluster, we have specifically worked on the scans in the NDB storage engine which have been heavily optimised. In benchmarks using a lot of scans such as Sysbench we have managed to scale up performance per data node by 46% comparing 7.4.0 to 7.3.5. Compared to 7.2.16 the difference is even bigger than 100% but going from 7.2 to 7.3 it was mainly inefficiences in the MySQL Server that was fixed.





Another important thing we've done in 7.4.0 is add a lot of documentation about both our restarts and our scans in the form of extended comments in the code. We've also gone through the log message to the operator while restarting and made them much more accessible and extensive.

MySQL Cluster 7.4.0 improvements for virtual machine environments

With 7.4 we're working hard on making MySQL Cluster more stable even when the underlying system isn't as stable as we would expect. MySQL Cluster is designed for high availability environments and now we're working on making sure that the system can continue to operate even when systems are overcommitted, when we're working in virtual machine environment where we cannot be certain of the exact resources we have available it is hard to operate a high availability environment but we still want to work as reliable as possible.

MySQL Cluster 7.4.0 Stability improvements

We have also been working on improving the availability of the system also in high availability environments by improving the restart times. There are many areas where we can work on this, we can remove subtle delays that adds up to longer restart times, we can use more parallelism in certain phases of the restarts. We have also made our local checkpoints more parallelised which means that we have a more balanced load on the various LDM threads in our system. This actually has the nice side effect that we get a more balanced load amongst the LDM threads which pays off in 5-10% improved performance for any application. Naturally it also means that we can run the local checkpoint faster since we don't risk imbalances in the CPU load by running local checkpoints faster.

Another unique feature of MySQL Cluster is supporting Active-Active environments using MySQL replication. We've been working to extend the support of this feature even further.


Benchmark environment description

We executed a set of benchmarks using Sysbench 0.4.12.6 in our dbt2-0.37.50.6 environment. We used a big machine with 8 sockets of Intel Xeon CPUs running at 2GHz. Each socket has 6 cores and 12 CPU threads. In most cases we run with hyperthreading enabled. But we have found that running LDM threads without hyperthreading is a good idea. This decreases the amount of partitions to manage and the number of threads to manage which have a positive effect on performance. We used 8 LDM threads and in this case the NDB data node used 2 sockets, the benchmark program and the MySQL Server had access to 5 sockets. The MySQL Server used about 40 CPU threads out of the 60 it had access to so in this configuration we had spare resources to use. But in the next step where we went to 12 LDM threads we could not use the full potential of the SW. In this case the data node needed 3 sockets, the benchmark program used 1 socket and thus the MySQL Server only had access to 4 sockets and this meant that it could increase performance by 25% and not the 50% made possible by going to 12 LDM threads (actually we squeezed a bit and made 52 CPU threads available to the MySQL Server and thus got about a 30% improvement over 8 LDM threads). Using 7.3 the data nodes are less efficient so here we could scale the LDM threads all the way to the 50% improvement (actually we even got to 52.7% improvement, so perfect scaling of performance as more LDM threads are added).

So with 12 LDM threads we need a 54 core-machine to make full use of the possibility of the data node. With 16 LDM threads we need even more, we need 4 sockets for the data node, we now need 2 sockets for the benchmark program, we need 6 sockets to run the MySQL Server and thus a total of 12 sockets or 72 cores. This is probably as far as MySQL 5.6 can help us scale before the MySQL Server can no longer scale. But this is an important area of focus for MySQL 5.7 that have already had a set of improvements implemented in the 5.7.4 DMR released now.

Final words

So with this kind of scalability we are now in a position to deliver more performance from one single node than we previously could deliver from an entire cluster. Imagine what performance we can then get when connecting many nodes together and still we're working on making each MySQL Cluster thread more efficient in its execution.

Monday, November 25, 2013

How to make an efficient Scalable Key Lookup engine of MySQL Cluster

MySQL Cluster have all the ingridients to be designed as a very scalable and extremely efficient key lookup engine for the Cloud. As we have shown in earlier entries of my blog we've been able to scale MySQL Cluster 7.2 to handle 72 million key lookups per second or 4.3 billion key lookups per minute. This benchmark was actually limited by a limitation of the NDB API nodes to handle more than about 300k lookups per second and so with a maximum of 255 nodes we got to around 72 million per second in total. However in MySQL Cluster 7.3 we have removed this limitation, in addition we have also enabled scaling to even bigger data nodes, so it should be possible now to reach even higher numbers.

The aim of this blog is however not to give any new benchmark results, rather it is providing details about how the benchmark program works and how this benchmark program architecture can be used to design an efficient scalable key lookup data store.

To obtain best possible performance we want to ensure that the data node can operate as efficiently as possible. This is done by ensuring that a connection to the data node sends many key lookups bundled together. Operating on individual key lookups is possible of course, but as usual it is more efficient to operate on bigger entities than one key lookup at a time. To provide this we use a concept we call Executor Thread. The Executor Thread will only execute key lookups aimed for a certain data node. So this means that the number of Executor Threads will be a multiple of the number of data nodes (there could be more than one thread per data node if necessary). The Executor Thread will receive key lookups from an internal queue handled by the application program (in our case the flexAsynch benchmark program). The key lookups are prepared by the Definer Threads. The Definer Thread will receive a key lookup aimed for any data node, it will take this key lookup and calculate the receiving data node for this key lookup (there is API calls in the NDB API to handle this). Based on this calculation the Definer Thread will put the key lookup in the queue of the proper Executor Thread.

The architecture before the Definer Thread is dependent on the application. In the figure provided here we have shown one possible architecture where we have one receive thread that receives a flow of messages from somewhere, to process those messages we need to interpret the packets and process them, this could entail one or more key lookups. In the figure we have assumed there is one key lookup per message and that the Executor Thread can format the packet back to the sender based on the information in the internal key lookup order.



So the important part of the architecture is the Executor Thread that handles messages to one data node based on an internal data structure that defines one key lookup and defines how to process the response (this thread should do as little work as possible to ensure it can focus on communication with the data node). There should also be a Definer Thread that prepares the key lookup request and puts the request in the queue of the proper Executor Thread. The Definer Thread could also do other things and there could be few or many Definer Threads in the architecture.

So how does flexAsynch work, in this case we don't have any input traffic, we generate the key lookups in the Definer Threads. The Definer Thread has a very simple operation. It starts by preparing a set of key lookups to any data node. For each of those key lookups it puts the request in the queue of the proper Executor Thread. After placing all requests in a queue, it starts waiting for all operations to complete. After all requests have received their response from the Executor Threads we simply continue with the next batch.

The operation of the Executor Thread is also very simple, it gets the current set of key lookups waiting in queue, it prepares those for execution of the NDB API. It sends off all the operations to the data node. When the data node have executed all of the operations it reports the result back to the Definer Threads and updates some benchmark statistics, then it continues with the next batch of key lookups.

So the operation of an efficient key lookup data store is not difficult at all. To make it scale one can then add up to 48 data nodes per cluster (each is capable of handling more than 5 million key lookups per second of around 100 byte in size). Each cluster can handle a total of 255 nodes in total. Obviously it is also straightforward to operate more than one cluster to scale even further.

The benchmark code exists in storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code exists in the NEW module here (it also contains a lot of legacy code for old variants of the flexAsynch benchmark).

Tuesday, November 19, 2013

MySQL Cluster run-time environment: Part 3: Configuration recommendations

Binding threads to CPUs in the MySQL Cluster data nodes can have great benefits. So what about hyperthreading, should we use all CPU threads, or only 1 CPU thread per CPU core? The recommendation differs actually. For the most part it is beneficial to use hyperthreading. In most thread types it gives about 40% higher performance with 2 CPUs using hyperthreading compared to 1 CPU not using hyperthreading. There are a few cases where it might be beneficial to not use hyperthreading though.

The first example is for LDM threads. Using hyperthreading means we will increase the number of partitions of a table by a factor of two. In many cases this isn't beneficial, particularly when the number of LDM threads is high. I tried with using 24 LDM threads with hyperthreading on 12 CPU cores and compared it to 12 LDM threads on 12 CPU cores. This case didn't benefit from using hyperthreading of LDM threads. However if the number of LDM threads is low it would probably still pay off, so going from 2 to 4 LDM threads is still beneficial, probably also going from 4 to 8. But going from 8 to 16 is less likely to be beneficial.

I have tested send and recv threads with and without hyperthreading, my benchmarks have always improved by using hyperthreading. It has always been better to use hyperthreading. The same conclusion I've seen with tc threads. Obviously if the main thread or the rep thread for some reason becomes the major bottleneck, then it makes sense to remove use of hyperthreading here.

Avoiding use of hyperthreading can be simply done by not configuring any threads to use the second CPU on each of the CPU cores we want to avoid hyperthreading on. As an example we will configure a machine with one data node, the machine have 4 sockets with 24 cores and 48 CPU threads. The CPUs 0-5 represent the CPU threads on socket 0 and core 0-5 and thread 0, CPUs 24-29 represents socket 0 and core 0-5 and thread 1. So if we want to configure with 12 LDM threads not using hyperthreading here we could use the config:

ThreadConfig="ldm={count=12,cpubind=0-11},tc={count=4,cpuset=12-17,36-41},send={count=4,cpuset=12-17,36-41},recv={count=4,cpuset=12-17,36-41},io={cpuset=18-19,42-43},main={count=1,cpuset=18-19,42-43},rep={count=1,cpuset=18-19,42-43},wd={cpuset=18-19,42-43}"

In this configuration the LDM threads will use CPUs 0-11 which covers all cores on socket 0 and 1. No other thread is configured to use any CPU thread in those CPU sockets. The OS might still decide to use some of them, but we have left a number of empty CPUs that the OS hopefully discovers as idle and uses those to schedule OS activities. We can actually do even better than this, there is a boot config variable in Linux whereby one can specify which CPUs the OS is allowed to use. Similarly there is a similar config variable for irqbalance to ensure interrupts are not scheduled on any CPU used by the MySQL Cluster data node. The tc, send and recv thread are scheduled on any the CPU threads on socket 2 and the main, rep, wd, and io threads are using 2 cores on socket 3. The OS and other processes aren't blocked from using other CPUs, but will most likely be scheduled on any of those free CPUs with no activity on them.

The first question that one starts in specifying the MySQL Cluster run-time environment would be to decide on how many ldm threads one should use. This is dependent on how many CPUs that are accessible to the data nodes. So assuming we have access to 24 CPU threads with hyperthreading. In this case it would be natural to start by using the number of ldm threads set to 6 and not use hyperthreading. A natural initial start is to use half of the available CPU cores for ldm threads. Next one assigns about a quarter of the ldm CPU resources to tc threads, in this case we land at 3 tc threads. Next one assigns a similar number of CPUs to send and recv threads. Then one assigns the rest of the CPUs to main, rep, io and wd threads. This should give a fair amount of resources available also to the OS.

After this initial assignment it is a good idea to run some benchmark which is close to the workload of your application to see whether the config works well. For this test run one should use cpubind for all thread types to ensure that we know how much CPU resources each thread type consumes (can be easily derived looking at top with per-CPU load mode, using cpubind). First check whether ldm threads is a bottleneck, if it is then check if it is possible to increase to the next level. In this example this would mean going to 8 ldm threads then using 2/3 of the CPU resources. If this isn't possible then just make sure that the rest of the thread types have a fair amount of CPU resources to avoid any unneeded bottlenecks.

If the bottleneck isn't the ldm thread, then assign more resources to this thread type and in most cases by removing resources from non-ldm threads. There could be cases where less than half of the CPU resources are needed by the ldm threads, but I would deem those as very unusual. Given that the actual database processing is done in the ldm threads, it would be an exceptional case if other threads consume more than half of the resources.

Always remember to update the NoOfFragmentLogParts variable if changing the number of ldm threads.

After a few trials we have most likely found a decent configuration. After finding this configuration we can also consider where to use cpuset and if any threads should use the realtime or spintime variables.

So next question is when to use cpubind, cpuset, realtime and spintime.

Both cpubind and cpuset is about locking threads to an individual CPU or a set of CPUs. We could actually consider even using no cpubind and cpuset as a form of CPU locking. We are locking the threads to the set of available CPUs in the OS. Given that we might be running on a virtual OS this might actually already be a subset of the existing subset of the existing CPUs. To make any configuration of CPUs using cpubind/cpuset one has to have knowledge of how cpu ids maps to CPU sockets and CPU cores.

So the default configuration not using any cpubind/cpuset is to allow the OS to schedule the thread onto any of the available CPUs. The OS scheduler is optimised towards an interactive environment where processes need to react to human interaction. It does also a good job of server environments where a fairly high number of threads compete for a small number of CPUs. It does also do a decent job of handling server environments where there are only a handful of threads competing for a number of CPUs.

The type of threads that the normal OS schedulers have most problems to handle are long-running threads that consume a lot of CPU resources. In particular threads that run more or less constantly. What happens is that the OS scheduler eventually downgrades their priority such that other processes are given a chance to execute, since the process still wants to execute the OS now searches for a free CPU to use. This is often successful, the problem is however that this means that we migrate the thread onto a new CPU. This happens many hundreds of times per second for those busy threads. The effect is that the thread comes to a new CPU where it has no data or instructions cached in the CPU caches. So this means that a migrated thread will spend quite a lot of time to warm up CPU caches before it can run as efficiently as it could before the migration. In addition this requires more bandwidth on the memory bus which in some cases can become a bottleneck.

In the above case it is better to stay with the same CPU even if a new job is scheduled, this is the case if we can be sure that the new process isn't one more long-running thread. So this means that in order to optimise the MySQL Cluster run-time environment we need to be in control over all usages of the CPU on the machine we're using. In many cases we colocate data nodes and MySQL Server processes. We can control placement of MySQL Server processes using numactl or taskset that can be applied when starting the process or using pid when the process is already started. This ensures that the MySQL Server process is never scheduled outside the set of CPUs we gave it access to through the taskset/numactl process. This is how I control the environment when running any MySQL Cluster benchmark. Similarly I also control the benchmark processes (sysbench/dbt2 client processes/flexAsynch...).

In this manner I am certain that no application process is using the CPUs provided for execution of the data node threads. So the only threads that will execute on the CPUs are either OS kernel threads or interrupt handling. Even this can be controlled by using a special boot option isolcpus and list the cpu numbers that the OS is allowed to use. When setting this variable only the listed cpus are using a normal scheduler handling with the possibility to migrate CPUs. Usage of the rest of the CPUs can only be invoked if the application uses a locking call such as controlled by cpuset/cpubind or taskset/numactl. The final thing to control is the execution of interrupts. This is normally handled by the irqbalance process and this can be configured to avoid a bitmap specified by the irqbalance configuration variable IRQBALANCE_BANNED_CPUS. So it is possible to create a completely compartmentalized load with interrupts on certain CPUs, OS and other applications on another set of CPUs, a set of CPUs for data node, a set of CPUs for MySQL Server and a set of CPUs for any other consuming application process. Providing this is a combination of boot options, irqbalance configuration, MySQL Cluster configuration and finally using taskset/numactl on certain processes.

So when to use cpubind and when to cpuset. cpubind should mainly be used on threads that can consume up to 100% of the CPUs. This would mainly be the ldm threads normally, but can definitely be the case also for other threads dependent on the workload and the configuration. So one way of configuring MySQL Cluster environments is to use cpubind for ldm threads and put the tc, send and recv threads in a common cpuset. The nice thing with this configuration is that we can run with more threads than really necessary, so e.g. we can run with 4 tc threads, 4 send threads and 4 recv threads and put these into a cpuset consisting of e.g. 8 CPUs. In this manner the  OS can easily handle if there is a certain thread types that requires more resources for a while.

So configuring a MySQL Cluster run-time environment is about making use of static scheduling for the most critical resources and to use the flexibility of the OS scheduling to handle the less critical resources. A good design philosophy for all environments like this is to design the run-time environment with a well-known bottleneck. For MySQL Cluster data nodes we recommend to always make the ldm threads the bottleneck. The nice thing about this is that it makes it easier to understand how to handle overload and reason around it. As an example if the ldm thread are overloaded and tc threads have available resources we can ensure that the tc threads handle sending error messages about overload without even contacting the ldm threads. This can be achieved by decreasing available resources in ldm threads through configurations at least for scan operations.

Hopefully these recommendations will help you find the optimal configuration and still a safe configuration. Default configurations will work for most installations, but to get the last 10-100% performance out of the system one might need to dive a bit deeper into the configuration of the MySQL Cluster run-time environment.

Next configuration item to consider is the realtime setting. This can now be set on each thread type. Traffic queries normally execute arriving in the recv thread, sent to the tc or the ldm threads and then the reply is sent through the send thread. main thread is mainly involved in the meta-data operation which rarely are time-critical. rep threads can have a high load, but they are not part of any critical paths except for asynchronous replication to other clusters. The io thread is only involved in time-critical operations if disk data is used in MySQL Cluster. The wd threads are different, obviously it is important that the watchdog thread gets an opportunity to execute every now and then. So if other threads are using realtime it's a good idea to use realtime also on the wd thread type. recv, send and tc threads are usually doing small jobs and thus realtime scheduling might be beneficial for those threads to cut the response time. For ldm threads that execute close to 100% of the time it's debatable whether realtime is such a good idea. The OS cannot handle realtime threads executing on realtime priority for a long time, so we have implemented protection for this in the data nodes. This ensures that we decrease the priority to normal user priority even for realtime threads if they execute for too long. It's rarely any benefits of throughput to use the realtime configuration for threads. It's mainly intended to enable less variation on response time.

Another use case for realtime is when there is a mix of data node threads and other application threads where the application threads have lower priority. In this case we ensure that the data node threads gets prioritised access to CPUs before the other application threads gets access.

The final configuration item to consider is the spintime. This means that the data node thread will execute for a bit longer before entering sleep mode. So if the recv thread or any other thread sends a new message to the thread in this spintime we decrease the wake up time. The only case where this can increase throughput is if the spinning thread is the bottleneck of the data node and the spinning doesn't steal resources from any other critical thread. The main usage of spintime would be as a tool to improve response time in an environment with plentiful of CPU resources. It is important to consider that the use of spintime will increase use of CPU resources for those threads it is set on. It only applies to the thread types ldm, tc, main, rep, send and recv threads.  One should in most cases avoid mixing spintime and realtime settings on the same thread.