The Value of Using Klustron in Small Data Volume Scenarios

Preface

In the initial stage of the development of a new business in a company or department, the amount of data is usually not large, for example, only within a few tens of GB. A MySQL master-slave replication cluster can easily handle it. So what is the value of using Klustron in this case? In fact, there is still huge value, and this article will comprehensively answer this question.

High availability

The high performance & strong synchronization functionality of Klustron-storage Fullsync ensures the high availability of each Klustron-storage shard.

At the same time, the cluster_mgr module of Klustron will automatically complete the master node probing, automatic & manual selection of the master node, and automatic switching to the new master node, making Klustron have a complete automatic high availability capability without user intervention.

Compared with MySQL Group Replication or semisync replication, Klustron has higher performance, shorter delay, and lower consumption of computing resources.

The reasons are as follows:

MGR needs to hold locks and wait for the slave machine to confirm (ACK) that it has received the transaction binlog before the Flush transaction binlog, which will block all active transactions that have transaction lock conflicts with these transactions that are waiting for ack for a long time, thereby reducing the system's concurrent processing capacity and prolonging the running time of those blocked transactions, affecting the response speed of the business system. Klustron-storage Fullsync waits for the slave machine to confirm after-commit, and has released the transaction lock it holds before waiting for the confirmation of the binlog received by the slave machine, so it will not block those transactions that conflict with it due to waiting for the slave machine to respond.
Meanwhile, fullsync does not need to occupy working threads while waiting for the slave machine ACK, while semisync and MGR both need to occupy working threads, which means that MySQL needs to start more working threads to deal with business loads. Each working thread will consume CPU time slices, memory, and need the OS to manage their states, all of which are consumption of computing resources.
Due to these technical advantages, Klustron-storage fullsync has up to 5 times the performance advantage over MGR in performance tests such as sysbench.

High Performance

Not only does Klustron-storage have better replication performance than MySQL, but Klustron also has a series of high-performance and scalable capabilities in data analysis, query processing, and other aspects, as follows.

Read-write separation

When a computing node executes a read-only query statement (i.e., select statement), it can automatically send a select query to the slave node of the storage cluster to obtain data to execute the select query statement.

This behavior does not require user intervention, so the user program can use this function without modification. Of course, users can also turn off this feature on a connection or global level. Executing read-only queries on the slave machine can reduce the load on the master node.

However, it should be noted that some read-only queries cannot be sent to the slave machine. For example, if a table is updated and then the data in the table is queried in the same transaction, in order to ensure that the latest data is queried and to avoid inconsistent query results, this select statement cannot be sent to the slave machine for execution by default, unless the user reduces the consistency level and forces the reading of the possible old data on the slave machine.

That is to say, read-write separation technology is not always able to share the load of the master node, which is the same for any distributed database or middleware in the middle of database sharding.

Parallel select queries

Klustron supports parallel select queries, including parallel query processing within computing nodes, parallel queries between computing nodes and storage nodes, and parallel queries within storage nodes.

The parallel processing capabilities at these three levels greatly improve the query performance of Klustron, especially for complex queries in data analysis.

Parallel within computing nodes refers to multiple threads executing specific nodes in the query plan. Some query processing functions in Klustron-server have parallel query capabilities, including NestedLoopJoin, HashJoin, MergeJoin, Append, Aggregate, etc.; this parallel capability is inherited from PostgreSQL.

Parallel between computing nodes and storage nodes: refers to computing nodes that can asynchronously send SQL statements to read and write target data of storage nodes, so that all storage nodes containing target data will execute the received SQL statements in parallel; this is a capability that Klustron has had since the initial public release of version 0.6, and it is also our self-developed technology.

Parallel within storage nodes: when a computing node sends multiple read-only select statements to the same storage cluster, the computing node will open multiple connections to the target storage node and execute a select statement in each connection. These select statements are executed in parallel in this storage node.

In this way, the computing node can use more computing resources of the storage node to serve the same query request at the same time. This function will be released in Klustron 1.0, and it is also our self-developed technology.

Data Analysis Functionality

Regardless of the number of storage shards in a Klustron cluster, users can deploy several computing nodes specifically for data analysis queries. These computing nodes use read-write separation technology to obtain data from the slave nodes of these storage shards to execute analysis statements, so they do not affect the performance of OLTP loads.

Klustron's computing nodes support complete data analysis functionality and have passed all TPC-H and TPC-DS performance tests, which is an ability that MySQL does not have.

Specifically, Klustron-server supports all window functions, grouping sets, cube, rollup, and other functions, while MySQL only supports rollup and some window function functions.

Klustron's full-cluster multi-level parallel query processing capability brings a performance leap for data analysis queries.

Fully Utilize Hardware Resources

Using the simplest MySQL one master and two slave cluster requires three machines, but only the master node can bear the write load among these three machines, and its CPU and IO loads are usually higher than those of the slave machines.

After using Klustron, deploy a Klustron cluster in peer-to-peer deployment mode on three machines. This cluster has three storage shards, and each machine has the master node of one shard and a slave node of the other two shards (as shown below). In this way, the computing and write loads and the total number of connections initiated by the application are evenly distributed among the three machines, fully utilizing the computing resources of each machine, and achieving higher throughput and performance for the entire cluster.

![](小数据量用户场景使用 Klustron_的价值/1.png)

Comprehensive and flexible cluster management functions

Programmable basic cluster management API

Klustron supports complete cluster management functions, including automatic cluster slave and globally consistent recovery, horizontal elastic expansion, addition, deletion, start and stop of computing nodes, addition, deletion of storage clusters and storage nodes, redo standby, automatic or manual master-slave switching, business-unaware Online DDL, etc.

And all these functions have corresponding API interfaces, so external software systems can operate and use Klustron's cluster management functions in a programmable way, for example, it can be easily integrated into the user's database cluster management interface.

Reserved horizontal expansion capability

As the user's business grows, even if a Klustron cluster currently uses only one storage shard, it can automatically apply a business-unaware way to horizontally and elastically expand to multiple storage shards in the future, as long as the user allocates more server nodes to the cluster, without other intervention by the DBA. This way, users don't have to worry about how to deal with the continuous growth of data scale in the future.

During the automatic horizontal elastic expansion, the Klustron cluster does not lock the table and does not affect the operation of the application program.

Graphical operation monitoring, fault diagnosis and alarm

The operation logs, slow query logs and other log files of each module of Klustron contain rich runtime information of each module of the cluster, which Klustron uses for graphical semi-automatic fault diagnosis to help DBAs quickly and accurately locate problems.

At the same time, with the popular monitoring system combination of prometheus+grafana, the monitoring of cluster nodes has been realized. Alarms can also be made in various forms such as SMS and telephone to notify DBAs to deal with problems in a timely manner.

Database Migration Workload

For migrating from other database systems to Klustron, the data loading part can be completed using third-party tools.

Usually, the difficulty and workload are relatively large in terms of the transformation of the application system. In this regard, we have done a lot of work to help users easily migrate from MySQL and Oracle Server to Klustron.

MySQL Compatibility

Klustron supports two connection protocols: PostgreSQL and MySQL. In each protocol, any SQL statement supported by Klustron can be sent.

In this way, more extensive data storage management and utilization capabilities than MySQL can be utilized, such as Klustron's OLAP analysis capabilities, which are far superior to MySQL in performance.

At the same time, Klustron supports MySQL's private DML syntax, refer to this article (Klustron's support for MySQL's private DML syntaxopen in new window).

For SQL functions unique to MySQL (that is, not defined in the SQL standard), we will add support as needed, including all MySQL-specific functions except GIS functions and JSON functions. Adding such a function is not difficult, and we may also draw resources later to implement all such functions in the Klustron computing node.

In this way, application software that originally used MySQL can directly connect to Klustron without any code modification or recompilation.

Oracle Compatibility

Klustron inherits PostgreSQL's compatibility with Oracle databases, including support for most of the features of PL/SQL and SQL-2003.

For other Oracle-specific features, users need to complete application-side code modifications (usually requiring minor modifications to stored procedures and SQL queries), and can use some third-party tools to simplify and speed up these tasks.

If the application program originally used ODBC or JDBC to connect to the Oracle database, then it can be connected to Klustron without any code modification; otherwise, it is necessary to modify the application-side code and use the client library for PostgreSQL in the corresponding programming language to connect to Klustron.

The Value of Using Klustron in Small Data Volume Scenarios

The Value of Using Klustron in Small Data Volume Scenarios

Preface

High availability

High Performance

Comprehensive and flexible cluster management functions

Database Migration Workload

END