Klustron Shard 故障隔离测试
大约 4 分钟
Klustron Shard 故障隔离测试
注意:
如无特别说明,文中的版本号可以使用任何已发布版本的版本号代替。所有已发布版本详见:http://doc.klustron.com/zh/Release_notes.html。
本文内容:
Klustron的金融级高可靠性,由一系列能力和技术组成了一个完整的技术体系,这些能力和机制坚如磐石,组成了Klustron坚不可摧的金融级更高可靠性技术体系,这其中当然也包括了故障隔离能力。本文主要测试当集群中的某一个shard主备完全故障之后,Klustron集群仍旧能够部分正常工作。故障shard上面存储的数据无法操作,但其他shard的数据则可以进行正常的读写和提交。
1. 测试环境
测试集群相关机器的配置信息如下:
| 名称 | 节点类别 | IP | 端口 | Shard_ID | 
|---|---|---|---|---|
| comp3 | 计算节点 | 192.168.40.152 | 47001 | N/A | 
| shard_1 | 存储主节点 | 192.168.40.151 | 57003 | 6 | 
| 存储备节点 | 192.168.40.152 | |||
| 存储备节点 | 192.168.40.153 | |||
| shard_2 | 存储主节点 | 192.168.40.152 | 57005 | 7 | 
| 存储备节点 | 192.168.40.153 | |||
| 存储备节点 | 192.168.40.151 | |||
| shard_3 | 存储主节点 | 192.168.40.153 | 57007 | 5 | 
| 存储备节点 | 192.168.40.152 | |||
| 存储备节点 | 192.168.40.151 | 
2. 环境准备
创建测试数据库和用户
psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;
准备测试表sales_order,其按月分区。2023年1月分区的数据放在了shard_3上,2月数据放在了shard_1上,3月的数据则放在了shard_2上。
psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb
CREATE TABLE sales_order
(
   order_number         INT NOT NULL,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL,
   PRIMARY KEY (order_number,order_date)
) partition by range(order_date);
CREATE TABLE sales_order_202301 PARTITION OF sales_order
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
CREATE TABLE sales_order_202302 PARTITION OF sales_order
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
CREATE TABLE sales_order_202303 PARTITION OF sales_order
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01') with (Shard=7);
并往其中灌入10000条数据
create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2023-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=10000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*1000);
		v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;
call generate_order_data();
analyze sales_order;
testdb=> select count(*) from sales_order;
 count 
-------
 10000
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
 count 
-------
  3086
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)
3. 模拟整个 shard 故障
此时需要模拟shard_1所有的主备节点都发生故障。由于Klustron有自动拉起失败服务的功能,所以先需要将shard_1的数据文件目录移动位置,在shard_1的主备节点都执行
cd /kunlun/storage_datadir
mv 57003 57003_bak
然后再分别在shard_1的主备节点使用kill -9命令杀掉在57003端口上的mysqld和mysqld_safe进程。
4. 故障隔离测试
回到计算节点重新查询2023年2月的数据,则会抛出异常:
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
查询1月和3月的数据则能正常返回结果
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)
同样的,1月和3月数据对应的shard还能正常插入。
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)
testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
INSERT 0 1
testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
INSERT 0 1
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3418
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)
testdb=> commit;
但是如果事务中涉及操作故障shard的数据,则会报错。
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)
testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
UPDATE 113
testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
ERROR:  current transaction is aborted, commands ignored until end of transaction block
testdb=> rollback;
ROLLBACK
通过上面的测试可以看到,故障shard上面存储的数据无法提供任何服务了,但其他正常shard的数据还可以进行正常的读写和事务。
【END】
