Skip to main content

Data Import and Export

KlustronAbout 3 min

Data Import and Export

Klustron (formerly KunlunBase) supports full data import from common database systems and stream-based incremental data updates.

Full Data Import

To export data from PostgreSQL, simply perform a full dump of the data from the source database instance, then use Klustron's copy commandopen in new window to import it. When exporting data from other database systems, ensure that the SQL syntax is compatible with Klustron.

Importing from MySQL

If the source database system is MySQL, starting from Klustron version 1.2, Klustron supports common MySQL DDL syntax and can directly handle export files from tools like mysql-dump.

Special Steps for Klustron 1.1

Klustron 1.1 does not support MySQL DDL syntax. However, Klustron provides the DDL2Kunlun tool to process these DDL SQL statements, converting them into Klustron-compatible DDL SQL syntax. Therefore, you need to pre-process the SQL statements with this tool before feeding them into Klustron for automatic handling.

Speeding Up Full Data Import

If you need to import a large amount of data, it's best to open multiple connections, with each connection processing a portion of the data files. Ideally, each connection should handle one data file. This maximizes parallel imports, making the best use of system resources and significantly reducing the time required to import data. When performing parallel imports, if the data files are not split by table and the transaction isolation level is not READ COMMITTED (the default transaction isolation level for Klustron is READ COMMITTED), set the transaction isolation level to transaction_isolation='read committed' in the configuration file, then run the pg_ctl reload commandopen in new window to reload the configuration file.

If the data files to be imported contain numerous auto-commit insert statements rather than inserting a large number of rows within an explicit transaction, the overhead of millions or even billions of transaction commits can impose a heavy load on the disk, significantly slowing down the data import speed. In this case, either modify the source data files to insert a large number of rows within each INSERT statement or explicit transaction to reduce the number of commits, or temporarily adjust the cluster node settings as described below to significantly reduce the IO overhead of transaction commits.

Temporary Speed Settings for Data Import

If you want to speed up the full data import and the database cluster will not be used for production until the data import is complete, consider applying the following settings to the compute nodes and storage shard masters. After the import is complete, revert to the original values. This will significantly increase the import speed, but if a compute node or shard master node exits abnormally (e.g., power failure, hardware/software fault) during the import, you will need to reinstall the database cluster and re-import the data from scratch.

These settings are also effective for speeding up stream-based imports. However, since stream-based imports are usually a continuous process, and the cluster may already be in use, these speed settings should not be modified. If the conditions for temporary speed settings are still met during stream-based imports, you can modify these variables in the same way and revert them after the import is complete.

Storage Nodes
| Variable Name                   | Temporary Value Before Import  | Value After Import  |
|--------------—------------------|-------------------------|-----------------------|
|  innodb_doublewrite             | 0                       | 1                     |
|  innodb_flush_log_at_trx_commit | 0                       | 1                     |
|  sync_binlog                    | 0                       | 1                     |
|  enable_fullsync                | false                   | true                  |

Stream-Based Incremental Data Import

For stream-based data imports from most database systems (such as Oracle, DB2, SQL Server, Sybase, Informix, PostgreSQL, etc.), you can use third-party tools like OGG, DSG, etc. For stream-based imports from MySQL, Klustron version 1.1 provides the Binlog2sync tool for importing data updates. Starting from Klustron version 1.2, you can use Klustron's CDC tool for stream-based imports.

If you are importing full and incremental data from TiDB and other databases, these databases offer tools for full and incremental data export. Refer to the relevant tool documentation for details.