Clickhouse 数据备份和迁移

本文主要阐述在不同机器间如何迁移clickhouse数据。

数据迁移是日常运维中常见的工作,clickhouse迁移数据一般有3中方式:

使用remote函数

这种方式主要适合数据量不大的表,简单,语句如下:

1
INSERT into db.table SELECT * FROM remote('ip',db.table,'user','password');

使用 clickhouse-copier 工具

这是随 clickhouse 一起发行的一个数据迁移备份工具,处理数据量大的表效果比较高,同时支持并行,但需要使用zookeeper。

clickhouse-copier 是按集群来进行数据同步的,有两个配置文件,一个是连接zookeeper的配置文件,一个是任务配置文件。zookeeper(zookeeper.xml)配置文件如下,根据实际情况对内容进行修改:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<yandex>
<logger>
<level>trace</level>
<size>100M</size>
<count>3</count>
</logger>

<zookeeper>
<node index="1">
<host>127.0.0.1</host>
<port>2181</port>
</node>
</zookeeper>
</yandex>

同步任务配置文件(如文件名叫copy-job.xml,内容直接拷贝官方),根据实际情况进行修改:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<yandex>
<!-- Configuration of clusters as in an ordinary server config -->
<remote_servers>
<source_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>127.0.0.1</host>
<port>9000</port>
</replica>
</shard>
...
</source_cluster>

<destination_cluster>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>remote_ip</host>
<port>9000</port>
</replica>
</shard>
...
</destination_cluster>
</remote_servers>

<!-- How many simultaneously active workers are possible. If you run more workers superfluous workers will sleep. -->
<max_workers>2</max_workers>

<!-- Setting used to fetch (pull) data from source cluster tables -->
<settings_pull>
<readonly>1</readonly>
</settings_pull>

<!-- Setting used to insert (push) data to destination cluster tables -->
<settings_push>
<readonly>0</readonly>
</settings_push>

<!-- Common setting for fetch (pull) and insert (push) operations. Also, copier process context uses it.
They are overlaid by <settings_pull/> and <settings_push/> respectively. -->
<settings>
<connect_timeout>3</connect_timeout>
<!-- Sync insert is set forcibly, leave it here just in case. -->
<insert_distributed_sync>1</insert_distributed_sync>
</settings>

<!-- Copying tasks description.
You could specify several table task in the same task description (in the same ZooKeeper node), they will be performed
sequentially.
-->
<tables>
<!-- A table task, copies one table. -->
<table_hits>
<!-- Source cluster name (from <remote_servers/> section) and tables in it that should be copied -->
<cluster_pull>source_cluster</cluster_pull>
<database_pull>test</database_pull>
<table_pull>hits</table_pull>

<!-- Destination cluster name and tables in which the data should be inserted -->
<cluster_push>destination_cluster</cluster_push>
<database_push>test</database_push>
<table_push>hits2</table_push>

<!-- Engine of destination tables.
If destination tables have not be created, workers create them using columns definition from source tables and engine
definition from here.

NOTE: If the first worker starts insert data and detects that destination partition is not empty then the partition will
be dropped and refilled, take it into account if you already have some data in destination tables. You could directly
specify partitions that should be copied in <enabled_partitions/>, they should be in quoted format like partition column of
system.parts table.
-->
<engine>
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/hits2', '{replica}')
PARTITION BY toMonday(date)
ORDER BY (CounterID, EventDate)
</engine>

<!-- Sharding key used to insert data to destination cluster -->
<sharding_key>jumpConsistentHash(intHash64(UserID), 2)</sharding_key>

<!-- Optional expression that filter data while pull them from source servers -->
<where_condition>CounterID != 0</where_condition>

<!-- This section specifies partitions that should be copied, other partition will be ignored.
Partition names should have the same format as
partition column of system.parts table (i.e. a quoted text).
Since partition key of source and destination cluster could be different,
these partition names specify destination partitions.

NOTE: In spite of this section is optional (if it is not specified, all partitions will be copied),
it is strictly recommended to specify them explicitly.
If you already have some ready partitions on destination cluster they
will be removed at the start of the copying since they will be interpeted
as unfinished data from the previous copying!!!
-->
<enabled_partitions>
<partition>'2018-02-26'</partition>
<partition>'2018-03-05'</partition>
...
</enabled_partitions>
</table_hits>

<!-- Next table to copy. It is not copied until previous table is copying. -->
</table_visits>
...
</table_visits>
...
</tables>
</yandex>

这个配置文件主要在 remote_servers 和 tables 两部分内容,remote_servers 定义源集群和目标集群,跟分布式表定义集群一样; tables 定义需要同步的表,可以定义多个表。如果目标集群事先没有创建表,通过engine可以先定义如何创建表,在同步的时候就会自动创建。还可以指定同步哪些partition,有哪些partition可以通过表 system.parts 进行查询。

创建完任务描述文件之后,需要把任务描述文件上传到zookeeper的node,命令如下:

1
2
3
./zkCli.sh -server 127.0.0.1:2181 create /clickhouse/copytasks ""
./zkCli.sh -server 127.0.0.1:2181 create /clickhouse/copytasks/task1 ""
./zkCli.sh -server 127.0.0.1:2181 create /clickhouse/copytasks/task1/description "`cat copy-job.xml`"

如果需要更新任务,使用如下命令:

1
./zkCli.sh -server 127.0.0.1:2181 set /clickhouse/copytasks/task1/description "`cat copy-job.xml`"

zkCli.sh 为zookeeper安装目录bin下的一个脚本工具。

启动 clickhouse-copier 实例来完成同步任务:

1
clickhouse-copier --daemon --config zookeeper.xml --task-path /clickhouse/copytasks --base-dir /path/to/dir

clickhouse-copier 以上参数解释如下:

  • daemon: Starts clickhouse-copier in daemon mode.
  • config: The path to the zookeeper.xml file with the parameters for the connection to ZooKeeper.
  • task-path: The path to the ZooKeeper node. This node is used for syncing clickhouse-copier processes and storing tasks. Tasks are stored in $task-path/description.
  • task-file: Optional path to file with task configuration for initial upload to ZooKeeper.
  • task-upload-force: Force upload task-file even if node already exists.
  • base-dir: The path to logs and auxiliary files. When it starts, clickhouse-copier creates clickhouse-copier_YYYYMMHHSS_ subdirectories in $base-dir. If this parameter is omitted, the directories are created in the directory where clickhouse-copier was launched.

离线拷贝

这种方式通过DETACH/FREEZE分区,进行远程拷贝,然后再ATTACH恢复,主要语句如下:

1
2
3
4
5
6
# 下线分区
ALTER TABLE db.table DETACH PARTITION [partition];
# 备份分区
ALTER TABLE db.table FREEZE PARTITION [partition];
# 上线分区
ALTER TABLE db.table ATTACH PARTITION [partition];

参考