合 GreenPlum数据迁移工具之gpcopy

发布日期 2023年5月16日 · 已更新 2024年9月3日

0 261 2

👉 本文共约7604个字，系统预计阅读时间或需29分钟。

简介
gpcopy 可以干什么
和gptransfer的速度对比
为什么 gpcopy 可以更快速
为什么 gpcopy 可以更稳定
GPCOPY版本发展史
支持不同节点数的 Greenplum 集群间传输
下载
安装使用
帮助
命令示例
关于--on-segment-threshold参数
关于--no-distribution-check参数
我的示例
单库迁移
全集群迁移
gpcopy官方提示注意点
GPCOPY使用测试注意点
错误
Error: pq: SSL is not enabled on the server (172.72.6.40:5432)
Error: pq: password authentication failed for user "gpadmin" (172.72.6.50:5432)
命令总结
总结
参考

简介

gpcopy是一个数据迁移实用程序，可以在不同集群之间进行传输数据，可以将一个集群中的Greenplum数据库的元数据和数据复制到另一个集群的Greenplum数据库中。gpcopy可以迁移数据库的全部内容，包括数据库架构、表数据、索引、视图、角色、用户自定义函数、资源队列、资源组。

gpcopy 是新一代的 Greenplum 数据迁移工具，可以帮助客户在不同集群间，不同版本间，快速稳定地迁移数据。同上一代迁移工具 gptransfer 相比，gpcopy 具有巨大的优势：更快，更稳定，更易用，功能更丰富，是gptransfer的升级版

gpcopy 可以干什么

gpcopy 可以迁移整个集群，也可以具体传输某些数据库、某些命名空间和某些表；可以从文件读取传输或者略过的表，支持正则表达式；可以略过、追加或者替换目标集群的数据；可以并行传输；可以只迁移结构信息；可以静默自动化执行……

简单说，就是集群间迁移所存储的信息，使得用户业务可以迁移：

和gptransfer的速度对比

（1）复制数据更快：注意这里说的是数据，而不是元数据。gpcopy更快速原因可分为三点：segment间直接传输、Snappy压缩传输、数据校验。

①segment间直接传输：当一个表的数据超过指定数据行数(--on-segment-threshold默认为10000行)时，gpcopy会利用COPY ON SEGMENT特性使得gpcopy可以做到两个cluster的多节点间并发传输。除此之外，gpcopy的数据传输本身就是利用copy命令，而gptransfer则是通过外部表的SELECT和INSERT进行逐条操作，copy使用批量操作自然而然要比insert更快。

②Snappy压缩传输：gpcopy默认使用Google的Snappy格式对数据进行压缩和传输，而gptransfer使用zlib格式进行压缩和传输，百度snappy和zlib压缩性能对比，Snappy性能明显要高很多。

③数据校验：gpcopy和gptransfer都有两种校验方式，第一种校验方式相同：比较源表数据和目标表数据之间的行数，第二种校验方式也都是基于md5校验，但是gptransfer是先对源表和目标表进行排序，再对排序后的行计算MD5哈希值并逐行比较，反观gpcopy，先将每一行的所有列转换为文本，然后计算每行的md5值，最后对md5值进行XOR（异或）比较。

（2）数据迁移更稳定：命名管道以文件的形式存在于文件系统中，任何进程只要有权限，打开该文件即可通信。导致命名管道文件难以管理，也容易出问题。gpcopy没有命名管道文件。而gptransfer使用可写和可读外部表、Greenplum的gpfdist并行数据装载工具以及命名管道来从源数据库传输数据到目标数据库，所以命名管道必不可少。

为什么 gpcopy 可以更快速

·segment 间直接传输

gpcopy 的数据传输利用了 Greenplum 最新的 COPY ON SEGMENT 特性，首先 COPY 相较于 gptransfer 单纯使用的外部表就更快，因为它是批量操作，而外部表的 SELECT 和 INSERT 都是逐条操作；另外 COPY ON SEGMENT 特性使得 gpcopy 可以做到两个集群的多节点间并发传输，快上加快。

以下是 gpcopy 应用于相同节点数 Greenplum 集群间传输的架构，还是很简单直接的。

·Snappy 压缩传输

gpcopy 默认打开压缩选项，使用 Google 的 Snappy 格式对所传输得数据进行压缩，网络传输少了很多压力，速度也更快。

Snappy 对大多数的输入比 zlib 的最快模式要快几个数量级。在 core i7 的单核64位模式下，Snappy 压缩速度可以达到250MB/s或者更快，解压缩可以达到大约500MB/s或更快。

· 更快的数据校验

判断两个数据库系统的表是否一致从来不是一个简单的问题，简单使用哈希校验的话要考虑条目的顺序，排序的话又会把速度拖得更慢。如果这两个数据库系统和 Greenplum 一样是集群系统，这个问题就更难了。而 gpcopy 灵活地解决了这个问题，不需要排序，数据校验的速度是对所导出CSV格式文件做哈希的几倍!

为什么 gpcopy 可以更稳定

· 没有命名管道文件

命名管道以文件的形式存在于文件系统中，任何进程只要有权限，打开该文件即可通信。命令管道遵守先进先出的规则，对命名管道读总是从开始处返回数据，读过的数据不再存在于命名管道中，对它写则是添加到末尾，不支持lseek等操作。

命名管道文件难以管理，也容易出问题。例如不限制其它进程读、读过的数据不再存在这两个特点，结合起来会发生什么？想象一下，如果用户系统中存在着杀毒软件，所有文件都会被它读取采样……（这是一个真实案例）

· 完善的日志记录和错误处理

gpcopy 在这一块花了很大力气，每一步的操作，执行的查询，命令和结果都写到了日志文件，并根据用户指定的级别显示到标准输出。

迁移操作也都在事务内，发生错误可以做到表一级的回滚。运行结束的时候会有详细的成功和失败总结，同时生成和提示用户运行命令去重试所有的错误。

可以说，万一用户环境出现了错误，结合 gpcopy 和 Greenplum 的日志文件，我们的支持人员可以迅速地定位问题和给出解决方案，最大程度保障客户顺利迁移。

· “能用”而且好用的数据校验

这个前面提过了，前代 gptransfer 的数据校验是对数据进行排序然后哈希，用户基本都因为太慢而不得不略过，“稳定和一致”也就无从谈起了。

· gpcopy 可以用于升级

Greenplum 版本升级一般会有 catalog 变化，只升级可执行文件是不兼容的。而利用 gpcopy 则可以做到原地升级，另外因为有了快速好用的数据校验，用户也可以放心地一边迁移数据一边释放空间。（即使这样也强烈建议备份）

GPCOPY版本发展史

gpcopy改动变迁比较大的版本分别为1.0.0、1.1.0、1.5.0。

gpcopy1.0.0：始于greenplum-db-5.9.0，仅支持相同segment数的gpdb之间的数据迁移。

gpcopy1.1.0：始于greenplum-db-5.12.0，支持不同segment数的Greenplum集群间传输，分为两种情况，

①gpcopy从小集群到大集群传输

②gpcopy从大集群到小集群传输

gpcopy1.5.0：从4.3.33.0、5.21.0开始，gpcopy不再捆绑在greenplum的安装包中，成为Pivotal gpcopy的第一个独立发行版，gpcopy1.5.0相比较于之前版本做出了如下更改：

①在复制表数据时可以更改目标schema和table名称，前提是目标表必须存在，且必须具有与源表完全相同的表结构。

②默认支持传输表所有权和特权信息，以前只有在使用-full选项时才复制所有权和特权信息。

支持不同节点数的 Greenplum 集群间传输

gpcopy 1.1.0 现已支持不同节点数的 Greenplum 集群间传输！

现阶段导出依然是最快的COPY ON SEGMENT，导入则是利用外部表。多节点间并发传输、压缩和更快的数据校验，一个特性也不少。后续还会针对这个场景做更多的优化，敬请期待。

以下是 gpcopy 从小集群到大集群传输的基本架构，图片之外我们还做了传输量倾斜的优化。

以下是 gpcopy 从大集群到小集群传输的基本架构，一样也会有避免倾斜的优化。

下载

https://network.pivotal.io/products/gpdb-data-copy

大小：12MB

安装使用

解压后，将gpcopy和gpcopy_helper文件拷贝到源端和目标端的所有节点的$GPHOME/bin目录下。

-- 1.解压安装包
tar xzvf gpcopy-2.6.0.tar.gz 
cd gpcopy-2.6.0/

-- 2.在master节点，复制gpcopy和gpcopy_helper到GreenPlum相应的bin目录 
cp gpcopy $GPHOME/bin  
cp gpcopy_helper $GPHOME/bin

-- 3.赋予权限
chmod 755 $GPHOME/bin/gpcopy
chmod 755 $GPHOME/bin/gpcopy_helper

-- 4.在segment节点，只拷贝gpcopy_helper到相应目录下并赋予权限即可。(建议全拷贝)
scp gpcopy gpadmin@sdw1:/usr/local/greenplum-db-6.23.1/bin/
scp gpcopy_helper gpadmin@sdw1:/usr/local/greenplum-db-6.23.1/bin/

gpscp -v -f ~/conf/all_hosts gpcopy =:$GPHOME/bin/
gpscp -v -f ~/conf/all_hosts gpcopy_helper =:$GPHOME/bin/

-- 1.解压安装包

tar xzvf gpcopy-2.6.0.tar.gz

cd gpcopy-2.6.0/

-- 2.在master节点，复制gpcopy和gpcopy_helper到GreenPlum相应的bin目录

cp gpcopy $GPHOME/bin

cp gpcopy_helper $GPHOME/bin

-- 3.赋予权限

chmod 755 $GPHOME/bin/gpcopy

chmod 755 $GPHOME/bin/gpcopy_helper

-- 4.在segment节点，只拷贝gpcopy_helper到相应目录下并赋予权限即可。(建议全拷贝)

scp gpcopy gpadmin@sdw1:/usr/local/greenplum-db-6.23.1/bin/

scp gpcopy_helper gpadmin@sdw1:/usr/local/greenplum-db-6.23.1/bin/

gpscp -v -f ~/conf/all_hosts gpcopy =:$GPHOME/bin/

gpscp -v -f ~/conf/all_hosts gpcopy_helper =:$GPHOME/bin/

其实就是：

tar xzvf gpcopy-2.6.0.tar.gz 
cd gpcopy-2.6.0/

gpscp -v -f ~/conf/all_hosts gpcopy =:$GPHOME/bin/
gpscp -v -f ~/conf/all_hosts gpcopy_helper =:$GPHOME/bin/

tar xzvf gpcopy-2.6.0.tar.gz

cd gpcopy-2.6.0/

gpscp -v -f ~/conf/all_hosts gpcopy =:$GPHOME/bin/

gpscp -v -f ~/conf/all_hosts gpcopy_helper =:$GPHOME/bin/

帮助

[gpadmin@lhrgp40 ~]$ gpcopy -h
gpcopy utility is for copying objects from a Greenplum cluster to another

Usage:
  gpcopy [flags]

Flags:
  -a, --analyze                          Analyze tables after copy
      --append                           Append destination table if it exists
      --data-port-range DashInt          The range of port number destination helper chooses to transfer data in
  -d, --dbname string                    The database to be copied
      --debug                            Print debug log messages
  -D, --dest-dbname string               The database in destination cluster to copy to
      --dest-host string                 The host of destination cluster
      --dest-mapping-file string         Use the host to IP map file in case of destination cluster IP auto-resovling fails
      --dest-port int                    The port of destination cluster (default 5432)
      --dest-table string                The renamed dest table(s) for include-table, separated by commas, supports regular expression
      --dest-user string                 The user of destination cluster (default "gpadmin")
      --drop                             Drop destination table if it exists prior to copying data
      --dry-run                          Just run for test without affecting gpdb schema or data
      --dumper string                    The dll dumper to be used. "pgdump" or "gpbackup" ("gpbackup" is an experimental option) (default "pgdump")
      --enable-receive-daemon            Use a daemon helper process with a single port on destination to receive data (default true)
  -e, --exclude-table string             Copy all tables except the specified table(s), separated by commas
  -E, --exclude-table-file ArrayString   Copy all tables except the specified table(s) listed in the file
  -F, --full                             Copy full data cluster
  -h, --help                             help for gpcopy
  -t, --include-table string             Copy only the specified table(s), separated by commas, supports regular expression
  -T, --include-table-file string        Copy only the specified table(s) listed in the file
      --include-table-json string        Copy only the specified table(s) listed in the json format, can contain destination table name and filter SQL.
      --jobs int                         The maximum number of tables that concurrently copies, valid values are between 1 and 64512 (default 4)
  -m, --metadata-only                    Only copy metadata, do not copy data
      --no-compression                   Transfer the plain data, instead of compressing as Snappy format
      --no-distribution-check            Don't check distribution while copying
      --no-ownership                     Don't copy owner and privileges for table or sequence
  -o, --on-segment-threshold int         Copy between masters directly, if the table has smaller or same number of rows (default 10000)
  -p, --parallelize-leaf-partitions      Copy the leaf partition tables in parallel (default true)
      --quiet                            Suppress non-warning, non-error log messages
      --skip-existing                    Skip tables that exist in destination cluster
      --source-host string               The host of source cluster (default "127.0.0.1")
      --source-port int                  The port of source cluster (default 5432)
      --source-user string               The user of source cluster (default "gpadmin")
      --ssl-ca string                    SSL ca root cert file path for helper's TLS data socket
      --ssl-cert string                  SSL cert file path for helper's TLS data socket
      --ssl-key string                   SSL key file path for helper's TLS data socket
      --ssl-min-tls string               Minus version of helper's TLS data socket
      --timeout int                      The timeout in second to wait until source and destination are both ready for data transferring. '0' means waiting forever. (default 30)
      --truncate                         Truncate destination table if it exists prior to copying data
      --truncate-source-after            Truncate the source table after it's copied to release storage space
  -v, --validate string                  The method performing data validation on table data, "count" or "md5xor"
      --verbose                          Print verbose log messages
      --version                          version for gpcopy
      --yes                              Do not prompt
[gpadmin@lhrgp40 ~]$

[gpadmin@lhrgp40 ~]$ gpcopy -h

gpcopy utility is for copying objects from a Greenplum cluster to another

Usage:

gpcopy [flags]

Flags:

-a, --analyze Analyze tables after copy

--append Append destination table if it exists

--data-port-range DashInt The range of port number destination helper chooses to transfer data in

-d, --dbname string The database to be copied

--debug Print debug log messages

-D, --dest-dbname string The database in destination cluster to copy to

--dest-host string The host of destination cluster

--dest-mapping-file string Use the host to IP map file in case of destination cluster IP auto-resovling fails

--dest-port int The port of destination cluster (default 5432)

--dest-table string The renamed dest table(s) for include-table, separated by commas, supports regular expression

--dest-user string The user of destination cluster (default "gpadmin")

--drop Drop destination table if it exists prior to copying data

--dry-run Just run for test without affecting gpdb schema or data

--dumper string The dll dumper to be used. "pgdump" or "gpbackup" ("gpbackup" is an experimental option) (default "pgdump")

--enable-receive-daemon Use a daemon helper process with a single port on destination to receive data (default true)

-e, --exclude-table string Copy all tables except the specified table(s), separated by commas

-E, --exclude-table-file ArrayString Copy all tables except the specified table(s) listed in the file

-F, --full Copy full data cluster

-h, --help help for gpcopy

-t, --include-table string Copy only the specified table(s), separated by commas, supports regular expression

-T, --include-table-file string Copy only the specified table(s) listed in the file

--include-table-json string Copy only the specified table(s) listed in the json format, can contain destination table name and filter SQL.

--jobs int The maximum number of tables that concurrently copies, valid values are between 1 and 64512 (default 4)

-m, --metadata-only Only copy metadata, do not copy data

--no-compression Transfer the plain data, instead of compressing as Snappy format

--no-distribution-check Don't check distribution while copying

--no-ownership Don't copy owner and privileges for table or sequence

-o, --on-segment-threshold int Copy between masters directly, if the table has smaller or same number of rows (default 10000)

-p, --parallelize-leaf-partitions Copy the leaf partition tables in parallel (default true)

--quiet Suppress non-warning, non-error log messages

--skip-existing Skip tables that exist in destination cluster

--source-host string The host of source cluster (default "127.0.0.1")

--source-port int The port of source cluster (default 5432)

--source-user string The user of source cluster (default "gpadmin")

--ssl-ca string SSL ca root cert file path for helper's TLS data socket

--ssl-cert string SSL cert file path for helper's TLS data socket

--ssl-key string SSL key file path for helper's TLS data socket

--ssl-min-tls string Minus version of helper's TLS data socket

--timeout int The timeout in second to wait until source and destination are both ready for data transferring. '0' means waiting forever. (default 30)

--truncate Truncate destination table if it exists prior to copying data

--truncate-source-after Truncate the source table after it's copied to release storage space

-v, --validate string The method performing data validation on table data, "count" or "md5xor"

--verbose Print verbose log messages

--version version for gpcopy

--yes Do not prompt

[gpadmin@lhrgp40 ~]$

命令示例

# --debug 是为了在前台查看日志
# 从21服务器上把 datadev.public.dmtestone表 数据迁移到 102服务器上  
# 库：datadev 模式：public 表：dmtestone

export PGSSLMODE=disable
gpcopy --source-host 192.168.100.21 --dest-host 192.168.100.102 \
--include-table datadev.public.dmtestone --drop --dest-table postgres2.public.dmtestone --debug

# 从21服务器上把dc数据库 迁移到 102服务器上
gpcopy --source-host 192.168.100.21 --dest-host 192.168.100.102 \
--dbname dc --dest-dbname  dc --skip-existing --debug

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
    --dest-host demohost --dest-port 1234 --dest-user gpuser \
    --full --drop --validate count

# --debug 是为了在前台查看日志

# 从21服务器上把 datadev.public.dmtestone表数据迁移到 102服务器上

# 库：datadev 模式：public 表：dmtestone

export PGSSLMODE=disable

gpcopy --source-host 192.168.100.21 --dest-host 192.168.100.102 \

--include-table datadev.public.dmtestone --drop --dest-table postgres2.public.dmtestone --debug

# 从21服务器上把dc数据库迁移到 102服务器上

gpcopy --source-host 192.168.100.21 --dest-host 192.168.100.102 \

--dbname dc --dest-dbname dc --skip-existing --debug

gpcopy --source-host mytest --source-port 1234 --source-user gpuser \

--dest-host demohost --dest-port 1234 --dest-user gpuser \

--full --drop --validate count

关于--on-segment-threshold参数

The --on-segment-threshold setting determines where gpcopy performs the data transfer:

When --on-segment-threshold is set to -1 (the default), gpcopy copies table data using the source and destination Greenplum Database segment instances.
When --on-segment-threshold is set to -2, gpcopy copies table data using the source and destination Greenplum Database coordinators.
When --on-segment-threshold is set to a positive value, it identifies a row number threshold. If a table contains this number of rows or less, gpcopy copies the table data using the source and destination coordinators. If the number of rows in a table is more than the threshold, gpcopy copies the data using the source and destination segment instances. gpcopy uses the source table statistics to determine the number of table rows. If a source table is not analyzed, gpcopy assumes that the table is a small table, ignores the threshold setting, and copies the table data using only the coordinators. If your database has large tables without statistics, set the --on-segment-threshold option to -1 to force gpcopy to copy table data using segment instances.

关于--no-distribution-check参数

gpcopy does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy operation and specify the --no-distribution-check option to deactivate checking of data distribution.

Caution
本人提供Oracle(OCP、OCM)、MySQL(OCP)、PostgreSQL(PGCA、PGCE、PGCM)等数据库的培训和考证业务，私聊QQ646634621或微信dbaup66，谢谢！
后续精彩内容已被站长无情隐藏，请输入验证码解锁本文！
验证码：
获取验证码：请先关注本站微信公众号，然后回复“验证码”，获取验证码。在微信里搜索“AiDBA”或者“dbaup6”或者微信扫描右侧二维码都可以关注本站微信公众号。
相关文章
GreenPlum备份恢复工具之gpbackup和gprestore
GreenPlum备份恢复工具之pg_dump、pg_dumpall、 pg_restore等
GreenPlum数据库的备份和恢复概述
将GreenPlum数据库从6升级到7版本（大版本升级）
使用带有gpbackup和gprestore的S3存储插件进行异地备份恢复
GreenPlum通过gpbackup和gprestore在MinIO的S3存储中的异地备份和还原
GreenPlum相关软件在博通网站的下载方式
在GreenPlum中使用PgBouncer连接池
Greenplum介绍
Greenplum 7 新特性整理
Oracle同步数据到GreenPlum
密码保护GreenPlum性能优化系列
GreenPlum中的灾备（数据实时同步）
密码保护Greenplum之explain生成执行计划和阅读执行计划
Greenplum版本升级概述及小版本升级示例
Greenplum中检测和恢复故障的segment（gprecoverseg）
Greenplum数据库高可用性概述
GreenPlum数据库日常维护运维（持续更新）
从PG库导出然后导入到GP库需要做的更改
gpcc监控不显示内存、cpu等资源
打赏赞(2)分享

标签： GreenPlum 数据迁移整理自网络备份恢复 gpcopy

小麦苗

学习或考证，均可联系麦老师，请加微信db_bao或QQ646634621

发表回复取消回复

要发表评论，您必须先登录。

合 GreenPlum数据迁移工具之gpcopy

简介

gpcopy 可以干什么

和gptransfer的速度对比

为什么 gpcopy 可以更快速

为什么 gpcopy 可以更稳定

GPCOPY版本发展史

支持不同节点数的 Greenplum 集群间传输

下载

安装使用

帮助

命令示例

关于--on-segment-threshold参数

关于--no-distribution-check参数

相关文章

您可能还喜欢...

发表回复取消回复

网站公告

网站寄语

本站其它工具

搜索本网站

标签云☁

网站日历

网站归档

网站分类

2024 年 11 月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

合 GreenPlum数据迁移工具之gpcopy

简介

gpcopy 可以干什么

和gptransfer的速度对比

为什么 gpcopy 可以更快速

为什么 gpcopy 可以更稳定

GPCOPY版本发展史

支持不同节点数的 Greenplum 集群间传输

下载

安装使用

帮助

命令示例

关于--on-segment-threshold参数

关于--no-distribution-check参数

相关文章

您可能还喜欢...

GreenPlum中的序列

Linux 挂载 sshfs 文件系统，可挂载为数据库的异地备份目录

Greenplum中如何获取函数中正在执行的真实SQL语句(定位存储过程中最耗时部分)

发表回复 取消回复

网站公告

网站寄语

本站其它工具

搜索本网站

标签云☁

网站日历

网站归档

网站分类

发表回复取消回复