1、创建Redshift集群
实验一创建的集群是2节点的dc2.large,这里调整为4节点dc2.large。
调整集群过程可以使用Resize功能,在提示选项时候,选择使用Elastic Resize,将不需要重新创建新集群,而是直接添加节点,这样的方式比Classic Resize要快很多。
调整后等待一段时间生效开始测试。
本实验推荐使用SQLWorkbenchJ等客户端进行,便于查看运行时间。在配置连接时候,请打开auto-commit,以确保实验中每个sql都自动commit而无需手工执行commit。
2、未进行优化场景下的测试
2.1 创建未压缩的表结构
生效之后,执行如下命令创建表结构。
CREATE TABLE part
(
p_partkey INTEGER NOT NULL ENCODE RAW,
p_name VARCHAR(22) NOT NULL ENCODE RAW,
p_mfgr VARCHAR(6) NOT NULL ENCODE RAW,
p_category VARCHAR(7) NOT NULL ENCODE RAW,
p_brand1 VARCHAR(9) NOT NULL ENCODE RAW,
p_color VARCHAR(11) NOT NULL ENCODE RAW,
p_type VARCHAR(25) NOT NULL ENCODE RAW,
p_size INTEGER NOT NULL ENCODE RAW,
p_container VARCHAR(10) NOT NULL ENCODE RAW
);
CREATE TABLE supplier
(
s_suppkey INTEGER NOT NULL ENCODE RAW,
s_name VARCHAR(25) NOT NULL ENCODE RAW,
s_address VARCHAR(25) NOT NULL ENCODE RAW,
s_city VARCHAR(10) NOT NULL ENCODE RAW,
s_nation VARCHAR(15) NOT NULL ENCODE RAW,
s_region VARCHAR(12) NOT NULL ENCODE RAW,
s_phone VARCHAR(15) NOT NULL ENCODE RAW
);
CREATE TABLE customer
(
c_custkey INTEGER NOT NULL ENCODE RAW,
c_name VARCHAR(25) NOT NULL ENCODE RAW,
c_address VARCHAR(25) NOT NULL ENCODE RAW,
c_city VARCHAR(10) NOT NULL ENCODE RAW,
c_nation VARCHAR(15) NOT NULL ENCODE RAW,
c_region VARCHAR(12) NOT NULL ENCODE RAW,
c_phone VARCHAR(15) NOT NULL ENCODE RAW,
c_mktsegment VARCHAR(10) NOT NULL ENCODE RAW
);
CREATE TABLE dwdate
(
d_datekey INTEGER NOT NULL ENCODE RAW,
d_date VARCHAR(19) NOT NULL ENCODE RAW,
d_dayofweek VARCHAR(10) NOT NULL ENCODE RAW,
d_month VARCHAR(10) NOT NULL ENCODE RAW,
d_year INTEGER NOT NULL ENCODE RAW,
d_yearmonthnum INTEGER NOT NULL ENCODE RAW,
d_yearmonth VARCHAR(8) NOT NULL ENCODE RAW,
d_daynuminweek INTEGER NOT NULL ENCODE RAW,
d_daynuminmonth INTEGER NOT NULL ENCODE RAW,
d_daynuminyear INTEGER NOT NULL ENCODE RAW,
d_monthnuminyear INTEGER NOT NULL ENCODE RAW,
d_weeknuminyear INTEGER NOT NULL ENCODE RAW,
d_sellingseason VARCHAR(13) NOT NULL ENCODE RAW,
d_lastdayinweekfl VARCHAR(1) NOT NULL ENCODE RAW,
d_lastdayinmonthfl VARCHAR(1) NOT NULL ENCODE RAW,
d_holidayfl VARCHAR(1) NOT NULL ENCODE RAW,
d_weekdayfl VARCHAR(1) NOT NULL ENCODE RAW
);
CREATE TABLE lineorder
(
lo_orderkey INTEGER NOT NULL ENCODE RAW,
lo_linenumber INTEGER NOT NULL ENCODE RAW,
lo_custkey INTEGER NOT NULL ENCODE RAW,
lo_partkey INTEGER NOT NULL ENCODE RAW,
lo_suppkey INTEGER NOT NULL ENCODE RAW,
lo_orderdate INTEGER NOT NULL ENCODE RAW,
lo_orderpriority VARCHAR(15) NOT NULL ENCODE RAW,
lo_shippriority VARCHAR(1) NOT NULL ENCODE RAW,
lo_quantity INTEGER NOT NULL ENCODE RAW,
lo_extendedprice INTEGER NOT NULL ENCODE RAW,
lo_ordertotalprice INTEGER NOT NULL ENCODE RAW,
lo_discount INTEGER NOT NULL ENCODE RAW,
lo_revenue INTEGER NOT NULL ENCODE RAW,
lo_supplycost INTEGER NOT NULL ENCODE RAW,
lo_tax INTEGER NOT NULL ENCODE RAW,
lo_commitdate INTEGER NOT NULL ENCODE RAW,
lo_shipmode VARCHAR(10) NOT NULL ENCODE RAW
);
2.2 以不压缩方式加载数据
执行以下命令导入数据。五个表可以分别导入,也可以一次整体导入。其中compupdate off命令是强制禁用压缩。
导入时候请记录每条SQL执行的时间。可以分别记录,也可以一次运行整体记录下来。记录下来的时间将与未来优化后的加载时间进行对比。
copy customer from 's3://redshift-demo-lxy/demo2/customer'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip compupdate off region 'cn-northwest-1';
copy dwdate from 's3://redshift-demo-lxy/demo2/dwdate'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip compupdate off region 'cn-northwest-1';
copy lineorder from 's3://redshift-demo-lxy/demo2/lineorder'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip compupdate off region 'cn-northwest-1';
copy part from 's3://redshift-demo-lxy/demo2/part'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip compupdate off region 'cn-northwest-1';
copy supplier from 's3://redshift-demo-lxy/demo2/supplier'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip compupdate off region 'cn-northwest-1';
此时可以通过SQLWorkbenchJ等GUI工具一次性执行所有命令,并分别记录下五个语句过程各自的时间,然后记录下总加载时间,后续将进行对比。
验证数据加载是否完整,执行如下命令。
select count(*) from LINEORDER;
select count(*) from PART;
select count(*) from CUSTOMER;
select count(*) from SUPPLIER;
select count(*) from DWDATE;
返回结果与如下数据对比,一致则表示数据加载正确。
表名 | 行数 |
---|---|
LINEORDER | 600037902 |
PART | 1400000 |
CUSTOMER | 3000000 |
SUPPLIER | 1000000 |
DWDATE | 2556 |
记录下这个结果,后续将进行对比。
2.3 查询占用的空间
执行如下命令。
select stv_tbl_perm.name as table, count(*) as mb
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')
group by stv_tbl_perm.name
order by 1 asc;
记录下占用空间数字,后续将进行对比。
2.4 检查数据的切片和分布
查询切片数。
SELECT count(slice) from stv_slices;
查询结果为8。由于本次实验采用的dc2.large规格,一个节点有2个切片,4个节点共计8个切片。
执行如下检查数据的分布命令。
select trim(name) as table, slice, sum(num_values) as rows, min(minvalue), max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')
and col =0
group by name, slice
order by name, slice;
返回结果可以看到数据是分布在各Slide上的。
2.5 查询测试
首先Redshift在第一次执行SQL时候需要编译,因此执行时间很长,第二次执行就会非常快。但是这里需要考虑缓存因素,如果执行相同的查询,会从缓存返回结果。所以后续测试:
- 以第二次或者多次查询的结果为准,忽略第一次查询
- 禁用查询缓存
运行如下三个query,分别记录测试时间。注意第一次运行时候需要变异,时间较长,先不要记录时间,第二次运行即可获得正确的运行时间。也可以反复运行个3-5次后,取最后一次的数值。由于关掉了缓存,每次运行都应该是相差较小的查询耗时。
-- Disable query cache
set enable_result_cache_for_session to off;
-- Query 1
-- Restrictions on only one dimension.
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997
and lo_discount between 1 and 3
and lo_quantity < 24;
-- Query 2
-- Restrictions on two dimensions
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;
-- Query 3
-- Drill down in time to just one month
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
记录下三个查询个字的时间,后续将进行对比。
三、了解不同的压缩算法
3.1 创建一个临时测试表并指定不同压缩算法
执行如下命令,新建一张表叫做encodingshipmode,其中各列分别使用raw、bytedict、lzo、runlength、text255、text32k压缩算法。
create table encodingshipmode (
moderaw varchar(22) encode raw,
modebytedict varchar(22) encode bytedict,
modelzo varchar(22) encode lzo,
moderunlength varchar(22) encode runlength,
modetext255 varchar(22) encode text255,
modetext32k varchar(22) encode text32k
);
3.2 插入测试数据
执行如下命令,生成测试数据。注意,请在保持上述数据基础上用lineorder表作为数据源。原始lineorder表有6千万行,本次插入数据测试,只插入其中的20000000行。
insert into encodingshipmode
select
lo_shipmode as moderaw,
lo_shipmode as modebytedict,
lo_shipmode as modelzo,
lo_shipmode as moderunlength,
lo_shipmode as modetext255,
lo_shipmode as modetext32k
from lineorder
where lo_orderkey < 200000000;
执行过程需要数分钟。执行如下命令确认数据加载完成。
select count(*) from encodingshipmode;
表行数应为 200000262 行。表示数据加载成功。
3.3 查看不同压缩比情况下的空间占用
执行如下命令显示每一列占用的空间。通过查找 STV_BLOCKLIST 系统视图,显示各列占用空间如下。
select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name = 'encodingshipmode'
and col < 6
group by name, col
order by col;
输出结果如下。
col | max
----+----
0 | 442
1 | 53
2 | 122
3 | 385
4 | 108
5 | 210
由此可以看到,不同的列采用的压缩算法不一样,占用的的空间也不一样。其中0列是原始数据442MB,第1到5列是不同个压缩算法之后占用的空间,第二列在创建表时候,选择的压缩算法是“bytedict”,是相对高效节省空间的算法。压缩比大约8:1。
3.4 分析某一个数据表适合的压缩方式
执行如下命令对前文的lineorder表进行分析,显示各列最适合的压缩方式。注意,分析对象不是临时表encodingshipmode,而是实验开始时候加载了6千万条数据的原始表。
analyze compression customer;
analyze compression dwdate;
analyze compression lineorder;
analyze compression part;
analyze compression supplier;
返回结果包含五张表各自的分析结果,以lineorder表为例,压缩建议如下。
Table | Column | Encoding | Est_reduction_pct
----------+--------------------+----------+------------------
lineorder | lo_commitdate | az64 | 46.88
lineorder | lo_custkey | az64 | 28.13
lineorder | lo_discount | az64 | 84.38
lineorder | lo_extendedprice | zstd | 26.96
lineorder | lo_linenumber | az64 | 87.50
lineorder | lo_orderdate | az64 | 46.88
lineorder | lo_orderkey | zstd | 11.37
lineorder | lo_orderpriority | bytedict | 91.38
lineorder | lo_ordertotalprice | zstd | 22.61
lineorder | lo_partkey | az64 | 34.38
lineorder | lo_quantity | az64 | 78.12
lineorder | lo_revenue | zstd | 26.69
lineorder | lo_shipmode | bytedict | 87.93
lineorder | lo_shippriority | zstd | 99.96
lineorder | lo_suppkey | az64 | 34.38
lineorder | lo_supplycost | az64 | 45.98
lineorder | lo_tax | az64 | 84.38
以上结果可以看出,上文测试的lo_shipmode列,系统推荐的压缩方式也是bytedict,与实际测试效果最好的相符。此外,需要注意,表dwdate因为数据量不足,系统无法给出推荐压缩。
关于更多设计列的压缩方式,请参考这篇文档。
3.5 查看一个已经存在的表使用的压缩方式
执行如下命令。
select *
from pg_table_def where tablename = 'encodingshipmode';
返回结果如下,这是查询临时表encodingshipmode每一列配置的压缩属性。
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
-----------+------------------+---------------+-----------------------+-----------+---------+---------+--------
public | encodingshipmode | modetext32k | character varying(22) | text32k | false | 0 | false
public | encodingshipmode | modetext255 | character varying(22) | text255 | false | 0 | false
public | encodingshipmode | moderunlength | character varying(22) | runlength | false | 0 | false
public | encodingshipmode | moderaw | character varying(22) | none | false | 0 | false
public | encodingshipmode | modelzo | character varying(22) | lzo | false | 0 | false
public | encodingshipmode | modebytedict | character varying(22) | bytedict | false | 0 | false
因此当遇到一些表数据量较少的时候,也可以用这种办法查询每列的类型,并手工选择对应的压缩算法。
四、优化表并重新测试
4.1 删库
这里将摧毁实验一的表,执行如下命令删库。其中增加参数cascade表示无视外键等关联,强制删库。
drop table part cascade;
drop table supplier cascade;
drop table customer cascade;
drop table dwdate cascade;
drop table lineorder cascade;
4.2 重新创建表Schema
创建时候,增加了sorting key排序键,且使用了压缩。
CREATE TABLE part (
p_partkey integer not null encode delta sortkey distkey,
p_name varchar(22) not null encode text255,
p_mfgr varchar(6) not null encode zstd,
p_category varchar(7) not null encode bytedict,
p_brand1 varchar(9) not null encode zstd,
p_color varchar(11) not null encode bytedict,
p_type varchar(25) not null encode bytedict,
p_size integer not null encode az64,
p_container varchar(10) not null encode bytedict
);
CREATE TABLE supplier (
s_suppkey integer not null encode az64 sortkey,
s_name varchar(25) not null encode zstd ,
s_address varchar(25) not null encode zstd,
s_city varchar(10) not null encode bytedict,
s_nation varchar(15) not null encode bytedict,
s_region varchar(12) not null encode bytedict,
s_phone varchar(15) not null encode zstd
) diststyle all;
CREATE TABLE customer (
c_custkey integer not null encode az64 sortkey,
c_name varchar(25) not null encode zstd,
c_address varchar(25) not null encode zstd,
c_city varchar(10) not null encode bytedict,
c_nation varchar(15) not null encode bytedict,
c_region varchar(12) not null encode bytedict,
c_phone varchar(15) not null encode zstd,
c_mktsegment varchar(10) not null encode bytedict
) diststyle all;
CREATE TABLE dwdate (
d_datekey integer not null sortkey,
d_date varchar(19) not null,
d_dayofweek varchar(10) not null,
d_month varchar(10) not null,
d_year integer not null,
d_yearmonthnum integer not null,
d_yearmonth varchar(8) not null,
d_daynuminweek integer not null,
d_daynuminmonth integer not null,
d_daynuminyear integer not null,
d_monthnuminyear integer not null,
d_weeknuminyear integer not null,
d_sellingseason varchar(13) not null,
d_lastdayinweekfl varchar(1) not null,
d_lastdayinmonthfl varchar(1) not null,
d_holidayfl varchar(1) not null,
d_weekdayfl varchar(1) not null
) diststyle all;
CREATE TABLE lineorder (
lo_orderkey integer not null encode zstd,
lo_linenumber integer not null encode az64,
lo_custkey integer not null encode az64,
lo_partkey integer not null encode az64 distkey,
lo_suppkey integer not null encode az64,
lo_orderdate integer not null encode az64 sortkey,
lo_orderpriority varchar(15) not null encode bytedict,
lo_shippriority varchar(1) not null encode zstd,
lo_quantity integer not null encode az64,
lo_extendedprice integer not null encode zstd,
lo_ordertotalprice integer not null encode zstd,
lo_discount integer not null encode az64,
lo_revenue integer not null encode zstd,
lo_supplycost integer not null encode az64,
lo_tax integer not null encode az64,
lo_commitdate integer not null encode az64,
lo_shipmode varchar(10) not null encode bytedict
);
4.3 以压缩方式加载数据
执行如下命令。
copy customer from 's3://redshift-demo-lxy/demo2/customer'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip region 'cn-northwest-1';
copy dwdate from 's3://redshift-demo-lxy/demo2/dwdate'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip region 'cn-northwest-1';
copy lineorder from 's3://redshift-demo-lxy/demo2/lineorder'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip region 'cn-northwest-1';
copy part from 's3://redshift-demo-lxy/demo2/part'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip region 'cn-northwest-1';
copy supplier from 's3://redshift-demo-lxy/demo2/supplier'
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift'
gzip region 'cn-northwest-1';
分别记录下五个加载过程各自的时间,然后记录下总加载时间,后续将进行对比。
4.4 占用空间对比
执行如下代码。
select stv_tbl_perm.name as table, count(*) as mb
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')
group by stv_tbl_perm.name
order by 1 asc;
查询结果如下。
table. | MB
———-+——
customer | 764
dwdate | 160
lineorder | 25843
part | 140
supplier | 284
记录下结果用于对比。
4.5 查询速度对比
重复前文步骤2.5的查询。查询脚本如下。
-- Disable query cache
set enable_result_cache_for_session to off;
-- Query 1
-- Restrictions on only one dimension.
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997
and lo_discount between 1 and 3
and lo_quantity < 24;
-- Query 2
-- Restrictions on two dimensions
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;
-- Query 3
-- Drill down in time to just one month
select c_city, s_city, d_year, sum(lo_revenue) as revenue
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;
记录下几个查询的时间,与前文未压缩的数据进行对比。
五、小结
经过一系列测试可以看出,数据量最大的lineorder有6亿条数据。经过一系列优化,压缩可以带来巨大的空间节约而不损失性能。
有关Redshift数据库设计的最佳时间,请参考这里。
https://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Creating_tables.html
至此实验结束,请清除资源。