Redshift动手实验(二)优化表结构

1、创建Redshift集群

实验一创建的集群是2节点的dc2.large,这里调整为4节点dc2.large。

调整集群过程可以使用Resize功能,在提示选项时候,选择使用Elastic Resize,将不需要重新创建新集群,而是直接添加节点,这样的方式比Classic Resize要快很多。

调整后等待一段时间生效开始测试。

本实验推荐使用SQLWorkbenchJ等客户端进行,便于查看运行时间。在配置连接时候,请打开auto-commit,以确保实验中每个sql都自动commit而无需手工执行commit。

2、未进行优化场景下的测试

2.1 创建未压缩的表结构

生效之后,执行如下命令创建表结构。

CREATE TABLE part 
(
  p_partkey     INTEGER      NOT NULL ENCODE RAW,
  p_name        VARCHAR(22)  NOT NULL ENCODE RAW,
  p_mfgr        VARCHAR(6)   NOT NULL ENCODE RAW,
  p_category    VARCHAR(7)   NOT NULL ENCODE RAW,
  p_brand1      VARCHAR(9)   NOT NULL ENCODE RAW,
  p_color       VARCHAR(11)  NOT NULL ENCODE RAW,
  p_type        VARCHAR(25)  NOT NULL ENCODE RAW,
  p_size        INTEGER      NOT NULL ENCODE RAW,
  p_container   VARCHAR(10)  NOT NULL ENCODE RAW
);

CREATE TABLE supplier 
(
  s_suppkey   INTEGER     NOT NULL ENCODE RAW,
  s_name      VARCHAR(25) NOT NULL ENCODE RAW,
  s_address   VARCHAR(25) NOT NULL ENCODE RAW,
  s_city      VARCHAR(10) NOT NULL ENCODE RAW,
  s_nation    VARCHAR(15) NOT NULL ENCODE RAW,
  s_region    VARCHAR(12) NOT NULL ENCODE RAW,
  s_phone     VARCHAR(15) NOT NULL ENCODE RAW
);

CREATE TABLE customer 
(
  c_custkey      INTEGER     NOT NULL ENCODE RAW,
  c_name         VARCHAR(25) NOT NULL ENCODE RAW,
  c_address      VARCHAR(25) NOT NULL ENCODE RAW,
  c_city         VARCHAR(10) NOT NULL ENCODE RAW,
  c_nation       VARCHAR(15) NOT NULL ENCODE RAW,
  c_region       VARCHAR(12) NOT NULL ENCODE RAW,
  c_phone        VARCHAR(15) NOT NULL ENCODE RAW,
  c_mktsegment   VARCHAR(10) NOT NULL ENCODE RAW
);

CREATE TABLE dwdate 
(
  d_datekey            INTEGER     NOT NULL ENCODE RAW,
  d_date               VARCHAR(19) NOT NULL ENCODE RAW,
  d_dayofweek          VARCHAR(10) NOT NULL ENCODE RAW,
  d_month              VARCHAR(10) NOT NULL ENCODE RAW,
  d_year               INTEGER     NOT NULL ENCODE RAW,
  d_yearmonthnum       INTEGER     NOT NULL ENCODE RAW,
  d_yearmonth          VARCHAR(8)  NOT NULL ENCODE RAW,
  d_daynuminweek       INTEGER     NOT NULL ENCODE RAW,
  d_daynuminmonth      INTEGER     NOT NULL ENCODE RAW,
  d_daynuminyear       INTEGER     NOT NULL ENCODE RAW,
  d_monthnuminyear     INTEGER     NOT NULL ENCODE RAW,
  d_weeknuminyear      INTEGER     NOT NULL ENCODE RAW,
  d_sellingseason      VARCHAR(13) NOT NULL ENCODE RAW,
  d_lastdayinweekfl    VARCHAR(1)  NOT NULL ENCODE RAW,
  d_lastdayinmonthfl   VARCHAR(1)  NOT NULL ENCODE RAW,
  d_holidayfl          VARCHAR(1)  NOT NULL ENCODE RAW,
  d_weekdayfl          VARCHAR(1)  NOT NULL ENCODE RAW
);

CREATE TABLE lineorder 
(
  lo_orderkey          INTEGER       NOT NULL ENCODE RAW,
  lo_linenumber        INTEGER       NOT NULL ENCODE RAW,
  lo_custkey           INTEGER       NOT NULL ENCODE RAW,
  lo_partkey           INTEGER       NOT NULL ENCODE RAW,
  lo_suppkey           INTEGER       NOT NULL ENCODE RAW,
  lo_orderdate         INTEGER       NOT NULL ENCODE RAW,
  lo_orderpriority     VARCHAR(15)   NOT NULL ENCODE RAW,
  lo_shippriority      VARCHAR(1)    NOT NULL ENCODE RAW,
  lo_quantity          INTEGER       NOT NULL ENCODE RAW,
  lo_extendedprice     INTEGER       NOT NULL ENCODE RAW,
  lo_ordertotalprice   INTEGER       NOT NULL ENCODE RAW,
  lo_discount          INTEGER       NOT NULL ENCODE RAW,
  lo_revenue           INTEGER       NOT NULL ENCODE RAW,
  lo_supplycost        INTEGER       NOT NULL ENCODE RAW,
  lo_tax               INTEGER       NOT NULL ENCODE RAW,
  lo_commitdate        INTEGER       NOT NULL ENCODE RAW,
  lo_shipmode          VARCHAR(10)   NOT NULL ENCODE RAW
);

2.2 以不压缩方式加载数据

执行以下命令导入数据。五个表可以分别导入,也可以一次整体导入。其中compupdate off命令是强制禁用压缩。

导入时候请记录每条SQL执行的时间。可以分别记录,也可以一次运行整体记录下来。记录下来的时间将与未来优化后的加载时间进行对比。

copy customer from 's3://redshift-demo-lxy/demo2/customer' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip compupdate off region 'cn-northwest-1';

copy dwdate from 's3://redshift-demo-lxy/demo2/dwdate' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip compupdate off region 'cn-northwest-1';

copy lineorder from 's3://redshift-demo-lxy/demo2/lineorder' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip compupdate off region 'cn-northwest-1';

copy part from 's3://redshift-demo-lxy/demo2/part' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip compupdate off region 'cn-northwest-1';

copy supplier from 's3://redshift-demo-lxy/demo2/supplier' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip compupdate off region 'cn-northwest-1';

此时可以通过SQLWorkbenchJ等GUI工具一次性执行所有命令,并分别记录下五个语句过程各自的时间,然后记录下总加载时间,后续将进行对比。

验证数据加载是否完整,执行如下命令。

select count(*) from LINEORDER;
select count(*) from PART;
select count(*) from CUSTOMER;
select count(*) from SUPPLIER;
select count(*) from DWDATE;

返回结果与如下数据对比,一致则表示数据加载正确。

表名行数
LINEORDER600037902
PART1400000
CUSTOMER3000000
SUPPLIER1000000
DWDATE2556

记录下这个结果,后续将进行对比。

2.3 查询占用的空间

执行如下命令。

select stv_tbl_perm.name as table, count(*) as mb
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')
group by stv_tbl_perm.name
order by 1 asc;

记录下占用空间数字,后续将进行对比。

2.4 检查数据的切片和分布

查询切片数。

SELECT count(slice) from stv_slices;

查询结果为8。由于本次实验采用的dc2.large规格,一个节点有2个切片,4个节点共计8个切片。

执行如下检查数据的分布命令。

select trim(name) as table, slice, sum(num_values) as rows, min(minvalue), max(maxvalue)
from svv_diskusage
where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder') 
and col =0
group by name, slice
order by name, slice;

返回结果可以看到数据是分布在各Slide上的。

2.5 查询测试

首先Redshift在第一次执行SQL时候需要编译,因此执行时间很长,第二次执行就会非常快。但是这里需要考虑缓存因素,如果执行相同的查询,会从缓存返回结果。所以后续测试:

  • 以第二次或者多次查询的结果为准,忽略第一次查询
  • 禁用查询缓存

运行如下三个query,分别记录测试时间。注意第一次运行时候需要变异,时间较长,先不要记录时间,第二次运行即可获得正确的运行时间。也可以反复运行个3-5次后,取最后一次的数值。由于关掉了缓存,每次运行都应该是相差较小的查询耗时。

-- Disable query cache
set enable_result_cache_for_session to off;

-- Query 1
-- Restrictions on only one dimension. 
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997 
and lo_discount between 1 and 3 
and lo_quantity < 24;

-- Query 2
-- Restrictions on two dimensions 
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

-- Query 3
-- Drill down in time to just one month 
select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

记录下三个查询个字的时间,后续将进行对比。

三、了解不同的压缩算法

3.1 创建一个临时测试表并指定不同压缩算法

执行如下命令,新建一张表叫做encodingshipmode,其中各列分别使用raw、bytedict、lzo、runlength、text255、text32k压缩算法。

create table encodingshipmode (
  moderaw         varchar(22) encode raw,
  modebytedict    varchar(22) encode bytedict,
  modelzo         varchar(22) encode lzo,
  moderunlength   varchar(22) encode runlength,
  modetext255     varchar(22) encode text255,
  modetext32k     varchar(22) encode text32k
);

3.2 插入测试数据

执行如下命令,生成测试数据。注意,请在保持上述数据基础上用lineorder表作为数据源。原始lineorder表有6千万行,本次插入数据测试,只插入其中的20000000行。

insert into encodingshipmode
  select 
    lo_shipmode as moderaw, 
    lo_shipmode as modebytedict, 
    lo_shipmode as modelzo,
    lo_shipmode as moderunlength,
    lo_shipmode as modetext255,
    lo_shipmode as modetext32k
  from lineorder 
  where lo_orderkey < 200000000;

执行过程需要数分钟。执行如下命令确认数据加载完成。

select count(*) from encodingshipmode;

表行数应为 200000262 行。表示数据加载成功。

3.3 查看不同压缩比情况下的空间占用

执行如下命令显示每一列占用的空间。通过查找 STV_BLOCKLIST 系统视图,显示各列占用空间如下。

select col, max(blocknum)
from stv_blocklist b, stv_tbl_perm p
where (b.tbl=p.id) and name = 'encodingshipmode'
and col < 6
group by name, col
order by col;

输出结果如下。

col | max
----+----
  0 | 442
  1 |  53
  2 | 122
  3 | 385
  4 | 108
  5 | 210

由此可以看到,不同的列采用的压缩算法不一样,占用的的空间也不一样。其中0列是原始数据442MB,第1到5列是不同个压缩算法之后占用的空间,第二列在创建表时候,选择的压缩算法是“bytedict”,是相对高效节省空间的算法。压缩比大约8:1。

3.4 分析某一个数据表适合的压缩方式

执行如下命令对前文的lineorder表进行分析,显示各列最适合的压缩方式。注意,分析对象不是临时表encodingshipmode,而是实验开始时候加载了6千万条数据的原始表。

analyze compression customer;
analyze compression dwdate;
analyze compression lineorder;
analyze compression part;
analyze compression supplier;

返回结果包含五张表各自的分析结果,以lineorder表为例,压缩建议如下。

Table     | Column             | Encoding | Est_reduction_pct
----------+--------------------+----------+------------------
lineorder | lo_commitdate      | az64     | 46.88            
lineorder | lo_custkey         | az64     | 28.13            
lineorder | lo_discount        | az64     | 84.38            
lineorder | lo_extendedprice   | zstd     | 26.96            
lineorder | lo_linenumber      | az64     | 87.50            
lineorder | lo_orderdate       | az64     | 46.88            
lineorder | lo_orderkey        | zstd     | 11.37            
lineorder | lo_orderpriority   | bytedict | 91.38            
lineorder | lo_ordertotalprice | zstd     | 22.61            
lineorder | lo_partkey         | az64     | 34.38            
lineorder | lo_quantity        | az64     | 78.12            
lineorder | lo_revenue         | zstd     | 26.69            
lineorder | lo_shipmode        | bytedict | 87.93            
lineorder | lo_shippriority    | zstd     | 99.96            
lineorder | lo_suppkey         | az64     | 34.38            
lineorder | lo_supplycost      | az64     | 45.98            
lineorder | lo_tax             | az64     | 84.38            

以上结果可以看出,上文测试的lo_shipmode列,系统推荐的压缩方式也是bytedict,与实际测试效果最好的相符。此外,需要注意,表dwdate因为数据量不足,系统无法给出推荐压缩。

关于更多设计列的压缩方式,请参考这篇文档。

3.5 查看一个已经存在的表使用的压缩方式

执行如下命令。

select * 
from pg_table_def where tablename = 'encodingshipmode';

返回结果如下,这是查询临时表encodingshipmode每一列配置的压缩属性。

schemaname | tablename        | column        | type                  | encoding  | distkey | sortkey | notnull
-----------+------------------+---------------+-----------------------+-----------+---------+---------+--------
public     | encodingshipmode | modetext32k   | character varying(22) | text32k   | false   |       0 | false  
public     | encodingshipmode | modetext255   | character varying(22) | text255   | false   |       0 | false  
public     | encodingshipmode | moderunlength | character varying(22) | runlength | false   |       0 | false  
public     | encodingshipmode | moderaw       | character varying(22) | none      | false   |       0 | false  
public     | encodingshipmode | modelzo       | character varying(22) | lzo       | false   |       0 | false  
public     | encodingshipmode | modebytedict  | character varying(22) | bytedict  | false   |       0 | false  

因此当遇到一些表数据量较少的时候,也可以用这种办法查询每列的类型,并手工选择对应的压缩算法。

四、优化表并重新测试

4.1 删库

这里将摧毁实验一的表,执行如下命令删库。其中增加参数cascade表示无视外键等关联,强制删库。

drop table part cascade;
drop table supplier cascade;
drop table customer cascade;
drop table dwdate cascade;
drop table lineorder cascade;

4.2 重新创建表Schema

创建时候,增加了sorting key排序键,且使用了压缩。

CREATE TABLE part (
  p_partkey         integer         not null encode delta sortkey distkey,
  p_name            varchar(22)     not null encode text255,
  p_mfgr            varchar(6)      not null encode zstd,
  p_category        varchar(7)      not null encode bytedict,
  p_brand1          varchar(9)      not null encode zstd,
  p_color           varchar(11)     not null encode bytedict,
  p_type            varchar(25)     not null encode bytedict,
  p_size            integer         not null encode az64,
  p_container       varchar(10)     not null encode bytedict
);

CREATE TABLE supplier (
  s_suppkey         integer        not null encode az64 sortkey,
  s_name            varchar(25)    not null encode zstd ,
  s_address         varchar(25)    not null encode zstd,
  s_city            varchar(10)    not null encode bytedict,
  s_nation          varchar(15)    not null encode bytedict,
  s_region          varchar(12)    not null encode bytedict,
  s_phone           varchar(15)    not null encode zstd
) diststyle all;

CREATE TABLE customer (
  c_custkey         integer        not null encode az64 sortkey,
  c_name            varchar(25)    not null encode zstd,
  c_address         varchar(25)    not null encode zstd,
  c_city            varchar(10)    not null encode bytedict,
  c_nation          varchar(15)    not null encode bytedict,
  c_region          varchar(12)    not null encode bytedict,
  c_phone           varchar(15)    not null encode zstd,
  c_mktsegment      varchar(10)    not null encode bytedict
) diststyle all;

CREATE TABLE dwdate (
  d_datekey            integer       not null sortkey,
  d_date               varchar(19)   not null,
  d_dayofweek              varchar(10)   not null,
  d_month                 varchar(10)   not null,
  d_year               integer       not null,
  d_yearmonthnum       integer         not null,
  d_yearmonth          varchar(8)       not null,
  d_daynuminweek       integer       not null,
  d_daynuminmonth      integer       not null,
  d_daynuminyear       integer       not null,
  d_monthnuminyear     integer       not null,
  d_weeknuminyear      integer       not null,
  d_sellingseason      varchar(13)   not null,
  d_lastdayinweekfl    varchar(1)    not null,
  d_lastdayinmonthfl   varchar(1)    not null,
  d_holidayfl          varchar(1)    not null,
  d_weekdayfl          varchar(1)    not null
) diststyle all;

CREATE TABLE lineorder (
  lo_orderkey              integer         not null encode zstd,
  lo_linenumber            integer         not null encode az64,
  lo_custkey               integer         not null encode az64,
  lo_partkey               integer         not null encode az64 distkey,
  lo_suppkey               integer         not null encode az64,
  lo_orderdate             integer         not null encode az64 sortkey,
  lo_orderpriority         varchar(15)   not null encode bytedict,
  lo_shippriority          varchar(1)    not null encode zstd,
  lo_quantity              integer         not null encode az64,
  lo_extendedprice         integer         not null encode zstd,
  lo_ordertotalprice       integer         not null encode zstd,
  lo_discount              integer         not null encode az64,
  lo_revenue               integer         not null encode zstd,
  lo_supplycost            integer         not null encode az64,
  lo_tax                   integer         not null encode az64,
  lo_commitdate         integer       not null encode az64,
  lo_shipmode              varchar(10)   not null encode bytedict
);

4.3 以压缩方式加载数据

执行如下命令。

copy customer from 's3://redshift-demo-lxy/demo2/customer' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip region 'cn-northwest-1';

copy dwdate from 's3://redshift-demo-lxy/demo2/dwdate' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip region 'cn-northwest-1';

copy lineorder from 's3://redshift-demo-lxy/demo2/lineorder' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip region 'cn-northwest-1';

copy part from 's3://redshift-demo-lxy/demo2/part' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip region 'cn-northwest-1';

copy supplier from 's3://redshift-demo-lxy/demo2/supplier' 
credentials 'aws_iam_role=arn:aws-cn:iam::420029960748:role/lxy-redshift' 
gzip region 'cn-northwest-1';

分别记录下五个加载过程各自的时间,然后记录下总加载时间,后续将进行对比。

4.4 占用空间对比

执行如下代码。

select stv_tbl_perm.name as table, count(*) as mb
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')
group by stv_tbl_perm.name
order by 1 asc;

查询结果如下。

table. | MB
———-+——
customer | 764
dwdate | 160
lineorder | 25843
part | 140
supplier | 284

记录下结果用于对比。

4.5 查询速度对比

重复前文步骤2.5的查询。查询脚本如下。

-- Disable query cache
set enable_result_cache_for_session to off;

-- Query 1
-- Restrictions on only one dimension. 
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder, dwdate
where lo_orderdate = d_datekey
and d_year = 1997 
and lo_discount between 1 and 3 
and lo_quantity < 24;

-- Query 2
-- Restrictions on two dimensions 
select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;

-- Query 3
-- Drill down in time to just one month 
select c_city, s_city, d_year, sum(lo_revenue) as revenue 
from customer, lineorder, supplier, dwdate
where lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or
c_city='UNITED KI5')
and (s_city='UNITED KI1' or
s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by c_city, s_city, d_year
order by d_year asc, revenue desc;

记录下几个查询的时间,与前文未压缩的数据进行对比。

五、小结

经过一系列测试可以看出,数据量最大的lineorder有6亿条数据。经过一系列优化,压缩可以带来巨大的空间节约而不损失性能。

有关Redshift数据库设计的最佳时间,请参考这里。

https://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Creating_tables.html

至此实验结束,请清除资源。