AWS Redshift - Creating working Data Warehouse in AWS Redshift ( 4 Major Steps)
AWS Redshift
cluster - Star Schema Benchmark
As we know that AWS
redshift has multiple advantages with respect to OLAP analytics, so we
have decided to create a small Data warehouse replica in this
tutorial.
The
complete video with explanation is given here.
The DWH
architecture which we are going to create is as follows, this can be understood
as a star schema based single fact data warehouse.
Following are the steps
how anyone can create the Start Schema based Data Warehouse in AWS Redshift, we
are going to utilize multiple AWS services to prepare this mini Data warehouse.
Step 1. IAM role, create a role using AWS Redshift service which has S3 full
operation policies.
Step 2 S3
bucket, create S3 bucket and
copy the file from the below zip location to your S3 location.
Click
on here S3 Data to Copy From
you will find
data for customer , part and dwdate table.
Unzip and copy the data from this location to
your S3 bucket residing in same AWS region where you will create AWS
redshift cluster.
We
will copy the Supplier and LINEPRDER table data from other AWS S3 bucket which
are created
by AWS for learning purpose
(these are created in other AWS regions)
S Step 3
Redshift cluster Db - Create AWS redshift cluster, make it
accessible Publicly and also
Make sure the Security
Group liked to it has Inbound rule which allows traffic from 0.0.0.0/0 IP
address. See the video above.
4.
Step 4. SQL work bench/J (Use SQL workbench/J not MySQL workbench) –
We need to use any SQL client to
connect and create tables & copy data to the tables. We may utilize
the SQL Editor supplied by AWS Redshift Dashboard as well but to automate we
need to connect to the AWS redshift via some API ( Java or Python), so we
anyway needs JDBC Or ODBC connectivity. we will need, JDBC driver and ARN
end point of AWS Redshift cluster, User name and Password for connectivity
See the youtube video for the
visual explanation
Table creation scripts are as
below:
Run the below table creation statements
in SQL Workbench/J
CREATE TABLE part
(
p_partkey INTEGER
NOT NULL,
p_name VARCHAR(22)
NOT NULL,
p_mfgr VARCHAR(6),
p_category VARCHAR(7)
NOT NULL,
p_brand1 VARCHAR(9)
NOT NULL,
p_color VARCHAR(11)
NOT NULL,
p_type VARCHAR(25)
NOT NULL,
p_size INTEGER
NOT NULL,
p_container VARCHAR(10)
NOT NULL
);
CREATE TABLE supplier
(
s_suppkey INTEGER
NOT NULL,
s_name VARCHAR(25)
NOT NULL,
s_address VARCHAR(25)
NOT NULL,
s_city VARCHAR(10)
NOT NULL,
s_nation VARCHAR(15)
NOT NULL,
s_region VARCHAR(12)
NOT NULL,
s_phone VARCHAR(15)
NOT NULL
);
CREATE TABLE customer
(
c_custkey INTEGER
NOT NULL,
c_name VARCHAR(25)
NOT NULL,
c_address VARCHAR(25)
NOT NULL,
c_city VARCHAR(10)
NOT NULL,
c_nation VARCHAR(15)
NOT NULL,
c_region VARCHAR(12)
NOT NULL,
c_phone VARCHAR(15)
NOT NULL,
c_mktsegment VARCHAR(10)
NOT NULL
);
CREATE TABLE dwdate
(
d_datekey INTEGER
NOT NULL,
d_date VARCHAR(19)
NOT NULL,
d_dayofweek VARCHAR(10)
NOT NULL,
d_month VARCHAR(10)
NOT NULL,
d_year INTEGER
NOT NULL,
d_yearmonthnum INTEGER
NOT NULL,
d_yearmonth VARCHAR(8)
NOT NULL,
d_daynuminweek INTEGER
NOT NULL,
d_daynuminmonth INTEGER
NOT NULL,
d_daynuminyear INTEGER
NOT NULL,
d_monthnuminyear INTEGER
NOT NULL,
d_weeknuminyear INTEGER
NOT NULL,
d_sellingseason VARCHAR(13)
NOT NULL,
d_lastdayinweekfl VARCHAR(1)
NOT NULL,
d_lastdayinmonthfl VARCHAR(1)
NOT NULL,
d_holidayfl VARCHAR(1)
NOT NULL,
d_weekdayfl VARCHAR(1)
NOT NULL
);
CREATE TABLE lineorder
(
lo_orderkey INTEGER
NOT NULL,
lo_linenumber INTEGER
NOT NULL,
lo_custkey INTEGER
NOT NULL,
lo_partkey INTEGER
NOT NULL,
lo_suppkey INTEGER
NOT NULL,
lo_orderdate INTEGER
NOT NULL,
lo_orderpriority VARCHAR(15)
NOT NULL,
lo_shippriority VARCHAR(1)
NOT NULL,
lo_quantity INTEGER
NOT NULL,
lo_extendedprice INTEGER
NOT NULL,
lo_ordertotalprice INTEGER
NOT NULL,
lo_discount INTEGER
NOT NULL,
lo_revenue INTEGER
NOT NULL,
lo_supplycost INTEGER
NOT NULL,
lo_tax INTEGER
NOT NULL,
lo_commitdate INTEGER
NOT NULL,
lo_shipmode VARCHAR(10)
NOT NULL
);
Run the copy commands
one by one in SQL workbench/J as below:
Copy the data from your
own S3 bucket to Redshift tables:
1. Copy part Table
copy part from
's3://myredshiftbucket111/load/part-csv.tbl'
credentials
'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'
csv;
2. Copy dwdate Table
copy dwdate from
's3://myredshiftbucket111/load/dwdate-tab.tbl'
credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'
delimiter '\t'
dateformat 'auto';
3. copy customer table
from
's3://myredshiftbucket111/load/customer-fw.tbl'
credentials
'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'
fixedwidth
'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,
c_phone:15,c_mktsegment:10'
ACCEPTINVCHARS
maxerror 10
now, copy the 2 tables
which are placed in some other S3 bucket placed in other AWS region:
4. Copy supplier Table
copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl'
credentials
'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'
delimiter '|'
gzip
region 'us-west-2';
5. Copy lineorder Table
copy lineorder
from 's3://awssampledb/load/lo/lineorder-multi.tbl'
credentials
'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'
gzip
compupdate off
region 'us-east-1';
after copying, you can
see the all data copied and run the different SQL analytics queries.
you will find STL_LOAD_ERRORS table useful to
see error while copying data to redshift tables.
Step 6 Vacuum and analyze the
database
run the below command to
reclaim and the space and realign and gather the table statistics in the AWS
redshift Data warehouse tables.
vacuum;
analyze;
Step
7. Clean up your resources
If you are using Free Tier then delete all AWS
resource else you may incur unwanted charges.
delete all 4 AWS
services from your AWS resource usage.
Please visit below my youtube channel datapundt for other AWS and Abinitio and Data Engineering tutorials.
Thanks
datapunditeducation....
SO clear, congratulations!
ReplyDelete