AWS Redshift - Creating working Data Warehouse in AWS Redshift ( 4 Major Steps)

                      AWS Redshift cluster - Star Schema Benchmark 

As we know that AWS redshift has multiple advantages with respect to OLAP analytics, so we have decided to create a small Data warehouse replica in this tutorial. 

    The complete video with explanation is given here.

    



    The DWH architecture which we are going to create is as follows, this can be understood as a star schema based single fact data warehouse.

 






 

Following are the steps how anyone can create the Start Schema based Data Warehouse in AWS Redshift, we are going to utilize multiple AWS services to prepare this mini Data warehouse.

 

Step 1.  IAM role, create a role using AWS Redshift service which has S3 full operation policies.  

Step 2 S3 bucket, create S3 bucket and copy the file from the below zip location to your S3 location.

 Click on here S3 Data to Copy From

  you will find data for customer , part and dwdate table.

      Unzip and copy the data from this location to your S3 bucket residing in same AWS region where you will create AWS redshift cluster.

We will copy the Supplier and LINEPRDER table data from other AWS S3 bucket which are created 

          by AWS for learning purpose (these are created in other AWS regions)

 

S      Step 3   Redshift cluster Db - Create AWS redshift cluster, make it accessible Publicly and also

           Make sure the Security Group liked to it has Inbound rule which allows traffic from 0.0.0.0/0 IP address.  See the video above.

 

4.         Step 4.  SQL work bench/J (Use SQL workbench/J not MySQL workbench) – 

                SQL Workbench/J

      We need to use any SQL client to connect and create tables & copy data to the tables. We may utilize the SQL Editor supplied by AWS Redshift Dashboard as well but to automate we need to connect to the AWS redshift via some API ( Java or Python), so we anyway needs JDBC Or ODBC connectivity.  we will need, JDBC driver and ARN end point of AWS Redshift cluster, User name and Password for connectivity

         See the youtube video for the visual explanation    

        Table creation scripts are as below:

        Run the below table creation statements in SQL Workbench/J

        CREATE TABLE part

(

  p_partkey     INTEGER NOT NULL,

  p_name        VARCHAR(22) NOT NULL,

  p_mfgr        VARCHAR(6),

  p_category    VARCHAR(7) NOT NULL,

  p_brand1      VARCHAR(9) NOT NULL,

  p_color       VARCHAR(11) NOT NULL,

  p_type        VARCHAR(25) NOT NULL,

  p_size        INTEGER NOT NULL,

  p_container   VARCHAR(10) NOT NULL

);

 

CREATE TABLE supplier

(

  s_suppkey   INTEGER NOT NULL,

  s_name      VARCHAR(25) NOT NULL,

  s_address   VARCHAR(25) NOT NULL,

  s_city      VARCHAR(10) NOT NULL,

  s_nation    VARCHAR(15) NOT NULL,

  s_region    VARCHAR(12) NOT NULL,

  s_phone     VARCHAR(15) NOT NULL

);

 

CREATE TABLE customer

(

  c_custkey      INTEGER NOT NULL,

  c_name         VARCHAR(25) NOT NULL,

  c_address      VARCHAR(25) NOT NULL,

  c_city         VARCHAR(10) NOT NULL,

  c_nation       VARCHAR(15) NOT NULL,

  c_region       VARCHAR(12) NOT NULL,

  c_phone        VARCHAR(15) NOT NULL,

  c_mktsegment   VARCHAR(10) NOT NULL

);

 

CREATE TABLE dwdate

(

  d_datekey            INTEGER NOT NULL,

  d_date               VARCHAR(19) NOT NULL,

  d_dayofweek          VARCHAR(10) NOT NULL,

  d_month              VARCHAR(10) NOT NULL,

  d_year               INTEGER NOT NULL,

  d_yearmonthnum       INTEGER NOT NULL,

  d_yearmonth          VARCHAR(8) NOT NULL,

  d_daynuminweek       INTEGER NOT NULL,

  d_daynuminmonth      INTEGER NOT NULL,

  d_daynuminyear       INTEGER NOT NULL,

  d_monthnuminyear     INTEGER NOT NULL,

  d_weeknuminyear      INTEGER NOT NULL,

  d_sellingseason      VARCHAR(13) NOT NULL,

  d_lastdayinweekfl    VARCHAR(1) NOT NULL,

  d_lastdayinmonthfl   VARCHAR(1) NOT NULL,

  d_holidayfl          VARCHAR(1) NOT NULL,

  d_weekdayfl          VARCHAR(1) NOT NULL

);

CREATE TABLE lineorder

(

  lo_orderkey          INTEGER NOT NULL,

  lo_linenumber        INTEGER NOT NULL,

  lo_custkey           INTEGER NOT NULL,

  lo_partkey           INTEGER NOT NULL,

  lo_suppkey           INTEGER NOT NULL,

  lo_orderdate         INTEGER NOT NULL,

  lo_orderpriority     VARCHAR(15) NOT NULL,

  lo_shippriority      VARCHAR(1) NOT NULL,

  lo_quantity          INTEGER NOT NULL,

  lo_extendedprice     INTEGER NOT NULL,

  lo_ordertotalprice   INTEGER NOT NULL,

  lo_discount          INTEGER NOT NULL,

  lo_revenue           INTEGER NOT NULL,

  lo_supplycost        INTEGER NOT NULL,

  lo_tax               INTEGER NOT NULL,

  lo_commitdate        INTEGER NOT NULL,

  lo_shipmode          VARCHAR(10) NOT NULL

);

 

Run the copy commands one by one in SQL workbench/J as below:

Copy the data from your own S3 bucket to Redshift tables:

1. Copy part Table

copy part from 's3://myredshiftbucket111/load/part-csv.tbl' 

credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'

csv;

 

2. Copy dwdate Table

copy dwdate from 's3://myredshiftbucket111/load/dwdate-tab.tbl'

credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'  

delimiter '\t' 

dateformat 'auto';

 

3. copy customer table

from 's3://myredshiftbucket111/load/customer-fw.tbl'

credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'  

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'

ACCEPTINVCHARS

maxerror 10

 

now, copy the 2 tables which are placed in some other S3 bucket placed in other AWS region:

 

4. Copy supplier Table

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl' 

credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly' 

delimiter '|' 

gzip

region 'us-west-2';

 

 

5. Copy lineorder Table

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl' 

credentials 'aws_iam_role=arn:aws:iam::598245567716:role/s3redshiftreadonly'  

gzip

compupdate off

region 'us-east-1';

 

 

after copying, you can see the all data copied and run the different SQL analytics queries.

 

 

you will find STL_LOAD_ERRORS   table useful to see error while copying data to redshift tables.

 

Step 6 Vacuum and analyze the database

 

run the below command to reclaim and the space and realign and gather the table statistics in the AWS redshift Data warehouse tables.

 

vacuum;

analyze;

 

Step 7. Clean up your resources

 If you are using Free Tier then delete all AWS resource else you may incur unwanted charges.

  delete all 4 AWS services from your AWS resource usage.

 

Please visit below my youtube channel datapundt for other AWS and Abinitio and Data Engineering tutorials.

    datapundit

Thanks

datapunditeducation....

 

 

 

 

 

 

Comments

Post a Comment

Popular posts from this blog

Abinitio Interview Question # 1 - Write Multiple Files in Abinitio

Next In Sequence in ABinitio | How next_in_sequence() works in MFS