aws redshift, google bigquery and snowflake db: Bulk Load csv json avro files to aws redshift

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into aws redshift table and see difference in load timings based on redshift cluster size.

Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to redshift table and capture timings and also see impact on load time when i change redshift cluster configurations

Get the data - i have downloaded the data from google bigquery public datasets - refer to blog export-google-bigquery-public-dataset.html for steps to download the data.

Load timings recorded - *(no other load was running on cluster. I also captured timings for copy command with COMPUPDATE OFF and STATUPDATE OFF for some cases)

File Type	Total Files	Avg File Size (gziped for json, csv)	Avg Row count in file	Total Rows loaded	Time Taken (Sec)	Time Taken with COMPUPDATE OFF STATUPDATE OFF (Seconds)	Redshift cluster config
CSV	57	15.6MB	584544	33319019	130	115	dc2.large 2Nodes
CSV	57	15.6MB	584544	33319019	71	-	dc2.large 4Nodes (2elastic nodes added)
CSV	57	15.6MB	584544	33319019	15	-	dc2.8Xlarge 2Nodes
CSV	57	15.6MB	584544	33319019	11	-	dc2.8Xlarge 4Nodes (Elastic 2 Nodes)
JSON	57	18.8MB	584544	33319019	220	215	dc2.large 2Nodes
JSON	57	18.8MB	584544	33319019	140	-	dc2.large 4Nodes (2elastic nodes added)
JSON	57	18.8MB	584544	33319019	22.3	-	dc2.8Xlarge 2Nodes
JSON	57	18.8MB	584544	33319019	20	-	dc2.8Xlarge 4Nodes (Elastic 2 Nodes)
AVRO	57	72.7MB	584544	33319019	380	330	dc2.large 2Nodes
AVRO	57	72.7MB	584544	33319019	230	-	dc2.large 4Nodes (2elastic nodes added)
AVRO	57	72.7MB	584544	33319019	33	-	dc2.8Xlarge 2Nodes
AVRO	57	72.7MB	584544	33319019	27	-	dc2.8Xlarge 4Nodes (Elastic 2 Nodes)

Bulk Load CSV files

drop table citibike_trips_csv;

CREATE TABLE citibike_trips_csv ( tripduration INTEGER , starttime DATETIME , stoptime DATETIME , start_station_id INTEGER , start_station_name VARCHAR(100) , start_station_latitude DOUBLE PRECISION, start_station_longitude DOUBLE PRECISION, end_station_id INTEGER , end_station_name VARCHAR(100) , end_station_latitude DOUBLE PRECISION, end_station_longitude DOUBLE PRECISION, bikeid INTEGER , usertype VARCHAR(50) , birth_year VARCHAR(10) , gender VARCHAR(10) );

copy citibike_trips_csv( tripduration, starttime, stoptime, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bikeid, usertype, birth_year, gender ) from 's3://agbucket02/nyc_citibike_trip_csv' iam_role 'arn:aws:iam::yourroleID:role/yourrolename' csv gzip IGNOREHEADER 1 IGNOREBLANKLINES;

(to turn off stat computation add STATUPDATE OFF COMPUPDATE OFF)

Avg size of each file - 15.6MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load JSON files - We need a jsonpaths file as below.

jsonpaths.txt

{
"jsonpaths": [
"$.tripduration",
"$.starttime",
"$.stoptime",
"$.start_station_id",
"$.start_station_name",
"$.start_station_latitude",
"$.start_station_longitude",
"$.end_station_id",
"$.end_station_name",
"$.end_station_latitude",
"$.end_station_longitude",
"$.bikeid",
"$.usertype",
"$.birth_year",
"$.gender"
]
}

drop table if exists citibike_trips_json; CREATE TABLE citibike_trips_json ( tripduration INTEGER encode zstd, starttime DATETIME encode zstd, stoptime DATETIME encode zstd, start_station_id INTEGER encode zstd, start_station_name VARCHAR(100) encode zstd, start_station_latitude DOUBLE PRECISION encode zstd, start_station_longitude DOUBLE PRECISION encode zstd, end_station_id INTEGER encode zstd, end_station_name VARCHAR(100) encode zstd, end_station_latitude DOUBLE PRECISION encode zstd, end_station_longitude DOUBLE PRECISION encode zstd, bikeid INTEGER encode zstd, usertype VARCHAR(50) encode zstd, birth_year VARCHAR(10) encode zstd, gender VARCHAR(10) encode zstd );

Load json

copy citibike_trips_json
from 's3://agbucket02/nyc_citybike_json/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename
FORMAT AS JSON 's3://agbucket02/jsonpaths.txt' GZIP;

copy citibike_trips_json
from 's3://agbucket02/nyc_citybike_json/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename

FORMAT AS JSON 's3://agbucket02/jsonpaths.txt' GZIP;

STATUPDATE OFF COMPUPDATE OFF;

Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load AVRO files - we will use the same jsonpaths file as we used for json load.

CREATE TABLE citibike_trips_avro
(
tripduration INTEGER ,
starttime BIGINT,
stoptime BIGINT ,
start_station_id INTEGER ,
start_station_name VARCHAR(100),
start_station_latitude DOUBLE PRECISION,
start_station_longitude DOUBLE PRECISION,
end_station_id INTEGER ,
end_station_name VARCHAR(100),
end_station_latitude DOUBLE PRECISION,
end_station_longitude DOUBLE PRECISION,
bikeid INTEGER,
usertype VARCHAR(50),
birth_year VARCHAR(10),
gender VARCHAR(10)
);

truncate table citibike_trips_avro;
copy citibike_trips_avro
from 's3://agbucket02/nyc_citibike_trip_avro/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename'
FORMAT AS AVRO 's3://agbucket02/jsonpaths.txt';

Avg size of each file - 72.7MB

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

1 comment:

AnonymousJune 9, 2023 at 5:18 AM
Establishing an Amazon Redshift Cloud Data Warehouse Consulting in DataStage is a crucial step to enable data integration and transformation processes with the power and scalability of Amazon’s cloud data warehousing solution. Here, we will guide you through the process of setting up the connection in DataStage, ensuring a seamless data flow between your DataStage environment and Amazon Redshift.

Monday, December 3, 2018

Bulk Load csv json avro files to aws redshift

1 comment: