aws redshift, google bigquery and snowflake db: csv

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into google bigquery table and capture load timings.

Details - I have citibike trips data available in csv, json and avro format in google cloud storage GS.

Get the data - i have downloaded the data from google bigquery public datasets - refer to blog export-google-bigquery-public-dataset.html for steps to download the data.

Load timings recorded - *(no other load was running on bigquery)

File Type	Total Files	Avg File Size (gziped for json, csv)	Avg Row count in file	Total Rows loaded	Time Taken (Sec)
CSV	57	15.6MB	584544	33319019	68
JSON	57	18.8MB	584544	33319019	88
AVRO	57	72.7MB	584544	33319019	88

Bulk Load CSV files
launch google cloudshell and run below commands

bq load --source_format=CSV --autodetect pristine-nomad-222804:nyc_data.citibike_trips_csv gs://ag_nycdata/citibike_trips/citibike_trips*.csv.gzip

Avg size of each file - 15.6MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load JSON files -
bq load --source_format=NEWLINE_DELIMITED_JSON --autodetect pristine-nomad-222804:nyc_data.citibike_trips_json gs://ag_nycdata/citibike_trips/citibike_trips*.json.gzip

Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load AVRO files -
bq load --source_format=AVRO --autodetect pristine-nomad-222804:nyc_data.citibike_trips_avro gs://ag_nycdata/citibike_trips/citibike_trips*.avro

Avg size of each file - 72.7MB

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into snowflake database table and compare load timings between Medium Vs XL Warehouse

Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to snowflake table and capture timings and also see impact on load time when i change warehouse from Medium size to XL size (all warehouse had max cluster=1)
[I am using snowflake trial account and have used warehouse and database with default settings.]

Get the data - i have downloaded the data from google bigquery public datasets - refer to blog export-google-bigquery-public-dataset.html for steps to download the data.
for creating the stage in snowflake to enable copy of data from aws S3 - refer to blog loading-data-from-multiple-csv-files-in.html

Load timings recorded - (for json and avro i have added time to load variant table and final table)

File Type	Total Files	Avg File Size (gziped for json,csv)	Avg Row count in file	Total Rows loaded	Time Taken (Seconds)	Snowflake Warehouse
CSV	57	15.6MB	584544	33319019	10	Medium
CSV	57	15.6MB	584544	33319019	9.2	XL
JSON	57	18.8MB	584544	33319019	46.84	Medium
JSON	57	18.8MB	584544	33319019	22.03	XL
AVRO	57	72.7MB	584544	33319019	34.94	Medium
AVRO	57	72.7MB	584544	33319019	21.2	XL

Bulk Load CSV files

CREATE TABLE dwh01.trips.citibike_trips_csv

(

tripduration INTEGER ,

starttime DATETIME ,

stoptime DATETIME ,

start_station_id INTEGER ,

start_station_name VARCHAR(100) ,

start_station_latitude DOUBLE,

start_station_longitude DOUBLE ,

end_station_id INTEGER ,

end_station_name VARCHAR(100) ,

end_station_latitude DOUBLE,

end_station_longitude DOUBLE ,

bikeid INTEGER ,

usertype VARCHAR(50) ,

birth_year VARCHAR(10) ,

gender VARCHAR(10)

);

copy into dwh01.trips.citibike_trips_csv

from

(

select

t.$1,

replace(t.$2, 'UTC', ''),

replace(t.$3, 'UTC', ''),

t.$4,

t.$5,

t.$6,

t.$7,

t.$8,

t.$9,

t.$10,

t.$11,

t.$12,

t.$13,

t.$14,

t.$15

from @dwh01.public.ag_aws_bucket02/nyc_citibike_trip_csv/ t

)

pattern = '.*citibike_trips.*[.]csv[.]gzip'

file_format=(TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER=',' skip_header=1 null_if=('') field_optionally_enclosed_by='"')

Time taken - 10 seconds Medium Size Warehouse

Time taken - 9.2 seconds XL Size Warehouse

Avg size of each file - 15.6MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load JSON files - You will first load data into table with variant column and later move the data from this variant column table to final table.

Create table dwh01.trips.citibike_trips_json_variant

(raw_json variant);

CREATE TABLE dwh01.trips.citibike_trips_json

(

tripduration INTEGER ,

starttime DATETIME ,

stoptime DATETIME ,

start_station_id INTEGER ,

start_station_name VARCHAR(100) ,

start_station_latitude DOUBLE,

start_station_longitude DOUBLE ,

end_station_id INTEGER ,

end_station_name VARCHAR(100) ,

end_station_latitude DOUBLE,

end_station_longitude DOUBLE ,

bikeid INTEGER ,

usertype VARCHAR(50) ,

birth_year VARCHAR(10) ,

gender VARCHAR(10)

);

Load json

copy into dwh01.trips.citibike_trips_json_variant

From '@dwh01.public.ag_aws_bucket02/nyc_citybike_json/'

pattern = '.*citibike_trips.*[.]json[.]gzip'

file_format=(TYPE = JSON COMPRESSION = GZIP)

Time taken - 34.5 seconds - Medium Size Warehouse

Time taken - 15.03 seconds - XL Size Warehouse

Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Loading from Json variable table into final table…

insert into dwh01.trips.citibike_trips_json

(tripduration,

starttime,

stoptime,

start_station_id,

start_station_name,

start_station_latitude,

start_station_longitude,

end_station_id,

end_station_name,

end_station_latitude,

end_station_longitude,

bikeid,

usertype,

birth_year,

gender

)

select

raw_json:tripduration,

replace(raw_json:starttime, 'UTC', ''),

replace(raw_json:stoptime, 'UTC', ''),

raw_json:start_station_id,

raw_json:start_station_name,

raw_json:start_station_latitude,

raw_json:start_station_longitude,

raw_json:end_station_id,

raw_json:end_station_name,

raw_json:end_station_latitude,

raw_json:end_station_longitude,

raw_json:bikeid,

raw_json:usertype,

raw_json:birth_year,

raw_json:gender

from dwh01.trips.citibike_trips_json_variant;

Time taken - 12.34seconds Medium Size Warehouse

Time taken - 7 seconds - XL Size Warehouse

Total rowcount loaded - 33319019

Bulk Load AVRO files - You will first load data into table with variant column and later move the data from this variant column table to final table.

Create table dwh01.trips.citibike_trips_avro_variant

(raw_avro variant);

CREATE TABLE dwh01.trips.citibike_trips_avro

(

tripduration INTEGER ,

starttime DATETIME ,

stoptime DATETIME ,

start_station_id INTEGER ,

start_station_name VARCHAR(100) ,

start_station_latitude DOUBLE,

start_station_longitude DOUBLE ,

end_station_id INTEGER ,

end_station_name VARCHAR(100) ,

end_station_latitude DOUBLE,

end_station_longitude DOUBLE ,

bikeid INTEGER ,

usertype VARCHAR(50) ,

birth_year VARCHAR(10) ,

gender VARCHAR(10)

);

Load avro -

copy into dwh01.trips.citibike_trips_avro_variant

From '@dwh01.public.ag_aws_bucket02/nyc_citibike_trip_avro/'

pattern = '.*citibike_trips.*[.]avro'

file_format=(TYPE = AVRO)

Time taken - 26.67 seconds - Medium Size Warehouse

Time taken - 13 seconds - XL Size Warehouse

Avg size of each file - 72.7MB

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Loading from avro variant table into final table…

insert into dwh01.trips.citibike_trips_avro

(tripduration,

starttime,

stoptime,

start_station_id,

start_station_name,

start_station_latitude,

start_station_longitude,

end_station_id,

end_station_name,

end_station_latitude,

end_station_longitude,

bikeid,

usertype,

birth_year,

gender

)

select

raw_avro:tripduration,

replace(raw_avro:starttime, 'UTC', ''),

replace(raw_avro:stoptime, 'UTC', ''),

raw_avro:start_station_id,

raw_avro:start_station_name,

raw_avro:start_station_latitude,

raw_avro:start_station_longitude,

raw_avro:end_station_id,

raw_avro:end_station_name,

raw_avro:end_station_latitude,

raw_avro:end_station_longitude,

raw_avro:bikeid,

raw_avro:usertype,

raw_avro:birth_year,

raw_avro:gender

from dwh01.trips.citibike_trips_avro_variant;

Time taken - 8.27 seconds - Medium Size Warehouse

Time taken - 8.2 seconds - XL Size Warehouse

Total rowcount loaded - 33319019

aws redshift, google bigquery and snowflake db

Thursday, December 13, 2018

Bulk Load csv json avro files to Google BigQuery

Wednesday, November 21, 2018

Bulk Load csv json avro files to snowflake db