Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into snowflake database table and compare load timings between Medium Vs XL Warehouse
Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to snowflake table and capture timings and also see impact on load time when i change warehouse from Medium size to XL size (all warehouse had max cluster=1)
[I am using snowflake trial account and have used warehouse and database with default settings.]
Bulk Load CSV files
Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to snowflake table and capture timings and also see impact on load time when i change warehouse from Medium size to XL size (all warehouse had max cluster=1)
[I am using snowflake trial account and have used warehouse and database with default settings.]
- Get the data - i have downloaded the data from google bigquery public datasets - refer to blog export-google-bigquery-public-dataset.html for steps to download the data.
- for creating the stage in snowflake to enable copy of data from aws S3 - refer to blog loading-data-from-multiple-csv-files-in.html
Load timings recorded - (for json and avro i have added time to load variant table and final table)
Bulk Load CSV files
CREATE TABLE dwh01.trips.citibike_trips_csv
(
tripduration INTEGER ,
starttime DATETIME ,
stoptime DATETIME ,
start_station_id INTEGER ,
start_station_name VARCHAR(100) ,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE ,
end_station_id INTEGER ,
end_station_name VARCHAR(100) ,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE ,
bikeid INTEGER ,
usertype VARCHAR(50) ,
birth_year VARCHAR(10) ,
gender VARCHAR(10)
);
copy into dwh01.trips.citibike_trips_csv
from
(
select
t.$1,
replace(t.$2, 'UTC', ''),
replace(t.$3, 'UTC', ''),
t.$4,
t.$5,
t.$6,
t.$7,
t.$8,
t.$9,
t.$10,
t.$11,
t.$12,
t.$13,
t.$14,
t.$15
from @dwh01.public.ag_aws_bucket02/nyc_citibike_trip_csv/ t
)
pattern = '.*citibike_trips.*[.]csv[.]gzip'
file_format=(TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER=',' skip_header=1 null_if=('') field_optionally_enclosed_by='"')
Time taken - 10 seconds Medium Size Warehouse
Time taken - 9.2 seconds XL Size Warehouse
Avg size of each file - 15.6MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019
Bulk Load JSON files - You will first load data into table with variant column and later move the data from this variant column table to final table.
Create table dwh01.trips.citibike_trips_json_variant
(raw_json variant);
CREATE TABLE dwh01.trips.citibike_trips_json
(
tripduration INTEGER ,
starttime DATETIME ,
stoptime DATETIME ,
start_station_id INTEGER ,
start_station_name VARCHAR(100) ,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE ,
end_station_id INTEGER ,
end_station_name VARCHAR(100) ,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE ,
bikeid INTEGER ,
usertype VARCHAR(50) ,
birth_year VARCHAR(10) ,
gender VARCHAR(10)
);
Load json
copy into dwh01.trips.citibike_trips_json_variant
From '@dwh01.public.ag_aws_bucket02/nyc_citybike_json/'
pattern = '.*citibike_trips.*[.]json[.]gzip'
file_format=(TYPE = JSON COMPRESSION = GZIP)
Time taken - 34.5 seconds - Medium Size Warehouse
Time taken - 15.03 seconds - XL Size Warehouse
Avg size of each file - 18.8MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019
Loading from Json variable table into final table…
insert into dwh01.trips.citibike_trips_json
(tripduration,
starttime,
stoptime,
start_station_id,
start_station_name,
start_station_latitude,
start_station_longitude,
end_station_id,
end_station_name,
end_station_latitude,
end_station_longitude,
bikeid,
usertype,
birth_year,
gender
)
select
raw_json:tripduration,
replace(raw_json:starttime, 'UTC', ''),
replace(raw_json:stoptime, 'UTC', ''),
raw_json:start_station_id,
raw_json:start_station_name,
raw_json:start_station_latitude,
raw_json:start_station_longitude,
raw_json:end_station_id,
raw_json:end_station_name,
raw_json:end_station_latitude,
raw_json:end_station_longitude,
raw_json:bikeid,
raw_json:usertype,
raw_json:birth_year,
raw_json:gender
from dwh01.trips.citibike_trips_json_variant;
Time taken - 12.34seconds Medium Size Warehouse
Time taken - 7 seconds - XL Size Warehouse
Total rowcount loaded - 33319019
Bulk Load AVRO files - You will first load data into table with variant column and later move the data from this variant column table to final table.
Create table dwh01.trips.citibike_trips_avro_variant
(raw_avro variant);
CREATE TABLE dwh01.trips.citibike_trips_avro
(
tripduration INTEGER ,
starttime DATETIME ,
stoptime DATETIME ,
start_station_id INTEGER ,
start_station_name VARCHAR(100) ,
start_station_latitude DOUBLE,
start_station_longitude DOUBLE ,
end_station_id INTEGER ,
end_station_name VARCHAR(100) ,
end_station_latitude DOUBLE,
end_station_longitude DOUBLE ,
bikeid INTEGER ,
usertype VARCHAR(50) ,
birth_year VARCHAR(10) ,
gender VARCHAR(10)
);
Load avro -
copy into dwh01.trips.citibike_trips_avro_variant
From '@dwh01.public.ag_aws_bucket02/nyc_citibike_trip_avro/'
pattern = '.*citibike_trips.*[.]avro'
file_format=(TYPE = AVRO)
Time taken - 26.67 seconds - Medium Size Warehouse
Time taken - 13 seconds - XL Size Warehouse
Avg size of each file - 72.7MB
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019
Loading from avro variant table into final table…
insert into dwh01.trips.citibike_trips_avro
(tripduration,
starttime,
stoptime,
start_station_id,
start_station_name,
start_station_latitude,
start_station_longitude,
end_station_id,
end_station_name,
end_station_latitude,
end_station_longitude,
bikeid,
usertype,
birth_year,
gender
)
select
raw_avro:tripduration,
replace(raw_avro:starttime, 'UTC', ''),
replace(raw_avro:stoptime, 'UTC', ''),
raw_avro:start_station_id,
raw_avro:start_station_name,
raw_avro:start_station_latitude,
raw_avro:start_station_longitude,
raw_avro:end_station_id,
raw_avro:end_station_name,
raw_avro:end_station_latitude,
raw_avro:end_station_longitude,
raw_avro:bikeid,
raw_avro:usertype,
raw_avro:birth_year,
raw_avro:gender
from dwh01.trips.citibike_trips_avro_variant;
Time taken - 8.27 seconds - Medium Size Warehouse
Time taken - 8.2 seconds - XL Size Warehouse
Total rowcount loaded - 33319019