Showing posts with label load file to snowflake. Show all posts
Showing posts with label load file to snowflake. Show all posts

Monday, November 19, 2018

Bulk Load data from aws s3 to Snowflake database using copy into

Summary - We can ingest data into snowflake database in multiple ways. one such way is from csv files. In this blog we will see how we can load data from multiple csv files stored in aws S3 bucket to  snowflake database and also transform this data during load.

Details - Snowflake makes it easy to load data from aws S3 to snowflake database using csv files. It even allows you to transform this data during load time.

  1. Get the csv files you want to load in aws S3. In this example i am using csv files generated from public datasets provided by google bigquery  - see blog export-google-bigquery-public-dataset
  2. Snowflake needs either aws S3 use access keys or IAM role to access data. I am using aws user access keys in this example.
  3. login to your snowflake DB.
  4. create a stage to connect to S3 as below
create or replace stage dwh01.public.ag_aws_bucket02 url='s3://agbucket02/'
credentials=(aws_key_id='xxx' aws_secret_key='ccccc');

in above - agbucket02 is the aws S3 bucket which has my data/csv files
  • Once above stage is created, create the table using ddl script below
  • In below ddl - i am using schema called trips to create table citibike_trips

CREATE TABLE dwh01.trips.citibike_trips

(
   tripduration INTEGER ,
   starttime DATETIME ,
   stoptime DATETIME ,
   start_station_id INTEGER ,
   start_station_name VARCHAR(100)  ,
   start_station_latitude DOUBLE,
   start_station_longitude DOUBLE ,
   end_station_id INTEGER ,
   end_station_name VARCHAR(100)  ,
   end_station_latitude DOUBLE,
   end_station_longitude DOUBLE ,
   bikeid INTEGER ,
   usertype VARCHAR(50)  ,
   birth_year VARCHAR(10)  ,
   gender VARCHAR(10)  
);
  • once table is created, you can use below copy command to load data from multiple csv files into your table.
  • You can also transform the data during load - for example i wanted to remove UTC text from all timestamps before loading them into citibike_trips table, hence i used replace function.
  • Snowflake allows you to specify pattern from a stage location - I have used pattern = '.*citibike_trips.*[.]csv[.]gzip' - which means take all .csv.gzip whih have citibike_trips in them.
  • I have also used other properties in copy command to specify file compression and delimiter and header row.


copy into dwh01.trips.citibike_trips
from
(
   select
       t.$1,
       replace(t.$2, 'UTC', ''),
       replace(t.$3, 'UTC', ''),
       t.$4,
       t.$5,
       t.$6,
       t.$7,
       t.$8,
       t.$9,
       t.$10,
       t.$11,
       t.$12,
       t.$13,
       t.$14,
       t.$15
from @dwh01.public.ag_aws_bucket02/nyc_citibike_trip_csv/ t
)
pattern = '.*citibike_trips.*[.]csv[.]gzip'
file_format=(TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER=','
skip_header=1 null_if=('') field_optionally_enclosed_by='"');