Monday, November 19, 2018

Bulk Load data from aws s3 to Snowflake database using copy into

Summary - We can ingest data into snowflake database in multiple ways. one such way is from csv files. In this blog we will see how we can load data from multiple csv files stored in aws S3 bucket to  snowflake database and also transform this data during load.

Details - Snowflake makes it easy to load data from aws S3 to snowflake database using csv files. It even allows you to transform this data during load time.

  1. Get the csv files you want to load in aws S3. In this example i am using csv files generated from public datasets provided by google bigquery  - see blog export-google-bigquery-public-dataset
  2. Snowflake needs either aws S3 use access keys or IAM role to access data. I am using aws user access keys in this example.
  3. login to your snowflake DB.
  4. create a stage to connect to S3 as below
create or replace stage dwh01.public.ag_aws_bucket02 url='s3://agbucket02/'
credentials=(aws_key_id='xxx' aws_secret_key='ccccc');

in above - agbucket02 is the aws S3 bucket which has my data/csv files
  • Once above stage is created, create the table using ddl script below
  • In below ddl - i am using schema called trips to create table citibike_trips

CREATE TABLE dwh01.trips.citibike_trips

(
   tripduration INTEGER ,
   starttime DATETIME ,
   stoptime DATETIME ,
   start_station_id INTEGER ,
   start_station_name VARCHAR(100)  ,
   start_station_latitude DOUBLE,
   start_station_longitude DOUBLE ,
   end_station_id INTEGER ,
   end_station_name VARCHAR(100)  ,
   end_station_latitude DOUBLE,
   end_station_longitude DOUBLE ,
   bikeid INTEGER ,
   usertype VARCHAR(50)  ,
   birth_year VARCHAR(10)  ,
   gender VARCHAR(10)  
);
  • once table is created, you can use below copy command to load data from multiple csv files into your table.
  • You can also transform the data during load - for example i wanted to remove UTC text from all timestamps before loading them into citibike_trips table, hence i used replace function.
  • Snowflake allows you to specify pattern from a stage location - I have used pattern = '.*citibike_trips.*[.]csv[.]gzip' - which means take all .csv.gzip whih have citibike_trips in them.
  • I have also used other properties in copy command to specify file compression and delimiter and header row.


copy into dwh01.trips.citibike_trips
from
(
   select
       t.$1,
       replace(t.$2, 'UTC', ''),
       replace(t.$3, 'UTC', ''),
       t.$4,
       t.$5,
       t.$6,
       t.$7,
       t.$8,
       t.$9,
       t.$10,
       t.$11,
       t.$12,
       t.$13,
       t.$14,
       t.$15
from @dwh01.public.ag_aws_bucket02/nyc_citibike_trip_csv/ t
)
pattern = '.*citibike_trips.*[.]csv[.]gzip'
file_format=(TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER=','
skip_header=1 null_if=('') field_optionally_enclosed_by='"');

Sunday, November 18, 2018

Download data from google cloud storage (gcs) using Talend

Summary - In this blog i will show how we can use talend open studio to download file from gcs bucket/folders to local machine. I have extra logic built in - which first checks if file already exists in local machine and only downloads the file if its not existing. You can adjust the logic as per your needs.

Details - As of now talend supports connection to gcs based on access-key and secret key.

  1. As a pre-requisite you should have access to gcp and google cloud storage (gcs) buckets.
  2. You can start for free and get credits
  3. once you are setup on gcp and have initial project launched - you can create a bucket in gcs
  4. Talend has tGS* components to help integration with gcp.
  5. enabling Interoperability
  6. you have to get your access key and secret key from gcp. To get these keys you have to enable Interoperability in gcs panel. See screenshot below- 


Next you should use option "create a new key" to generate keys and make a note of these.
  • Now in talend open studio for data integration - from palette select tGSConnection component and enter above keys. This component establishes connection to gcp.
  • tGSList component can be used to get list of item in a gcs bucket.
  • The logic i have used is to get list of all items (key/name of which starts with a prefix) and iterate for each object and see if it already exists in my local folder on local machine. If it exists i skip this file else i use tGSGet to get this file from gcs to local machine
  • Finally i use tGSClose to close connection to gcp.
Talend Job
Components to connect to GCP and get bucket its from GCS

Specifying the key prefix so that relevant files are returned by tGSList


Java Logic to get file name from key returned by gcs

tJava is used to derive the file name from the complete key name returned by tGSList and
Java Code
String strTemp = ((String)globalMap.get("tGSList_1_CURRENT_KEY"));
int index_of_slash = strTemp.indexOf("/");
if (index_of_slash ==-1)  
globalMap.put("gs_filename", strTemp);
else

globalMap.put("gs_filename", strTemp.substring(index_of_slash+1));


Path of local of local file/folder to be checked for existance


using NOT as we want to get file if it not exists


giving output directory and key name in tGSGet component




Export google bigquery public dataset to json, avro, csv


Summary - In this blog we will see how we can export data from google bigquery public datasets to csv, avro and json format files.

Details - There are multiple ways you can export public datasets to files from google bigquery. You can use tools like talend who can do it for you or you can use inbuilt features from google bigquery to do same.
I will like to demonstrate how easy it is to export these datasets into multiple files of different formats using google bigquery export options
  1. Get access to google cloud platform - you can start for free and get credits.
  2. once you are setup with first project - go to bigquery from gcp console.
  3. navigate to your desired public dataset in the left hand panel.
  4. select the table you want to export and click on preview option to see sample data in right hand bottom panel.
  5. In screenshot below i have used "bigquery-public-data.new_york" dataset and it has tables which stores data related to yellow taxi trips. I have selected tlc_yellow_trips_2016 in left hand panel.


  • Once you are in preview panel - you will have an option of "EXPORT".
  • Click on this option and enter/your google cloud storage bucket and folder where you want to export the data to 
  • You can also select export file format and compression
  • It is advisable to use * in filename - which allows gcp bigquery job to split export into multi-part files - which makes it easier to further load these files into other databases or process them in parallel.
  • In example below i am exporting data into my bucket called ag_nycdata and folder is tlc_yellow_trips_2016
  • I have used multiple exports like csv without compression, csv with compression, json with and without compression and avro.
Once you start the export - the data gets exported to google cloud storage and files gets generated as expected - see below