Monday, April 26, 2021

Google Strategic Cloud Data Engineer (Cloud Data Architect) Interview Experience


In this blog I will like to share the experience I had while interviewing for the position of SCDE (strategic cloud data engineer). 

Overall duration:

The process lasted for around 4-5 months - from the date I applied to the date I received the offer (this long time was owning to the month of December (holidays season) and also extra preparation time I had asked for and delay from Google's end).

Initial Discussions:

My initial interactions were with Google Recruiters for Initial Screening/Interview process sessions.  

Round #1 and #2 

Both rounds were focused around data solution/dwh design questions, data modelling, system design questions around data platforms, SQL problems and simple coding. In both the round I found the interviewers had great patience to listen to the approaches and discuss the solutions and cover the pros and cons of the suggested solutions. I did good in both the rounds. 

Round #3/3.x

This round was standard coding and system design round.  I did not did great in this round and I was asked to appear once more for this round - and was given around 4 weeks time to prepare. I took additional week and then appeared for this around. The interviewer I met was cooperative and had patience to listen to the solutions I proposed and did discussed in details about pros and cons of solutions. 
I was asked one coding question and one system design question. I was able to clear this round and had a great relaxing feeling after this round. 

Round #4

This was the Hiring manager and Googliness round. The hiring manager listened to me with patience and gave me different situations and learned about my reaction/solutions to those situations. Yes, I was able to clear this round. 

Finally

This was it. I was all set and received the offer from Google Cloud Professional Services. I was excited and felt great. It was like dream come true - getting an offer from Google. 

Summary

Overall I had a great experience in interacting with Google Recruiters and Engineers. Interview process was definitely long and could have been shorter (though I agree I myself asked for extra time to prepare for coding round). Most of the recruiters I met were cooperative and had patience.

p.s - I did not joined Google and turned down the offer later [that's another story :( ]

Thursday, December 13, 2018

Bulk Load csv json avro files to Google BigQuery

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into google bigquery table and capture load timings.

Details - I have citibike trips data available in csv, json and avro format in google cloud storage GS.

Load timings recorded - *(no other load was running on bigquery)


File TypeTotal FilesAvg File Size
(gziped for json, csv)
Avg Row
count in
file
Total Rows
loaded
Time
Taken
(Sec)
CSV5715.6MB5845443331901968
JSON5718.8MB5845443331901988
AVRO5772.7MB5845443331901988


Bulk Load CSV files
launch google cloudshell and run below commands

bq  load --source_format=CSV --autodetect pristine-nomad-222804:nyc_data.citibike_trips_csv gs://ag_nycdata/citibike_trips/citibike_trips*.csv.gzip

Avg size of each file - 15.6MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019


Bulk Load JSON files -
bq  load --source_format=NEWLINE_DELIMITED_JSON --autodetect pristine-nomad-222804:nyc_data.citibike_trips_json gs://ag_nycdata/citibike_trips/citibike_trips*.json.gzip


Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019



Bulk Load AVRO files - 
bq  load --source_format=AVRO --autodetect pristine-nomad-222804:nyc_data.citibike_trips_avro gs://ag_nycdata/citibike_trips/citibike_trips*.avro


Avg size of each file - 72.7MB
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019

Monday, December 3, 2018

Bulk Load csv json avro files to aws redshift

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into aws redshift table and see difference in load timings based on redshift cluster size.

Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to redshift table and capture timings and also see impact on load time when i change redshift cluster configurations
Load timings recorded - *(no other load was running on cluster. I also captured timings for copy command with COMPUPDATE OFF and STATUPDATE OFF for some cases)


File TypeTotal FilesAvg File Size
(gziped for json, csv)
Avg Row
count in
file
Total Rows
loaded
Time
Taken
(Sec)
Time Taken with COMPUPDATE OFF STATUPDATE OFF (Seconds)Redshift cluster config
CSV5715.6MB58454433319019130115dc2.large 2Nodes
CSV5715.6MB5845443331901971-dc2.large 4Nodes (2elastic nodes added)
CSV5715.6MB5845443331901915-dc2.8Xlarge 2Nodes
CSV5715.6MB5845443331901911-dc2.8Xlarge 4Nodes
(Elastic 2 Nodes)
JSON5718.8MB58454433319019220215dc2.large 2Nodes
JSON5718.8MB58454433319019140-dc2.large 4Nodes (2elastic nodes added)
JSON5718.8MB5845443331901922.3-dc2.8Xlarge 2Nodes
JSON5718.8MB5845443331901920-dc2.8Xlarge 4Nodes
(Elastic 2 Nodes)
AVRO5772.7MB58454433319019380330dc2.large 2Nodes
AVRO5772.7MB58454433319019230-dc2.large 4Nodes (2elastic nodes added)
AVRO5772.7MB5845443331901933-dc2.8Xlarge 2Nodes
AVRO5772.7MB5845443331901927-dc2.8Xlarge 4Nodes
(Elastic 2 Nodes)




Bulk Load CSV files

drop table citibike_trips_csv;
CREATE TABLE citibike_trips_csv ( tripduration INTEGER , starttime DATETIME , stoptime DATETIME , start_station_id INTEGER , start_station_name VARCHAR(100) , start_station_latitude DOUBLE PRECISION, start_station_longitude DOUBLE PRECISION, end_station_id INTEGER , end_station_name VARCHAR(100) , end_station_latitude DOUBLE PRECISION, end_station_longitude DOUBLE PRECISION, bikeid INTEGER , usertype VARCHAR(50) , birth_year VARCHAR(10) , gender VARCHAR(10) );

copy citibike_trips_csv( tripduration, starttime, stoptime, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bikeid, usertype, birth_year, gender ) from 's3://agbucket02/nyc_citibike_trip_csv' iam_role 'arn:aws:iam::yourroleID:role/yourrolename' csv gzip IGNOREHEADER 1 IGNOREBLANKLINES;


(to turn off stat computation add STATUPDATE OFF COMPUPDATE OFF)

Avg size of each file - 15.6MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019



Bulk Load JSON files - We need a jsonpaths file as below.

jsonpaths.txt

{
    "jsonpaths": [
        "$.tripduration",
        "$.starttime",
        "$.stoptime",
        "$.start_station_id",
        "$.start_station_name",
        "$.start_station_latitude",
        "$.start_station_longitude",
        "$.end_station_id",
        "$.end_station_name",
        "$.end_station_latitude",
        "$.end_station_longitude",
        "$.bikeid",
        "$.usertype",
        "$.birth_year",
        "$.gender"
    ]
}

drop table if exists citibike_trips_json; CREATE TABLE citibike_trips_json ( tripduration INTEGER encode zstd, starttime DATETIME encode zstd, stoptime DATETIME encode zstd, start_station_id INTEGER encode zstd, start_station_name VARCHAR(100) encode zstd, start_station_latitude DOUBLE PRECISION encode zstd, start_station_longitude DOUBLE PRECISION encode zstd, end_station_id INTEGER encode zstd, end_station_name VARCHAR(100) encode zstd, end_station_latitude DOUBLE PRECISION encode zstd, end_station_longitude DOUBLE PRECISION encode zstd, bikeid INTEGER encode zstd, usertype VARCHAR(50) encode zstd, birth_year VARCHAR(10) encode zstd, gender VARCHAR(10) encode zstd );

Load json

copy citibike_trips_json
from 's3://agbucket02/nyc_citybike_json/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename
FORMAT AS JSON 's3://agbucket02/jsonpaths.txt' GZIP;

copy citibike_trips_json
from 's3://agbucket02/nyc_citybike_json/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename


FORMAT AS JSON 's3://agbucket02/jsonpaths.txt' GZIP;
STATUPDATE OFF COMPUPDATE OFF;


Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019

Bulk Load AVRO files - we will use the same jsonpaths file as we used for json load.

CREATE TABLE citibike_trips_avro
 (
    tripduration INTEGER ,
    starttime BIGINT,
    stoptime BIGINT ,
    start_station_id INTEGER ,
    start_station_name VARCHAR(100),
    start_station_latitude DOUBLE PRECISION,
    start_station_longitude DOUBLE PRECISION,
    end_station_id INTEGER ,
    end_station_name VARCHAR(100),
    end_station_latitude DOUBLE PRECISION,
    end_station_longitude DOUBLE PRECISION,
    bikeid INTEGER,
    usertype VARCHAR(50),
    birth_year VARCHAR(10),
    gender VARCHAR(10)
);

truncate table citibike_trips_avro;
copy citibike_trips_avro
from 's3://agbucket02/nyc_citibike_trip_avro/'
iam_role 'arn:aws:iam::yourroleid:role/yourrolename'
FORMAT AS AVRO 's3://agbucket02/jsonpaths.txt';

Avg size of each file - 72.7MB
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019


Wednesday, November 21, 2018

Bulk Load csv json avro files to snowflake db

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into snowflake database table and compare load timings between Medium Vs XL Warehouse

Details - I have citibike trips data available in csv, json and avro format in aws S3. I will load this data without any configuration changes to snowflake table and capture timings and also see impact on load time when i change warehouse from Medium size to XL size (all warehouse had max cluster=1)
[I am using snowflake trial account and have used warehouse and database with default settings.]

Load timings recorded - (for json and avro i have added time to load variant table and final table)

File TypeTotal FilesAvg File Size
(gziped for json,csv)
Avg Row
count in
file
Total Rows
loaded
Time Taken
(Seconds)
Snowflake
Warehouse
CSV5715.6MB5845443331901910Medium
CSV5715.6MB584544333190199.2XL
JSON5718.8MB5845443331901946.84Medium
JSON5718.8MB5845443331901922.03XL
AVRO5772.7MB5845443331901934.94Medium
AVRO5772.7MB5845443331901921.2XL

Bulk Load CSV files

CREATE TABLE dwh01.trips.citibike_trips_csv

(

   tripduration INTEGER ,

   starttime DATETIME ,

   stoptime DATETIME ,

   start_station_id INTEGER ,
   start_station_name VARCHAR(100)  ,
   start_station_latitude DOUBLE,
   start_station_longitude DOUBLE ,
   end_station_id INTEGER ,
   end_station_name VARCHAR(100)  ,
   end_station_latitude DOUBLE,
   end_station_longitude DOUBLE ,
   bikeid INTEGER ,
   usertype VARCHAR(50)  ,
   birth_year VARCHAR(10)  ,
   gender VARCHAR(10)  
);

copy into dwh01.trips.citibike_trips_csv
from
(
   select
       t.$1,
       replace(t.$2, 'UTC', ''),
       replace(t.$3, 'UTC', ''),
       t.$4,
       t.$5,
       t.$6,
       t.$7,
       t.$8,
       t.$9,
       t.$10,
       t.$11,
       t.$12,
       t.$13,
       t.$14,
       t.$15
from @dwh01.public.ag_aws_bucket02/nyc_citibike_trip_csv/ t
)
pattern = '.*citibike_trips.*[.]csv[.]gzip'
file_format=(TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER=',' skip_header=1 null_if=('') field_optionally_enclosed_by='"')

Time taken - 10 seconds Medium Size Warehouse
Time taken - 9.2 seconds XL Size Warehouse
Avg size of each file - 15.6MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019



Bulk Load JSON files - You will first load data into table with variant column and later move the data from this variant column table to final table.

Create table dwh01.trips.citibike_trips_json_variant

(raw_json variant);

CREATE TABLE dwh01.trips.citibike_trips_json

(

   tripduration INTEGER ,

   starttime DATETIME ,
   stoptime DATETIME ,
   start_station_id INTEGER ,
   start_station_name VARCHAR(100)  ,
   start_station_latitude DOUBLE,
   start_station_longitude DOUBLE ,
   end_station_id INTEGER ,
   end_station_name VARCHAR(100)  ,
   end_station_latitude DOUBLE,
   end_station_longitude DOUBLE ,
   bikeid INTEGER ,
   usertype VARCHAR(50)  ,
   birth_year VARCHAR(10)  ,
   gender VARCHAR(10)  
);

Load json

copy into dwh01.trips.citibike_trips_json_variant
From '@dwh01.public.ag_aws_bucket02/nyc_citybike_json/'
pattern = '.*citibike_trips.*[.]json[.]gzip'
file_format=(TYPE = JSON COMPRESSION = GZIP)

Time taken - 34.5 seconds - Medium Size Warehouse
Time taken - 15.03 seconds - XL Size Warehouse
Avg size of each file - 18.8MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019

Loading from Json variable table into final table…
insert into dwh01.trips.citibike_trips_json
(tripduration,
starttime,
stoptime,
start_station_id,
start_station_name,
start_station_latitude,
start_station_longitude,
end_station_id,
end_station_name,
end_station_latitude,
end_station_longitude,
bikeid,
usertype,
birth_year,
gender
)
select
   raw_json:tripduration,
   replace(raw_json:starttime, 'UTC', ''),
   replace(raw_json:stoptime, 'UTC', ''),
   raw_json:start_station_id,
   raw_json:start_station_name,
   raw_json:start_station_latitude,
   raw_json:start_station_longitude,
   raw_json:end_station_id,
   raw_json:end_station_name,
   raw_json:end_station_latitude,
   raw_json:end_station_longitude,
   raw_json:bikeid,
   raw_json:usertype,
   raw_json:birth_year,
   raw_json:gender
from dwh01.trips.citibike_trips_json_variant;

Time taken - 12.34seconds Medium Size Warehouse
Time taken - 7 seconds - XL Size Warehouse
Total rowcount loaded - 33319019


Bulk Load AVRO files - You will first load data into table with variant column and later move the data from this variant column table to final table.

Create table dwh01.trips.citibike_trips_avro_variant

(raw_avro variant);

CREATE TABLE dwh01.trips.citibike_trips_avro

(

   tripduration INTEGER ,

   starttime DATETIME ,
   stoptime DATETIME ,
   start_station_id INTEGER ,
   start_station_name VARCHAR(100)  ,
   start_station_latitude DOUBLE,
   start_station_longitude DOUBLE ,
   end_station_id INTEGER ,
   end_station_name VARCHAR(100)  ,
   end_station_latitude DOUBLE,
   end_station_longitude DOUBLE ,
   bikeid INTEGER ,
   usertype VARCHAR(50)  ,
   birth_year VARCHAR(10)  ,
   gender VARCHAR(10)  
);

Load avro -
copy into dwh01.trips.citibike_trips_avro_variant
From '@dwh01.public.ag_aws_bucket02/nyc_citibike_trip_avro/'
pattern = '.*citibike_trips.*[.]avro'
file_format=(TYPE = AVRO)

Time taken - 26.67 seconds - Medium Size Warehouse
Time taken - 13 seconds - XL Size Warehouse
Avg size of each file - 72.7MB
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019

Loading from avro variant table into final table…
insert into dwh01.trips.citibike_trips_avro
(tripduration,
starttime,
stoptime,
start_station_id,
start_station_name,
start_station_latitude,
start_station_longitude,
end_station_id,
end_station_name,
end_station_latitude,
end_station_longitude,
bikeid,
usertype,
birth_year,
gender
)
select
   raw_avro:tripduration,
   replace(raw_avro:starttime, 'UTC', ''),
   replace(raw_avro:stoptime, 'UTC', ''),
   raw_avro:start_station_id,
   raw_avro:start_station_name,
   raw_avro:start_station_latitude,
   raw_avro:start_station_longitude,
   raw_avro:end_station_id,
   raw_avro:end_station_name,
   raw_avro:end_station_latitude,
   raw_avro:end_station_longitude,
   raw_avro:bikeid,
   raw_avro:usertype,
   raw_avro:birth_year,
   raw_avro:gender
from dwh01.trips.citibike_trips_avro_variant;

Time taken - 8.27 seconds  - Medium Size Warehouse
Time taken - 8.2 seconds  - XL Size Warehouse
Total rowcount loaded - 33319019