Showing posts with label bigquery load timings for avro. Show all posts
Showing posts with label bigquery load timings for avro. Show all posts

Thursday, December 13, 2018

Bulk Load csv json avro files to Google BigQuery

Summary - In this blog i will load same data (which is split into multiple files) from csv, avro and json format into google bigquery table and capture load timings.

Details - I have citibike trips data available in csv, json and avro format in google cloud storage GS.

Load timings recorded - *(no other load was running on bigquery)


File TypeTotal FilesAvg File Size
(gziped for json, csv)
Avg Row
count in
file
Total Rows
loaded
Time
Taken
(Sec)
CSV5715.6MB5845443331901968
JSON5718.8MB5845443331901988
AVRO5772.7MB5845443331901988


Bulk Load CSV files
launch google cloudshell and run below commands

bq  load --source_format=CSV --autodetect pristine-nomad-222804:nyc_data.citibike_trips_csv gs://ag_nycdata/citibike_trips/citibike_trips*.csv.gzip

Avg size of each file - 15.6MB (compressed)
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019


Bulk Load JSON files -
bq  load --source_format=NEWLINE_DELIMITED_JSON --autodetect pristine-nomad-222804:nyc_data.citibike_trips_json gs://ag_nycdata/citibike_trips/citibike_trips*.json.gzip


Avg size of each file - 18.8MB (compressed)

Avg rowcount in each file - 584544

Total rowcount loaded - 33319019



Bulk Load AVRO files - 
bq  load --source_format=AVRO --autodetect pristine-nomad-222804:nyc_data.citibike_trips_avro gs://ag_nycdata/citibike_trips/citibike_trips*.avro


Avg size of each file - 72.7MB
Avg rowcount in each file - 584544
Total rowcount loaded - 33319019