Comparing Different Approaches to Querying External Data in BigQuery
External table, BigQuery table and partitioned BigQuery table.
BigQuery provides multiple methods to query data stored in external tables (documentation: link). For the same data, these different approaches have different billing impllications. Here, I compare between the following methods of external data querying using BigQuery:
External data (stored as csv.gz file in GCS) queried as external table in BigQuery.
External data (from csv.gz file in GCS) imported as non-partitioned BigQuery table and queried as external table in BigQuery.
External data (from csv.gz file in GCS) imported as partitioned BigQuery table and queried as external table in BigQuery.
External data (stored as parquet file in GCS) queried as external table in BigQuery.
Also, for the parquet table, there can be different compression methods. Here, I compare between gzip, snappy and brotli compressions (compression methods available in pandas to_parquet method).
For this test, I used NY taxi rides for hire vehicles (fhv) data available here: link. I used all data from 2019. Data from the data source was ingested, transformed (when applicable - csv-to-parquet) and loaded to GCS bucket (EL pipeline: link). For the csv-to-parquet conversion step, snappy compression was the fastest. Order of compression speed:
snappy (~2.75 s) > gzip (~3.5 s) > brotli (~4.5 s)
(EL pipeline runtime in google cloud e2-standard-4 VM instance)
BigQuery charges each query based on the data it scans (link). And depending on how the external table is read, data scanned would be different. There are two major ways for querying external data:
Load external data in BigQuery as BigQuery table.
Query data directly from external table.
Each of these two methods have multiple variations. External data can be loaded in BigQuery table as unpartitioned table and partitioned table. External data can be directly queried where the external data is in csv format or is in parquet format. When I compared different options for the same data, billed bytes ranged from 24 MB to 2.52 GB for the following query:
SELECT
DISTINCT affiliated_base_number
FROM
source_table
WHERE
pickup_datetime BETWEEN '2019-03-01'
AND '2019-03-31';
24 MB to 2.52 GB billed bytes - quite a large dynamic range indeed!
Summary comparison of different methods to query external data using BigQuery.
*For the imported partitioned BigQuery table, data was partitioned by pickup_datetime
.
Overall conclusions:
Amount of bytes scanned is significantly less for partitioned BigQuery tables (~30x less in our case).
Querying external table is efficient for parquet files as compared to csv files (~50x less bytes scanned).
Recommendations:
BigQuery tables - partition data with right partition keys.
External tables - store external data as parquet files.
Parquet compression - Tested compressoin methods are equivalent from BigQuery’s perspective. However, the compressions themselves are different - snappy is the fasted (but comporessed file(s) are slightly larger than other compression methods)*.
*ORC compression also gives similar (from BigQuery perspective, 42 MB billed bytes), but the generated ORC files are quite big (539.01 MB for parquet vs. 2.68 GB for ORC).
GitHub Repository