BigQuery Stored Procedures to Automate Snapshots at Dataset Level
By Rohan Paithankar & Bihag Kashikar | @intelia | November 6

Introduction
With the growing use of data warehouse automation tools, the focus of data engineers is shifting towards DataOps. This includes designing robust pipelines to deliver enhancements without the fear of breaking downstream analytics workflows. An important aspect of this process is the snapshot and restore process while pushing changes to datasets.
Google’s white-paper on Continuous Data Integration introduces snapshot/clone as one of the best practices in the deployment workflow.
With the above as the motivation, we began work on one such use-case in BigQuery but found that it only offers snapshots at Table-level. Our challenge was to create snapshots for a defined number of Datasets with a varying number of tables with the option to automate it as a part of the deployment workflow. To overcome this challenge, we leveraged stored procedures in BigQuery.
Solution
BigQuery provides system-defined metadata views called INFORMATION_SCHEMA for every dataset. We utilise it to obtain the list of tables in a particular dataset and assign it to a variable tables. Refer to the query below:
We then iterate through the array returned by the above query using a simple for-loop:
For every value in the list of tables, we
- Create a fully qualified snapshot table name snapshot_name with a target snapshot_dataset_id.
- Obtain the fully qualified table name table_name from the loop variable table.value.
- Run a dynamic SQL query using EXECUTE_IMMEDIATE to pass the above variables as parameters.
Note: The expiration timestamp has been set to 14 days from the snapshot creation timestamp but can be modified as per the use case.
Final stored procedure:
The stored procedures can also be deployed using Terraform and called using any CI/CD tool or script using bq CLI.
Watch out for another blog on embedding these steps in a CI/CD pipeline.
This is just one of the solutions to automate Dataset-level snapshots and it works well for:
- Limited number of datasets with increasing number of tables
- Teams that are SQL-oriented
- Scheduled execution of snapshots
The above solution only utilises a single service, i.e, BigQuery and hence has lesser solution overhead.
To conclude, it won’t be a surprise if Google soon releases Dataset-level snapshots out of the box.