Google BigQuery is a Cloudbased Data Warehouse. It is not only serverless, but also cost-effective as well as multi-cloud.
- Cost-effective
- Serverless
- Multicloud
Here are the benefits of Google BigQuery.
- Gain insights with real-time and predictive analytics
- Access data and share insights with ease
- Protect your data and operate with trust
Here are the key features of Google BigQuery.
- ML and predictive modeling with BigQuery ML.
- Multicloud data analysis with BigQuery Omni.
- Interactive data analysis with BigQuery BI Engine.
- Geospatial analysis with BigQuery GIS.
You can follow this page to get more details about Google BigQuery. As part of this course, we will see how to get started from the perspective of Interactive data analysis with BigQuery BI Engine as it is closest to the Data Engineering. We will also look into key integrations such as Python, Pandas, and Spark using Dataproc/Databricks.
Let us quickly review CRUD Operations using Google BigQuery. We can perform all standard CRUD Operations on Google BigQuery Tables.
- INSERT
- UPDATE
- DELETE
- MERGE (also known as UPSERT)
Let us go ahead and see how we can create Dataset and an empty table using Google BigQuery.
- Once we login into BigQuery UI, we can create data set under the project. Let us create data set by name retail.
- Make sure to choose Empty table under Create table from.
- Make sure to choose appropriate project id and data set (retail).
- Specify the table name as
orders
. - Enter one column at a time using wizard. You can use the below details as reference.
order_id:INTEGER,
order_date:TIMESTAMP,
order_customer_id: INTEGER,
order_status:STRING
You can also create table using BigQuery Editor in Google Cloud Console.
CREATE TABLE tidy-fort-361710.retail.orders (
order_id INTEGER,
order_date DATE,
order_cusotmer_id INTEGER,
order_status STRING
);
Let us go ahead and Create Empty tables for order_items
and products
in the retail
dataset.
Here are the schema details for order_items
.
order_item_id:INTEGER
order_item_order_id:INTEGER
order_item_product_id:INTEGER
order_item_quantity:INTEGER
order_item_subtotal:FLOAT
order_item_product_price:FLOAT
Here are the schema details for products
.
product_id:INTEGER
product_cateogry_id:INTEGER
product_name:STRING
product_description:STRING
product_price:FLOAT
product_image:STRING
Here is the Python based approach to load data from files in GCS into Google BigQuery Table.
- We need to make sure we instantiate the client.
- Create Job Config using Schema of the table that needs to be populated.
- Use
client.load_table_from_uri
to populate the table from the file under URI using the job config.
You can go through the notebook and make required changes to see if it works as expected or not. You can also validate by running queries against the table using Google BigQuery UI.
As we have gone through the details about loading orders data into table in Google BigQuery, take it as an exercise and load data into order_items
as well as products
.
- You should have been created these empty tables as part of the previous exercise.
Let us go ahead and drop the tables in Google BigQuery so that we can create them again.
Let us go ahead and create table for orders
using Web UI of Google BigQuery. The table will be created using GCS Files.
- After choosing Cloud Storage as source, we can specify column names like this as our data doesn't have header. We need to make sure the data type for
order_date
is defined asTIMESTAMP
.
order_id:INTEGER,
order_date:TIMESTAMP,
order_customer_id: INTEGER,
order_status:STRING
You can also use CREATE EXTERNAL TABLE
command to create table using location in GCS. Here is the link for the syntax to create external tables in Google BigQuery.
CREATE EXTERNAL TABLE retail.orders (
order_id INTEGER,
order_date TIMESTAMP,
order_customer_id INTEGER,
order_status STRING
) OPTIONS (
format = 'CSV',
uris = ['gs://airetail/retail_db/orders/*']
);
Let us go ahead and Create tables for order_items
and products
in the retail
dataset. This time we will be creating tables pointing to GCS Files.
Here are the schema details for order_items
.
order_item_id:INTEGER
order_item_order_id:INTEGER
order_item_product_id:INTEGER
order_item_quantity:INTEGER
order_item_subtotal:FLOAT
order_item_product_price:FLOAT
Here are the schema details for products
.
product_id:INTEGER
product_cateogry_id:INTEGER
product_name:STRING
product_description:STRING
product_price:FLOAT
product_image:STRING
Here are some of the SQL Capabilities of Google BigQuery.
- All Standard SQL Features.
- Robust set of pre-defined functions
- Filtering of the data
- Joins (Inner as well as Outer)
- Standard Aggregations
- Cumulative Aggregations
- Ranking as well as Window Functions
- Ability to process JSON Data.
Here is the query to compute daily product revenue using Google BigQuery. The query covers most of the basic abilities of SQL using Google BigQuery such as Filtering, Joins, Aggregations, etc.
SELECT o.order_date,
oi.order_item_product_id,
round(sum(oi.order_item_subtotal), 2) AS revenue
FROM retail.orders AS o
JOIN retail.order_items AS oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1, 2
ORDER BY 1, 3 DESC
Google BigQuery also supports cumulative aggreations and ranking.
- Cumulative Aggregations using daily revenue
WITH daily_revenue AS (
SELECT o.order_date,
round(sum(oi.order_item_subtotal), 2) AS revenue
FROM retail.orders AS o
JOIN retail.order_items AS oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1
) SELECT format_date('%Y%m', order_date) AS order_month,
order_date,
revenue,
round(sum(revenue) OVER (
PARTITION BY format_date('%Y%m', order_date)
ORDER BY order_date
), 2) AS revenue_cum
FROM daily_revenue
ORDER BY 2;
- Top 3 Monthly Products by revenue
WITH daily_product_revenue AS (
SELECT o.order_date,
oi.order_item_product_id,
round(sum(oi.order_item_subtotal), 2) AS revenue
FROM retail.orders AS o
JOIN retail.order_items AS oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY 1, 2
) SELECT * FROM (
SELECT format_date('%Y%m', order_date) AS order_month,
order_date,
order_item_product_id,
revenue,
dense_rank() OVER (
PARTITION BY order_date
ORDER BY revenue DESC
) AS drank
FROM daily_product_revenue
) WHERE drank <= 3
ORDER BY 2, 4 DESC
Let us get an overview of Key Integrations with Google BigQuery.
- Integration with GCS
- Integration with External Databases such as Cloud SQL
- Spark Connector on Dataproc (Covered as part of subsequent modules)
- Programming Languages such as Python
Let us get an overview of Python Pandas Integration with Google BigQuery.
- We need to install
pandas
,pandas-gbq
andgoogle-cloud-bigquery
to get data from Google BigQuery table into Pandas Dataframe. - Once all the required libraries installed, we can use
pd.read_gbq
by passing query as argument.
import pandas as pd
query = '''
SELECT order_status, count(*) AS order_count
FROM `itversity-rnd.retail.orders`
GROUP BY 1
ORDER BY 2 DESC
'''
project_id = 'itversity-rnd'
df = pd.read_gbq(query, project_id=project_id)
df
Let us get an overview of Integration of BigQuery with Cloud SQL (Postgresql).
- Click on ADD DATA and choose External data source.
- Use the instance connection name (from Cloud SQL) and credentials to add the external data source.
- Run queries using
EXTERNAL_QUERY
function inFROM
clause. We need to pass the external connection and valid postgres query as arguments toEXTERNAL_QUERY
.
SELECT *
FROM EXTERNAL_QUERY(
"tidy-fort-361710.us.retailpgexternal",
"SELECT * FROM information_schema.tables WHERE table_schema = 'public'"
);
SELECT *
FROM EXTERNAL_QUERY(
"tidy-fort-361710.us.retailpgexternal",
"SELECT order_date, count(*) AS order_count FROM orders GROUP BY 1 ORDER BY 2 DESC"
);
SELECT *
FROM EXTERNAL_QUERY(
"tidy-fort-361710.us.retailpgexternal",
"SELECT * FROM products"
) AS p;
SELECT o.order_date,
oi.order_item_product_id,
p.product_name,
round(sum(oi.order_item_subtotal)) AS revenue
FROM EXTERNAL_QUERY(
"tidy-fort-361710.us.retailpgexternal",
"SELECT * FROM products"
) AS p
JOIN retail.order_items AS oi
ON p.product_id = oi.order_item_product_id
JOIN retail.orders AS o
ON oi.order_item_order_id = o.order_id
GROUP BY 1, 2, 3
ORDER BY 1, 4 DESC;