- The
web/bigquery
directory contains the Snowplow BigQuery web model v1.0.3 structured as a Dataform project. As a first step copy its contents to the root of your Dataform project. - Configure the model.
- Setup your environment(s) in
environment.json
and set a schedule or start a new run!
The Snowplow data models are relevant to users who already have a full Snowplow pipeline running and a dataset of web events from Snowplow Trackers loaded in their database. At the moment the Snowplow web data model in Dataform only supports BigQuery.
This repository contains the first version of adapting the new generation of Snowplow's BiqQuery web data model v1 to be run with Dataform. As such and in order to keep an exact correspondence, in v0.1.0 we intentionally and explicitly enforced the incrementalization logic and modular structure of the parent model.
That way the Snowplow Dataform's web data model inherits as key features:
- Extensible modular structure
- Metadata log tables
- Assertions to test your data quality
- Feature flags
You can read more about the structure, logic and technical architecture of the Snowplow web data model:
As an overview the standard web data model runs 4 core modules:
.
├── dataform.json
├── definitions
│ ├── assertions
│ ├── standard
│ │ ├── 00_setup
│ │ ├── 01_base
│ │ ├── 02_page_views
│ │ ├── 03_sessions
│ │ └── 04_users
│ └── utils
│ └── declarations.js
├── environments.json
├── includes
│ └── sp.js
├── package.json
└── package-lock.json
-
01_base The Base module applies incremental logic to the atomic data, and produces deduplicated tables for the Page Views module.
-
02_page_views The page views module runs the standard web page views model. It takes the
events_staged
table - produced by the Base module - as an input (and also queries atomic data). Its persistent output is thederived.page_views
table. -
03_sessions The sessions module runs the standard web sessions model - it takes the
page_views_staged
table - produced by the Page Views module - as an input. Its persistent output is thederived.sessions
table. -
04_users The users module runs the standard web users model - it takes the
sessions_userid_manifest_staged
table - produced by the Sessions module - as an input. Its persistent output is thederived.users
table.
To configure your data model you only need to check the sp.js
file in your includes
directory:
-
Schema names
input_schema
: name of atomic datasetscratch_schema
: name of scratch datasetoutput_schema
: name of derived dataset
const input_schema = "rt_pipeline_prod1"; const scratch_schema = "scratch"; const output_schema = "derived";
-
Dataform specific
model_disabled
: (Default:false
). Provides a way to disable the model. It affects the 01-main and the 99-complete steps.assertions_disabled
: (Default:false
). Whether assertions are run along the model.destroy_disabled
: (Default:true
). Provides a way to drop all tables produced by the model for a full rerun. Note: this does not affect thedatamodel_metadata
table.
const model_disabled = false; const assertions_disabled = false; const destroy_disabled = false;
-
Entropy
entropy
: You can specify a string to append to all tables (default:""
), to test without affecting prod tables. Note: the model does not make use and is not affected by Dataform'stablePrefix
.
const entropy = "_test";
-
Start date
start_date
: the start date, used to seed manifest.
-
Output
stage_next
: specifies whether each module will update staging tablesskip_derived
: specifies whether each module will skip inserting to production tables
const stage_next = { base: true, page_views: true, sessions:true }; const skip_derived = { page_views: false, sessions: false, users: false };
-
Cleanup
ends_run
: set to true for no subsequent modules in the runcleanup_mode
: One of:debug
- only keeps main tables,trace
- keeps all tables,all
- cleans up everything.
const ends_run = { base: false, page_views: false, sessions: false, users: true }; const cleanup_mode = { base: "all", page_views: "all", sessions: "all", users: "all" };
-
Other parameters
lookback_window_hours
: (default: 6). Period of time (in hours) to look before the latest event in manifest - to account for late arriving data, which comes out of order.days_late_allowed
: (default: 3). Period of time (in days) for which we should include late data. If the difference between collector tstamps for the session start and new event is greater than this value, data for that session will not be processed.session_lookback_days
: (default: 365). Period of time (in days) to limit scan on session manifest. Exists to improve performance of model when we have a lot of sessions. Should be set to as large a number as practical.update_cadence_days
: (default: 7). Period of time (in days) in the future (from the latest event in manifest) to look for new events.upsert_lookback_days
: (default: 30). Period of time (in days) to look back over the target table in order to find rows to delete when committing data to a table.derived_tstamp_partitioned
: (default:false
). For use on legacy pipelines whose partition key is thederived_tstamp
. Note that this affects the incremental logic somewhat, asderived_tstamp
isn't closely tied to when data was received and so it is recommended to extend thelookback_window_hours
if the events table is partitioned byderived_tstamp
.minimumVisitLength
: (default: 5). The value ofminimumVisitLength
configured in the Snowplow Javascript trackerheartbeat
: (default: 10). The value ofheartbeat
configured in the Javascript tracker.ua_bot_filter
: (default:true
). Whether to filter out bots via useragent string pattern match.iab
: (default:false
). Whether to include data from the IAB enrichment.ua_parser
: (default:false
). Whether to include data from the UA Parser enrichment.yauaa
: (default:false
). Whether to include data from the YAUAA enrichment.
-
Custom staged dependencies
custom_staged_dependencies
: This array parameter allows you ensure custom steps that depend on standard model'sstaged
tables run before the complete steps that truncate them. You can read more in Customization below.
Now your model is configured and you can finish by setting up your Dataform environment(s).
Following the customization plugin principles of the parent v1 Bigquery web model, it is also possible to customize the Dataform model and leverage the incrementalization logic of the standatd modules.
In the definitions/custom
subdirectory you can find the same custom example featured in the v1 BigQuery web model implemented for Dataform. In the directory structure you can see the custom modules parallel to standard:
├── dataform.json
├── definitions
│ ├── assertions
│ ├── custom
│ ├── standard
│ └── utils
├── environments.json
├── includes
│ └── sp.js
├── package.json
└── package-lock.json
In order to compliment the standard model with your custom incremental module, you only need to know which module's _staged
tables it is meant to consume. Those modules' commit actions will be the dependencies of your custom module.
For example, the featured custom module, consumes both the events_staged
and the page_views_staged
tables, that are outputs of the 01_base
and 02_page_views
modules respectively. Since the 02_page_views
module already depends on the 01_base
module, it suffices to depend the first action of the custom module only on the last step of the page_views module (07_commit_page_views
).
config {
type: "operations",
name: "01_link_clicks",
disabled: sp.model_disabled,
dependencies: [
"07_commit_page_views"
],
hermetic: true,
hasOutput: false,
tags: ["sp_web", "sp_custom"]
}
Since the 99_complete
steps of the standard model complete the incremental logic by truncating the staged
tables, you also need to ensure that the part of custom module that depends on those tables, also runs before the 99_complete
steps. In order to do so, you only need to add the dependent step(s) in the custom_staged_dependencies
array defined in the sp.js
file.
const custom_staged_dependencies = [
"02_channel_engagement"
];
As you can see, for our custom example, the last step depending on staged
tables is the 02_channel_engagement
, which gets added in custom_staged_dependencies
.
Further notes:
-
The custom module follows the incrementalization logic of standard modules, by having the main, complete and detroy steps of its own:
- The main steps of the custom module run along the standard model, that's why in their config we set:
disabled: sp.model_disabled,
- The
complete
step of the custom module, need to also run for a complete destruction of the whole model, that is why incustom/99_custom_complete.sqlx
:
disabled: sp.model_disabled && sp.destroy_disabled,
- Similarly the destroy step of the custom module runs along the destroy steps of the standard model:
disabled: sp.destroy_disabled,
-
We generally suggest that any customization:
- Follows in analogy the same guidelines and best practices mentioned in Snowplow's v1 web model.
- Interacts with the standard model only through the
sp.js
file, avoiding to change the standard model's files directly.
We welcome all ideas, questions and contributions!
For support requests, please use our community support Discourse forum.
If you find a bug, please report an issue on GitHub.
The Snowplow Incubator dataform-data-models project is Copyright 2021 Snowplow Analytics Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.