[Feature] Historical Backfilling and Rebasing of Snapshots #9892

walker-philips · 2024-04-11T14:59:47Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

My understanding: Snapshots are intended to record historical data safely, and avoid accidental deletion of data. They are driven by timestamps or data driven deltas specified by the user. Data driven Snapshots cannot "replay" history, such that a Snapshot query should only return a single row to represent a desired unit of data. As an example, a Snapshot Check Strategy query of Bank Account balances should only show the current day's balance per account as opposed to several days worth of balances per account.

Users of DBT may possess lower/same quality of historical prior to a Snapshot being formally initialized. Currently, there is no way to backfill a Snapshot. Also, a user may change their approach to Snapshotting data, causing mass, unnecessary snapshotting of rows or business rules may change leading to reducing the depth of snapshots. There is no way to "condense" snapshots by removing rows of now-duplicate data and updating the dbt_valid_from/dbt_valid_to columns to remain linear.

Describe alternatives you've considered

I have a rough working approach utilizing a macro, a template model, a controlling model that coordinates the backfill of several Snapshots, pre/post hooks to run the necessary update statements. It does not attempt to alter the existing Snapshot. Instead, you could feed the existing Snapshot as a backfill to a new Snapshot model. In order to promote data survivability, a user can choose to save backfill data as a table to allow recovery.

Who will this benefit?

Any users with historical data they would like to merge into a Snapshot or one who wants to rebase/condense an existing Snapshot due to a change in Snapshot strategy/methodology

Are you interested in contributing this feature?

Yes, I have an example of my imperfect approach

Anything else?

No response

walker-philips · 2024-04-11T15:00:31Z

@dbeatty10 Let me know if this is sufficient for lining up the feature? Happy to provide my scripts as a point of reference.

dbeatty10 · 2024-04-11T15:09:32Z

Thanks for opening this @walker-philips 🤩

Yes, any additional points of reference (such as your scripts) would be great to see as well.

walker-philips · 2024-04-11T15:12:16Z

@dbeatty10 Where would be the best place to add those? I didnt want to fork just for a few files and I also cannot add my full project due to compliance reasons.

dbeatty10 · 2024-04-11T15:57:35Z

Here's a few ideas:

Share the files here (optionally wrapping in a Markdown toggle for readability)
Share the files in a GitHub Gist
Fork in GitHub and make a commit

github-actions · 2024-07-11T01:51:10Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-07-19T01:51:06Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions · 2024-07-19T01:51:07Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jemimamace · 2024-07-29T01:24:58Z

Voicing my support for this one, recently had a similar use case for snapshots needing to be 'backfilled'.

dbeatty10 · 2024-07-30T16:36:17Z

@walker-philips we're waiting on additional points of reference (such as your scripts) to assess this feature request further. Are you still planning on sharing those?

We've got an initiative right now related to snapshots, so it would be good timing for us to consider your ideas if you are able to provide more detail. You can read more here:

Without additional information of what this feature request would look like, this is issue is likely to go stale again.

walker-philips · 2024-07-30T19:00:29Z

@dbeatty10

So here's my assumptions and approach:
An existing snapshot implies there is:

a key of some kind
a dbt_valid_from/timestamp value that creates some level of order to the data
some rules regarding what will trigger a snapshot ( a column(s) updating for a timestamp strategy or a subset of columns for a check strategy)
a place(s) where the snapshot originated from
a current structure of the snapshot (table schema/column types)

My assumption:

The end user is going to have to transform the historical/deprecating snapshot data to some degree to adhere to the current/desired structure of a snapshot
The user may combine the existing snapshot and as many other data sources together to create a historical dataset
The user can use existing dbt_valid_from values and guesstimate older data's validity start times
The user will prefer to avoid oversnapshotting data so they can use group by's to slim the data down to rows that accurately represent the level of snapshotting desired
Valid to is implicitly tied to the next unique_key's row's valid from value.
People would like to avoid the costly efforts of backfilling each time they dbt build so we should save the transformed backfill dataset as an audit trail/and optimization for speed over storage.
dbt model names are unique and thus we can use hacky dynamic model naming to handle creating these supplementary models

The approach works as follows:
the render_backfills model takes a list of the existing snapshot model names along with their unique key(s) as dictionary. It will iterate through them to coordinate the creation and loading of backfills by calling the apply_snapshot_backfill() macro. This will load data into the existing snapshot table from the newly created backfill data source.

For each snapshot to be backfilled, we have to create a backfill model that transforms all the historical data into a form that can be unioned with the current snapshot table. It has some necessary criteria that always has to be met. Namely the designation of a unique_key and dbt_valid_from stand in.

I utilize pre/post hooks to execute INSERT statements that save the backfill data as its own table disconnected from the dbt framework. This helps avoid circular dependencies.

The dbt_backfill_applier_model() is simply boilerplate code that helps determine whether to reuse an instantiated backfill dataset or make it for the first time.

Lastly, render_backfills deletes itself as its essentially useless from a database asset standpoint. It has to be a table in order to force the triggering of the post hook and execution of the run_query command

Admittedly, this feels super hacky. But they key takeaways I have are:

The transformation of historical datasets into snapshot merge-worthy models seem to be a great usage of DBT.
Slimming/expanding datasets to new snapshot rules doesn't seem inherently complex.
There should be no limitation on what datasets an end user may choose to merge into a snapshot. They do so at their own risk and should have proper backups to recreate their snapshots if all goes wrong.
We should avoid what could be a costly operation to assemble these backfill datasets
The complexity of setting up my approach has much to be improved
It should also be possible to do snapshot pruning by feeding the current snapshot into itself, however I think the inherent circular dependency may require hardcoding model names rather than using ref()

apply_snapshot_backfill.txt
dbt_backfill_applier_model.txt
my_table_name_backfill_apply.txt
render_backfills.txt
column_intersect.txt
my_table_name.txt

walker-philips · 2024-08-09T15:10:25Z

@dbeatty10

Realized I was missing something like this to handle consecutive, repeating data. There's a potential that by just grouping by the check_cols (for a check strategy at least), you would eliminate future, repeated values entirely.

{% macro identify_consecutive_duplicates_cte(source_cte_name, check_cols, pk, date_col) %}
select *,
	(case 
		when LAG(concat_ws('-',{{check_cols|join('::text,')}}::text)) over (partition by {{ pk }} ORDER BY {{ date_col }} asc) 
			= concat_ws('-',{{check_cols|join('::text,')}}::text)	
		then true
		else false
	end) as is_duplicate
from {{source_cte_name}}
{%- endmacro %}

you can then filter away these consecutive rows to prune snapshots and historical data by just adding where is_duplicate = false

walker-philips added enhancement New feature or request triage labels Apr 11, 2024

dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Apr 11, 2024

dbeatty10 added awaiting_response and removed triage labels Apr 11, 2024

github-actions bot added the stale Issues that have gone stale label Jul 11, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2024

github-actions bot added triage and removed awaiting_response labels Jul 29, 2024

dbeatty10 reopened this Jul 30, 2024

dbeatty10 added awaiting_response and removed triage stale Issues that have gone stale labels Jul 30, 2024

github-actions bot added triage and removed awaiting_response labels Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

walker-philips commented Apr 11, 2024

walker-philips commented Apr 11, 2024

dbeatty10 commented Apr 11, 2024

walker-philips commented Apr 11, 2024

dbeatty10 commented Apr 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

jemimamace commented Jul 29, 2024

dbeatty10 commented Jul 30, 2024

walker-philips commented Jul 30, 2024

walker-philips commented Aug 9, 2024 •

edited

Loading

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

Comments

walker-philips commented Apr 11, 2024

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

walker-philips commented Apr 11, 2024

dbeatty10 commented Apr 11, 2024

walker-philips commented Apr 11, 2024

dbeatty10 commented Apr 11, 2024

github-actions bot commented Jul 11, 2024

github-actions bot commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

jemimamace commented Jul 29, 2024

dbeatty10 commented Jul 30, 2024

walker-philips commented Jul 30, 2024

walker-philips commented Aug 9, 2024 • edited Loading

walker-philips commented Aug 9, 2024 •

edited

Loading