Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

Open
3 tasks done
walker-philips opened this issue Apr 11, 2024 · 11 comments
Open
3 tasks done

[Feature] Historical Backfilling and Rebasing of Snapshots #9892

walker-philips opened this issue Apr 11, 2024 · 11 comments
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality triage

Comments

@walker-philips
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

My understanding: Snapshots are intended to record historical data safely, and avoid accidental deletion of data. They are driven by timestamps or data driven deltas specified by the user. Data driven Snapshots cannot "replay" history, such that a Snapshot query should only return a single row to represent a desired unit of data. As an example, a Snapshot Check Strategy query of Bank Account balances should only show the current day's balance per account as opposed to several days worth of balances per account.

Users of DBT may possess lower/same quality of historical prior to a Snapshot being formally initialized. Currently, there is no way to backfill a Snapshot. Also, a user may change their approach to Snapshotting data, causing mass, unnecessary snapshotting of rows or business rules may change leading to reducing the depth of snapshots. There is no way to "condense" snapshots by removing rows of now-duplicate data and updating the dbt_valid_from/dbt_valid_to columns to remain linear.

Describe alternatives you've considered

I have a rough working approach utilizing a macro, a template model, a controlling model that coordinates the backfill of several Snapshots, pre/post hooks to run the necessary update statements. It does not attempt to alter the existing Snapshot. Instead, you could feed the existing Snapshot as a backfill to a new Snapshot model. In order to promote data survivability, a user can choose to save backfill data as a table to allow recovery.

Who will this benefit?

Any users with historical data they would like to merge into a Snapshot or one who wants to rebase/condense an existing Snapshot due to a change in Snapshot strategy/methodology

Are you interested in contributing this feature?

Yes, I have an example of my imperfect approach

Anything else?

No response

@walker-philips walker-philips added enhancement New feature or request triage labels Apr 11, 2024
@walker-philips
Copy link
Author

@dbeatty10 Let me know if this is sufficient for lining up the feature? Happy to provide my scripts as a point of reference.

@dbeatty10
Copy link
Contributor

Thanks for opening this @walker-philips 🤩

Yes, any additional points of reference (such as your scripts) would be great to see as well.

@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Apr 11, 2024
@walker-philips
Copy link
Author

@dbeatty10 Where would be the best place to add those? I didnt want to fork just for a few files and I also cannot add my full project due to compliance reasons.

@dbeatty10
Copy link
Contributor

Here's a few ideas:

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jul 11, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

1 similar comment
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2024
@jemimamace
Copy link

Voicing my support for this one, recently had a similar use case for snapshots needing to be 'backfilled'.

@dbeatty10
Copy link
Contributor

@walker-philips we're waiting on additional points of reference (such as your scripts) to assess this feature request further. Are you still planning on sharing those?

We've got an initiative right now related to snapshots, so it would be good timing for us to consider your ideas if you are able to provide more detail. You can read more here:

Without additional information of what this feature request would look like, this is issue is likely to go stale again.

@dbeatty10 dbeatty10 reopened this Jul 30, 2024
@dbeatty10 dbeatty10 added awaiting_response and removed triage stale Issues that have gone stale labels Jul 30, 2024
@walker-philips
Copy link
Author

@dbeatty10

So here's my assumptions and approach:
An existing snapshot implies there is:

  • a key of some kind
  • a dbt_valid_from/timestamp value that creates some level of order to the data
  • some rules regarding what will trigger a snapshot ( a column(s) updating for a timestamp strategy or a subset of columns for a check strategy)
  • a place(s) where the snapshot originated from
  • a current structure of the snapshot (table schema/column types)

My assumption:

  • The end user is going to have to transform the historical/deprecating snapshot data to some degree to adhere to the current/desired structure of a snapshot
  • The user may combine the existing snapshot and as many other data sources together to create a historical dataset
  • The user can use existing dbt_valid_from values and guesstimate older data's validity start times
  • The user will prefer to avoid oversnapshotting data so they can use group by's to slim the data down to rows that accurately represent the level of snapshotting desired
  • Valid to is implicitly tied to the next unique_key's row's valid from value.
  • People would like to avoid the costly efforts of backfilling each time they dbt build so we should save the transformed backfill dataset as an audit trail/and optimization for speed over storage.
  • dbt model names are unique and thus we can use hacky dynamic model naming to handle creating these supplementary models

The approach works as follows:
the render_backfills model takes a list of the existing snapshot model names along with their unique key(s) as dictionary. It will iterate through them to coordinate the creation and loading of backfills by calling the apply_snapshot_backfill() macro. This will load data into the existing snapshot table from the newly created backfill data source.

For each snapshot to be backfilled, we have to create a backfill model that transforms all the historical data into a form that can be unioned with the current snapshot table. It has some necessary criteria that always has to be met. Namely the designation of a unique_key and dbt_valid_from stand in.

I utilize pre/post hooks to execute INSERT statements that save the backfill data as its own table disconnected from the dbt framework. This helps avoid circular dependencies.

The dbt_backfill_applier_model() is simply boilerplate code that helps determine whether to reuse an instantiated backfill dataset or make it for the first time.

Lastly, render_backfills deletes itself as its essentially useless from a database asset standpoint. It has to be a table in order to force the triggering of the post hook and execution of the run_query command

Admittedly, this feels super hacky. But they key takeaways I have are:

  1. The transformation of historical datasets into snapshot merge-worthy models seem to be a great usage of DBT.
  2. Slimming/expanding datasets to new snapshot rules doesn't seem inherently complex.
  3. There should be no limitation on what datasets an end user may choose to merge into a snapshot. They do so at their own risk and should have proper backups to recreate their snapshots if all goes wrong.
  4. We should avoid what could be a costly operation to assemble these backfill datasets
  5. The complexity of setting up my approach has much to be improved
  6. It should also be possible to do snapshot pruning by feeding the current snapshot into itself, however I think the inherent circular dependency may require hardcoding model names rather than using ref()

apply_snapshot_backfill.txt
dbt_backfill_applier_model.txt
my_table_name_backfill_apply.txt
render_backfills.txt
column_intersect.txt
my_table_name.txt

@walker-philips
Copy link
Author

walker-philips commented Aug 9, 2024

@dbeatty10

Realized I was missing something like this to handle consecutive, repeating data. There's a potential that by just grouping by the check_cols (for a check strategy at least), you would eliminate future, repeated values entirely.

{% macro identify_consecutive_duplicates_cte(source_cte_name, check_cols, pk, date_col) %}
select *,
	(case 
		when LAG(concat_ws('-',{{check_cols|join('::text,')}}::text)) over (partition by {{ pk }} ORDER BY {{ date_col }} asc) 
			= concat_ws('-',{{check_cols|join('::text,')}}::text)	
		then true
		else false
	end) as is_duplicate
from {{source_cte_name}}
{%- endmacro %}

you can then filter away these consecutive rows to prune snapshots and historical data by just adding where is_duplicate = false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality triage
Projects
None yet
Development

No branches or pull requests

3 participants