Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-659] Snowflake git integration #123

Merged
merged 8 commits into from
Jul 15, 2024
Merged

[ETL-659] Snowflake git integration #123

merged 8 commits into from
Jul 15, 2024

Conversation

philerooski
Copy link
Contributor

@philerooski philerooski commented Jul 3, 2024

Deploys Snowflake objects to a specific environment as part of our CI/CD process. An environment is analogous to a database. We assume that account-level objects have already been created.

The following objects are created:

  • RECOVER_GIT_REPOSITORY git repository
  • RECOVER_{environment} database (I will refer to this merely as the RECOVER database from here on for simplicity sake).
  • PARQUET schema
  • PARQUET_FORMAT file format.
  • PARQUET_S3 external stage.
  • A table for each Parquet dataset in our external stage (manually defined)

Deployment logic and object DDL is organized as a hierarchy:

snowflake/objects
└── database
    └── recover
        └── schema
            └── parquet
                ├── file_format
                ├── stage
                └── table

(*Almost) every level in this hierarchy has a deploy.sql which will deploy all child objects with respect to the current directory.

* Due to technical limitations, there is no functioning deploy.sql file under snowflake/objects/database, which would have deployed every database (we only have one database in our deployment, currently).

snowflake/objects
├── database
│   ├── deploy.sql
│   └── recover
│       ├── deploy.sql
│       └── schema
│           ├── deploy.sql
│           └── parquet
│               ├── deploy.sql
│               ├── file_format
│               │   ├── deploy.sql
│               │   └── parquet_format.sql
│               ├── stage
│               │   ├── deploy.sql
│               │   └── parquet_s3.sql
│               └── table
│                   ├── deploy.sql
│                   ├── enrolledparticipants_customfields_symptoms_parquet.sql
│                   ├── enrolledparticipants_customfields_treatments_parquet.sql
│                   ├── enrolledparticipants_parquet.sql
│                   ├── fitbitactivitylogs_parquet.sql
│                   ├── fitbitdailydata_parquet.sql
│                   ├── ...
└── deploy.sql

For example, the file located at snowflake/objects/database/recover/deploy.sql will deploy all objects under the RECOVER database, and snowflake/objects/database/recover/schema/parquet/deploy.sql will deploy the file formats, stages, and tables under the PARQUET schema.

The child objects, that is, the schemas, file formats, stages, tables – anything that is not a database – are defined in such a way as to be agnostic to their parent context. For example, snowflake/objects/database/recover/schema/parquet/deploy.sql will deploy the PARQUET schema and all its child objects to whichever database your Snowflake user assumes. There is nothing in the SQL which restricts the PARQUET schema to be created within the RECOVER database. Likewise, the tables can be deployed to any schema, although their DDL is specific to the columns in our Parquet datasets.

The entrypoint for deployment is snowflake/objects/deploy.sql. When a commit is pushed to a branch, our CI/CD process will instantiate all Snowflake objects to an environment-specific database prefixed with your branch name, e.g., RECOVER_ETL_659. Details on how to manually deploy are provided in the file itself.

Implementation details

  • This PR is an incomplete solution to Snowflake deployment. In a production-ready deployment configuration, we would already have a production database with tables containing data from which we could easily derive tables to deploy to a dev environment. Lacking that, we define our tables from scratch and don't yet load any data into them. Having the ability to create tables from scratch is a critical component for doing reproducible deployments, but it's not an effective production-level deployment strategy. For what work remains to be done to have a fully-featured deployment, see this comment.
  • Objects are created in a few different ways:
    • CREATE OR REPLACE will destroy any existing objects and create a new object. This is a powerful, declarative way of enforcing object configurations, but it's not always effective to recreate objects and sometimes existing objects ought to be reused.
    • CREATE ... IF NOT EXISTS will create an object if it doesn't yet already exist. We use this statement to preserve the state of that object or its child objects. The prototypical use case for this statement is to preserve tables containing data and their parent objects. I use this statement with DATABASE and SCHEMA objects because they are parents of table objects. Although, if this is a dev deployment, I instead CREATE OR REPLACE the database object in anticipation of cloning preexisting production tables while ensuring a completely reproducible dev environment with each new deployment.
    • CREATE OR ALTER is a mechanism for declaratively updating table objects. More information here.
  • Table data types are defined to be as large as possible without consuming extraneous storage or requiring additional processing. For more information, see String & binary data types and Numeric data types.

@philerooski philerooski requested a review from a team as a code owner July 3, 2024 00:18
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM - with some minor comments - but going to pre-approve!

I wanted to touch on this "This PR is an incomplete solution to Snowflake deployment. "

I am still a bit confused by this, but could you outline what you think a "production" deployment looks like via CI/CD? when you push into main here, I think it creates a recover_main database?

/*
Create the Parquet file format
*/
CREATE OR REPLACE FILE FORMAT {{ parquet_file_format_name }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Create if not exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tradeoff.

CREATE OR REPLACE is the best way to ensure we are deploying in a declarative way without having to worry about state and requires minimal overhead. FILE FORMAT is a lightweight object and can be replaced easily.

But with the caveat about FILE FORMAT and EXTERNAL TABLE, we either need to enforce top-down that external tables which use FILE FORMAT are always recreated during deployment, or – if we were to instead use CREATE ... IF NOT EXISTS – we need to depend on the developer to be aware that if they have an existing deployment and they modify the FILE FORMAT object, then those changes won't be reflected in subsequent deployments unless they manually delete the object. But if they manually delete the object, then they also have to manually deal with any fallout from the caveat I mentioned above! Alternatively, they could ALTER the file format, but already we're getting ourselves into a situation where the developer has to be familiar with and remember this caveat and they have to manage the state of those objects themselves; CI/CD won't do it for them. Of course, we'd have to deal with this complexity when deploying changes to a FILE FORMAT in staging or prod, too.

Fortunately, since EXTERNAL TABLE objects are merely metadata (schema and table properties), they are also lightweight objects and can be replaced easily. So my preference is to use CREATE OR REPLACE for both object types to ensure that we are deploying objects in a state that's reflected in their DDL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philerooski Are we planning on using EXTERNAL TABLE? That's an important caveat you mentioned - but if we aren't, it's probably ok to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 We are not, but if we were I would still recommend this approach because external tables are easy to replace.

.github/workflows/upload-and-deploy.yaml Show resolved Hide resolved
@@ -0,0 +1,11 @@
CREATE OR ALTER TABLE ENROLLEDPARTICIPANTS_CUSTOMFIELDS_SYMPTOMS (
Copy link
Member

@thomasyu888 thomasyu888 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, does it automatically put you in the right schema? I did see the use schema in one of the scripts!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are deploying from an entrypoint in database/recover/schema/parquet/deploy.sql (where you see the USE SCHEMA statement) or higher, than yes. But if you are using another schema and you EXECUTE IMMEDIATE FROM this file, then this table will be created in that other schema.

@philerooski
Copy link
Contributor Author

I wanted to touch on this "This PR is an incomplete solution to Snowflake deployment." I am still a bit confused by this, but could you outline what you think a "production" deployment looks like via CI/CD?

To summarize what the deployment does in this PR, it effectively creates all the objects necessary to create tables (plus a few extra – FILE FORMAT and STAGE objects – in anticipation of future work) and then creates tables for each of our Parquet datasets. It doesn't load any data, it merely creates the tables (or alters them, if they already exist).

This is less than ideal for a developer. They don't want to load any data and they definitely don't want to do it manually! So in the future, when we have Snowflake tables in production which already contain our data, we can modify the default behavior of this deployment so that we clone those tables, rather than create empty tables. That's what makes this an incomplete solution.

As for what a production-ready deployment system looks like, I believe we ought to have multiple ways of deploying tables:

  1. By cloning from existing tables in production
  2. By creating the table from scratch and loading data into the table via COPY INTO from an external stage.
  3. By updating the table if there are changes, and otherwise doing nothing

(1) is our ideal dev environment deployment scenario. It requires minimal overhead and we get a production database in an isolated environment that we can muck around in.
(2) is our bootstrap scenario. If – for whatever reason – we need to redeploy our database entirely from scratch, we should be able to do that.
(3) is our production scenario. We want to update the table metadata if there have been any changes, but data loading happens via a separate mechanism not related to deployment.

when you push into main here, I think it creates a recover_main database?

No, RECOVER_MAIN is one of the databases which we already assume to exist as detailed in snowflake/objects/deploy.sql. I use Snowflake scripting to specify that we only create or replace a database if this is not the main or staging branch/environment.

Create an external stage over the RECOVER Git repository so that we can
use EXECUTE IMMEDIATE FROM statements.
*/
CREATE OR REPLACE GIT REPOSITORY recover_git_repository
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually, we will need ALTER this instead of create or replace in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 Could you explain why we might need to maintain an existing GIT REPOSITORY object? By using CREATE OR REPLACE we guarantee that our repository is up to date without needing to worry about whether the repository has already been created – which would require a FETCH statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philerooski so, maybe im jumping ahead and just thinking of the dev and prod deployment. I guess its not too much lift to create or replace these everytime we push to a branch.

I think I'm struggling a bit conceptually to see the production CI/CD pipeline so I'll wait for you to get the first data flowing in and we can revisit.

Copy link
Member

@thomasyu888 thomasyu888 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess the issue is there's two parallel tracks here that I'm getting mixed up

  1. Development workflow
  2. Dev and prod deployment

What we are currently tackling here is (2), in that everytime you add a file (lets say Table) or add a scheduled task or dynamic table, the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.

For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its not too much lift to create or replace these everytime we push to a branch.

No, it's not much overhead at all. And it's the only way to do stateless/declarative deployment in Snowflake – with the exception of objects which support the CREATE OR ALTER syntax.

the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.

I'm a little confused about how you think this works, because you said above that we create or replace most objects every time we push to a branch, but this statement implies that we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.

For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding

That's right. Does it make things clearer if I put it like this: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects, like tables with data.

Copy link
Member

@thomasyu888 thomasyu888 Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.

Hmmm. lets merge this and see how it plays out - I think we are actually on the same page, but I think I just need to just see it in action.

Snowflake takes care of what has already been deployed (hence the CREATE... IF NOT EXISTS, or CREATE OR ALTER statements)

That does help!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight revision: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects and their parent objects, like tables with data, and their schema and database.

* An API INTEGRATION with an API_ALLOWED_PREFIXES value
containing 'https://github.com/Sage-Bionetworks/' and
API_PROVIDER = GIT_HTTPS_API.
- STORAGE INTEGRATION `RECOVER_PROD_S3`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want a recover_dev_s3? Not sure if we want non-releases to go through any of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do already have a storage integration for our dev account. But it's not a dependency for this deployment, and since in the future we will clone prod tables (or maybe we want to sample them to keep things lightweight?) to their own isolated environment I don't see why we would want to work with the pilot data.

Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Great work getting this together.

Copy link

sonarcloud bot commented Jul 15, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
5.6% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants