-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-659] Snowflake git integration #123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM - with some minor comments - but going to pre-approve!
I wanted to touch on this "This PR is an incomplete solution to Snowflake deployment. "
I am still a bit confused by this, but could you outline what you think a "production" deployment looks like via CI/CD? when you push into main
here, I think it creates a recover_main
database?
/* | ||
Create the Parquet file format | ||
*/ | ||
CREATE OR REPLACE FILE FORMAT {{ parquet_file_format_name }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Create if not exists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tradeoff.
CREATE OR REPLACE
is the best way to ensure we are deploying in a declarative way without having to worry about state and requires minimal overhead. FILE FORMAT
is a lightweight object and can be replaced easily.
But with the caveat about FILE FORMAT
and EXTERNAL TABLE
, we either need to enforce top-down that external tables which use FILE FORMAT
are always recreated during deployment, or – if we were to instead use CREATE ... IF NOT EXISTS
– we need to depend on the developer to be aware that if they have an existing deployment and they modify the FILE FORMAT
object, then those changes won't be reflected in subsequent deployments unless they manually delete the object. But if they manually delete the object, then they also have to manually deal with any fallout from the caveat I mentioned above! Alternatively, they could ALTER
the file format, but already we're getting ourselves into a situation where the developer has to be familiar with and remember this caveat and they have to manage the state of those objects themselves; CI/CD won't do it for them. Of course, we'd have to deal with this complexity when deploying changes to a FILE FORMAT
in staging or prod, too.
Fortunately, since EXTERNAL TABLE
objects are merely metadata (schema and table properties), they are also lightweight objects and can be replaced easily. So my preference is to use CREATE OR REPLACE
for both object types to ensure that we are deploying objects in a state that's reflected in their DDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philerooski Are we planning on using EXTERNAL TABLE
? That's an important caveat you mentioned - but if we aren't, it's probably ok to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasyu888 We are not, but if we were I would still recommend this approach because external tables are easy to replace.
@@ -0,0 +1,11 @@ | |||
CREATE OR ALTER TABLE ENROLLEDPARTICIPANTS_CUSTOMFIELDS_SYMPTOMS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat, does it automatically put you in the right schema? I did see the use schema
in one of the scripts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are deploying from an entrypoint in database/recover/schema/parquet/deploy.sql
(where you see the USE SCHEMA
statement) or higher, than yes. But if you are using another schema and you EXECUTE IMMEDIATE FROM
this file, then this table will be created in that other schema.
To summarize what the deployment does in this PR, it effectively creates all the objects necessary to create tables (plus a few extra – This is less than ideal for a developer. They don't want to load any data and they definitely don't want to do it manually! So in the future, when we have Snowflake tables in production which already contain our data, we can modify the default behavior of this deployment so that we clone those tables, rather than create empty tables. That's what makes this an incomplete solution. As for what a production-ready deployment system looks like, I believe we ought to have multiple ways of deploying tables:
(1) is our ideal dev environment deployment scenario. It requires minimal overhead and we get a production database in an isolated environment that we can muck around in.
No, |
Create an external stage over the RECOVER Git repository so that we can | ||
use EXECUTE IMMEDIATE FROM statements. | ||
*/ | ||
CREATE OR REPLACE GIT REPOSITORY recover_git_repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually, we will need ALTER this instead of create or replace in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasyu888 Could you explain why we might need to maintain an existing GIT REPOSITORY
object? By using CREATE OR REPLACE
we guarantee that our repository is up to date without needing to worry about whether the repository has already been created – which would require a FETCH
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philerooski so, maybe im jumping ahead and just thinking of the dev
and prod
deployment. I guess its not too much lift to create or replace these everytime we push to a branch.
I think I'm struggling a bit conceptually to see the production CI/CD pipeline so I'll wait for you to get the first data flowing in and we can revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I guess the issue is there's two parallel tracks here that I'm getting mixed up
- Development workflow
- Dev and prod deployment
What we are currently tackling here is (2), in that everytime you add a file (lets say Table) or add a scheduled task or dynamic table, the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.
For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its not too much lift to create or replace these everytime we push to a branch.
No, it's not much overhead at all. And it's the only way to do stateless/declarative deployment in Snowflake – with the exception of objects which support the CREATE OR ALTER syntax.
the CI/CD is going to run through this entire stack to create the resource without creating all the other resources again.
I'm a little confused about how you think this works, because you said above that we create or replace most objects every time we push to a branch, but this statement implies that we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.
For (1), like you've mentioned, it would be CLONE DATABASE, then create all the new resource per a new branch. We do NOT load all the data up again for existing data., we just load data for the new resources or tables we're adding
That's right. Does it make things clearer if I put it like this: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects, like tables with data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll be able to know when a change is independent of already deployed objects, which would require a dependency graph.
Hmmm. lets merge this and see how it plays out - I think we are actually on the same page, but I think I just need to just see it in action.
Snowflake takes care of what has already been deployed (hence the CREATE... IF NOT EXISTS, or CREATE OR ALTER statements)
That does help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight revision: The deployment process for dev/staging/prod is exactly the same except for the hard to replace objects and their parent objects, like tables with data, and their schema and database.
* An API INTEGRATION with an API_ALLOWED_PREFIXES value | ||
containing 'https://github.com/Sage-Bionetworks/' and | ||
API_PROVIDER = GIT_HTTPS_API. | ||
- STORAGE INTEGRATION `RECOVER_PROD_S3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want a recover_dev_s3
? Not sure if we want non-releases to go through any of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do already have a storage integration for our dev account. But it's not a dependency for this deployment, and since in the future we will clone prod tables (or maybe we want to sample them to keep things lightweight?) to their own isolated environment I don't see why we would want to work with the pilot data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome. Great work getting this together.
Quality Gate passedIssues Measures |
Deploys Snowflake objects to a specific environment as part of our CI/CD process. An environment is analogous to a database. We assume that account-level objects have already been created.
The following objects are created:
RECOVER_GIT_REPOSITORY
git repositoryRECOVER_{environment}
database (I will refer to this merely as theRECOVER
database from here on for simplicity sake).PARQUET
schemaPARQUET_FORMAT
file format.PARQUET_S3
external stage.Deployment logic and object DDL is organized as a hierarchy:
(*Almost) every level in this hierarchy has a
deploy.sql
which will deploy all child objects with respect to the current directory.* Due to technical limitations, there is no functioning
deploy.sql
file undersnowflake/objects/database
, which would have deployed every database (we only have one database in our deployment, currently).For example, the file located at
snowflake/objects/database/recover/deploy.sql
will deploy all objects under theRECOVER
database, andsnowflake/objects/database/recover/schema/parquet/deploy.sql
will deploy the file formats, stages, and tables under thePARQUET
schema.The child objects, that is, the schemas, file formats, stages, tables – anything that is not a database – are defined in such a way as to be agnostic to their parent context. For example,
snowflake/objects/database/recover/schema/parquet/deploy.sql
will deploy thePARQUET
schema and all its child objects to whichever database your Snowflake user assumes. There is nothing in the SQL which restricts thePARQUET
schema to be created within theRECOVER
database. Likewise, the tables can be deployed to any schema, although their DDL is specific to the columns in our Parquet datasets.The entrypoint for deployment is
snowflake/objects/deploy.sql
. When a commit is pushed to a branch, our CI/CD process will instantiate all Snowflake objects to an environment-specific database prefixed with your branch name, e.g.,RECOVER_ETL_659
. Details on how to manually deploy are provided in the file itself.Implementation details
CREATE OR REPLACE
will destroy any existing objects and create a new object. This is a powerful, declarative way of enforcing object configurations, but it's not always effective to recreate objects and sometimes existing objects ought to be reused.CREATE ... IF NOT EXISTS
will create an object if it doesn't yet already exist. We use this statement to preserve the state of that object or its child objects. The prototypical use case for this statement is to preserve tables containing data and their parent objects. I use this statement withDATABASE
andSCHEMA
objects because they are parents of table objects. Although, if this is a dev deployment, I insteadCREATE OR REPLACE
the database object in anticipation of cloning preexisting production tables while ensuring a completely reproducible dev environment with each new deployment.CREATE OR ALTER
is a mechanism for declaratively updating table objects. More information here.