Skip to content

Latest commit

 

History

History
105 lines (84 loc) · 6.4 KB

File metadata and controls

105 lines (84 loc) · 6.4 KB

Surfline SQLFluff GitHub Workflow

Developed by Greg Clunies @ Surfline.

To add this to your repo, copy the contents of sqlfluff_lint_dbt_models.yml in this folder into a file named .github/workflows/sqlfluff_lint_dbt_models.yml.

Project using this workflow

See the dbt_artifacts dbt package for an example of this workflow in action. An example of the annotations can be seen here.

This GitHub Workflow

  • Lints any added or modified models in /models
  • Uses conda to setup a virtual environment and manage python, dbt, and sqlfluff dependencies. An example environment.yml can be found in this folder. You can modify this workflow to handle your dependencies as you wish (e.g., pip, virtualenv, poetry, etc.).
  • Uses templater = dbt - this requires a dummy profiles.yml and a connection to your data warehouse from the workflow.
  • Handles connecting to warehouse via VPN if your data warehouse requires it (optional). If your warehouse doesn't require connection via VPN, you can delete the Install OpenVPN and Connect to VPN steps from the workflow.
  • Annotates failures on the PR, on the line where they occur.

Example annotations screenshot

NOTE: This workflow has been tested on Redshift. Config for Snowflake is included in this repo, but has not been tested. For a very similar workflow which has been tested on Snowflake, see the tails.com flow

Setup

.sqlfluff

We use the .sqlfluff found here.

templater = dbt & dummy profiles.yml

When sqlfluff uses templater = dbt, it is actually using the dbt compiler to compile your SQL before sqlfluff lints it. When the dbt compiler, uh ... compiles... the SQL in your models containing macros like dbt_utils.star(), it needs to connect and query the warehouse to get information about the table and columns referenced in the marco.

But for dbt to connect to a data warehouse, it needs a profiles.yml. It's a bad idea to commit your profiles.yml to your repo since it has sensitive DB credentials, so instead create a dummy profiles.yml in a ./ci_cd/ folder in your dbt repo, for Redshift (Snowflake version) this will look like:

# -- Redshift dummy - save this at `./ci_cd/profiles.yml` --
config:
    send_anonymous_usage_stats: False
    use_colors: True

<profile-from-dbt-project.yml>:  # <--- Replace with profile value in your dbt_project.yml!
  target: sqlfluff
  outputs:
    sqlfluff:
      type: redshift
      host: "{{ env_var('PROFILES_YML_HOST') }}"       # <-- Set these environment variables in GitHub Secrets
      port: "{{ env_var('PROFILES_YML_PORT') | int }}"
      user: "{{ env_var('PROFILES_YML_USER') }}"
      pass: "{{ env_var('PROFILES_YML_PASS') }}"
      dbname: "{{ env_var('PROFILES_YML_DBNAME') }}"
      schema: "{{ env_var('PROFILES_YML_SCHEMA') }}"
      threads: 1

The credentials set in the dummy profiles.yml use dbt's env_var function to read in environmment variables that you need to set in your repo using GitHub Secrets. These evironment variables are then injected into the virtual machine running the GitHub workflow like:

    # Set environment variables used throughout GitHub workflow
    env:
      DBT_PROFILES_DIR: /home/runner/work/${{ github.event.repository.name }}/${{ github.event.repository.name }}/ci_cd

      # SPECIFY database connection credentials as env vars below.
      # Env var values to be fetched from as GitHub Secrets.
      # HIGHLY recommended you use a unique set of connection credentials for this worklfow alone.

      # IF USING REDSHIFT, workflow will use these in dummy profiles.yml (else, ignored)
      PROFILES_YML_HOST: ${{ secrets.PROFILES_YML_HOST }}
      PROFILES_YML_PORT: ${{ secrets.PROFILES_YML_PORT }}
      PROFILES_YML_USER: ${{ secrets.PROFILES_YML_USER }}
      PROFILES_YML_PASS: ${{ secrets.PROFILES_YML_PASS }}
      PROFILES_YML_DBNAME: ${{ secrets.PROFILES_YML_DBNAME }}
      PROFILES_YML_SCHEMA: ${{ secrets.PROFILES_YML_SCHEMA }}

The exact secrets you need to configure in your Github repo are listed below.

Redshift:

  • PROFILES_YML_HOST
  • PROFILES_YML_PORT
  • PROFILES_YML_USER
  • PROFILES_YML_PASS
  • PROFILES_YML_DBNAME
  • PROFILES_YML_SCHEMA

Snowflake:

  • PROFILES_YML_SNOWFLAKE_ACCOUNT
  • PROFILES_YML_SNOWFLAKE_USER
  • PROFILES_YML_SNOWFLAKE_PASSWORD
  • PROFILES_YML_SNOWFLAKE_ROLE
  • PROFILES_YML_SNOWFLAKE_DATABASE
  • PROFILES_YML_SNOWFLAKE_WAREHOUSE

Now that the DB credentials are set in the dummy profiles.yml, the when sqlfluff uses `templater = dbt, it can connect to the warehouse!

Connecting to warehouse via VPN

If your warehouse requires connections to be made via a VPN, you will need the following credentials (ask your IT team about creating these):

  • username (strongly recommend creating user just for this GitHub workflow)
  • password
  • vpn_config.ovpn

You will set these as credentials as GitHub Secrets, which are used in the workflow.

  • OVPN_USERNAME
  • OVPN_PASSWORD
  • VPN_FILE (the contents of your .ovpn file).

Now the workflow can connect to your warehouse, again, required when sqlfluff uses templater = dbt.

Support for other database types:

  • A similar approach for creating dummy profiles.yml for other databases (e.g., BigQuery, Presto, etc.) should be possible.
  • You will have to modify your dummy profiles.yml to match the format found in your local profiles.yml, typically found at ~/.dbt/profiles.yml. Or see the dbt docs for connection instructions for other database types.
  • Then set the connection credentials as environment variables in your sqlfluff_lint_dbt_models.yml.
  • Store the environment variables values as GitHub Secrets.