Skip to content

Latest commit

 

History

History

aws-databricks-modular-privatelink

Deploy Multiple AWS Databricks Workspace with CMK, Customer-managed VPC, Private Links, IP Access Lists

In this example, we created modules and root level template to deploy multiple (e.g. 10+) E2 Databricks workspaces at scale easily. Users of this template minimally should do these:

  1. Supply credentials (aws+databricks) and configuration variables for each workspaces
  2. Edit the locals block in main.tf to decide what & how many workspaces to deploy
  3. Run terraform init and terraform apply to deploy 1 or more workspaces into your VPC.
  4. Optionally, take the outputs files in /artifacts and patch each workspace with IP Access List.

This modular design also allows customer to deploy, manage and delete individual workspace(s) easily, with minimal configuration needed. This template takes heavy reference (e.g. CMK module + Private Links) from https://github.com/andyweaves/databricks-terraform-e2e-examples from [email protected] and this repo is adapted to meet specific customer requirements.

Architecture

To be added - LucidChart brewing...

Project Folder Structure

.
├── iam.tf
├── instance_profile.tf
├── main.tf
├── outputs.tf
├── privatelink.tf
├── providers.tf
├── variables.tf
├── vpc.tf
├── artifacts        # stores workspaces URL and other info for next stage deployment
    ├── workspace_1_deployment.json       
    ├── ...
├── modules   
    ├── databricks_cmk
        ├── data.tf
        ├── main.tf         
        ├── outputs.tf      
        ├── providers.tf
        ├── variables.tf    
    ├── mws_workspace
        ├── main.tf         
        ├── variables.tf    
        ├── outputs.tf      
        ├── modules
            ├── mws_network
                ├── main.tf
                ├── variables.tf
                ├── outputs.tf
            ├── mws_storage
                ├── main.tf
                ├── variables.tf
                ├── outputs.tf

Get Started

Step 1: Clone this repo to local, set environment variables for aws and databricks providers authentication:

export TF_VAR_databricks_account_client_id=your_account_level_spn_application_id
export TF_VAR_databricks_account_client_secret=your_account_level_spn_secret
export TF_VAR_databricks_account_id=your_databricks_account_id

export AWS_ACCESS_KEY_ID=your_aws_role_access_key_id
export AWS_SECRET_ACCESS_KEY=your_aws_role_secret_access_key

Step 2: Modify variables.tf, for each workspace you need to write a variable block like this, all attributes are required:

variable "workspace_1_config" {
  default = {
    private_subnet_pair = { subnet1_cidr = "10.109.6.0/23", subnet2_cidr = "10.109.8.0/23" }
    workspace_name      = "test-workspace-1"
    prefix              = "ws1" // prefix decides subnets name
    region              = "ap-southeast-1"
    root_bucket_name    = "test-workspace-1-rootbucket"
    block_list          = ["58.133.93.159"]
    allow_list          = [] // if allow_list empty, all public IP not blocked by block_list are allowed
    tags = {
      "Name" = "test-workspace-1-tags",
      "Env"  = "test-ws-1" // add more tags if needed, tags will be applied on databricks subnets and root s3 bucket, but workspace objects like clusters tag needs to be defined in workspace config elsewhere
    }
  }
}

Since we are using CMK (customer managed key) for encryption on root S3 bucket and Databricks managed resources, you also need to provide an AWS IAM ARN for cmk_admin. The format will be: arn:aws:iam::123456:user/xxx. You need to create this user and assign KMS admin role to it.

Step 3: Modify main.tf - locals block, add/remove your workspace config var inside locals, like this:

workspace_confs = {
    workspace_1 = var.workspace_1_config
    workspace_2 = var.workspace_2_config
    workspace_3 = var.workspace_3_config
}

Step 4: Check your VPC and subnet CIDR, then run terraform init and terraform apply to deploy your workspaces; this will deploy multiple E2 workspaces into your VPC.

We are calling the module mws_workspace to create multiple workspaces by batch, you should treat this concept as a group of workspaces that share the same VPC in a region. If you want to deploy workspaces in different VPCs, you need to create multiple mws_workspace instances.

In the default setting, this template creates one VPC (with one public subnet and one private subnet for hosting VPCEs). Each incoming workspace will add 2 private subnets into this VPC. If you need to create multiple VPCs, you should copy paste the VPC configs and change accordingly, or you can wrap VPC configs into a module, we leave this to you.

At this step, your workspaces deployment and VPC networking infra should have been successfully deployed and you will have n config json files for n workspaces deployed, under /artifacts folder, to be used in another Terraform project to deploy workspace objects including IP Access List.

Private Links

In this example, we used 1 VPC for all workspaces, and we used backend VPCE for Databricks clusters to communicate with control plane. All workspaces deployed into the same VPC will share one pair of VPCEs (one for relay, one for rest api), typically since VPCEs can provide considerable bandwidth, you just need one such pair of VPCEs for all workspaces in each region. For HA setup, you can build VPCEs into multiple az as well.

IP Access List

For all the workspaces in this template, we allowed access from the Internet, but we restrict access using IP access list. Each workspace can be customized with allow_list and block_list in variables block.

The process of IP access list management is separated from Terraform process of workspace deployment. This is because we want:

  1. To keep a clean cut between workspace deployment and workspace management.
  2. It is general good practice to separate workspace deployment and workspace management.
  3. To keep workspace objects deployment in separate terraform project, not to risk leaving orphaned resources and ruins your workspace deployment (e.g. changed provider etc).

After you have deployed your workspaces using this template (aws_databricks_modular_privatelink), you will have workspace host URLs saved as local file under /artifacts. Those files are for you to input to the next Terraform workspace management process, and to patch the workspace IP access list.

IP Access List Decision Flow

alt text

Example - blocked access from workspace: my phone is blocked to access the workspace, since the public IP was in the workspace's block list.

alt text

Recommended to keep IP Access List management in a separate Terraform project, to avoid orphaned resources. (Similar error below)

alt text

Tagging

We added custom tagging options in variables.tf to tag your aws resources: in each workspace's config variable map, you can supply with any number of tags, and these tags will propagate down to resources related to that workspace, like root bucket s3 and the 2 subnets. Note that aws databricks itself does not support tagging, also the abstract layer of storage_configuration, and network_configuration does not support tagging. Instead, if you need to tag/enforce certain tags for clusters and pools, do it in workspace management terraform projects, (not this directory that deploys workspaces).

Terraform States Files stored in remote S3

We recommend using remote storage, like S3, for state storage, instead of using default local backend. If you have already applied and retains state files locally, you can also configure s3 backend then apply, it will migrate local state file content into S3 bucket, then local state file will become empty. As you switch the backends, state files are migrated from A to B.

terraform {
  backend "s3" {
    # Replace this with your bucket name!
    bucket = "terraform-up-and-running-state-unique-hwang"
    key    = "global/s3/terraform.tfstate"
    region = "ap-southeast-1"
    # Replace this with your DynamoDB table name!
    dynamodb_table = "terraform-up-and-running-locks"
    encrypt        = true
  }
}

You should create the infra for remote backend in another Terraform Project, like the aws_remote_backend_infra project in this repo's root level - https://github.com/hwang-db/tf_aws_deployment/tree/main/aws_remote_backend_infra, since we want to separate the backend infra out from any databricks project infra. As shown below, you create a separate set of tf scripts and create the S3 and DynamoDB Table. Then all other tf projects can store their state files in this remote backend.

alt text

Tips: If you want to destroy your backend infra (S3+DynamoDB), since your state files of S3 and backend infra are stored in that exact S3, to avoid falling into chicken and egg problem, you need to follow these steps:

  1. Comment out remote backend and migrate states to local backend
  2. Comment out all backend resources configs, run apply to get rid of them. Or you can run destroy.

Common Actions

To add specific workspace(s)

You just need to supply with each workspace's configuration in root level variables.tf, similar to the examples given. Then you need to add the workspaces you want into locals block and run apply.

To delete specific workspace(s)

Do Not run terraform destroy or terraform destroy -target for the purpose of deleting resources. Instead, you should just remove resources from your .tf scripts and run terraform apply.

You just need to remove the workspace config from main.tf - locals block, then run terraform apply to delete the workspace. For example, to delete workspace_3, you need to remove the following lines from main.tf - locals block, it is optional to remove the same from variable block in variables.tf:

workspace_3 = var.workspace_3_config

Then run terraform apply, workspace_3 will be deleted.

Configure IAM roles, S3 access policies and Instance Profile for clusters

This template illustrates the traditional method of creating Instance Profile to grant cluster with S3 bucket access, see original official guide

The sample script in instance_profile.tf will help you create the underlying IAM role and policies for you to create instance profile at workspace level, you will find the arn from tf output, you can then manually take the value and configure at workspace admin setting page like below:

alt text

Next you need to configure permissions for users/groups to use this instance profile to spin up clusters, and the cluster will be able to access the S3 specified in the instance profile's IAM role's policy.

Grant Access to other users to use this instance profile

Deploying instance profile to workspace is obviously a workspace configuration process, and we suggest you write the relevant tf scripts in workspace management project (such as inside aws_workspace_config), not in this workspace deployment project. The screenshot in the above step is a manual version of adding instance profile inside your workspace.

By default, the instance profile you created from the above steps is only accessible to its creator and admin group. Thus you also need to do access control (permissions) and specify who can use such instance profile to spin up clusters. See sample tf script and tutorial here: Tutorial

Requirements

Name Version
aws ~> 4.0

Providers

Name Version
aws 4.32.0
databricks 1.3.1
databricks.mws 1.3.1
http 3.1.0
local 2.2.3
random 3.4.3
time 0.8.0

Modules

Name Source Version
databricks_cmk ./modules/databricks_cmk n/a
workspace_collection ./modules/mws_workspace n/a

Resources

Name Type
aws_eip.nat_gateway_elastic_ips resource
aws_iam_role.cross_account_role resource
aws_iam_role_policy.this resource
aws_internet_gateway.igw resource
aws_nat_gateway.nat_gateways resource
aws_route_table.pl_subnet_rt resource
aws_route_table.public_route_table resource
aws_route_table_association.dataplane_vpce_rtb resource
aws_route_table_association.public_route_table_associations resource
aws_security_group.privatelink resource
aws_security_group.sg resource
aws_subnet.privatelink resource
aws_subnet.public_subnets resource
aws_vpc.mainvpc resource
aws_vpc_endpoint.backend_relay resource
aws_vpc_endpoint.backend_rest resource
databricks_mws_credentials.this resource
databricks_mws_vpc_endpoint.backend_rest_vpce resource
databricks_mws_vpc_endpoint.relay resource
local_file.deployment_information resource
random_string.naming resource
time_sleep.wait resource
aws_availability_zones.available data source
databricks_aws_assume_role_policy.this data source
databricks_aws_crossaccount_policy.this data source
http_http.my data source

Inputs

Name Description Type Default Required
cmk_admin cmk string "arn:aws:iam::026655378770:user/hao" no
databricks_account_id n/a string n/a yes
databricks_account_password n/a string n/a yes
databricks_account_client_id n/a string n/a yes
privatelink_subnets_cidr n/a list(string)
[
"10.109.4.0/23"
]
no
public_subnets_cidr n/a list(string)
[
"10.109.2.0/23"
]
no
region n/a string "ap-southeast-1" no
relay_vpce_service n/a string "com.amazonaws.vpce.ap-southeast-1.vpce-svc-0557367c6fc1a0c5c" no
tags n/a map {} no
vpc_cidr n/a string "10.109.0.0/17" no
workspace_1_config n/a map
{
"allow_list": [
"65.184.145.97"
],
"block_list": [
"58.133.93.159"
],
"prefix": "ws1",
"private_subnet_pair": {
"subnet1_cidr": "10.109.6.0/23",
"subnet2_cidr": "10.109.8.0/23"
},
"region": "ap-southeast-1",
"root_bucket_name": "test-workspace-1-rootbucket",
"tags": {
"Env": "test-ws-1",
"Name": "test-workspace-1-tags"
},
"workspace_name": "test-workspace-1"
}
no
workspace_2_config n/a map
{
"allow_list": [
"65.184.145.97"
],
"block_list": [
"54.112.179.135",
"195.78.164.130"
],
"prefix": "ws2",
"private_subnet_pair": {
"subnet1_cidr": "10.109.10.0/23",
"subnet2_cidr": "10.109.12.0/23"
},
"region": "ap-southeast-1",
"root_bucket_name": "test-workspace-2-rootbucket",
"tags": {
"Name": "test-workspace-2-tags"
},
"workspace_name": "test-workspace-2"
}
no
workspace_vpce_service n/a string "com.amazonaws.vpce.ap-southeast-1.vpce-svc-02535b257fc253ff4" no

Outputs

Name Description
arn n/a
databricks_hosts n/a