These instructions are part of expanding and deploying the Data Mesh architecture. Please follow each part in sequence.
0-vpc-sc | Runs a local terraform plan that outputs the necessary configurations for your Service Perimeter. |
1-tag-engine-oauth | Instructions on how to configure OAUTH needed for Data Mesh's Tag Engine. |
2-artifacts-project | Sets up a repository structure and instructions on deploying the artifacts project |
3-artifact-publish | A repository structure containing Dockerfiles and python packages that will be used for building and publishing artifacts |
4-data-governance | A repository structure containing instructions on deploying the data governance project |
5-service-catalog-project | A repository structure containing instructions on deploying the service catalog project |
6-service-catalog-solutions | Instructions on how to configure Service Catalog |
7-data-domain-1-nonconfidential | A repository structure containing instructions on deploying the non-confidential data project |
8-data-domain-1-ingest | A repository structure containing instructions on deploying the ingest project |
9-data-domain-1-confidential | A repository structure containing instructions on deploying the confidential data project |
10-run-cdmc-engines | Instructions on how to run the CDMC engines |
11-consumer-1 | A repository structure containing instructions on deploying the Consumer project |
12-adding-additional-data | Instructions on how to add additional data domains and/or datasets to an existing data domain |
Repository for domain-1-ingest
-
clone your repository that was created in 4-projects
git clone [email protected]:[git-owner-name]/gcp-dm-bu4-prj-domain-1-ingest.git bu4-prj-domain-1-ingest
-
cd over to the
bu4-prj-domain-1-ingest
directorycd bu4-prj-domain-1-ingest
-
Seed the repository if it has not been initialized yet.
git commit --allow-empty -m 'repository seed' git push --set-upstream origin main git checkout -b production git push --set-upstream origin production git checkout -b nonproduction git push --set-upstream origin nonproduction git checkout -b plan
-
Copy contents of foundation to new repo.
cp -RT ../gcp-data-mesh-foundations/policy-library/ ./policy-library cp ../gcp-data-mesh-foundations/build/cloudbuild-connection-tf-* . cp ../gcp-data-mesh-foundations/build/tf-wrapper.sh . chmod 755 ./tf-wrapper.sh
-
Copy contents over to new repo.
cp -R ../gcp-data-mesh-foundations/5-app-infra/8-data-domain-1-ingest/* .
-
Update the
backend.tf
files with the backend bucket from step Projects.export backend_bucket=$(terraform -chdir="../gcp-projects/business_unit_4/shared" output -json state_buckets | jq -r '."domain-1-ingest"') for i in `find . -name 'backend.tf'`; do sed -i'' -e "s/UPDATE_APP_INFRA_BUCKET/${backend_bucket}/" $i; done
-
Update
remote_state_bucket
in common.auto.tfvarsexport remote_state_bucket=$(terraform -chdir="../gcp-bootstrap/envs/shared" output -raw projects_gcs_bucket_tfstate) sed -i'' -e "s/REMOTE_STATE_BUCKET/${remote_state_bucket}/" common.auto.tfvars
-
Update
data_governance_state_bucket
in common.auto.tfvarsexport data_governance_state_bucket=$(terraform -chdir="../gcp-projects/business_unit_4/shared" output -json state_buckets | jq -r '."data-governance"') sed -i'' -e "s/DATA_GOVERNANCE_STATE_BUCKET/${data_governance_state_bucket}/" common.auto.tfvars
-
Update
dataflow_gcs_bucket_url
in common.auto.tfvars. This contains the templates created ingcp-dm-bu4-prj-artifacts
terraform -chdir="../bu4-prj-artifacts/envs/shared" init export gcs_bucket_url=$(terraform -chdir="../bu4-prj-artifacts/envs/shared" output -json gcs_template_bucket | jq -r .url) sed -i'' -e "s|UPDATE_TEMPLATE_BUCKET_URL|${gcs_bucket_url}|" common.auto.tfvars
Before proceeding on with the data ingestion process, we need to encrypt data from csv files and store it in GCS.
-
The first step is to generate an encrypted version of each plaintext CSV file that you plan to ingest. Once encrypted, these files will be uploaded to GCS, allowing Dataflow to ingest the data and populate the relevant BigQuery tables.
-
Run the following commands to export the necessary environment variables for encryption. These variables include the KMS crypto key and the wrapped key, which will be used to securely encrypt the files:
export cryptokeyname=$(terraform -chdir="../gcp-projects/business_unit_4/shared" output -json deidentify_keys | jq -r '."deidenfication_key_common-us-central1"') export org_tf_sa=$(terraform -chdir="../gcp-bootstrap/envs/shared" output -raw organization_step_terraform_service_account_email) export common_secrets_project_id=$(terraform -chdir="../gcp-projects/business_unit_4/shared" output -raw common_secrets_project_id) export secret_name=$(terraform -chdir="../gcp-projects/business_unit_4/shared" output -raw kms_wrapper_secret_name) echo "Crypto Key Name: $cryptokeyname" echo "Organization Terraform Service Account: $org_tf_sa" echo "Common Secrets Project ID: $common_secrets_project_id" echo "Secret Name: $secret_name"
wrappedkey=$(gcloud secrets versions list $secret_name --project=$common_secrets_project_id --uri | head -n 1 | sed 's|https://secretmanager.googleapis.com/v1/||') echo "Wrapped Key: $wrappedkey"
-
Impersonate the organization terraform service account to create the keyset:
gcloud auth application-default login --impersonate-service-account=${org_tf_sa}
-
Create an
encrypted_data
folder at the root of this repository.mkdir encrypted_data
-
Navigate to the directory containing the encryption helper script and execute it for each CSV file you want to encrypt. The script will output encrypted versions of each file.
cd helpers/csv-enc-emulator
-
Read the README.md file for instructions to ensure your python environment contains the requisite packages.
-
Run the following commands to encrypt each CSV file.
python3 simple-csv-raw-to-enc.py --cryptoKeyName $cryptokeyname --wrappedKey $wrappedkey --input_file_path sample-100-raw.csv --output_file_path ../../encrypted_data/sample-100-encrypted.csv python3 simple-csv-raw-to-enc.py --cryptoKeyName $cryptokeyname --wrappedKey $wrappedkey --input_file_path NewCust.csv --output_file_path ../../encrypted_data/NewCust.csv python3 simple-csv-raw-to-enc.py --cryptoKeyName $cryptokeyname --wrappedKey $wrappedkey --input_file_path UpdCust.csv --output_file_path ../../encrypted_data/UpdCust.csv
- Parameters:
--cryptoKeyName
: Specifies the KMS crypto key for encryption.--wrappedKey
: Provides the wrapped encryption key.--input_file_path
: Path of the raw file to be encrypted.--output_file_path
: Path for saving the encrypted file.
After running these commands, the encrypted files will be located in the
../../encrypted_data/
directory. - Parameters:
-
Remove service account impersonation from your session:
gcloud auth application-default login
-
Navigate back to the project’s root directory:
cd ../..
-
Set the environment variable for the target GCS bucket by running the following command, which retrieves the bucket name for data ingestion for domain-1 in the nonproduction environment from Terraform output:
terraform -chdir="../gcp-projects/business_unit_4/nonproduction" init export data_ingestion_bucket=$(terraform -chdir="../gcp-projects/business_unit_4/nonproduction" output -json data_ingestion_buckets | jq -r '."domain-1"')
-
Use the
gsutil cp
command to upload each encrypted CSV file to the specified GCS bucket:gsutil cp encrypted_data/sample-100-encrypted.csv ${data_ingestion_bucket}/sample-100-encrypted.csv gsutil cp encrypted_data/NewCust.csv ${data_ingestion_bucket}/NewCust.csv gsutil cp encrypted_data/UpdCust.csv ${data_ingestion_bucket}/UpdCust.csv
-
Commit Changes
git add . git commit -m 'Initialize data domain non confidential repo'
-
Push your plan branch
git push --set-upstream origin plan
-
Create a PR request from
plan
tononproduction
in your GitHub Repository -
Observe the plan in GCP Build by going to: https://console.cloud.google.com/cloud-build/builds;region=us-central1?hl=en&project=[prj-c-bu4-infra-gh-cb-ID-HERE]
-
Once the plan has been successfully satisfied, Merge your request and view the terraform apply here: https://console.cloud.google.com/cloud-build/builds;region=us-central1?hl=en&project=[prj-c-bu4-infra-gh-cb-ID-HERE]
In this part of the data ingestion process, data from a JSON file will be sent as Pub/Sub messages for further processing.
-
Go to the
helpers/pubsub-job-emulator
folder and execute the following commands to initialize Terraform and send the JSON data to Pub/Sub:cd helpers/pubsub-job-emulator terraform -chdir="../gcp-projects/business_unit_4/nonproduction" init export ingest_project_id=$(terraform -chdir="../gcp-projects/business_unit_4/nonproduction" output -json data_domain_ingestion_projects | jq -r '."domain-1".project_id')
-
Read the instructions in the README.md file in the pubsub-job-emulator directory for instructions to ensure your python environment contains the requisite packages.
-
Run the python code to send the data:
python3 simple-pubsub-job.py --cryptoKeyName $cryptokeyname --wrappedKey $wrappedkey --messages_file ../sample-generator/sample-100-raw.json --project_id $ingest_project_id --topic_id data_ingestion
- This will send data from
sample-100-raw.json
as messages to the Pub/Sub topicdata_ingestion
within the specified project. - The
cryptoKeyName
andwrappedKey
are used to ensure that the data is encrypted during the ingestion process.
- This will send data from
-
Once the data arrives on the landing services (PubSub) the Dataflow pipelines ingest the data. You can monitor the Pub/Sub payload jobs in the Pub/Sub Console:
- Go to the
Metrics
tab for the specific Topic to view general message metrics. - Use the
Metrics
tab on the Subscription for detailed metrics. - To view messages in transit, click on the
Messages
tab and then thePULL
button. (Note: Viewing messages in transit may delay processing.)
- Go to the
-
In the Dataflow Console, monitor the pipeline jobs that ingest, decrypt, transform, and re-encrypt the data.
- Click on a specific Dataflow job to view step-by-step processing details.
-
After processing, the data lands in the BigQuery table within the
non-confidential
project.- To preview the data, in the BigQuery Query Console use the
Preview
tab for a quick data view.
NOTE: Executing direct BQ console queries on
non-confidential
project is limited toDEVELOMENT
environment. Querying data inNON-PRODUCTION
andPRODUCTION
environment, is via theCONSUMER
project. - To preview the data, in the BigQuery Query Console use the
-
Only authorized users can view the masked and encrypted data fields. A user therefore needs to be in the apropriate group to be able to view these data fields. See
Defined User Groups
section for details. -
A separate Dataflow pipeline moves data from the
non-confidential
project to the confidential project, where the data is decrypted and unmasked, existing in plain text format.- This pipeline can also be monitored in the
Dataflow Console
within theconfidential
project for step-by-step processing details.
- This pipeline can also be monitored in the
-
In the
confidential
project, the data is accessible only to specific full-access users. These users can query the data using a SELECT statement in BigQuery or view it through the Preview tab within the BigQuery table. -
Once done, cd out of this folder
cd ../../..
The information presented below is reference for an operator to manage containers in a production environment.
In an operational environment, it is recommended to use the short SHA tags of images instead of environment-specific tags (e.g., nonproduction
, production
). Short SHAs provide a unique reference to each image version, enabling reliable traceability back to the specific commit used to build the image. This approach is particularly beneficial for managing production workloads, where consistent and repeatable deployments are essential.
For example, instead of using:
dataflow_template_jobs = {
"bq_to_bq" = {
"image_name" = "samples/reidentify_bq_to_bq:nonproduction"
"template_filename" = "reidentify_bq_to_bq-nonproduction.json"
"additional_parameters" = {
batch_size = 1000
}
},
}
A production configuration might specify the image and template filename with a short SHA tag, such as:
dataflow_template_jobs = {
"bq_to_bq" = {
"image_name" = "samples/reidentify_bq_to_bq:123abc"
"template_filename" = "reidentify_bq_to_bq-123abc.json"
"additional_parameters" = {
batch_size = 1000
}
},
}
By using a short SHA (123abc
in this example) in place of the environment name, the job configuration is aligned with a specific version of the image. This approach not only improves traceability but also allows developers and operators to distinguish between multiple image versions, supporting effective version control and rollback strategies in production.
For the purposes of this example and to simplify the deployment process for multiple environments, the configurations default to using environment names (e.g., nonproduction
). This setup streamlines initial testing and allows for faster iterations when managing multiple environments.
The instructions provided here will demonstrate actions needed for a production deployment. These are set as optional as the process is still time consuming and requires additional manual work. In the previous steps, nonproduction
was included as an example of what would be done in an operational environment, and the production
environment, although part of of an operational environment, will increase the time required to complete the deployment process for this Data Mesh Example.
-
Gather the necessary variables needed:
terraform -chdir="../gcp-projects/business_unit_4/production" init export data_ingestion_bucket=$(terraform -chdir="../gcp-projects/business_unit_4/production" output -json data_ingestion_buckets | jq -r '."domain-1"') echo "data_ingestion_bucket: ${data_ingestion_bucket}"
-
Use the
gsutil cp
command to upload each encrypted CSV file to the specified GCS bucket:gsutil cp encrypted_data/sample-100-encrypted.csv ${data_ingestion_bucket}/sample-100-encrypted.csv gsutil cp encrypted_data/NewCust.csv ${data_ingestion_bucket}/NewCust.csv gsutil cp encrypted_data/UpdCust.csv ${data_ingestion_bucket}/UpdCust.csv
-
Create a PR request from
nonproduction
toproduction
in your GitHub Repository -
Observe the plan in GCP Build by going to: https://console.cloud.google.com/cloud-build/builds;region=us-central1?hl=en&project=[prj-c-bu4-infra-gh-cb-ID-HERE]
-
Once your plan has been satisfied, Merge your request and view the terraform apply here: https://console.cloud.google.com/cloud-build/builds;region=us-central1?hl=en&project=[prj-c-bu4-infra-gh-cb-ID-HERE]
This process is to ingest data in a production environment. As stated above, this is an optional step, as the process is still time consuming and requires additional manual work, increasing the time required to complete the deployment process for this Data Mesh Example.
-
Go to the
helpers/pubsub-job-emulator
folder and execute the following commands to initialize Terraform and send the JSON data to Pub/Sub:cd helpers/pubsub-job-emulator terraform -chdir="../gcp-projects/business_unit_4/production" init export ingest_project_id=$(terraform -chdir="../gcp-projects/business_unit_4/production" output -json data_domain_ingestion_projects | jq -r '."domain-1".project_id') echo "ingest_project_id: ${ingest_project_id}"
-
Read the instructions in the README.md file in the pubsub-job-emulator directory for instructions to ensure your python environment contains the requisite packages.
-
Run the python code to send the data:
python3 simple-pubsub-job.py --cryptoKeyName $cryptokeyname --wrappedKey $wrappedkey --messages_file ../sample-generator/sample-100-raw.json --project_id $ingest_project_id --topic_id data_ingestion
- This will send data from
sample-100-raw.json
as messages to the Pub/Sub topicdata_ingestion
within the specified project. - The
cryptoKeyName
andwrappedKey
are used to ensure that the data is encrypted during the ingestion process.
- This will send data from
-
Once the data arrives on the landing services (PubSub) the Dataflow pipelines ingest the data. You can monitor the Pub/Sub payload jobs in the Pub/Sub Console:
- Go to the
Metrics
tab for the specific Topic to view general message metrics. - Use the
Metrics
tab on the Subscription for detailed metrics. - To view messages in transit, click on the
Messages
tab and then thePULL
button. (Note: Viewing messages in transit may delay processing.)
- Go to the
-
In the Dataflow Console, monitor the pipeline jobs that ingest, decrypt, transform, and re-encrypt the data.
- Click on a specific Dataflow job to view step-by-step processing details.
-
After processing, the data lands in the BigQuery table within the
non-confidential
project.- To query and view the data, you can run a
SELECT
statement in the BigQuery Query Console or use thePreview
tab for a quick data view.
- To query and view the data, you can run a
-
Certain fields in the data are masked or encrypted. Only authorized users within specific access can view this sensitive data.
-
A separate Dataflow pipeline moves data from the
non-confidential
project to the confidential project, where the data is decrypted and unmasked, existing in plain text format.- This pipeline can also be monitored in the
Dataflow Console
within theconfidential
project for step-by-step processing details.
- This pipeline can also be monitored in the
-
In the
confidential
project, the data is accessible only to specific full-access users. These users can query the data using a SELECT statement in BigQuery or view it through the Preview tab within the BigQuery table.
Once done, cd out of this folder
cd ../../..