odd-collector-gcp

ODD Collector GCP is a lightweight service which gathers metadata from all your Google Cloud Platform data sources.

To learn more about collector types and ODD Platform's architecture, read the documentation.

Authentication

By default, collector uses the process described in https://google.aip.dev/auth/4110 to resolve credentials.
If not running on Google Cloud Platform (GCP), this generally requires the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to a JSON file containing credentials.

Implemented adapters

BigQuery
BigTable
GoogleCloudStorage
GoogleCloudStoraDeltaTables

BigQuery

type: bigquery_storage
name: bigquery_storage
project: <any_project_name>
datasets_filter: # Optional, if not provided all datasets from the project will be fetched
  include: [ <patterns_to_include> ] # List of dataset name patterns to include
  exclude: [ <patterns_to_exclude> ] # List of dataset name patterns to exclude

BigTable

type: bigtable
name: bigtable
project: <any_project_name>
rows_limit: 10 # get combination of all types in table used across the first N rows.

GoogleCloudStorage

type: gcs
name: gcs_adapter
project: <any_project_name>
filename_filter: # Optional. Default filter allows each file to be ingested to platform.
  include: [ '.*.parquet' ]
  exclude: [ 'dev_.*' ]
parameters: # Optional set of parameters, default values presented below.
  anonymous: bool = False, # Whether to connect anonymously. If True, will not attempt to look up credentials using standard GCP configuration methods.
  access_token: str = None, # GCP access token. If provided, temporary credentials will be fetched by assuming this role; also, a credential_token_expiration must be specified as well.
  target_service_account: str = None, # An optional service account to try to impersonate when accessing GCS. This requires the specified credential user or service account to have the necessary permissions.
  credential_token_expiration: datetime = None, # Datetime in format: "2023-12-31 23:59:59". Expiration for credential generated with an access token. Must be specified if access_token is specified.
  default_bucket_location: str = "US", # GCP region to create buckets in.
  scheme: str = "https", # GCS connection transport scheme.
  endpoint_override: str = None, # Override endpoint with a connect string such as “localhost:9000”
  default_metadata: mapping or pyarrow.KeyValueMetadata = None, # Default metadata for open_output_stream. This will be ignored if non-empty metadata is passed to open_output_stream.
  retry_time_limit: timedelta = None, # Timedelta in seconds. Set the maximum amount of time the GCS client will attempt to retry transient errors. Subsecond granularity is ignored.
datasets:
  # Recursive fetch for all objects in the bucket.
  - bucket: my_bucket
  # Explicitly specify the prefix to file.
  - bucket: my_bucket
    prefix: folder/subfolder/file.csv
  # When we want to use the folder as a dataset. Very useful for partitioned datasets.
  # I.e it can be Hive partitioned dataset with structure like this:
  # gs://my_bucket/partitioned_data/year=2019/month=01/...
  - bucket: my_bucket
    prefix: partitioned_data/
    folder_as_dataset:
      file_format: parquet
      flavor: hive
  #field_names must be provided if partition flavor was not used. I.e for structure like this:
  # gs://my_bucket/partitioned_data/year/...
  - bucket: my_bucket
    prefix: partitioned_data/
    folder_as_dataset:
      file_format: csv
      field_names: ['year']

GoogleCloudStorageDeltaTables

type: gcs_delta
name: gcs_delta_adapter
project: <any_project_name>
parameters: # Optional set of parameters, default values presented below.
  anonymous: bool = False, # Whether to connect anonymously. If True, will not attempt to look up credentials using standard GCP configuration methods.
  access_token: str = None, # GCP access token. If provided, temporary credentials will be fetched by assuming this role; also, a credential_token_expiration must be specified as well.
  target_service_account: str = None, # An optional service account to try to impersonate when accessing GCS. This requires the specified credential user or service account to have the necessary permissions.
  credential_token_expiration: datetime = None, # Datetime in format: "2023-12-31 23:59:59". Expiration for credential generated with an access token. Must be specified if access_token is specified.
  default_bucket_location: str = "US", # GCP region to create buckets in.
  scheme: str = "https", # GCS connection transport scheme.
  endpoint_override: str = None, # Override endpoint with a connect string such as “localhost:9000”
  default_metadata: mapping or pyarrow.KeyValueMetadata = None, # Default metadata for open_output_stream. This will be ignored if non-empty metadata is passed to open_output_stream.
  retry_time_limit: timedelta = None, # Timedelta in seconds. Set the maximum amount of time the GCS client will attempt to retry transient errors. Subsecond granularity is ignored.
delta_tables: # Explicitly specify the bucket and prefix to the file.
  - bucket: str = "bucket_name"
    prefix: str = "folder/subfolder/file.csv"
    filter: # Optional, default values below
      include: list[str] = [".*"] # List of patterns to include.
      exclude: list[str] = None  # List of patterns to exclude

How to build

docker build .

Config example

Due to the Plugin is inherited from pydantic.BaseSetting, each field missed in collector-config.yaml can be taken from env variables.

Custom .env file for docker-compose.yaml

GOOGLE_APPLICATION_CREDENTIALS=
PLATFORM_HOST_URL=

Custom collector-config.yaml

default_pulling_interval: 10
token: "<CREATED_COLLECTOR_TOKEN>"
platform_host_url: "http://localhost:8080"
plugins:
  - type: bigquery_storage
    name: bigquery_storage
    project: opendatadiscovery

docker-compose.yaml

version: "3.8"
services:
  # --- ODD Platform ---
  database:
    ...
  odd-platform:
    ...
  odd-collector-aws:
    image: 'image_name'
    restart: always
    volumes:
      - collector_config.yaml:/app/collector_config.yaml
    environment:
      - GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}
      - PLATFORM_HOST_URL=${PLATFORM_HOST_URL}
    depends_on:
      - odd-platform

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
config_examples		config_examples
docker		docker
odd_collector_gcp		odd_collector_gcp
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
collector_config.yaml		collector_config.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

odd-collector-gcp

Authentication

Preview

Implemented adapters

BigQuery

BigTable

GoogleCloudStorage

GoogleCloudStorageDeltaTables

How to build

Config example

About

Releases 5

Packages

Contributors 4

Languages

License

opendatadiscovery/odd-collector-gcp

Folders and files

Latest commit

History

Repository files navigation

odd-collector-gcp

Authentication

Preview

Implemented adapters

BigQuery

BigTable

GoogleCloudStorage

GoogleCloudStorageDeltaTables

How to build

Config example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 4

Languages

Packages