Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Add a multi_value_processor plugin to the tide_data_pipeline module to handle multiple values. #85

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

vincent-gao
Copy link
Contributor

@vincent-gao vincent-gao commented Jul 18, 2024

Jira

https://digital-vic.atlassian.net/browse/SDPAP-9391

Change

  1. This PR adds a new DatasetTransform plugin called multi_value_processor. The plugin splits delimited string into arrays and optionally applies a callback function to each resulting value.
  2. This PR introduces a custom ElasticSearch destination plugin (TideElasticSearchDestination) that extends the base ElasticSearchDestination plugin. The main improvements are:
    • Flexible index_id generation:
      • Adds support for a hash prefix in addition to the existing prefix.
      • Implements a new getFullIndexId() method that generates the index_id based on the presence of hash prefix and regular prefix.
    • Custom configuration:
      • Adds a new 'hash_prefix' configuration option.
    • Improved cleanup process:
      • Updates the processCleanup() method to use the new index_id generation logic. more details see the conversation from the Slack

Related

This change has applied to https://github.com/dpc-sdp/content-solar-vic-gov-au/pull/148, just for a reference

Note on implementation approach

We've implemented this as a custom plugin rather than submitting a patch to the data_pipelines module. The hash code prefix functionality is specific to SDP requirements and lacks the generality needed for its upstream.

Example

pipeline_with_str_replace:
  label: 'Multiple Value Processor with str_replace'
  transforms:
    field:
      Suburbs:
        - plugin: multi_value_processor
          separator: ';'
          callback: str_replace
          parameters:
            - 'a'
            - 'A'
          value_position: 2

pipeline_with_substr:
  label: 'Multiple Value Processor with substr'
  transforms:
    field:
      Suburbs:
        - plugin: multi_value_processor
          separator: ';'
          callback: substr
          parameters:
            - '0'
            - 'true'

@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch 4 times, most recently from a8aba94 to 79ec01d Compare July 19, 2024 12:16
@vincent-gao vincent-gao changed the title adds mutiple_value_processor plugin for datapipe Add a multiple_value_processor plugin to the tide_data_pipeline module to handle multiple values. Jul 19, 2024
@vincent-gao vincent-gao changed the title Add a multiple_value_processor plugin to the tide_data_pipeline module to handle multiple values. Add a multiple_value_processor plugin to the tide_data_pipeline module to handle multiple values. Jul 19, 2024
@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch from 79ec01d to de3a112 Compare July 19, 2024 12:26
@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch from de3a112 to 2886a8c Compare July 19, 2024 13:10
@vincent-gao vincent-gao changed the title Add a multiple_value_processor plugin to the tide_data_pipeline module to handle multiple values. Add a multi_value_processor plugin to the tide_data_pipeline module to handle multiple values. Jul 21, 2024
@vincent-gao vincent-gao self-assigned this Jul 21, 2024
@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch 14 times, most recently from 76323cf to 446a32f Compare July 22, 2024 06:52
@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch 2 times, most recently from 40d96dc to 9cebc0b Compare July 22, 2024 09:28
@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch from 9468c3a to 5dfaf70 Compare July 22, 2024 22:57
@vincent-gao
Copy link
Contributor Author

vincent-gao commented Aug 1, 2024

All unit/kernal tests have passed. The failure is only because our DevOps team is debugging why phpdbg doesn't work.

Copy link
Contributor

@anthony-malkoun anthony-malkoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed the approach so happy to approve this. Good find with the delete issues.

@anthony-malkoun
Copy link
Contributor

@vincent-gao are we doing anything about the builds failing and removing the circle jobs?

@vincent-gao
Copy link
Contributor Author

vincent-gao commented Aug 1, 2024

@vincent-gao are we doing anything about the builds failing and removing the circle jobs?

Yes, currently, we are doing a mono-repo, should we remove the .circle directory in the mono-repo?
and I will also shut down the coverage test for now until Guy fixes the phpdbg issue

@vincent-gao vincent-gao force-pushed the feature/add-mutiple_values-plugin-for-pipeline branch from 332a3ae to 49864df Compare August 1, 2024 05:07
@vincent-gao vincent-gao merged commit 74bd42f into develop Aug 2, 2024
2 of 6 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants