Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optionnal param "PathRejects" in appendWithoutDuplicates for ignored rows #49

Open
ilyasse05 opened this issue Feb 15, 2023 · 5 comments
Labels
good first issue Good for newcomers

Comments

@ilyasse05
Copy link

I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source.

And also return count of rows inserted, Updates, rejected.

Originally posted by @ilyasse05 in #47 (comment)

@brayanjuls
Copy link
Collaborator

@ilyasse05 - Thanks for opening the issue, would you like to contribute?

I think initially we could persist only the rows that were discarded, the rejection path could be a delta table with generic columns, something like this,

columns:

  • functionname:String
  • origintable:String
  • data:Struct
  • timestamp:TimeStamp

@MrPowers - Thoughts on this?

@ilyasse05
Copy link
Author

@brayanjuls why not but i am not expert in scala langage.

For columns, i think for performance, we have to keep the same columns from table source + technical column, it will be helpful if we need to recycle discarded rows and also add partition column like the target delta table.

@brayanjuls
Copy link
Collaborator

@ilyasse05 - Sorry for taking too long to response. My inicial proposal was to keep the columns of the source table as a Struct and have a single dead letter table for all the tables, but rethinking it again It would be hard to mantain multiple schemas in a single table that will be intended for analysis and reprocessing, so we could have one "dead letter table" per target table.

Example, if the target table have the following schema

root
 |-- pkey: integer (nullable = true)
 |-- attr1: string (nullable = true)
 |-- attr2: string (nullable = true)
 |-- is_current: boolean (nullable = true)
 |-- effective_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)

Then the dead letter table will have the following schema(the prefix dl in the columns stands for dead letter),

root
 |-- pkey: integer (nullable = true)
 |-- attr1: string (nullable = true)
 |-- attr2: string (nullable = true)
 |-- is_current: boolean (nullable = true)
 |-- effective_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- dl_function_name: string (nullable = false)
 |-- dl_origin_table: string (nullable = false)
 |-- dl_timestamp: timestamp (nullable = false)

@ilyasse05
Copy link
Author

ilyasse05 commented Mar 2, 2023

@brayanjuls yes that exactly what we need to do, for the path we create the same table, with the same columns + technical columns like you suggested.

@brayanjuls
Copy link
Collaborator

@ilyasse05 - do you want to send a pull request ?

@brayanjuls brayanjuls added the good first issue Good for newcomers label Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants