-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optionnal param "PathRejects" in appendWithoutDuplicates for ignored rows #49
Comments
@ilyasse05 - Thanks for opening the issue, would you like to contribute? I think initially we could persist only the rows that were discarded, the rejection path could be a delta table with generic columns, something like this, columns:
@MrPowers - Thoughts on this? |
@brayanjuls why not but i am not expert in scala langage. For columns, i think for performance, we have to keep the same columns from table source + technical column, it will be helpful if we need to recycle discarded rows and also add partition column like the target delta table. |
@ilyasse05 - Sorry for taking too long to response. My inicial proposal was to keep the columns of the source table as a Struct and have a single dead letter table for all the tables, but rethinking it again It would be hard to mantain multiple schemas in a single table that will be intended for analysis and reprocessing, so we could have one "dead letter table" per target table. Example, if the target table have the following schema root
|-- pkey: integer (nullable = true)
|-- attr1: string (nullable = true)
|-- attr2: string (nullable = true)
|-- is_current: boolean (nullable = true)
|-- effective_time: timestamp (nullable = true)
|-- end_time: timestamp (nullable = true) Then the dead letter table will have the following schema(the prefix dl in the columns stands for dead letter), root
|-- pkey: integer (nullable = true)
|-- attr1: string (nullable = true)
|-- attr2: string (nullable = true)
|-- is_current: boolean (nullable = true)
|-- effective_time: timestamp (nullable = true)
|-- end_time: timestamp (nullable = true)
|-- dl_function_name: string (nullable = false)
|-- dl_origin_table: string (nullable = false)
|-- dl_timestamp: timestamp (nullable = false) |
@brayanjuls yes that exactly what we need to do, for the path we create the same table, with the same columns + technical columns like you suggested. |
@ilyasse05 - do you want to send a pull request ? |
I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source.
And also return count of rows inserted, Updates, rejected.
Originally posted by @ilyasse05 in #47 (comment)
The text was updated successfully, but these errors were encountered: