Parallel scan in Full Table strategy #46
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of change
According to the AWS documentation here, there is a possibility to Scan a DynamoDB table in parallel. This is useful for large scans, as by default, the Scan operation returns data to the application in 1 MB increments.
Manual QA steps
In order to run the tap in parallel, we need to specify as environment variables, the following attributes:
parallel_segment
: specify the segment ID.parallel_totalsegments
: specify the total number of segments.Risks
Rollback steps
Additional info
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan
the Scan operation can logically divide a table or secondary index into multiple segments, with multiple application workers scanning the segments in parallel. Each worker can be a thread.
In order to run a Parallel scan, we need to run multiple executions of the tap, each one with different
parallel_segment
attribute value. But the sameparallel_totalsegments
.So for example:
Execution 1:
Execution 2: