You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pipeline [1] splits a text file with the delimiter "ABC". For the input text "ABABCD", the expected result is ["AB", "D"]. But, the actual pipeline result is ["ABABCD", "D"]. See [2] for the result with DirectRunner.
I guess the delimiter match at TextSource has the root cause. It processes the input text "ABABCD" as follows, so fails to match the delimiter "ABC" in the input text.
"A": "A" == delimiter[0] (= "A"), set delPosn to 1
"B": "B" == delimiter[1] (= "B"), set delPosn to 2
"A": "A" != delimiter[2] (= "C"), set delPosn to 0 <-- This is wrong. delPosn should be 1 as "A" matches delimiter[0]
"B": "B" != delimiter[0] (= "A"), set delPosn to 0
"C": "C" != delimiter[0] (= "A"), set delPosn to 0
"D": "D" != delimiter[0] (= "A"), set delPosn to 0
I think this is something like a regex match problem (e.g. delimiter "ABCABCABD" and input text "...ABCABCABC...". It may need to have multiple delPosns for partial matches).
ug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes
INFO: Filepattern short.csv matched 1 files with total size 6
Aug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource split
INFO: Splitting filepattern short.csv into bundles of size 0 took 1 ms and produced 1 files and 6 bundles
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <ABABCD>
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <D>
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Infrastructure
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".
For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].
The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.
In the fix:
01234567
ABABABCD
^
ABABC
But, it must match the index 2 char "A" in the delimiter.
I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".
For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].
The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.
In the fix:
01234567
ABABABCD
^
ABABC
But, it must match the index 2 char "A" in the delimiter.
What happened?
The pipeline [1] splits a text file with the delimiter "ABC". For the input text "ABABCD", the expected result is ["AB", "D"]. But, the actual pipeline result is ["ABABCD", "D"]. See [2] for the result with DirectRunner.
I guess the delimiter match at TextSource has the root cause. It processes the input text "ABABCD" as follows, so fails to match the delimiter "ABC" in the input text.
I think this is something like a regex match problem (e.g. delimiter "ABCABCABD" and input text "...ABCABCABC...". It may need to have multiple
delPosn
s for partial matches).[1]
[2]
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: