[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

baeminbo · 2024-08-20T06:53:46Z

What happened?

The pipeline [1] splits a text file with the delimiter "ABC". For the input text "ABABCD", the expected result is ["AB", "D"]. But, the actual pipeline result is ["ABABCD", "D"]. See [2] for the result with DirectRunner.

I guess the delimiter match at TextSource has the root cause. It processes the input text "ABABCD" as follows, so fails to match the delimiter "ABC" in the input text.

"A": "A" == delimiter[0] (= "A"), set delPosn to 1
"B": "B" == delimiter[1] (= "B"), set delPosn to 2
"A": "A" != delimiter[2] (= "C"), set delPosn to 0 <-- This is wrong. delPosn should be 1 as "A" matches delimiter[0] 
"B": "B" != delimiter[0] (= "A"), set delPosn to 0
"C": "C" != delimiter[0] (= "A"), set delPosn to 0
"D": "D" != delimiter[0] (= "A"), set delPosn to 0

I think this is something like a regex match problem (e.g. delimiter "ABCABCABD" and input text "...ABCABCABC...". It may need to have multiple delPosns for partial matches).

[1]

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TextReadJob {
  private static final Logger LOG = LoggerFactory.getLogger(TextReadJob.class);

  private static final String INPUT_PATH = "short.csv"; // content: "ABABCD"

  private static final byte[] DELIMITER = "ABC".getBytes(StandardCharsets.UTF_8);

  public static void main(String[] args) throws IOException {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();

    Pipeline pipeline = Pipeline.create(options);
    
    pipeline.apply(TextIO.read().from(INPUT_PATH).withDelimiter(DELIMITER)).apply(ParDo.of(new DoFn<String, Void>() {
      @ProcessElement
      public void processElement(@Element String input) {
        LOG.info("input: <{}>", input);
      }
    }));

    pipeline.run();
  }
}

[2]

ug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes
INFO: Filepattern short.csv matched 1 files with total size 6
Aug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource split
INFO: Splitting filepattern short.csv into bundles of size 0 took 1 ms and produced 1 files and 6 bundles
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <ABABCD>
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <D>

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

The text was updated successfully, but these errors were encountered:

Amar3tto · 2024-08-22T07:00:29Z

.take-issue

baeminbo · 2024-09-02T02:18:55Z

I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".

For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].

The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.

In the fix:
01234567
ABABABCD
    ^
    ABABC

But, it must match the index 2 char "A" in the delimiter.

Right match:
01234567
ABABABCD
    ^
  ABABC

Amar3tto · 2024-09-02T17:37:11Z

I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".

For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].

The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.
In the fix:
01234567
ABABABCD
    ^
    ABABC
But, it must match the index 2 char "A" in the delimiter.
Right match:
01234567
ABABABCD
    ^
  ABABC

I will take a look

baeminbo added awaiting triage bug labels Aug 20, 2024

github-actions bot added java P1 labels Aug 20, 2024

github-actions bot removed the awaiting triage label Aug 22, 2024

github-actions bot assigned Amar3tto Aug 22, 2024

Amar3tto mentioned this issue Aug 23, 2024

Fix TextIO.read() split with non-default delimiter #32298

Merged

3 tasks

Abacn closed this as completed in #32298 Aug 23, 2024

github-actions bot added this to the 2.60.0 Release milestone Aug 23, 2024

baeminbo mentioned this issue Sep 5, 2024

Use Knuth–Morris–Pratt algorithm for delimiter search in TextIO #32398

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

baeminbo commented Aug 20, 2024

Amar3tto commented Aug 22, 2024

baeminbo commented Sep 2, 2024

Amar3tto commented Sep 2, 2024

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

Comments

baeminbo commented Aug 20, 2024

What happened?

Issue Priority

Issue Components

Amar3tto commented Aug 22, 2024

baeminbo commented Sep 2, 2024

Amar3tto commented Sep 2, 2024