Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

Closed
1 of 17 tasks
baeminbo opened this issue Aug 20, 2024 · 3 comments · Fixed by #32298 or #32398
Closed
1 of 17 tasks

[Bug]: TextIO.read() with non-default delimiter doesn't split at the right position #32251

baeminbo opened this issue Aug 20, 2024 · 3 comments · Fixed by #32298 or #32398
Assignees

Comments

@baeminbo
Copy link
Contributor

What happened?

The pipeline [1] splits a text file with the delimiter "ABC". For the input text "ABABCD", the expected result is ["AB", "D"]. But, the actual pipeline result is ["ABABCD", "D"]. See [2] for the result with DirectRunner.

I guess the delimiter match at TextSource has the root cause. It processes the input text "ABABCD" as follows, so fails to match the delimiter "ABC" in the input text.

"A": "A" == delimiter[0] (= "A"), set delPosn to 1
"B": "B" == delimiter[1] (= "B"), set delPosn to 2
"A": "A" != delimiter[2] (= "C"), set delPosn to 0 <-- This is wrong. delPosn should be 1 as "A" matches delimiter[0] 
"B": "B" != delimiter[0] (= "A"), set delPosn to 0
"C": "C" != delimiter[0] (= "A"), set delPosn to 0
"D": "D" != delimiter[0] (= "A"), set delPosn to 0

I think this is something like a regex match problem (e.g. delimiter "ABCABCABD" and input text "...ABCABCABC...". It may need to have multiple delPosns for partial matches).

[1]

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TextReadJob {
  private static final Logger LOG = LoggerFactory.getLogger(TextReadJob.class);

  private static final String INPUT_PATH = "short.csv"; // content: "ABABCD"

  private static final byte[] DELIMITER = "ABC".getBytes(StandardCharsets.UTF_8);

  public static void main(String[] args) throws IOException {
    PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();

    Pipeline pipeline = Pipeline.create(options);
    
    pipeline.apply(TextIO.read().from(INPUT_PATH).withDelimiter(DELIMITER)).apply(ParDo.of(new DoFn<String, Void>() {
      @ProcessElement
      public void processElement(@Element String input) {
        LOG.info("input: <{}>", input);
      }
    }));

    pipeline.run();
  }
}

[2]

ug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource getEstimatedSizeBytes
INFO: Filepattern short.csv matched 1 files with total size 6
Aug 19, 2024 11:30:43 PM org.apache.beam.sdk.io.FileBasedSource split
INFO: Splitting filepattern short.csv into bundles of size 0 took 1 ms and produced 1 files and 6 bundles
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <ABABCD>
Aug 19, 2024 11:30:43 PM baeminbo.TextReadJob$1 processElement
INFO: input: <D>

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@Amar3tto
Copy link
Contributor

.take-issue

@baeminbo
Copy link
Contributor Author

baeminbo commented Sep 2, 2024

I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".

For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].

The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.

In the fix:
01234567
ABABABCD
    ^
    ABABC

But, it must match the index 2 char "A" in the delimiter.

Right match:
01234567
ABABABCD
    ^
  ABABC

@Amar3tto
Copy link
Contributor

Amar3tto commented Sep 2, 2024

I'm afraid that #32298 cannot fix this issue with a delimiter including a "repeated pattern".

For example, the delimiter is "ABABC" and the text is "ABABABCD". The expected result is ["AB", "D"]. But, the actual result with the fix #32298 is ["ABABABCD", "D"].

The fix considers that the index 5 char "A" matches the index 0 char "A" in the delimiter.

In the fix:
01234567
ABABABCD
    ^
    ABABC

But, it must match the index 2 char "A" in the delimiter.

Right match:
01234567
ABABABCD
    ^
  ABABC

I will take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment