Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skipLines in FileIO.ReadableFile.readFullyAsUTF8String #28728

Closed
wants to merge 3 commits into from

Conversation

gudfhr95
Copy link

@gudfhr95 gudfhr95 commented Sep 28, 2023

Fixes #17990

Changes

  • Add skipLines in FileIO.ReadableFile.readFullyAsUTF8String
  • Add skipLines in FileIO.ReadableFile.readFullyAsBytes
  • Refactored FileIOTest to Enclosed.class to add parameterized test

Related PR


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the java label Sep 28, 2023
@liferoad
Copy link
Collaborator

R: @shunping-google

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@gudfhr95
Copy link
Author

Hi, @damondouglas.
I finished implementing what you suggested.
I'm looking forward for your review.
Thank you.

@gudfhr95 gudfhr95 marked this pull request as ready for review September 29, 2023 13:27
@damccorm
Copy link
Contributor

@damondouglas or @shunping-google could you please take a look at this one?

@ghost
Copy link

ghost commented Oct 27, 2023

Hi gudfhr95, I am reviewing your change today and will get back to you shortly. Sorry about the delay.

@ghost
Copy link

ghost commented Oct 31, 2023

gudfhr95,

Thanks again for submitting your PRs for review! I spent some time discussing with the previous reviewer (@damondouglas) about this PR (#28728) offline and below is our opinion.

While there are merits in your changes, we also see certain problems that we will have to address:

  • It modifies the signature of the public function readFullyAsBytes(), which will inevitably break the backward compatibility of existing code. This is something we want to avoid.
  • It may not work on large files since it has to put all file content in the memory.
  • It only considers "\n" as the line delimiter, while there could be more out there.

In fact, I think FileIO is not the best place for this change. In the design of Beam IO, FileIO should be mainly focusing on getting a readable/writable channel ready. It should not have any assumption on what's inside the file. If you put the header skipping feature in FileIO, you are assuming it is a text-like file that has line breaks.

We prefer putting the change in a class derived from FileIO. Just like your original PR (#28502), TextIO will be the best choice as it is for text files and it has already handled different types of delimiters. However, the original PR also introduced some breaking changes to the public functions. Moreover, it will not scale well because it needs to disable isSplittable() when the header skipping feature is used.

In summary, we see problems in both PRs and it will be great if we can

  1. implement the changes in TextIO,

  2. keep the capability of parallel processing for a large file, and

  3. make sure there are no breaking changes (instead of changing the signature of a public function, try to overload it).

There is another PR (#29202) submitted recently on implementing this feature. You can refer to it for some ideas.

Thanks,

Shunping

@gudfhr95
Copy link
Author

gudfhr95 commented Nov 9, 2023

Hi, since #29202 was merged.
There are no reason to continue this PR.
Thanks for your reviews.

@gudfhr95 gudfhr95 closed this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip header row from csv
3 participants