Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1312: [C++] Support setting the block buffer capacity of BufferedOutputStream #1394

Merged
merged 6 commits into from
Feb 7, 2023

Conversation

coderex2522
Copy link
Contributor

What changes were proposed in this pull request?

This pr provides the API to set the initial buffer capacity of BufferedOutputStream.

Why are the changes needed?

It's convenient for users to adjust the buffer size of the BufferedOutputStream according to the written data.

How was this patch tested?

Add the WriterTest.setOutputBufferCapacity to test different buffer capacity of BufferedOutputStream.
In addition, I did a simple test similar to this issue

PeakMem BufferCapacity = 1MB BufferCapacity = 64KB
10000 Fields 38.96GB 11.48GB
1000 Fields 3.92GB 1.17GB
100 Fields 0.41GB 0.13GB

@github-actions github-actions bot added the CPP label Feb 1, 2023
@@ -262,6 +262,17 @@ namespace orc {
* @return if not set, the default is false
*/
bool getUseTightNumericVector() const;

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Indentation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is detected by the CI format check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
}

TEST(WriterTest, setOutputBufferCapacity) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

TEST(WriterTest, setOutputBufferCapacity) {
testSetOutputBufferCapacity(1024);
testSetOutputBufferCapacity(1024 * 1024);
testSetOutputBufferCapacity(1024 * 1024 * 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is 1GB, I'd remove this from unit test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1GB case has been deleted.

@dongjoon-hyun
Copy link
Member

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check and fix Windows failure.

unknown file: error: C++ exception with description "bad allocation" thrown in the test body.
[  FAILED  ] WriterTest.setOutputBufferCapacity (44714 ms)

@@ -262,6 +262,17 @@ namespace orc {
* @return if not set, the default is false
*/
bool getUseTightNumericVector() const;

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is detected by the CI format check.

c++/include/orc/Writer.hh Show resolved Hide resolved
TEST(WriterTest, setOutputBufferCapacity) {
testSetOutputBufferCapacity(1024);
testSetOutputBufferCapacity(1024 * 1024);
testSetOutputBufferCapacity(1024 * 1024 * 1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Type::buildTypeFromString("struct<col1:int,col2:int>"));
WriterOptions options;
options.setStripeSize(1024 * 1024)
.setCompressionBlockSize(1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change setCompressionBlockSize accordingly? It seems that even you have changed the output buffer size but the page is very small and no resize will happen actually.

@@ -262,6 +262,19 @@ namespace orc {
* @return if not set, the default is false
*/
bool getUseTightNumericVector() const;

/**
* Set the initial buffer capacity of output stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Set the initial buffer capacity of output stream.
* Set the initial capacity to the output buffer of the compressed stream.


/**
* Set the initial buffer capacity of output stream.
* Each column contains multiple output streams with buffers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Each column contains multiple output streams with buffers,
* Each column contains one or more compressed streams depending on its type,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class CompressionStream is a subclass of class BufferedOutputStream which contains the buffer. And BufferedOutputStream has no logic associated with the compression algorithm, so it may not be appropriate to use compressed stream to describe it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it makes sense to me to use BufferedOutputStream explicitly. We have defined several output streams (e.g. one from OrcFile.hh). It should be clear to user which one we are referring to.

/**
* Set the initial buffer capacity of output stream.
* Each column contains multiple output streams with buffers,
* and these buffers will automatically expand when memory is exhausted.
Copy link
Member

@wgtmac wgtmac Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* and these buffers will automatically expand when memory is exhausted.
* and these buffers will automatically expand when more memory is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the used size of the buffer reaches the initial capacity, the buffer will continue to expand in compression block size units.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CompressionStream, will the output buffer exceed the compression block size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outputBuffer(char* variable) of the class CompressionStreamBase doesn't exceed the compression block size, and the dataBuffer(BlockBuffer variable) of the class BufferedOutputStream will exceed the compression block size.

WriterOptions& setOutputBufferCapacity(uint64_t capacity);

/**
* Get the buffer capacity of output stream.
Copy link
Member

@wgtmac wgtmac Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Get the buffer capacity of output stream.
* Get the initial capacity of output buffer in the class BufferedOutputStream.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. My point is to explicitly describe which buffer applies the new options. It is kind of vague to me by simply saying output buffer and output stream.


/**
* Set the initial buffer capacity of output stream.
* Each column contains multiple output streams with buffers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it makes sense to me to use BufferedOutputStream explicitly. We have defined several output streams (e.g. one from OrcFile.hh). It should be clear to user which one we are referring to.

/**
* Set the initial buffer capacity of output stream.
* Each column contains multiple output streams with buffers,
* and these buffers will automatically expand when memory is exhausted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CompressionStream, will the output buffer exceed the compression block size?

@dongjoon-hyun
Copy link
Member

+1 for @wgtmac 's advice.

BTW, googletest failures in the CI is related to this PR?

CMake Error at googletest_ep-stamp/googletest_ep-download-RELWITHDEBINFO.cmake:49 (message):
  Command failed: 1
   '/usr/local/bin/cmake' '-Dmake=' '-Dconfig=' '-P' '/home/runner/work/orc/orc/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELWITHDEBINFO-impl.cmake'

@coderex2522
Copy link
Contributor Author

It seems to be caused by googletest package download failure. Compile successfully after retry.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @coderex2522 and @wgtmac .

@wgtmac
Copy link
Member

wgtmac commented Feb 4, 2023

@coderex2522 I will leave it to you to merge it to test your permission.

@coderex2522 coderex2522 merged commit 32dffd3 into apache:main Feb 7, 2023
@dongjoon-hyun
Copy link
Member

Congratulation, @coderex2522 . :)

@dongjoon-hyun
Copy link
Member

FYI, we provide merge script like Apache Spark community which preserves the committer identities when you merge someone else's PR.

@williamhyun
Copy link
Member

@coderex2522

Hey Xin, I've made an official announcement in the ORC dev and user mailing list welcoming you as our new committer.
If you have not yet subscribed to these mailing lists, please do so ASAP.

In addition, Could you make a PR to update the ORC website to add you as a committer?

@coderex2522
Copy link
Contributor Author

In addition, Could you make a PR to update the ORC website to add you as a committer?

Thanks, I've subscribed to these mailing lists before. And I will submit a PR to update the ORC website as soon as possible.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…OutputStream (apache#1394)

* ORC-1312: [C++] Support setting the capacity of output buffer in the class BufferedOutputStream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants