Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BigQuery Stress Test #30287

Merged
merged 6 commits into from
Feb 20, 2024
Merged

Implement BigQuery Stress Test #30287

merged 6 commits into from
Feb 20, 2024

Conversation

Amar3tto
Copy link
Contributor

@Amar3tto Amar3tto commented Feb 12, 2024

This pull request introduces stress tests for BigQueryIO, designed to assess the performance under various conditions. The stress tests simulate dynamic load increases and evaluate the behavior of BigQueryIO for different write formats and methods.

Changes:

  • Added stress tests for BigQueryIO.
  • Implemented dynamic load increases over time to simulate varying workloads.
  • Introduced configurations for controlling the stress test parameters, such as the number of columns, write method, write format, etc.
  • Added support for exporting metrics to InfluxDB or BigQuery based on the configuration parameter.

Dynamic load increases over time example:
image (2)


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Amar3tto Amar3tto marked this pull request as ready for review February 13, 2024 11:16
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Amar3tto
Copy link
Contributor Author

R: @damccorm @Abacn

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

/**
* BigQueryIO stress tests. The test is designed to assess the performance of BigQueryIO under
* various conditions.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment about how can we trigger specific test with gradle command line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Gooing to taking a closer look. Here is a few initial comment

Is the test currently writing metrics to influxDB, if so we can setup a grafana dashboard for it for http://metrics.beam.apache.org/ (in a separate PR) even it is currently empty

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks pretty good. Just had a comment about the ignored test and metrics nomenclature (no actions need for this PR, something to think about later)

}

@Test
@Ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a note why the test is ignored for each specific test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed @Ignore

@JsonProperty public boolean exportMetricsToInfluxDB = false;

/** InfluxDB measurement to publish results to. * */
@JsonProperty public String influxMeasurement;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A note for future dashboard) Would the measurement be unique for each stress test / each IO ?

Pasted an example metrics fyi

{
  "TotalStreamingDataProcessed": 0.0,
  "BillableShuffleDataProcessed": 0.0,
  "EstimatedCost": 0.06965586137166667,
  "AvgInputThroughputBytesPerSec": 6.712440411666667E7,
  "ElapsedTime": 1811.0,
  "MaxCpuUtilization": 0.7093815630496711,
  "AvgCpuUtilization": 0.6400222870148885,
  "AvgInputThroughputElementsPerSec": 66132.418,
  "TotalPdUsage": 405457.0,
  "TotalGpuTime": 0.0,
  "TotalSsdUsage": 0.0,
  "MaxInputThroughputElementsPerSec": 71111.15,
  "TotalDcuUsage": 0.0,
  "TotalVcpuTime": 3243.0,
  "TotalShuffleDataProcessed": 0.0,
  "EstimatedDataProcessedGB": 101.5,
  "TotalMemoryUsage": 1.3286034E7,
  "MaxInputThroughputBytesPerSec": 7.217781435E7
}

As can be seen, the field name obtained from getMetrics is not are the same for all pipelines. If we start to publish to influxDB, we need to consider the naming of measurement field to distinguish different test settings.

Even better is to use influxDB tags to distinguish them, however currently it is not supported by IOITMetrics.

https://docs.influxdata.com/influxdb/v1/concepts/glossary/#measurement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed using suffixes, so the measurement should be unique for each stress test case.

It's a good idea to use tags, we'll look into that in the future.

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Abacn Abacn merged commit 4120041 into apache:master Feb 20, 2024
7 checks passed
@akashorabek akashorabek mentioned this pull request Mar 1, 2024
3 tasks
@akashorabek
Copy link
Contributor

akashorabek commented Apr 25, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants