Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Python Flink runner load tests & Stop publish Python SDK image in beam_portability #27595

Merged
merged 2 commits into from
Jul 22, 2023

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Jul 21, 2023

Fixes #26921
Should fix #27601 (need job-server snapshot built on master to take effect)

  • Use beam-sdk images for load tests

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

* Use beam-sdk images for load tests
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

run seed job

@github-actions github-actions bot added the infra label Jul 21, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Go ParDo Flink Batch

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

See https://ci-beam.apache.org/job/beam_Publish_Docker_Snapshots/1062/ no longer build python containers, now build only job servers (flink and spark)

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

run seed job

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Found that Python Flink runner load tests all broken since Jul 20, e.g. https://ci-beam.apache.org/view/LoadTests/job/beam_LoadTests_Python_Combine_Flink_Batch/

01:40:48 ERROR:root:java.util.concurrent.TimeoutException: Timed out while waiting for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest --id=1-1 --provision_endpoint=localhost:43365'
01:40:52 INFO:apache_beam.runners.portability.portable_runner:Job state changed to FAILED
01:40:52 Traceback (most recent call last):
01:40:52   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
01:40:52     return _run_code(code, main_globals, None,
01:40:52   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
01:40:52     exec(code, run_globals)
01:40:52   File "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Batch/src/sdks/python/apache_beam/testing/load_tests/combine_test.py", line 129, in <module>
01:40:52     CombineTest().run()
01:40:52   File "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Batch/src/sdks/python/apache_beam/testing/load_tests/load_test.py", line 152, in run
01:40:52     state = self.result.wait_until_finish(duration=self.timeout_ms)
01:40:52   File "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_Combine_Flink_Batch/src/sdks/python/apache_beam/runners/portability/portable_runner.py", line 614, in wait_until_finish
01:40:52     raise self._runtime_exception
01:40:52 RuntimeError: Pipeline load-tests-python-flink-batch-combine-1-0720035125_c0b4d9c6-53f4-489b-90a5-f3885a9192b8 failed in state FAILED: java.util.concurrent.TimeoutException: Timed out while waiting for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest --id=1-1 --provision_endpoint=localhost:43365'

the first failing test run on Jul 20, 2023, 5:28 AM UTC using snapshot containers on Jul 19, 2023, 8:45 PM UTC.

It appears some changes on July 18-19 broke Python container.

@Abacn Abacn marked this pull request as draft July 21, 2023 14:06
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

1 similar comment
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

Manually reset "latest" tag to 3 days ago and it still fails. So this is not relevant to beam change

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

1 similar comment
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Load Tests Python ParDo Flink Batch

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

The tests run successfully in local flink cluster. Reproduce step:

  1. setup a flink cluster (flink-1.15) locally: ./flink-1.15.0/bin/start-cluster.sh
  2. In beam repo, install beam Python: cd ~/beam/sdks/python && pip install .[gcp]
  3. Start job server: docker run --publish 8099:8099 --publish 8098:8098 --publish 8097:8097 gcr.io/apache-beam-testing/beam_portability/beam_flink1.15_job_server:latest --flink-master=host.docker.internal:8081
  4. start beam load test: python -m apache_beam.testing.load_tests.combine_test --test-pipeline-options="--runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER --environment_config=gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest --publish_to_big_query=false --input_options='{\"num_records\":2500000,\"key_size\":10,\"value_size\":90}' --top_count=20"

Result:

INFO:apache_beam.testing.load_tests.load_test_metrics_utils:Missing InfluxDB options. Metrics will not be published to InfluxDB
INFO:root:Using provided Python SDK container image: gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest
INFO:root:Python SDK container image set to "gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest" for Docker environment
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function pack_combiners at 0x1193031f0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x119303280> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sort_stages at 0x1193039d0> ====================
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--top_count=20']
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STARTING
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
INFO:apache_beam.runners.portability.portable_runner:Job state changed to DONE
INFO:apache_beam.testing.load_tests.load_test_metrics_utils:Load test results for test: b9216a86d14440788a70f75ab9c92d58 and timestamp: 1689959067.058372:
INFO:apache_beam.testing.load_tests.load_test_metrics_utils:Metric: default_runtime Value: 264

(running on Mac M1, amd64 python sdk container image is slow. If build container image locally (./gradlew :sdks:python:container:py38:docker then do not feed environment_config pipeline option), it takes 30 s.

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

It appears pull python image locally takes 1min40s, below the timeout of 2 min:

return forExecutable(DEFAULT_DOCKER_COMMAND, Duration.ofMinutes(2));

while it is experimented that dataproc master it takes 3 min to finish:

11:51:27 latest: Pulling from apache-beam-testing/beam_portability/beam_python3.8_sdk
...
11:54:34 a0eb10d78f36: Pull complete

changed to command to increase the timeout to 10 min. However need to build a snapshot from master to take effect though

@Abacn Abacn marked this pull request as ready for review July 21, 2023 18:15
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

As a check, change the timeout

return forExecutable(DEFAULT_DOCKER_COMMAND, Duration.ofMinutes(2));

to 10s from master and run the test (#27595 (comment)) indeed encounters exact same error:

RuntimeError: Pipeline BeamApp-yathu-0721182603-43bbd017_08810687-bc69-441f-ae7e-a4a7f94c1855 failed in state FAILED: java.util.concurrent.TimeoutException: Timed out while waiting for command 'docker run -d --mount type=bind,src=/Users/yathu/.config/gcloud,dst=/root/.config/gcloud --network=host --env=DOCKER_MAC_CONTAINER=null gcr.io/apache-beam-testing/beam_portability/beam_python3.8_sdk:latest --id=1-1 --provision_endpoint=host.docker.internal:60233'

@Abacn Abacn changed the title Stop publish Python SDK image in beam_portability Fix Python Flink runner load tests & Stop publish Python SDK image in beam_portability Jul 21, 2023
@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

R: @tvalentyn @damccorm

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

previously successful seed job run can be found in the test status of the first commit: ba7cace

@Abacn
Copy link
Contributor Author

Abacn commented Jul 21, 2023

Run Java_PVR_Flink_Batch PreCommit

@Abacn Abacn merged commit 97cb32e into apache:master Jul 22, 2023
11 checks passed
@Abacn Abacn deleted the consolidateimage branch July 22, 2023 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Failing Test]: Python flink runner load test failing [Task]: Consolidate dev container image
2 participants