Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python job is flaky #30513

Open
github-actions bot opened this issue Mar 5, 2024 · 33 comments · Fixed by #32171 or #32382 · May be fixed by #32378
Open

The PostCommit Python job is flaky #30513

github-actions bot opened this issue Mar 5, 2024 · 33 comments · Fixed by #32171 or #32382 · May be fixed by #32378

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2024

The PostCommit Python is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@shunping
Copy link
Contributor

It first failed on https://github.com/apache/beam/actions/runs/8210266873.

The failed task is :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch.

Traceback:

INFO:apache_beam.utils.subprocess_server:Starting service with ('java' '-jar' '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar' '--spark-master-url' 'local[4]' '--artifacts-dir' '/tmp/beam-temp8q8022zi/artifactsg6e8usou' '--job-port' '56313' '--artifact-port' '0' '--expansion-port' '0')
INFO:apache_beam.utils.subprocess_server:Error: A JNI error has occurred, please check your installation and try again
INFO:apache_beam.utils.subprocess_server:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/BindableService
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.defineClass1(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
INFO:apache_beam.utils.subprocess_server:	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
INFO:apache_beam.utils.subprocess_server:	at java.security.AccessController.doPrivileged(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getDeclaredMethods0(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getMethod0(Class.java:3018)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getMethod(Class.java:1784)
INFO:apache_beam.utils.subprocess_server:	at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:670)
INFO:apache_beam.utils.subprocess_server:	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:652)
INFO:apache_beam.utils.subprocess_server:Caused by: java.lang.ClassNotFoundException: org.apache.beam.vendor.grpc.v1p60p1.io.grpc.BindableService
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:	... 19 more
ERROR:apache_beam.utils.subprocess_server:Started job service with ('java', '-jar', '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar', '--spark-master-url', 'local[4]', '--artifacts-dir', '/tmp/beam-temp8q8022zi/artifactsg6e8usou', '--job-port', '56313', '--artifact-port', '0', '--expansion-port', '0')
ERROR:apache_beam.utils.subprocess_server:Error bringing up service
Traceback (most recent call last):
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 111, in <module>
    run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 106, in run
    output | 'Write' >> WriteToText(known_args.output)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 612, in __exit__
    self.result = self.run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 586, in run
    return self.runner.run_pipeline(self, self._options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/runner.py", line 192, in run_pipeline
    return self.run_portable_pipeline(
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 381, in run_portable_pipeline
    job_service_handle = self.create_job_service(options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 296, in create_job_service
    return self.create_job_service_handle(server.start(), options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 81, in start
    self._endpoint = self._job_server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 110, in start
    return self._server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
> Task :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch FAILED

@shunping
Copy link
Contributor

Added the owner of the commit whose post-commit job failed at the first time.
@damccorm

@damccorm
Copy link
Contributor

I think we can pretty comfortably rule out that change, it was to the yaml sdk which is unrelated to portableWordCountSparkRunnerBatch. Note that this runs on a schedule, not on commits, though none of the commits in that scheduled time look particularly harmful

@shunping
Copy link
Contributor

I see. It was red for the last two weeks and flaky before that too.

@kennknowles
Copy link
Member

Permared right now

@damccorm
Copy link
Contributor

Only sorta - each component job is actually not permared - e.g. there are 2 successes here, https://github.com/apache/beam/actions/runs/8873798546

The whole workflow is permared just because our flake percentage is so high

@kennknowles
Copy link
Member

Yea, let's work out how to get top-level signal.

@Abacn
Copy link
Contributor

Abacn commented Apr 29, 2024

The lowest and highest Python version (3.8, 3.11) are running more tests than (3.9, 3.10), could be those tests or task permared

@kennknowles
Copy link
Member

Could make sense to find a way to get separate top-level signal for Python versions, assuming we can use software engineering to share everything necessary so they don't get out of sync.

@Abacn
Copy link
Contributor

Abacn commented Apr 29, 2024

Yeah, we used to have this for Jenkins where each Python PostCommit had its own task

@liferoad
Copy link
Collaborator

liferoad commented May 27, 2024

The Vertex AI package version issue (we do not import this directly. So it should be fine.):


../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
-- | --
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | /runner/_work/beam/beam/build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33: DeprecationWarning: |  
  | After May 30, 2024, importing any code below will result in an error. |  
  | Please verify that you are explicitly pinning to a version of `google-cloud-aiplatform` |  
  | (e.g., google-cloud-aiplatform==[1.32.0, 1.49.0]) if you need to continue using this |  
  | library. |  
  |   |  
  | from vertexai.preview import ( |  
  | init, |  
  | remote, |  
  | VertexModel, |  
  | register, |  
  | from_pretrained, |  
  | developer, |  
  | hyperparameter_tuning, |  
  | tabular_models, |  
  | ) |  
  |  


@liferoad
Copy link
Collaborator

liferoad commented May 28, 2024

A new flaky test in py39 and this is related to #29617:

https://ge.apache.org/s/hb7syztoolfhu/console-log?page=17


=================================== FAILURES =================================== |  
-- | --
  | �[31m�[1m_______________ BigQueryQueryToTableIT.test_big_query_legacy_sql _______________�[0m |  
  | [gw3] linux -- Python 3.9.19 /runner/_work/beam/beam/build/gradleenv/1398941893/bin/python3.9 |  
  |   |  
  | self = <apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT testMethod=test_big_query_legacy_sql> |  
  |   |  
  | �[37m@pytest�[39;49;00m.mark.it_postcommit�[90m�[39;49;00m |  
  | �[94mdef�[39;49;00m �[92mtest_big_query_legacy_sql�[39;49;00m(�[96mself�[39;49;00m):�[90m�[39;49;00m |  
  | verify_query = DIALECT_OUTPUT_VERIFY_QUERY % �[96mself�[39;49;00m.output_table�[90m�[39;49;00m |  
  | expected_checksum = test_utils.compute_hash(DIALECT_OUTPUT_EXPECTED)�[90m�[39;49;00m |  
  | pipeline_verifiers = [�[90m�[39;49;00m |  
  | PipelineStateMatcher(),�[90m�[39;49;00m |  
  | BigqueryMatcher(�[90m�[39;49;00m |  
  | project=�[96mself�[39;49;00m.project,�[90m�[39;49;00m |  
  | query=verify_query,�[90m�[39;49;00m |  
  | checksum=expected_checksum)�[90m�[39;49;00m |  
  | ]�[90m�[39;49;00m |  
  | �[90m�[39;49;00m |  
  | extra_opts = {�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mquery�[39;49;00m�[33m'�[39;49;00m: LEGACY_QUERY,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33moutput�[39;49;00m�[33m'�[39;49;00m: �[96mself�[39;49;00m.output_table,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33moutput_schema�[39;49;00m�[33m'�[39;49;00m: DIALECT_OUTPUT_SCHEMA,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33muse_standard_sql�[39;49;00m�[33m'�[39;49;00m: �[94mFalse�[39;49;00m,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mwait_until_finish_duration�[39;49;00m�[33m'�[39;49;00m: WAIT_UNTIL_FINISH_DURATION_MS,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mon_success_matcher�[39;49;00m�[33m'�[39;49;00m: all_of(*pipeline_verifiers),�[90m�[39;49;00m |  
  | }�[90m�[39;49;00m |  
  | options = �[96mself�[39;49;00m.test_pipeline.get_full_options_as_args(**extra_opts)�[90m�[39;49;00m |  
  | >     big_query_query_to_table_pipeline.run_bq_pipeline(options)�[90m�[39;49;00m |  
  |   |  
  | �[1m�[31mapache_beam/io/gcp/big_query_query_to_table_it_test.py�[0m:178: |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  | �[1m�[31mapache_beam/io/gcp/big_query_query_to_table_pipeline.py�[0m:103: in run_bq_pipeline |  
  | result = p.run()�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/testing/test_pipeline.py�[0m:115: in run |  
  | result = �[96msuper�[39;49;00m().run(�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/pipeline.py�[0m:560: in run |  
  | �[94mreturn�[39;49;00m Pipeline.from_runner_api(�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/pipeline.py�[0m:587: in run |  
  | �[94mreturn�[39;49;00m �[96mself�[39;49;00m.runner.run_pipeline(�[96mself�[39;49;00m, �[96mself�[39;49;00m._options)�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/direct/test_direct_runner.py�[0m:42: in run_pipeline |  
  | �[96mself�[39;49;00m.result = �[96msuper�[39;49;00m().run_pipeline(pipeline, options)�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/direct/direct_runner.py�[0m:117: in run_pipeline |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m fn_runner�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/__init__.py�[0m:18: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_runner�[39;49;00m �[94mimport�[39;49;00m FnApiRunner�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/fn_runner.py�[0m:68: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m execution�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/execution.py�[0m:62: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m translations�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/translations.py�[0m:55: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m bundle_processor�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/worker/bundle_processor.py�[0m:69: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m operations�[90m�[39;49;00m |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  |   |  
  | >   �[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[90m�[39;49;00m |  
  | �[1m�[31mE   KeyError: '__pyx_vtable__'�[0m |  
  |   |  
  | �[1m�[31mapache_beam/runners/worker/operations.py�[0m:1: KeyError


@liferoad
Copy link
Collaborator

Last three runs are green now.

image

Close this for now.

@github-actions github-actions bot added this to the 2.57.0 Release milestone May 29, 2024
@shunping
Copy link
Contributor

Great. Thanks @liferoad

@github-actions github-actions bot reopened this May 30, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

liferoad commented Jun 18, 2024


[31m�[1m_______ ERROR collecting apache_beam/runners/worker/log_handler_test.py ________�[0m |  
-- | --
  | �[1m�[31mapache_beam/runners/worker/log_handler_test.py�[0m:34: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m bundle_processor�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/worker/bundle_processor.py�[0m:69: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m operations�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/worker/operations.py�[0m:1: in init apache_beam.runners.worker.operations |  
  | �[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[90m�[39;49;00m |  
  | �[1m�[31mE   KeyError: '__pyx_vtable__'�[0m |  
  | �[31m�[1m________ ERROR collecting apache_beam/runners/worker/opcounters_test.py ________�[0m |  
  | �[1m�[31mapache_beam/runners/worker/opcounters_test.py�[0m:27: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m opcounters�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/worker/opcounters.py�[0m:1: in init apache_beam.runners.worker.opcounters |  
  | �[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[90m�[39;49;00m |  
  | �[1m�[31mE   ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject�[0m

https://ge.apache.org/s/w6kem3hrdnwii/console-log/task/:sdks:python:test-suites:direct:py38:tensorflowInferenceTest?anchor=1334&page=2


[36m�[1m=========================== short test summary info ============================�[0m |  
-- | --
  | �[31mERROR�[0m apache_beam/dataframe/transforms_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/dataframe/transforms_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/render_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/render_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/trivial_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/trivial_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/dataflow/dataflow_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/dataflow/dataflow_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/interactive/interactive_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/interactive/interactive_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/interactive/utils_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/interactive/utils_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/flink_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/flink_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/flink_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/flink_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/local_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/local_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/portable_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/portable_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/samza_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/samza_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_java_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_java_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/spark_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/fn_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/fn_runner_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/translations_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/translations_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/trigger_manager_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/bundle_processor_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/log_handler_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/opcounters_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | �[31mERROR�[0m apache_beam/runners/portability/fn_api_runner/trigger_manager_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/bundle_processor_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/log_handler_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/opcounters_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | �[31mERROR�[0m apache_beam/runners/worker/sdk_worker_main_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/sdk_worker_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/sideinputs_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | �[31mERROR�[0m apache_beam/runners/worker/sdk_worker_main_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/sdk_worker_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/runners/worker/sideinputs_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | �[31mERROR�[0m apache_beam/testing/load_tests/microbenchmarks_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/transforms/combinefn_lifecycle_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/testing/load_tests/microbenchmarks_test.py - KeyError: '__pyx_vtable__' |  
  | �[31mERROR�[0m apache_beam/transforms/combinefn_lifecycle_test.py - KeyError: '__pyx_vtable__'


@jrmccluskey
Copy link
Contributor

No cython issues in recent runs, just a number of flakes for tests with external connections (GCSIO, RRIO) that aren't consistent across Python versions or different runs

@Abacn
Copy link
Contributor

Abacn commented Aug 13, 2024

Currently Python3.12 Dataflow test has two test failing consistently:

apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification 

apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification_large_model

Error:

 subprocess.CalledProcessError: Command '['/runner/_work/beam/beam/build/gradleenv/2050596100/bin/python3.12', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmpoq1ebvgy/tmp_requirements.txt', '--exists-action', 'i', '--no-deps', '--implementation', 'cp', '--abi', 'cp312', '--platform', 'manylinux2014_x86_64']' returned non-zero exit status 1.


Error compiling Cython file:

sklearn/utils/_vector_sentinel.pyx:31:9: Previous declaration is here

Cannot install sklearn from source using cython

happened as early as https://github.com/apache/beam/commits/5b2bfe96f83a5631c3a8d5c3b92a0f695ffe2d7d

@Abacn
Copy link
Contributor

Abacn commented Aug 13, 2024

Copy link
Contributor Author

Reopening since the workflow is still flaky

@github-actions github-actions bot reopened this Aug 30, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

2024-08-30T07:28:39.6571287Z if setup_options.setup_file is not None:
2024-08-30T07:28:39.6571763Z if not os.path.isfile(setup_options.setup_file):
2024-08-30T07:28:39.6572227Z > raise RuntimeError(
2024-08-30T07:28:39.6572923Z 'The file %s cannot be found. It was specified in the '
2024-08-30T07:28:39.6573578Z '--setup_file command line option.' % setup_options.setup_file)
2024-08-30T07:28:39.6574970Z �[1m�[31mE RuntimeError: The file /runner/_work/beam/beam/sdks/python/apache_beam/examples/complete/juliaset/src/setup.py cannot be found. It was specified in the --setup_file command line option.�[0m

https://productionresultssa6.blob.core.windows.net/actions-results/9f18d66f-dabf-46e8-8b29-ae50d075f3dd/workflow-job-run-912db29d-d57b-5850-6efb-b125ca814b95/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-08-30T14%3A06%3A43Z&sig=aqESnfP68oo0sF7TUtpq%2BNFgdgfCbq8Ey3q%2BFMLZtvI%3D&ske=2024-08-31T00%3A21%3A54Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-08-30T12%3A21%3A54Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-05-04&sp=r&spr=https&sr=b&st=2024-08-30T13%3A56%3A38Z&sv=2024-05-04

@tvalentyn tvalentyn linked a pull request Aug 30, 2024 that will close this issue
@tvalentyn
Copy link
Contributor

Currently failing test:

gradlew :sdks:python:test-suites:portable:py312:portableLocalRunnerJuliaSetWithSetupPy

@damccorm
Copy link
Contributor

damccorm commented Nov 1, 2024

This is red again - https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=branch%3Amaster

It looks like there are currently 2 issues:

  1. Python 3.9 job is failing, I think probably because of the mypy changes. example failure
  2. The TensorRT tests are failing. Originally, they were failing because of a mismatch between container/local python versions, but now they seem to be running into CUDA issues with the new container. example failure and corresponding failing Dataflow job

@damccorm damccorm reopened this Nov 1, 2024
@damccorm
Copy link
Contributor

damccorm commented Nov 1, 2024

@jrmccluskey would you mind taking a look at these?

@damccorm damccorm assigned jrmccluskey and unassigned liferoad Nov 1, 2024
@jrmccluskey
Copy link
Contributor

Failure in the 3.9 postcommit is apache_beam/examples/fastavro_it_test.py::FastavroIT::test_avro_it, will dive deeper into that shortly

@jrmccluskey
Copy link
Contributor

The problem in the TensorRT container is that we seem to have two different versions of CUDA installed, one at version 11.8 and the other at 12.1 (we want everything at 12.1)

@damccorm
Copy link
Contributor

damccorm commented Nov 4, 2024

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?)

@Abacn
Copy link
Contributor

Abacn commented Nov 4, 2024

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?)

It's bad gradle cache. Cannot reproduce locally on master branch. Also inspected the expansion jar.

For some reason, recently, Gradle cache for shadowJar breaks more frequently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment