Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark-rapids] generate spark-rapids/spark-rapids.sh from template #1284

Draft
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 25, 2024

This is a re-implementation of the script using templates created while re-factoring common code under gpu/, dask/, rapids/, spark-rapids/, horovod/, mlvm/ and many of the other initialization actions.

@cjac cjac self-assigned this Dec 25, 2024
@cjac
Copy link
Contributor Author

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 25, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 27, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 27, 2024

okay so now we have some rocky8 and rocky9 coverage. I don't think we have to disable the 2.0 images. Let's verify...

cjac added 3 commits December 27, 2024 16:28
* defining is_debuntu with the other os comparison functions
* refactored gpu-related code out of common function library
* being more surgical about signing material usage
* skip installing dependencies if it has already been done
* refactored configure_gpu_exclusive_mode to fewer lines
* not relying on dkms certs being deployed to use modulus_md5sum
* removed dependency on pciutils
* less reactive to not having GPU
@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

okay, now it:

  • meets the expectations of the previous implementation
  • does not disable rocky tests

Let's try skipping fewer tests still

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

* 40G is plenty ; no need for 50G
@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

I think we've got single node clusters working fine on rocky again.

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

can we skip no tests?

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 28, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Dec 29, 2024

Failing on 2.1-rocky8:

gcloud compute ssh test-rapids-single-2-1-20241228-090957-ax78-m --zone=us-central1-f --command="echo :quit | spark-shell          --conf spark.executor.resource.gpu.amount=1          --conf spark.task.resource.gpu.amount=0.1          --conf spark.dynamicAllocation.enabled=false -i verify_xgboost_spark_rapids.scala"
...
2024-12-28T09:16:40.181635368Z 24/12/28 09:16:38 ERROR SparkContext: Error initializing SparkContext.
2024-12-28T09:16:40.181644485Z org.apache.spark.SparkException: Application application_1735377343997_0001 failed 2 times due to AM Container for appattempt_1735377343997_0001_000002 exited with  exitCode: -1
2024-12-28T09:16:40.181652729Z Failing this attempt.Diagnostics: [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.181661528Z [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.181669571Z For more detailed output, check the application tracking page: http://test-rapids-single-2-1-20241228-090957-ax78-m:8188/applicationhistory/app/application_1735377343997_0001 Then click on links to logs of each attempt.
...
2024-12-28T09:16:40.182246987Z 24/12/28 09:16:38 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to send shutdown message before the AM has registered!
2024-12-28T09:16:40.182254360Z 24/12/28 09:16:38 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
2024-12-28T09:16:40.182262098Z 24/12/28 09:16:38 WARN MetricsSystem: Stopping a MetricsSystem that is not running
2024-12-28T09:16:40.182269324Z 24/12/28 09:16:38 ERROR Main: Failed to initialize Spark session.

2024-12-28T09:16:40.172463304Z INFO: From Testing //:test_spark_rapids (shard 2 of 3):
2024-12-28T09:16:40.181205242Z ==================== Test output for //:test_spark_rapids (shard 2 of 3):
2024-12-28T09:16:40.181259588Z Running tests under Python 3.10.12: /usr/bin/python3
2024-12-28T09:16:40.181271199Z [ RUN      ] SparkRapidsTestCase.test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:40.181279535Z [  FAILED  ] SparkRapidsTestCase.test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:40.181289052Z ======================================================================
2024-12-28T09:16:40.181297700Z FAIL: test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4') (__main__.SparkRapidsTestCase)
2024-12-28T09:16:40.181305356Z test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4') (__main__.SparkRapidsTestCase)
2024-12-28T09:16:40.181313086Z test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:40.181321220Z ----------------------------------------------------------------------
2024-12-28T09:16:40.181329465Z Traceback (most recent call last):
2024-12-28T09:16:40.181339430Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test
2024-12-28T09:16:40.181350608Z     test_method(self, *testcase_params)
2024-12-28T09:16:40.181358406Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/spark-rapids/test_spark_rapids.py", line 80, in test_spark_rapids
2024-12-28T09:16:40.181367792Z     self.verify_spark_job()
2024-12-28T09:16:40.181375727Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/spark-rapids/test_spark_rapids.py", line 34, in verify_spark_job
2024-12-28T09:16:40.181386151Z     self.assert_instance_command(
2024-12-28T09:16:40.181393983Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/integration_tests/dataproc_test_case.py", line 290, in assert_instance_command
2024-12-28T09:16:40.181401367Z     ret_code, stdout, stderr = self.assert_command(
2024-12-28T09:16:40.181408995Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/integration_tests/dataproc_test_case.py", line 342, in assert_command
2024-12-28T09:16:40.181417121Z     self.assertEqual(
2024-12-28T09:16:40.181424369Z AssertionError: 1 != 0 : Failed to execute command:
2024-12-28T09:16:40.181451032Z gcloud compute ssh test-rapids-single-2-1-20241228-090957-ax78-m --zone=us-central1-f --command="echo :quit | spark-shell          --conf spark.executor.resource.gpu.amount=1          --conf spark.task.resource.gpu.amount=0.1          --conf spark.dynamicAllocation.enabled=false -i verify_xgboost_spark_rapids.scala"
2024-12-28T09:16:40.181488519Z STDOUT:

2024-12-28T09:16:40.181495772Z 
2024-12-28T09:16:40.181503201Z STDERR:
2024-12-28T09:16:40.181511070Z Setting default log level to "WARN".
2024-12-28T09:16:40.181518588Z To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2024-12-28T09:16:40.181527998Z 24/12/28 09:16:26 WARN ResourceUtils: The configuration of cores (exec = 24 task = 2, runnable tasks = 12) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 10. Please adjust your configuration.
2024-12-28T09:16:40.181535401Z 24/12/28 09:16:26 INFO SparkEnv: Registering MapOutputTracker
2024-12-28T09:16:40.181542402Z 24/12/28 09:16:26 INFO SparkEnv: Registering BlockManagerMaster
2024-12-28T09:16:40.181549576Z 24/12/28 09:16:27 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
2024-12-28T09:16:40.181556968Z 24/12/28 09:16:27 INFO SparkEnv: Registering OutputCommitCoordinator
2024-12-28T09:16:40.181564719Z 24/12/28 09:16:28 WARN RapidsPluginUtils: RAPIDS Accelerator 23.08.2 using cudf 23.08.0.
2024-12-28T09:16:40.181573748Z 24/12/28 09:16:28 WARN RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 24.
2024-12-28T09:16:40.181583185Z 24/12/28 09:16:28 WARN RapidsPluginUtils: The current setting of spark.task.resource.gpu.amount (0.1) is not ideal to get the best performance from the RAPIDS Accelerator plugin. It's recommended to be 1/{executor core count} unless you have a special use case.
2024-12-28T09:16:40.181592461Z 24/12/28 09:16:28 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2024-12-28T09:16:40.181602487Z 24/12/28 09:16:28 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
2024-12-28T09:16:40.181610414Z 24/12/28 09:16:38 ERROR YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
2024-12-28T09:16:40.181635368Z 24/12/28 09:16:38 ERROR SparkContext: Error initializing SparkContext.
2024-12-28T09:16:40.181644485Z org.apache.spark.SparkException: Application application_1735377343997_0001 failed 2 times due to AM Container for appattempt_1735377343997_0001_000002 exited with  exitCode: -1
2024-12-28T09:16:40.181652729Z Failing this attempt.Diagnostics: [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.181661528Z [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.181669571Z For more detailed output, check the application tracking page: http://test-rapids-single-2-1-20241228-090957-ax78-m:8188/applicationhistory/app/application_1735377343997_0001 Then click on links to logs of each attempt.
2024-12-28T09:16:40.181677394Z . Failing the application.
2024-12-28T09:16:40.181685827Z 	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:98) ~[spark-yarn_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181693770Z 	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:65) ~[spark-yarn_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181701925Z 	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:234) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181719600Z 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:627) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181727298Z 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2786) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181734874Z 	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181742350Z 	at scala.Option.getOrElse(Option.scala:189) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181749901Z 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181757102Z 	at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.181764381Z 	at $line3.$read$$iw$$iw.<init>(<console>:15) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181772357Z 	at $line3.$read$$iw.<init>(<console>:42) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181779738Z 	at $line3.$read.<init>(<console>:44) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181787200Z 	at $line3.$read$.<init>(<console>:48) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181794677Z 	at $line3.$read$.<clinit>(<console>) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181802224Z 	at $line3.$eval$.$print$lzycompute(<console>:7) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181809839Z 	at $line3.$eval$.$print(<console>:6) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181817260Z 	at $line3.$eval.$print(<console>) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.181824582Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
2024-12-28T09:16:40.181832141Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
2024-12-28T09:16:40.181839610Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
2024-12-28T09:16:40.181847607Z 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2024-12-28T09:16:40.181855210Z 	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181862897Z 	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181893089Z 	at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181912057Z 	at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.181920096Z 	at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.181927892Z 	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.181935752Z 	at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181943725Z 	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181951477Z 	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181958961Z 	at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181966424Z 	at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181973890Z 	at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.181981267Z 	at org.apache.spark.repl.SparkILoop.$anonfun$initializeSpark$2(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182010983Z 	at scala.collection.immutable.List.foreach(List.scala:431) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182019645Z 	at org.apache.spark.repl.SparkILoop.$anonfun$initializeSpark$1(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182027121Z 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182034037Z 	at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:97) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182041417Z 	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182048680Z 	at org.apache.spark.repl.SparkILoop.$anonfun$process$4(SparkILoop.scala:165) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182056018Z 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182063359Z 	at scala.tools.nsc.interpreter.ILoop.$anonfun$mumly$1(ILoop.scala:166) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182071177Z 	at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182078744Z 	at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:163) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182086459Z 	at org.apache.spark.repl.SparkILoop.loopPostInit$1(SparkILoop.scala:153) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182094268Z 	at org.apache.spark.repl.SparkILoop.$anonfun$process$10(SparkILoop.scala:221) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182101761Z 	at org.apache.spark.repl.SparkILoop.withSuppressedSettings$1(SparkILoop.scala:189) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182109137Z 	at org.apache.spark.repl.SparkILoop.startup$1(SparkILoop.scala:201) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182116475Z 	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:236) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182123995Z 	at org.apache.spark.repl.Main$.doMain(Main.scala:78) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182132121Z 	at org.apache.spark.repl.Main$.main(Main.scala:58) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182139564Z 	at org.apache.spark.repl.Main.main(Main.scala) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182147126Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
2024-12-28T09:16:40.182154501Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
2024-12-28T09:16:40.182161666Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
2024-12-28T09:16:40.182169382Z 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2024-12-28T09:16:40.182176797Z 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182184379Z 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:973) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182192149Z 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182200566Z 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182208136Z 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182216103Z 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1061) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182232195Z 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1070) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182239670Z 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182246987Z 24/12/28 09:16:38 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to send shutdown message before the AM has registered!
2024-12-28T09:16:40.182254360Z 24/12/28 09:16:38 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
2024-12-28T09:16:40.182262098Z 24/12/28 09:16:38 WARN MetricsSystem: Stopping a MetricsSystem that is not running
2024-12-28T09:16:40.182269324Z 24/12/28 09:16:38 ERROR Main: Failed to initialize Spark session.
2024-12-28T09:16:40.182276819Z org.apache.spark.SparkException: Application application_1735377343997_0001 failed 2 times due to AM Container for appattempt_1735377343997_0001_000002 exited with  exitCode: -1
2024-12-28T09:16:40.182284436Z Failing this attempt.Diagnostics: [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.182291579Z [2024-12-28 09:16:37.138]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:40.182299093Z For more detailed output, check the application tracking page: http://test-rapids-single-2-1-20241228-090957-ax78-m:8188/applicationhistory/app/application_1735377343997_0001 Then click on links to logs of each attempt.
2024-12-28T09:16:40.182307291Z . Failing the application.
2024-12-28T09:16:40.182327537Z 	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:98) ~[spark-yarn_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182335961Z 	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:65) ~[spark-yarn_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182343439Z 	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:234) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182350843Z 	at org.apache.spark.SparkContext.<init>(SparkContext.scala:627) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182358329Z 	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2786) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182365122Z 	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182372615Z 	at scala.Option.getOrElse(Option.scala:189) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182379463Z 	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182387053Z 	at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182395386Z 	at $line3.$read$$iw$$iw.<init>(<console>:15) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182403151Z 	at $line3.$read$$iw.<init>(<console>:42) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182410790Z 	at $line3.$read.<init>(<console>:44) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182418594Z 	at $line3.$read$.<init>(<console>:48) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182426110Z 	at $line3.$read$.<clinit>(<console>) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182433602Z 	at $line3.$eval$.$print$lzycompute(<console>:7) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182441090Z 	at $line3.$eval$.$print(<console>:6) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182448898Z 	at $line3.$eval.$print(<console>) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182456547Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
2024-12-28T09:16:40.182464922Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
2024-12-28T09:16:40.182480678Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
2024-12-28T09:16:40.182488370Z 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2024-12-28T09:16:40.182495675Z 	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:747) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182502973Z 	at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1020) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182510262Z 	at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:568) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182517992Z 	at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:36) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.182525406Z 	at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:116) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.182532855Z 	at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41) ~[scala-reflect-2.12.18.jar:?]
2024-12-28T09:16:40.182540263Z 	at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:567) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182547943Z 	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:594) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182555378Z 	at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:564) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182562238Z 	at scala.tools.nsc.interpreter.IMain.$anonfun$quietRun$1(IMain.scala:216) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182569535Z 	at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182577148Z 	at scala.tools.nsc.interpreter.IMain.quietRun(IMain.scala:216) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182585222Z 	at org.apache.spark.repl.SparkILoop.$anonfun$initializeSpark$2(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182592570Z 	at scala.collection.immutable.List.foreach(List.scala:431) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182599865Z 	at org.apache.spark.repl.SparkILoop.$anonfun$initializeSpark$1(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182607433Z 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182614502Z 	at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:97) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182621685Z 	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:83) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182641605Z 	at org.apache.spark.repl.SparkILoop.$anonfun$process$4(SparkILoop.scala:165) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182669811Z 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.18.jar:?]
2024-12-28T09:16:40.182682305Z 	at scala.tools.nsc.interpreter.ILoop.$anonfun$mumly$1(ILoop.scala:166) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182689109Z 	at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:206) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182696021Z 	at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:163) ~[scala-compiler-2.12.18.jar:?]
2024-12-28T09:16:40.182702631Z 	at org.apache.spark.repl.SparkILoop.loopPostInit$1(SparkILoop.scala:153) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182710137Z 	at org.apache.spark.repl.SparkILoop.$anonfun$process$10(SparkILoop.scala:221) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182717463Z 	at org.apache.spark.repl.SparkILoop.withSuppressedSettings$1(SparkILoop.scala:189) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182732685Z 	at org.apache.spark.repl.SparkILoop.startup$1(SparkILoop.scala:201) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182740219Z 	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:236) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182747599Z 	at org.apache.spark.repl.Main$.doMain(Main.scala:78) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182755439Z 	at org.apache.spark.repl.Main$.main(Main.scala:58) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182762847Z 	at org.apache.spark.repl.Main.main(Main.scala) ~[spark-repl_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182770614Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
2024-12-28T09:16:40.182778597Z 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
2024-12-28T09:16:40.182785873Z 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
2024-12-28T09:16:40.182793556Z 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
2024-12-28T09:16:40.182801150Z 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182808552Z 	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:973) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182816237Z 	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182824081Z 	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182832044Z 	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182839640Z 	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1061) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182847380Z 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1070) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182855379Z 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ~[spark-core_2.12-3.3.2.jar:3.3.2]
2024-12-28T09:16:40.182862581Z 
2024-12-28T09:16:40.182884975Z 
2024-12-28T09:16:40.182895223Z ----------------------------------------------------------------------
2024-12-28T09:16:40.183006263Z Ran 1 test in 412.500s
2024-12-28T09:16:40.183020028Z 
2024-12-28T09:16:40.183028507Z FAILED (failures=1)

@cjac
Copy link
Contributor Author

cjac commented Dec 29, 2024

The one on 2.0-rocky8 seems slightly different:

2024-12-28T09:09:27.479503282Z INFO: Analyzed target //:test_spark_rapids (97 packages loaded, 869 targets configured).
2024-12-28T09:16:19.390886487Z FAIL: //:test_spark_rapids (shard 2 of 3) (see /home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/testlogs/test_spark_rapids/shard_2_of_3/test.log)
2024-12-28T09:16:19.394234547Z INFO: From Testing //:test_spark_rapids (shard 2 of 3):
2024-12-28T09:16:19.395148228Z ==================== Test output for //:test_spark_rapids (shard 2 of 3):
2024-12-28T09:16:19.395188708Z Running tests under Python 3.10.12: /usr/bin/python3
2024-12-28T09:16:19.395201988Z [ RUN      ] SparkRapidsTestCase.test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:19.395212757Z [  FAILED  ] SparkRapidsTestCase.test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:19.395222048Z ======================================================================
2024-12-28T09:16:19.395231428Z FAIL: test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4') (__main__.SparkRapidsTestCase)
2024-12-28T09:16:19.395240868Z test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4') (__main__.SparkRapidsTestCase)
2024-12-28T09:16:19.395250208Z test_spark_rapids('SINGLE', ['m'], 'type=nvidia-tesla-t4')
2024-12-28T09:16:19.395264337Z ----------------------------------------------------------------------
2024-12-28T09:16:19.395280778Z Traceback (most recent call last):
2024-12-28T09:16:19.395293098Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test
2024-12-28T09:16:19.395305178Z     test_method(self, *testcase_params)
2024-12-28T09:16:19.395314587Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/spark-rapids/test_spark_rapids.py", line 80, in test_spark_rapids
2024-12-28T09:16:19.395324747Z     self.verify_spark_job()
2024-12-28T09:16:19.395334558Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/spark-rapids/test_spark_rapids.py", line 34, in verify_spark_job
2024-12-28T09:16:19.395343878Z     self.assert_instance_command(
2024-12-28T09:16:19.395353678Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/integration_tests/dataproc_test_case.py", line 290, in assert_instance_command
2024-12-28T09:16:19.395362927Z     ret_code, stdout, stderr = self.assert_command(
2024-12-28T09:16:19.395372827Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/test_spark_rapids.runfiles/_main/integration_tests/dataproc_test_case.py", line 342, in assert_command
2024-12-28T09:16:19.395382827Z     self.assertEqual(
2024-12-28T09:16:19.395392407Z AssertionError: 1 != 0 : Failed to execute command:
2024-12-28T09:16:19.395425898Z gcloud compute ssh test-rapids-single-2-0-20241228-090935-x9u6-m --zone=us-central1-f --command="echo :quit | spark-shell          --conf spark.executor.resource.gpu.amount=1          --conf spark.task.resource.gpu.amount=0.1          --conf spark.dynamicAllocation.enabled=false -i verify_xgboost_spark_rapids.scala"
2024-12-28T09:16:19.395436278Z STDOUT:
2024-12-28T09:16:19.395445578Z 
2024-12-28T09:16:19.395454438Z STDERR:
2024-12-28T09:16:19.395464178Z Setting default log level to "WARN".
2024-12-28T09:16:19.395473818Z To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2024-12-28T09:16:19.395484487Z 24/12/28 09:16:06 WARN org.apache.spark.resource.ResourceUtils: The configuration of cores (exec = 24 task = 2, runnable tasks = 12) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 10. Please adjust your configuration.
2024-12-28T09:16:19.395494558Z 24/12/28 09:16:06 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
2024-12-28T09:16:19.395504098Z 24/12/28 09:16:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
2024-12-28T09:16:19.395513327Z 24/12/28 09:16:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2024-12-28T09:16:19.395522558Z 24/12/28 09:16:06 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
2024-12-28T09:16:19.395532667Z 24/12/28 09:16:08 WARN com.nvidia.spark.rapids.RapidsPluginUtils: RAPIDS Accelerator 23.08.2 using cudf 23.08.0.
2024-12-28T09:16:19.395542738Z 24/12/28 09:16:08 WARN com.nvidia.spark.rapids.RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 24.
2024-12-28T09:16:19.395552338Z 24/12/28 09:16:08 WARN com.nvidia.spark.rapids.RapidsPluginUtils: The current setting of spark.task.resource.gpu.amount (0.1) is not ideal to get the best performance from the RAPIDS Accelerator plugin. It's recommended to be 1/{executor core count} unless you have a special use case.
2024-12-28T09:16:19.395562538Z 24/12/28 09:16:08 WARN com.nvidia.spark.rapids.RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2024-12-28T09:16:19.395572298Z 24/12/28 09:16:08 WARN com.nvidia.spark.rapids.RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
2024-12-28T09:16:19.395588848Z 24/12/28 09:16:17 ERROR org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
2024-12-28T09:16:19.395600008Z 24/12/28 09:16:17 ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
2024-12-28T09:16:19.395611437Z org.apache.spark.SparkException: Application application_1735377327179_0001 failed 2 times due to AM Container for appattempt_1735377327179_0001_000002 exited with  exitCode: -1
2024-12-28T09:16:19.395621297Z Failing this attempt.Diagnostics: [2024-12-28 09:16:16.821]ResourceHandlerChain.preStart() failed!
2024-12-28T09:16:19.395630168Z [2024-12-28 09:16:16.821]ResourceHandlerChain.preStart() failed!

@cjac
Copy link
Contributor Author

cjac commented Dec 29, 2024

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant