Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Rework RapidsShuffleManager initialization for Apache Spark 4.0.0 #11107

Closed
gerashegalov opened this issue Jun 28, 2024 · 3 comments · Fixed by #11904
Closed

[BUG] Rework RapidsShuffleManager initialization for Apache Spark 4.0.0 #11107

gerashegalov opened this issue Jun 28, 2024 · 3 comments · Fixed by #11904
Assignees
Labels
bug Something isn't working P0 Must have for release Spark 4.0+ Spark 4.0+ issues

Comments

@gerashegalov
Copy link
Collaborator

gerashegalov commented Jun 28, 2024

With apache/spark#43627 we eliminate the need to add the plugin jar via spark.executor.extraClassPath and paved the way to the simplified Boolean switch useRSM=true/false. Now would be a good time to do this work. At the minimum we need to fix the NullPointerException issue resulting from the initialization order change.

Steps/Code to reproduce bug

Start a local-cluster with RSM

JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 \
  ~/dist/spark-4.0.0-preview1-bin-hadoop3/bin/spark-shell \
  --jars scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar 
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.memory.gpu.allocSize=1536m \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark400.RapidsShuffleManager  \
  --master local-cluster[2,2,1024]

Note: --conf spark.executor.extraClassPath=$PWD/scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar

Run

scala> spark.range(100000).repartition(2).summary().collect()

Check the executor log

{
  "ts": "2024-06-28T21:39:27.924Z",
  "level": "ERROR",
  "msg": "Exception in the executor plugin, shutting down!",
  "exception": {
    "class": "java.lang.NullPointerException",
    "msg": "Cannot invoke \"Object.getClass()\" because \"shuffleManager\" is null",
    "stacktrace": [
      {
        "class": "org.apache.spark.sql.rapids.GpuShuffleEnv$",
        "method": "initShuffleManager",
        "file": "GpuShuffleEnv.scala",
        "line": 112
      },
      {
        "class": "com.nvidia.spark.rapids.RapidsExecutorPlugin",
        "method": "init",
        "file": "Plugin.scala",
        "line": 551
      },
      {
        "class": "org.apache.spark.internal.plugin.ExecutorPluginContainer",
        "method": "$anonfun$executorPlugins$1",
        "file": "PluginContainer.scala",
        "line": 125
      },
...
    ]
  },
  "logger": "RapidsExecutorPlugin"
}

Additional context

[SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order
razajafri#3

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify Spark 4.0+ Spark 4.0+ issues labels Jun 28, 2024
@abellina abellina self-assigned this Jun 28, 2024
@abellina
Copy link
Collaborator

Thanks for filing this. I do not know why we got an NPE here, I didn't get one when I tested the apache issue, so I am worried now that there's a bug somewhere.

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Jun 28, 2024

Our plugin init code currently assumes that the lazy shuffle manager instance SparkEnv.get.shuffleManager has already been created and set, to execute some validation and initialization. Now that the order of SM instantiation and Plugin initialization is reversed in 4.0.0 we need to do validation steps without assuming the an instance in the ExecutorDriver init and set some flag to force eager initialization at the SM instantiation time. I think we can write this code without shimming, but worst case with shimming.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 2, 2024
@gerashegalov
Copy link
Collaborator Author

This issue affects Databricks 14.3 as well

24/12/20 06:59:04 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down!
java.lang.NullPointerException
	at org.apache.spark.sql.rapids.GpuShuffleEnv$.initShuffleManager(GpuShuffleEnv.scala:112)
	at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:544)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:127)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:115)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants