[KYUUBI #5402] Introduce Spark JVM quake plugin

# 🔍 Description
## Issue References 🔗

This pull request fixes #5402

## Describe Your Solution 🔧

When facing out-of-control memory management in Spark engine, we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered.

So introducing JVMQuake provides more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure.
You can use the following configuration to enable jvmQuake plugins:
```
spark.plugins=org.apache.spark.kyuubi.jvm.quake.KyuubiJVMQuakePlugin
```
|  configuration   | default  | comment  |
|  ----  | ----  | ----  |
| spark.driver.jvmQuake.enabled  | false | when true, enable driver jvmQuake   |
| spark.executor.jvmQuake.enabled  | false | when true, enable executor jvmQuake   |
| spark.driver.jvmQuake.heapDump.enabled  | false | when true, enable jvm heap dump when jvmQuake rearch the threshold   |
| spark.executor.jvmQuake.heapDump.enabled  | false | when true, enable jvm heap dump when jvmQuake rearch the threshold   |
| spark.jvmQuake.dumpThreshold  | 100 | The number of seconds to dump memory  |
| spark.jvmQuake.killThreshold  | 200 | The number of seconds to kill process  |
| spark.jvmQuake.exitCode  | 502 | The exit code of kill process  |
| spark.jvmQuake.heapDumpPath  | /tmp/kyuubi_jvm_quake/apps | The path of heap dump  |
| spark.jvmQuake.checkInterval  | 3 | The number of seconds to check jvmQuake  |
| spark.jvmQuake.runTimeWeight  | 1.0 | The weight of rum time  |

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6572 from yoock/features/kyuubi-jvm-quake.

Closes #5402

84361ce8f [王龙] add jvm quake

Authored-by: 王龙 <wanglong16@xiaomi.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
This commit is contained in:
王龙 2024-09-02 12:29:41 +08:00 committed by Cheng Pan
parent ac7702c85d
commit e1e7772a9f
No known key found for this signature in database
GPG Key ID: 8001952629BCC75D
7 changed files with 463 additions and 0 deletions

View File

@ -0,0 +1,66 @@
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->
# JVM Quake Support
When facing out-of-control memory management in Spark engine, we typically use `spark.driver/executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath={heapDumpPath} -XX:OnOutOfMemoryError="kill -9 %p"` as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvm kill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered.
So introducing JVMQuake provides more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure.
## Usage
JVM Quake is implemented through Spark plugins, This plugin technically supports Spark 3.0 onwards, but was only verified with Spark 3.3 to 4.0 in CI.
### Build with Apache Maven
Spark JVM Quake Plugins is built using [Apache Maven](https://maven.apache.org).
To build it, `cd` to the root directory of kyuubi project and run:
```shell
build/mvn clean package -DskipTests -pl :kyuubi-spark-jvm-quake_2.12 -am
```
After a while, if everything goes well, you will get the plugin under `./extensions/spark/kyuubi-spark-jvm-quake/target/kyuubi-spark-jvm-quake_${scala.binary.version}-${project.version}.jar`
### Installing
With the `kyuubi-spark-jvm-quake_*.jar` and its transitive dependencies available for spark runtime classpath, such as
- Copied to `$SPARK_HOME/jars`, or
- Specified to `spark.jars` configuration
### Settings for Spark Plugins
Add `org.apache.spark.kyuubi.jvm.quake.SparkJVMQuakePlugin` to the spark configuration `spark.plugins`.
```properties
spark.plugins=org.apache.spark.kyuubi.jvm.quake.SparkJVMQuakePlugin
```
## Additional Configurations
| Name | Default Value | Description |
|------------------------------------------|---------------------------|--------------------------------------------------------------------|
| spark.driver.jvmQuake.enabled | false | when true, enable driver jvmQuake |
| spark.executor.jvmQuake.enabled | false | when true, enable executor jvmQuake |
| spark.driver.jvmQuake.heapDump.enabled | false | when true, enable jvm heap dump when jvmQuake rearch the threshold |
| spark.executor.jvmQuake.heapDump.enabled | false | when true, enable jvm heap dump when jvmQuake rearch the threshold |
| spark.jvmQuake.killThreshold | 200 | The number of seconds to kill process |
| spark.jvmQuake.exitCode | 502 | The exit code of kill process |
| spark.jvmQuake.heapDumpPath | /tmp/spark_jvm_quake/apps | The path of heap dump |
| spark.jvmQuake.checkInterval | 3 | The number of seconds to check jvmQuake |
| spark.jvmQuake.runTimeWeight | 1.0 | The weight of rum time |

View File

@ -0,0 +1,59 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.kyuubi</groupId>
<artifactId>kyuubi-parent</artifactId>
<version>1.10.0-SNAPSHOT</version>
<relativePath>../../../pom.xml</relativePath>
</parent>
<artifactId>kyuubi-spark-jvm-quake_${scala.binary.version}</artifactId>
<packaging>jar</packaging>
<name>Kyuubi Spark JVM Quake</name>
<url>https://kyuubi.apache.org/</url>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<type>test-jar</type>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<testResources>
<testResource>
<directory>${project.basedir}/src/test/resources</directory>
</testResource>
</testResources>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
</build>
</project>

View File

@ -0,0 +1,139 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.kyuubi.jvm.quake
import java.io.File
import java.lang.management.ManagementFactory
import java.util.concurrent.TimeUnit
import scala.collection.JavaConverters._
import SparkJVMQuake._
import com.sun.management.HotSpotDiagnosticMXBean
import org.apache.spark.SparkConf
import org.apache.spark.internal.Logging
import org.apache.spark.kyuubi.jvm.quake.SparkJVMQuakeConf._
import org.apache.spark.util.ThreadUtils
class SparkJVMQuake(conf: SparkConf, heapDumpEnabled: Boolean) extends Logging {
private val appId = conf.get("spark.app.id", "app-id")
private val killThreshold = TimeUnit.SECONDS.toNanos(conf.get(JVM_QUAKE_KILL_THRESHOLD))
private val exitCode = conf.get(JVM_QUAKE_EXIT_CODE)
private val checkInterval = TimeUnit.SECONDS.toMillis(conf.get(JVM_QUAKE_CHECK_INTERVAL))
private val runTimeWeight = conf.get(JVM_QUAKE_RUN_TIME_WEIGHT)
private[quake] val heapPath = s"${conf.get(JVM_QUAKE_HEAP_DUMP_PATH)}/$appId"
private[quake] var (lastExitTime, lastGCTime) = getLastGCInfo
private[quake] var bucket: Long = 0L
private var heapDumping: Boolean = false
private[quake] val heapDumpFileName = s"${getPid()}.hprof"
private val scheduler =
ThreadUtils.newDaemonSingleThreadScheduledExecutor("kyuubi-jvm-quake")
private[quake] def run(): Unit = {
val (currentExitTime, currentGcTime) = getLastGCInfo
if (currentExitTime != lastExitTime && !heapDumping) {
val gcTime = currentGcTime - lastGCTime
val runTime = currentExitTime - lastExitTime - gcTime
bucket = Math.max(0, bucket + gcTime - BigDecimal(runTime * runTimeWeight).toLong)
if (bucket > killThreshold) {
logError(s"JVM GC has reached the threshold!!!" +
s" (bucket: ${bucket / 1000000000}s, killThreshold: ${killThreshold / 1000000000}s)")
if (heapDumpEnabled) {
heapDumping = true
saveHeap()
}
System.exit(exitCode)
}
lastExitTime = currentExitTime
lastGCTime = currentGcTime
}
}
def start(): Unit = {
scheduler.scheduleAtFixedRate(
() => SparkJVMQuake.this.run(),
0,
checkInterval,
TimeUnit.MILLISECONDS)
}
def stop(): Unit = {
scheduler.shutdown()
}
private[quake] def saveHeap(): Unit = {
try {
val saveDir = new File(heapPath)
if (!saveDir.exists()) {
saveDir.mkdirs()
}
val heapDumpFile = new File(saveDir, heapDumpFileName)
if (heapDumpFile.exists()) {
logInfo(s"Heap exits $heapDumpFile")
} else {
logInfo(s"Starting heap dump at $heapDumpFile")
val server = ManagementFactory.getPlatformMBeanServer
val mxBean = ManagementFactory.newPlatformMXBeanProxy(
server,
"com.sun.management:type=HotSpotDiagnostic",
classOf[HotSpotDiagnosticMXBean])
mxBean.dumpHeap(heapDumpFile.getAbsolutePath, false)
}
} catch {
case e: Exception =>
logError(s"Failed to dump process(${getPid()}) heap to $heapPath", e)
}
}
}
object SparkJVMQuake {
private[this] var monitor: Option[SparkJVMQuake] = None
def start(sparkConf: SparkConf, heapDumpEnabled: Boolean): Unit = {
monitor = Some(new SparkJVMQuake(sparkConf, heapDumpEnabled))
monitor.foreach(_.start())
}
def stop(): Unit = {
monitor.foreach(_.stop())
}
def getLastGCInfo: (Long, Long) = {
val mxBeans = ManagementFactory.getGarbageCollectorMXBeans.asScala
val lastGCInfos = mxBeans
.filter(_.isInstanceOf[com.sun.management.GarbageCollectorMXBean])
.map(_.asInstanceOf[com.sun.management.GarbageCollectorMXBean])
.flatMap(bean => Option(bean.getLastGcInfo))
if (lastGCInfos.isEmpty) {
(0L, 0L)
} else {
(lastGCInfos.map(_.getEndTime).max * 1000000, mxBeans.map(_.getCollectionTime).sum * 1000000)
}
}
def getPid(): Long = {
ManagementFactory.getRuntimeMXBean.getName.split("@")(0).toLong
}
}

View File

@ -0,0 +1,94 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.kyuubi.jvm.quake
import java.util.concurrent.TimeUnit
import org.apache.spark.internal.config.ConfigBuilder
object SparkJVMQuakeConf {
val DRIVER_JVM_QUAKE_ENABLED =
ConfigBuilder("spark.driver.jvmQuake.enabled")
.doc("Whether to enable JVM quake on the driver.")
.version("1.10.0")
.booleanConf
.createWithDefault(false)
val EXECUTOR_JVM_QUAKE_ENABLED =
ConfigBuilder("spark.executor.jvmQuake.enabled")
.doc("Whether to enable JVM quake on the executor.")
.version("1.10.0")
.booleanConf
.createWithDefault(false)
val JVM_QUAKE_RUN_TIME_WEIGHT =
ConfigBuilder("spark.jvmQuake.runTimeWeight")
.doc("Weight of run time, This value determines the rate of change of JVM Quake value, " +
"and the larger this value, the smaller the rate of change of JVM Quake")
.version("1.10.0")
.doubleConf
.createWithDefault(1.0)
val JVM_QUAKE_HEAP_DUMP_PATH =
ConfigBuilder("spark.jvmQuake.heapDumpPath")
.doc("The local path of heap dump, If the directory does not exist, " +
"it will be automatically created, but corresponding permissions are required")
.version("1.10.0")
.stringConf
.createWithDefault("/tmp/spark_jvm_quake/apps")
val JVM_QUAKE_KILL_THRESHOLD =
ConfigBuilder("spark.jvmQuake.killThreshold")
.doc(s"JVM Quake value is cumulative value of " +
s"gcTime - runTime * ${JVM_QUAKE_RUN_TIME_WEIGHT.key}, " +
s"When the JVM Quake reaches this threshold, kill process and dump heap if dump is enabled")
.version("1.10.0")
.timeConf(TimeUnit.SECONDS)
.createWithDefault(200)
val DRIVER_JVM_QUAKE_HEAP_DUMP_ENABLED =
ConfigBuilder("spark.driver.jvmQuake.heapDump.enabled")
.doc(s"When true, When the driver JVM Quake reaches the ${JVM_QUAKE_KILL_THRESHOLD.key} " +
s"threshold, dump the memory to ${JVM_QUAKE_HEAP_DUMP_PATH.key}")
.version("1.10.0")
.booleanConf
.createWithDefault(false)
val EXECUTOR_JVM_QUAKE_HEAP_DUMP_ENABLED =
ConfigBuilder("spark.executor.jvmQuake.heapDump.enabled")
.doc(s"When true, When the executor JVM Quake reaches the ${JVM_QUAKE_KILL_THRESHOLD.key} " +
s"threshold, dump the memory to ${JVM_QUAKE_HEAP_DUMP_PATH.key}")
.version("1.10.0")
.booleanConf
.createWithDefault(false)
val JVM_QUAKE_CHECK_INTERVAL =
ConfigBuilder("spark.jvmQuake.checkInterval")
.doc("How often will JVM Quake check")
.version("1.10.0")
.timeConf(TimeUnit.SECONDS)
.createWithDefault(3)
val JVM_QUAKE_EXIT_CODE =
ConfigBuilder("spark.jvmQuake.exitCode")
.doc("Exit code for JVM Quake kill")
.version("1.10.0")
.intConf
.createWithDefault(138)
}

View File

@ -0,0 +1,61 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.kyuubi.jvm.quake
import java.util.{Collections, Map => JMap}
import org.apache.spark.SparkContext
import org.apache.spark.api.plugin.{DriverPlugin, ExecutorPlugin, PluginContext, SparkPlugin}
import org.apache.spark.internal.Logging
import org.apache.spark.kyuubi.jvm.quake.SparkJVMQuakeConf._
class SparkJVMQuakePlugin extends SparkPlugin with Logging {
override def driverPlugin(): DriverPlugin = {
new DriverPlugin() {
override def init(sc: SparkContext, pluginContext: PluginContext): JMap[String, String] = {
val jvmQuakeEnabled = sc.conf.get(DRIVER_JVM_QUAKE_ENABLED)
val jvmQuakeHeapDumpEnabled = sc.conf.get(DRIVER_JVM_QUAKE_HEAP_DUMP_ENABLED)
if (jvmQuakeEnabled) {
SparkJVMQuake.start(sc.conf, jvmQuakeHeapDumpEnabled)
}
Collections.emptyMap()
}
override def shutdown(): Unit = {
SparkJVMQuake.stop()
}
}
}
override def executorPlugin(): ExecutorPlugin = {
new ExecutorPlugin {
override def init(context: PluginContext, extraConf: JMap[String, String]): Unit = {
val jvmQuakeEnabled = context.conf().get(EXECUTOR_JVM_QUAKE_ENABLED)
val jvmQuakeHeapDumpEnabled = context.conf().get(EXECUTOR_JVM_QUAKE_HEAP_DUMP_ENABLED)
if (jvmQuakeEnabled) {
SparkJVMQuake.start(context.conf(), jvmQuakeHeapDumpEnabled)
}
}
override def shutdown(): Unit = {
SparkJVMQuake.stop()
}
}
}
}

View File

@ -0,0 +1,43 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.kyuubi.jvm.quake
import java.io.File
import java.util.UUID
import org.apache.spark.{SparkConf, SparkFunSuite}
import org.apache.spark.internal.Logging
import org.apache.spark.kyuubi.jvm.quake.SparkJVMQuakeConf._
import org.scalatest.BeforeAndAfterEach
class SparkJVMQuakeSuite extends SparkFunSuite with BeforeAndAfterEach with Logging {
test("check JVM Quake status") {
val conf = new SparkConf
conf.set("spark.app.id", UUID.randomUUID().toString)
conf.set(JVM_QUAKE_CHECK_INTERVAL.key, "1")
conf.set(JVM_QUAKE_HEAP_DUMP_PATH.key, "./extensions/spark/kyuubi-spark-jvm-quake/target")
var monitor = new SparkJVMQuake(conf, false)
assert(!new File(monitor.heapPath, monitor.heapDumpFileName).exists())
monitor = new SparkJVMQuake(conf, true)
monitor.saveHeap()
assert(new File(monitor.heapPath, monitor.heapDumpFileName).exists())
}
}

View File

@ -61,6 +61,7 @@
<module>extensions/spark/kyuubi-spark-connector-tpcds</module>
<module>extensions/spark/kyuubi-spark-connector-tpch</module>
<module>extensions/spark/kyuubi-spark-lineage</module>
<module>extensions/spark/kyuubi-spark-jvm-quake</module>
<module>externals/kyuubi-chat-engine</module>
<module>externals/kyuubi-download</module>
<module>externals/kyuubi-flink-sql-engine</module>