### What changes were proposed in this pull request? `RpcTimeoutException` adds RPC address in message to help troubleshooting of timeout. ### Why are the changes needed? The message of `RpcTimeoutException` does not contain the RPC address in the message at present, which causes that the timeout problem is hard to troubleshooting for unknown rpc address. ``` 24/11/12 03:00:51 [Executor task launch worker for task 53432.0 in stage 0.0 (TID 53487)] ERROR Executor: Exception in task 53432.0 in stage 0.0 (TID 53487) org.apache.celeborn.common.rpc.RpcTimeoutException: Futures timed out after [120000 milliseconds]. This timeout is controlled by celeborn.rpc.lookupTimeout at org.apache.celeborn.common.rpc.RpcTimeout.org$apache$celeborn$common$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:46) at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:61) at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:57) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRefByAddr(RpcEnv.scala:106) at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:114) at org.apache.celeborn.client.ShuffleClientImpl.setupLifecycleManagerRef(ShuffleClientImpl.java:1759) at org.apache.celeborn.client.ShuffleClient.get(ShuffleClient.java:89) at org.apache.spark.shuffle.celeborn.SparkShuffleManager.getWriter(SparkShuffleManager.java:239) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:100) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:144) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:598) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:603) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:316) at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74) ... 15 more ``` Therefore, `RpcTimeoutException` should include RPC address in message to help troubleshooting of timeout. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `NettyRpcEnvSuite#ask a message timeout on Future using RpcTimeout` Closes #2907 from SteNicholas/CELEBORN-1713. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> |
||
|---|---|---|
| .. | ||
| benchmarks | ||
| src | ||
| pom.xml | ||