From e2eeafd4bfdbf892aea20237a9638ff3c9e16f58 Mon Sep 17 00:00:00 2001 From: "zky.zhoukeyong" Date: Fri, 23 Jun 2023 21:06:43 +0800 Subject: [PATCH] [CELEBORN-709] Increase default fetch timeout ### What changes were proposed in this pull request? 30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s. ### Why are the changes needed? When I was testing 3T TPCDS with three workers, I encountered fetch timeout: ``` 23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647 ... 23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647 ``` And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1618 from waitinfuture/709. Authored-by: zky.zhoukeyong Signed-off-by: zky.zhoukeyong --- .../main/scala/org/apache/celeborn/common/CelebornConf.scala | 2 +- docs/configuration/client.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala index b72523bd1..92a201018 100644 --- a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala +++ b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala @@ -2785,7 +2785,7 @@ object CelebornConf extends Logging { .version("0.3.0") .doc("Timeout for a task to open stream and fetch chunk.") .timeConf(TimeUnit.MILLISECONDS) - .createWithDefaultString("30s") + .createWithDefaultString("600s") val CLIENT_FETCH_MAX_REQS_IN_FLIGHT: ConfigEntry[Int] = buildConf("celeborn.client.fetch.maxReqsInFlight") diff --git a/docs/configuration/client.md b/docs/configuration/client.md index 165430db7..25112c87e 100644 --- a/docs/configuration/client.md +++ b/docs/configuration/client.md @@ -28,7 +28,7 @@ license: | | celeborn.client.fetch.excludedWorker.expireTimeout | <value of celeborn.client.excludedWorker.expireTimeout> | ShuffleClient is a static object, it will be used in the whole lifecycle of Executor,We give a expire time for blacklisted worker to avoid a transient worker issues. | 0.3.0 | | celeborn.client.fetch.maxReqsInFlight | 3 | Amount of in-flight chunk fetch request. | 0.3.0 | | celeborn.client.fetch.maxRetriesForEachReplica | 3 | Max retry times of fetch chunk on each replica | 0.3.0 | -| celeborn.client.fetch.timeout | 30s | Timeout for a task to open stream and fetch chunk. | 0.3.0 | +| celeborn.client.fetch.timeout | 600s | Timeout for a task to open stream and fetch chunk. | 0.3.0 | | celeborn.client.flink.compression.enabled | true | Whether to compress data in Flink plugin. | 0.3.0 | | celeborn.client.flink.inputGate.concurrentReadings | 2147483647 | Max concurrent reading channels for a input gate. | 0.3.0 | | celeborn.client.flink.inputGate.memory | 32m | Memory reserved for a input gate. | 0.3.0 |