[CELEBORN-709] Increase default fetch timeout

### What changes were proposed in this pull request?
30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s.

### Why are the changes needed?
When I was testing 3T TPCDS with three workers, I encountered fetch timeout:
```
23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647
...
23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647
```
And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1618 from waitinfuture/709.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
This commit is contained in:
zky.zhoukeyong 2023-06-23 21:06:43 +08:00
parent 679f9cbf58
commit e2eeafd4bf
2 changed files with 2 additions and 2 deletions

View File

@ -2785,7 +2785,7 @@ object CelebornConf extends Logging {
.version("0.3.0")
.doc("Timeout for a task to open stream and fetch chunk.")
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("30s")
.createWithDefaultString("600s")
val CLIENT_FETCH_MAX_REQS_IN_FLIGHT: ConfigEntry[Int] =
buildConf("celeborn.client.fetch.maxReqsInFlight")

View File

@ -28,7 +28,7 @@ license: |
| celeborn.client.fetch.excludedWorker.expireTimeout | &lt;value of celeborn.client.excludedWorker.expireTimeout&gt; | ShuffleClient is a static object, it will be used in the whole lifecycle of Executor,We give a expire time for blacklisted worker to avoid a transient worker issues. | 0.3.0 |
| celeborn.client.fetch.maxReqsInFlight | 3 | Amount of in-flight chunk fetch request. | 0.3.0 |
| celeborn.client.fetch.maxRetriesForEachReplica | 3 | Max retry times of fetch chunk on each replica | 0.3.0 |
| celeborn.client.fetch.timeout | 30s | Timeout for a task to open stream and fetch chunk. | 0.3.0 |
| celeborn.client.fetch.timeout | 600s | Timeout for a task to open stream and fetch chunk. | 0.3.0 |
| celeborn.client.flink.compression.enabled | true | Whether to compress data in Flink plugin. | 0.3.0 |
| celeborn.client.flink.inputGate.concurrentReadings | 2147483647 | Max concurrent reading channels for a input gate. | 0.3.0 |
| celeborn.client.flink.inputGate.memory | 32m | Memory reserved for a input gate. | 0.3.0 |