[CELEBORN-709] Increase default fetch timeout
### What changes were proposed in this pull request? 30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s. ### Why are the changes needed? When I was testing 3T TPCDS with three workers, I encountered fetch timeout: ``` 23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647 ... 23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647 ``` And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1618 from waitinfuture/709. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
This commit is contained in:
parent
679f9cbf58
commit
e2eeafd4bf
@ -2785,7 +2785,7 @@ object CelebornConf extends Logging {
|
||||
.version("0.3.0")
|
||||
.doc("Timeout for a task to open stream and fetch chunk.")
|
||||
.timeConf(TimeUnit.MILLISECONDS)
|
||||
.createWithDefaultString("30s")
|
||||
.createWithDefaultString("600s")
|
||||
|
||||
val CLIENT_FETCH_MAX_REQS_IN_FLIGHT: ConfigEntry[Int] =
|
||||
buildConf("celeborn.client.fetch.maxReqsInFlight")
|
||||
|
||||
@ -28,7 +28,7 @@ license: |
|
||||
| celeborn.client.fetch.excludedWorker.expireTimeout | <value of celeborn.client.excludedWorker.expireTimeout> | ShuffleClient is a static object, it will be used in the whole lifecycle of Executor,We give a expire time for blacklisted worker to avoid a transient worker issues. | 0.3.0 |
|
||||
| celeborn.client.fetch.maxReqsInFlight | 3 | Amount of in-flight chunk fetch request. | 0.3.0 |
|
||||
| celeborn.client.fetch.maxRetriesForEachReplica | 3 | Max retry times of fetch chunk on each replica | 0.3.0 |
|
||||
| celeborn.client.fetch.timeout | 30s | Timeout for a task to open stream and fetch chunk. | 0.3.0 |
|
||||
| celeborn.client.fetch.timeout | 600s | Timeout for a task to open stream and fetch chunk. | 0.3.0 |
|
||||
| celeborn.client.flink.compression.enabled | true | Whether to compress data in Flink plugin. | 0.3.0 |
|
||||
| celeborn.client.flink.inputGate.concurrentReadings | 2147483647 | Max concurrent reading channels for a input gate. | 0.3.0 |
|
||||
| celeborn.client.flink.inputGate.memory | 32m | Memory reserved for a input gate. | 0.3.0 |
|
||||
|
||||
Loading…
Reference in New Issue
Block a user