### What changes were proposed in this pull request? Celeborn fetch chunk also should support check timeout #### Test case ``` executor instance 20 SQL: SELECT count(1) from (select /*+ REPARTITION(100) */ * from spark_auxiliary.t50g) tmp; --conf spark.celeborn.client.spark.shuffle.writer=sort \ --conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \ --conf spark.celeborn.client.push.timeout=10s \ --conf spark.celeborn.client.push.replicate.enabled=true \ --conf spark.celeborn.client.push.revive.maxRetries=10 \ --conf spark.celeborn.client.reserveSlots.maxRetries=10 \ --conf spark.celeborn.client.registerShuffle.maxRetries=3 \ --conf spark.celeborn.client.push.blacklist.enabled=true \ --conf spark.celeborn.client.blacklistSlave.enabled=true \ --conf spark.celeborn.client.fetch.timeout=30s \ --conf spark.celeborn.client.push.data.timeout=30s \ --conf spark.celeborn.client.push.limit.inFlight.timeout=600s \ --conf spark.celeborn.client.push.maxReqsInFlight=32 \ --conf spark.celeborn.client.shuffle.compression.codec=ZSTD \ --conf spark.celeborn.rpc.askTimeout=30s \ --conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \ --conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true ``` Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest` Before patch <img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206"> After patch <img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e"> The log shows the fetch timeout checker workers ``` 23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed. org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[ id-epoch:35-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095) storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=} mapIdBitMap:null], change to peer org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed. at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98) at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) ... 8 more 23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1587 from AngersZhuuuu/CELEBORN-676. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com> |
||
|---|---|---|
| .. | ||
| client.md | ||
| columnar-shuffle.md | ||
| ha.md | ||
| index.md | ||
| master.md | ||
| metrics.md | ||
| network.md | ||
| quota.md | ||
| worker.md | ||