celeborn

History

Angerszhuuuu 0aa13832b5 [CELEBORN-676] Celeborn fetch chunk also should support check timeout ### What changes were proposed in this pull request? Celeborn fetch chunk also should support check timeout #### Test case ``` executor instance 20 SQL: SELECT count(1) from (select /+ REPARTITION(100) / * from spark_auxiliary.t50g) tmp; --conf spark.celeborn.client.spark.shuffle.writer=sort \ --conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \ --conf spark.celeborn.client.push.timeout=10s \ --conf spark.celeborn.client.push.replicate.enabled=true \ --conf spark.celeborn.client.push.revive.maxRetries=10 \ --conf spark.celeborn.client.reserveSlots.maxRetries=10 \ --conf spark.celeborn.client.registerShuffle.maxRetries=3 \ --conf spark.celeborn.client.push.blacklist.enabled=true \ --conf spark.celeborn.client.blacklistSlave.enabled=true \ --conf spark.celeborn.client.fetch.timeout=30s \ --conf spark.celeborn.client.push.data.timeout=30s \ --conf spark.celeborn.client.push.limit.inFlight.timeout=600s \ --conf spark.celeborn.client.push.maxReqsInFlight=32 \ --conf spark.celeborn.client.shuffle.compression.codec=ZSTD \ --conf spark.celeborn.rpc.askTimeout=30s \ --conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \ --conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true ``` Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest` Before patch <img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206"> After patch <img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e"> The log shows the fetch timeout checker workers ``` 23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed. org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[ id-epoch:35-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095) storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=} mapIdBitMap:null], change to peer org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed. at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98) at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) ... 8 more 23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1587 from AngersZhuuuu/CELEBORN-676. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>		2023-06-15 13:54:09 +08:00
..
client.md	[CELEBORN-676] Celeborn fetch chunk also should support check timeout	2023-06-15 13:54:09 +08:00
columnar-shuffle.md	[CELEBORN-679] Optimize `Utils#bytesToString`	2023-06-14 17:42:16 +08:00
ha.md	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
index.md	[CELEBORN-629][DOC] Add doc about enable rac-awareness	2023-06-05 10:28:26 +08:00
master.md	[CELEBORN-595][FOLLOWUP] Fix change version to 0.3.0. (#1522 )	2023-05-30 20:12:56 +08:00
metrics.md	[CELEBORN-681][DOC] Add celeborn.metrics.conf to conf entity	2023-06-14 18:06:03 +08:00
network.md	[CELEBORN-676] Celeborn fetch chunk also should support check timeout	2023-06-15 13:54:09 +08:00
quota.md	[CELEBORN-576] Add static identity provider and manually settable identity provider for non-hadoop environment. (#1480 )	2023-05-08 17:29:01 +08:00
worker.md	[CELEBORN-675] Fix decode heartbeat message	2023-06-14 14:37:13 +08:00