celeborn

Author	SHA1	Message	Date
zky.zhoukeyong	1109e2c8f4	[CELEBORN-803][FOLLOWUP] Make ```rpcAskTimeout``` default to 60s ### What changes were proposed in this pull request? As title. ### Why are the changes needed? Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```, so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1725 from waitinfuture/803-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 23:53:52 +08:00
zky.zhoukeyong	a7bbbd05c4	[CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf ### What changes were proposed in this pull request? 1. Decrease writeTime metric sampling frequency to improve perf 2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4 ### Why are the changes needed? Following are test cases case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data Following are e2e time of shuffle write stage \|\|Sort pusher before\|Sort pusher after\|Hash pusher before\|Hash pusher after\| \|----\|----\|----\|----\|-----\| \|case1\|4.4min\|4.1min\|4.4min\|3.9min\| \|case2\|9.1min\|8.4min\|9.7min\|8.5min\| ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1718 from waitinfuture/797. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-14 20:51:50 +08:00
Cheng Pan	4f8e72f217	[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf ### What changes were proposed in this pull request? Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style ### Why are the changes needed? Pullout inline configuration to `CelebornConf`, and expose it in configuration docs ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1684 from pan3793/CELEBORN-774. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-06 16:23:32 +08:00
zky.zhoukeyong	4300835363	[CELEBORN-768] Change default config values for batch rpcs and netty … …memory allocator ### What changes were proposed in this pull request? Changes the following configs' default values \| config \| previous value \| current value \| \| ------------- \| ------------- \| ------------- \| \| celeborn.network.memory.allocator.share \| false \| true \| \| celeborn.client.shuffle.batchHandleChangePartition.enabled \| false \| true \| \| celeborn.client.shuffle.batchHandleCommitPartition.enabled \| false \| true \| ### Why are the changes needed? In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled. In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout. ### Does this PR introduce _any_ user-facing change? No, these conifgs are introduces from 0.3.0. ### How was this patch tested? Passes GA. Closes #1682 from waitinfuture/768. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 18:16:41 +08:00
mingji	40760ede3a	[CELEBORN-568] Support storage type selection ### What changes were proposed in this pull request? 1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now. 2. Add new buffer size for HDFS file writers. 3. Worker support empty working dirs. ### Why are the changes needed? Support HDFS only scenario. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1619 from FMX/CELEBORN-568. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 18:07:08 +08:00
Cheng Pan	1753556565	[CELEBORN-713] Local network binding support IP or FQDN ### What changes were proposed in this pull request? This PR aims to make network local address binding support both IP and FQDN strategy. Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-` ### Why are the changes needed? Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting. For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause ``` java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874) at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735) at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827) at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140) at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### Does this PR introduce _any_ user-facing change? Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior. ### How was this patch tested? Manually testing with `celeborn.network.bind.preferIpAddress=false` ``` Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.143.252 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.173.94 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.149.42 starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303. ``` Closes #1622 from pan3793/CELEBORN-713. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 09:42:11 +08:00
Angerszhuuuu	0aa13832b5	[CELEBORN-676] Celeborn fetch chunk also should support check timeout ### What changes were proposed in this pull request? Celeborn fetch chunk also should support check timeout #### Test case ``` executor instance 20 SQL: SELECT count(1) from (select /+ REPARTITION(100) / * from spark_auxiliary.t50g) tmp; --conf spark.celeborn.client.spark.shuffle.writer=sort \ --conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \ --conf spark.celeborn.client.push.timeout=10s \ --conf spark.celeborn.client.push.replicate.enabled=true \ --conf spark.celeborn.client.push.revive.maxRetries=10 \ --conf spark.celeborn.client.reserveSlots.maxRetries=10 \ --conf spark.celeborn.client.registerShuffle.maxRetries=3 \ --conf spark.celeborn.client.push.blacklist.enabled=true \ --conf spark.celeborn.client.blacklistSlave.enabled=true \ --conf spark.celeborn.client.fetch.timeout=30s \ --conf spark.celeborn.client.push.data.timeout=30s \ --conf spark.celeborn.client.push.limit.inFlight.timeout=600s \ --conf spark.celeborn.client.push.maxReqsInFlight=32 \ --conf spark.celeborn.client.shuffle.compression.codec=ZSTD \ --conf spark.celeborn.rpc.askTimeout=30s \ --conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \ --conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true ``` Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest` Before patch <img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206"> After patch <img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e"> The log shows the fetch timeout checker workers ``` 23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed. org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[ id-epoch:35-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095) storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=} mapIdBitMap:null], change to peer org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed. at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98) at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) ... 8 more 23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1587 from AngersZhuuuu/CELEBORN-676. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-15 13:54:09 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
Angerszhuuuu	d244f44518	[CELEBORN-593] Refine some RPC related default configurations (#1498 )	2023-05-19 18:23:12 +08:00
Angerszhuuuu	13ce04f8a1	[CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT (#1462 ) * [CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT	2023-04-26 15:19:34 +08:00
Angerszhuuuu	6830cb61ef	[CELEBORN-540][Refactor] Add config entity of celeborn.rpc.io.threads (#1443 ) * [CELEBORN-540][CONF] Add config entity of celeborn.rpc.io.threads	2023-04-21 11:21:41 +08:00
Ethan Feng	3aacede5f8	[CELEBORN-283] Derive network layer for flink plugin. (#1222 )	2023-02-17 14:12:54 +08:00
Angerszhuuuu	de3ef0d694	[CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout (#1047 ) * [CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout	2022-12-06 14:41:23 +08:00
nafiy	8e384cda5a	[CELEBORN-88][REFACTOR] Revive/PartitionSplit should set separated timeout configuration (#1046 )	2022-12-05 10:36:43 +08:00
nafiy	44d45c2a27	[CELEBORN-90][REFACTOR] GetReducerFileGroup should support separated timeout configuration (#1045 )	2022-12-02 22:53:51 +08:00
nafiy	13e1e24035	[CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration (#1031 ) * [CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration	2022-12-01 18:39:56 +08:00
nafiy	d584211a75	[CELEBORN-95][REFACTOR]Rename CLIENT_RPC_ASK_TIMEOUT to HA_CLIENT_RPC_ASK_TIMEOUT (#1037 )	2022-12-01 11:57:02 +08:00
Gabriel	5ecb09d62a	[ISSUE-911] Decrease numConnectionsPerPeer to achieve better performance (#983 )	2022-11-20 11:46:17 +08:00
Cheng Pan	d7be6006e7	Migrate network related conf to structured conf system (#875 ) * Migrate network related conf to structured conf system * migrate * fix * fix * worker * fix * nit * review * nit	2022-10-28 10:45:52 +08:00

20 Commits