celeborn

Author	SHA1	Message	Date
Fu Chen	17c1e01874	[CELEBORN-726] Update data replication terminology from `master/slave` to `primary/replica` for configurations and metrics ### What changes were proposed in this pull request? This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC. ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. Closes #1650 from cfmcgrady/primary-replica-metrics. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 09:47:02 +08:00
Cheng Pan	c2352a2f9f	[CELEBORN-736][BUILD] Bump commons-lang3 3.12.0 ### What changes were proposed in this pull request? Bump commons-lang3 to latest version ### Why are the changes needed? - https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11 - https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1648 from pan3793/CELEBORN-736. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 21:15:44 +08:00
Angerszhuuuu	4c4e18b0d6	[CELEBORN-735] Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse ### What changes were proposed in this pull request? Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1647 from AngersZhuuuu/CELEBORN-735. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 20:17:56 +08:00
Angerszhuuuu	a672db719a	[CELEBORN-734] Remove unused RPC ReregisterWorkerResonse ### What changes were proposed in this pull request? Remove unused RPC ReregisterWorkerResonse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1646 from AngersZhuuuu/CELEBORN-734. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 19:59:53 +08:00
Angerszhuuuu	590198ecea	[CELEBORN-666][FOLLOWUP] Rename all RPC blacklist fields ### What changes were proposed in this pull request? In this pr, we rename all RPC blacklist fields, it won't have have compatibility issues. For RPC `GetBlacklist` and `GetBlacklistResponse` we won't change it, since it won't be used in next release, so we can remove these two RPC in next release. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1643 from AngersZhuuuu/CELEBORN-666-RPC. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:49:44 +08:00
Angerszhuuuu	ad13b04f2e	[CELEBORN-732] Remove unused RPC ThreadDump & ThreadDumpResponse ### What changes were proposed in this pull request? Remove unused RPC ThreadDump & ThreadDumpResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1645 from AngersZhuuuu/CELEBORN-732. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:43:39 +08:00
Angerszhuuuu	63f22342e9	[CELEBORN-730] Remove unused SlaveLostResponse ### What changes were proposed in this pull request? Remove unused SlaveLostResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1644 from AngersZhuuuu/CELEBORN-730. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:35:23 +08:00
onebox-li	1b74d85fb1	[CELEBORN-725][MINOR] Refine congestion code ### What changes were proposed in this pull request? Refine the congestion relevant code/log/comments ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually test Closes #1637 from onebox-li/improve-congestion. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 18:31:40 +08:00
Cheng Pan	3d7c1fa0ae	[CELEBORN-729] Fix typo PbRegisterShuffle#numMappers ### What changes were proposed in this pull request? Fix typo `numMapppers`, should be `numMappers` ### Why are the changes needed? Fix typo ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Protobuf serde depends on message field seq no, not name. Closes #1642 from pan3793/CELEBORN-729. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 18:28:34 +08:00
Cheng Pan	b821349c4a	[CELEBORN-727][TEST] Fix flaky test RssHashCheckDiskSuite ### What changes were proposed in this pull request? Fix the flaky test by enlarging `celeborn.client.shuffle.expired.checkInterval` ### Why are the changes needed? ``` RssHashCheckDiskSuite: - celeborn spark integration test - hash-checkDiskFull * FAILED * 868 was not less than 0 (RssHashCheckDiskSuite.scala:83) ``` https://github.com/apache/incubator-celeborn/actions/runs/5396767745/jobs/9800766633 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA, and should observe CI, Closes #1640 from pan3793/CELEBORN-727. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 17:59:54 +08:00
Demon Liang	a1199a9895	[CELEBORN-728] Celeborn won't clean remnant application directory on HDFS if worker is restarted ### What changes were proposed in this pull request? To clean the remnant application directory after Celeborn Worker is restarted. ### Why are the changes needed? Remnant application directories will not be deleted, because `hadoopFs.listFiles(path,false)` will not list directories. ### Does this PR introduce _any_ user-facing change? No. Closes #1641 from Demon-Liang/0.3-dev. Authored-by: Demon Liang <liangdingwen.ldw@alipay.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> (cherry picked from commit 42a9160c8ceaf79bae514c54dafcb5b8e12d5251) Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 17:54:08 +08:00
Angerszhuuuu	afab4a0a3b	[CELEBORN-696][FOLLOWUP] Remove new allocated peer workers from pushExecludedWrkers ### What changes were proposed in this pull request? Remove new allocated location's workers from pushExecludedWrkers should also remove peers ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1636 from AngersZhuuuu/CELEBORN-696-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 17:38:36 +08:00
Angerszhuuuu	3985a5cbd7	[CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment ### What changes were proposed in this pull request? Unify all blacklist related code and comment ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 16:28:03 +08:00
zhongqiang.czq	374d735ae5	[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo… …nse with lower versions ### What changes were proposed in this pull request? The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse. The reply field is default false before the version 0.2.1, so master can be compatible with older client version ### Why are the changes needed? Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`. if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur. The following compatiblity error will be printted. ``` java java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362] at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ``` java Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$ at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362] at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?] at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?] at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?] at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The pr is tested manually and the testing process is as follows: 1. server side is deploy using the code of latest branch-0.3. 2. spark client is deploy the version of 0.2.1, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted 3. spark client is deploy the version of 0.3.0, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted This patch had conflicts when merged, resolved by Committer: Cheng Pan <chengpan@apache.org> Closes #1635 from zhongqiangczq/heartbeat2. Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 16:04:18 +08:00
Angerszhuuuu	33cf343d20	[CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration ### What changes were proposed in this pull request? Unify exclude and blacklist related configuration ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 10:59:58 +08:00
zky.zhoukeyong	57b0e815cf	[CELEBORN-656] Batch revive RPCs in client to avoid too many requests ### What changes were proposed in this pull request? This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests. To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl. ### Why are the changes needed? In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 \|grep -i revive \|wc -l 64364 ``` After this PR, number of ReviveBatch requests reduces to 708: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out \|grep -i revive \|wc -l 2573 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. I have tested: 1. Disable graceful shutdown, kill one worker, job succeeds 2. Disable graceful shutdown, kill two workers successively, job fails as expected 3. Enable graceful shutdown, restart two workers successively, job succeeds 4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds Closes #1588 from waitinfuture/656-2. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-27 22:11:04 +08:00
Shuang	fe2f76dba6	[CELEBORN-717][FLINK][FOLLOWUP] Fix ResultPartition lost numBytesOut/numBuffersOut metrics ### What changes were proposed in this pull request? Metics update logic need align with Flink 1.17/1.15 ### Why are the changes needed? See [1626](https://github.com/apache/incubator-celeborn/pull/1626) And metics update logic need align with Flink 1.17/1.15 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tpcds Manual Closes #1631 from RexXiong/CELEBORN-717-FOLLOWUP. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>	2023-06-27 21:47:41 +08:00
zky.zhoukeyong	ebff17ec3c	[CELEBORN-721] Fix concurrent bug in ChangePartitionManager ### What changes were proposed in this pull request? Fixes concurrent bug in ChangePartitionManager. ### Why are the changes needed? Before this PR, ```ChangePartitionManager.start``` tries to synchronize on ```requests``` in the body of ```run()```, but the synchronized keyword was put outside of the ```batchHandleChangePartitionExecutors.submit```, which has no effect. When I was testing https://github.com/apache/incubator-celeborn/pull/1588 , I encountered unexpected situations that when all ```rss-lifecycle-manager-change-partition-executor``` threads are idle, the ```inBatchPartitions``` is still not empty: ``` 23/06/27 20:35:55 INFO ChangePartitionManager: Inside run, shuffleId 0 inBatchPartitions size 834 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1634 from waitinfuture/721. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 21:30:47 +08:00
Angerszhuuuu	4c67325a3d	[CELEBORN-720][SPARK] Correct metric peakExecutionMemory of SortBasedShuffleWriter ### What changes were proposed in this pull request? Currently SortBasedShuffleWriter won't update peakMemoryUsedBytes, this pr support this. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1632 from AngersZhuuuu/CELEBORN-720. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 18:40:06 +08:00
mingji	40760ede3a	[CELEBORN-568] Support storage type selection ### What changes were proposed in this pull request? 1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now. 2. Add new buffer size for HDFS file writers. 3. Worker support empty working dirs. ### Why are the changes needed? Support HDFS only scenario. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1619 from FMX/CELEBORN-568. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 18:07:08 +08:00
Angerszhuuuu	a2b215bd47	[CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with `celeborn.hadoop.` prefix ### What changes were proposed in this pull request? Celeborn generate hadoop configuration should respect Celeborn conf ### Why are the changes needed? In spark client side we should write like `spark.celeborn.hadoop.xxx.xx` In server side we should write like `celeborn.hadoop.xxx.xxx` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1629 from AngersZhuuuu/CELEBORN-719. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-27 17:00:47 +08:00
zky.zhoukeyong	809c76a2e4	[CELEBORN-718] Decrease RemainingReviveTimes regardless worker is excluded or not …s excluded or not ### What changes were proposed in this pull request? This PR makes ReviveTimes decrease regardless of the partition location is excluded or not. ### Why are the changes needed? In such testing setup: - 3 Celeborn workers - Client side blacklist enabled ```spark.celeborn.client.push.blacklist.enabled=true``` - Replication is on ```spark.celeborn.client.push.replicate.enabled=true``` - Successively kill 2 workers I expect the task fail because of revive failure (When replication is on, we need at least 2 workers), but in stead the tasks hang forever. When digging into the logs I found the ```remain revive times``` does not decrease, leading to infinite revive loop. ``` 23/06/27 14:00:57 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:01 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:05 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:09 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:13 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:17 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:21 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:25 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:29 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:33 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:37 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:41 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:45 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:49 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:53 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:01:57 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:02:01 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:02:05 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:02:09 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. 23/06/27 14:02:13 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5. ``` The reason is before this PR, the revive times will not decrease if the partition location is excluded, which I don't see a reason for that. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. Closes #1628 from waitinfuture/718. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 15:21:09 +08:00
Shuang	22b21295e8	[CELEBORN-717][FLINK] Fix ResultPartition lost numBytesOut/numBuffersOut metrics ### What changes were proposed in this pull request? Reset numBytesOut/numBuffersOut metrics for RemoteShuffleResultPartition ### Why are the changes needed? Currently ResultPartition lost numBytesOut/numBuffersOut metrics, this will cause Flink AdaptiveScheduler can not dynamically adjust the task parallelism based on the input amount of data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. Closes #1626 from RexXiong/CELEBORN-717. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-06-27 11:49:00 +08:00
Cheng Pan	1753556565	[CELEBORN-713] Local network binding support IP or FQDN ### What changes were proposed in this pull request? This PR aims to make network local address binding support both IP and FQDN strategy. Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-` ### Why are the changes needed? Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting. For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause ``` java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874) at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735) at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827) at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140) at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### Does this PR introduce _any_ user-facing change? Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior. ### How was this patch tested? Manually testing with `celeborn.network.bind.preferIpAddress=false` ``` Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.143.252 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.173.94 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.149.42 starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303. ``` Closes #1622 from pan3793/CELEBORN-713. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 09:42:11 +08:00
Cheng Pan	2b82194ce0	[CELEBORN-715] Change master URL schema from rss to celeborn ### What changes were proposed in this pull request? Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>` ### Why are the changes needed? Respect the project name. ### Does this PR introduce _any_ user-facing change? Yes, migration guide is updated accordingly. ### How was this patch tested? Pass GA. Closes #1624 from pan3793/CELEBORN-715. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-26 22:30:20 +08:00
Fu Chen	4b8f126d54	[CELEBORN-716][BUILD] Correct the `to` name when renaming the Netty native library ### What changes were proposed in this pull request? As title ### Why are the changes needed? before this PR the `liborg_apache_celeborn_shaded_netty_transport_native_epoll_aarch_64.so` can't correctly be loaded. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested ```shell > tar zxf celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar > find * -name "*.so" META-INF/native/liborg_apache_celeborn_shaded_netty_transport_native_epoll_aarch_64.so META-INF/native/liborg_apache_celeborn_shaded_netty_transport_native_epoll_x86_64.so ``` Closes #1625 from cfmcgrady/typo. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-26 21:57:06 +08:00
Fu Chen	1b3ec61690	[CELEBORN-711][TEST] Rework PushDataTimeoutTest ### What changes were proposed in this pull request? 1. separated push data timeout tests and push merge data timeout tests in `PushDataTimeoutTest` 2. updated the test results assertion 3. rework `pushdata timeout will add to blacklist` ### Why are the changes needed? ensure that the timeout behavior is correctly implemented https://github.com/apache/incubator-celeborn/pull/1613#discussion_r1236423721 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #1620 from cfmcgrady/push-timeout-test. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-26 13:45:27 +08:00
zwangsheng	1ae92b56e0	[CELEBORN-714][HELM] Improved the local disk binding mechanism of Kubernetes HostPath ### What changes were proposed in this pull request? Add `diskType` in `charts/celeborn/values.yml` to help configuration `celeborn.worker.storage.dirs` Result like: ```properties celeborn.worker.storage.dirs=/mnt/disk1:disktype=HDD,/mnt/disk2:disktype=HDD,/mnt/disk3:disktype=HDD,/mnt/disk4:disktype=SSD ``` ### Why are the changes needed? Help user specify local disk type. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Local dry-run ```shell helm install celeborn charts/celeborn --dry-run ``` Closes #1623 from zwangsheng/CELEBORN-714. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-06-26 10:52:37 +08:00
zky.zhoukeyong	6b82ecdfa0	[CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code ### What changes were proposed in this pull request? Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues. ### Why are the changes needed? Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side. ### Does this PR introduce _any_ user-facing change? In some logs the application id will not be printed, which should not be a problem. ### How was this patch tested? UTs. Closes #1621 from waitinfuture/appid. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-25 21:37:16 +08:00
Cheng Pan	ac84d64d51	[CELEBORN-707][MASTER] Remove env CELEBORN_MASTER_HOST and CELEBORN_MASTER_PORT ### What changes were proposed in this pull request? Remove environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT`, and makes `CELEBORN_LOCAL_HOSTNAME` takes effect on both master and worker. ### Why are the changes needed? There are many different ways to configure the master/worker host and port, which makes the thing complex and inconsistent. After this change, #### master 1. cli args `--host` `--port` takes the highest priority 2. then lookup env `CELEBORN_LOCAL_HOSTNAME` 3. things are different when HA enabled and disabled 3.1. when HA is disabled, lookup configurations `celeborn.master.host` and `celeborn.master.port` 3.2. when HA is enabled, each node needs to know the whole cluster info, ``` celeborn.master.ha.node.1.host clb-1 celeborn.master.ha.node.1.port 9097 celeborn.master.ha.node.2.host clb-2 celeborn.master.ha.node.2.port 9097 celeborn.master.ha.node.3.host clb-3 celeborn.master.ha.node.3.port 9097 ``` in addition, `celeborn.master.ha.node.id=1` can be used to indicate the node id, otherwise, the master will try to bind each host to match the node id. #### worker 1. cli args `--host` `--port` takes the highest priority 2. then lookup env `CELEBORN_LOCAL_HOSTNAME` things are simple than the master case because each worker is not required to know others. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? UT. Closes #1616 from pan3793/CELEBORN-707. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-25 16:00:59 +08:00
zky.zhoukeyong	e2eeafd4bf	[CELEBORN-709] Increase default fetch timeout ### What changes were proposed in this pull request? 30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s. ### Why are the changes needed? When I was testing 3T TPCDS with three workers, I encountered fetch timeout: ``` 23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647 ... 23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647 ``` And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1618 from waitinfuture/709. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-23 21:06:43 +08:00
Cheng Pan	679f9cbf58	[CELEBORN-708] Fix commit metrics in application heartbeat ### What changes were proposed in this pull request? - Fix commit metrics in application heartbeat - Change client side application heartbeat message log level to info - Improve heartbeat log by unify the word "heartbeat" ### Why are the changes needed? `commitHandler.commitMetrics()` has side effects, multiple calls to get values independently is incorrect. ``` def commitMetrics(): (Long, Long) = (totalWritten.sumThenReset(), fileCount.sumThenReset()) ``` ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Review. Closes #1617 from pan3793/CELEBORN-708. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-21 22:34:24 +08:00
Cheng Pan	98744fb8ca	[CELEBORN-705][BUILD] Upgrade Maven from 3.6.3 to 3.8.8 ### What changes were proposed in this pull request? Upgrade Maven from 3.6.3 to 3.8.8. ### Why are the changes needed? Maven 3.6.3 is EOL. It was removed from the Apache Mirror site, so users can not benefit from download speedup from the mirror even with ``` export APACHE_MIRROR=https://mirrors.cloud.tencent.com/apache ``` https://mirrors.cloud.tencent.com/apache/maven/maven-3/ <img width="752" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/80e9e472-15c6-419e-a29b-69661615a16f"> There are logs from our CI server, it can not download from the mirror site and have to fallback to the Apache archive server, the latter is extremely slow. ``` $ ./build/mvn $MVN_OPTS $BUILD_PROFILES -version Falling back to archive.apache.org to download Maven ... ``` Why not 3.9.2? Maven 3.9 uses native transport-http as default and the default timeout is 10000ms, which is shorter than Wagon's default timeout 60000ms, which causes a lot of network timeout issues See details at https://github.com/apache/spark/pull/40738 ### Does this PR introduce _any_ user-facing change? Maybe, if the user uses insecure http private repo in their `pom.xml`. Because [Maven 3.8 enforces the https in default](https://maven.apache.org/docs/3.8.1/release-notes.html#cve-2021-26291). As a workaround, you can leverage `sed` to remove such restrictions. ``` $ build/mvn -version $ sed -i "s/<mirrorOf>external:http:\/<mirrorOf>dummy/g" build/apache-maven-/conf/settings.xml ... ``` ### How was this patch tested? Pass GA. Closes #1615 from pan3793/CELEBORN-705. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-21 21:54:17 +08:00
Fu Chen	c6113c10e5	[CELEBORN-703][WORKER][PERF] Avoid calling `CelebornConf#get` multi-time when `PushDataHandler` handle `PushData`/`PushMergedData` ### What changes were proposed in this pull request? As title. the worker's frame graph before: ![image](https://github.com/apache/incubator-celeborn/assets/8537877/68a0a1fd-34c2-4618-9146-a2d66c951645) the worker's frame graph after: ![image](https://github.com/apache/incubator-celeborn/assets/8537877/268e8109-737d-4440-b2b8-60687ef090cb) ### Why are the changes needed? improve the worker's perf. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing UT and manually tested. Closes #1613 from cfmcgrady/push-data-handler-conf. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-21 20:02:44 +08:00
Cheng Pan	8194407558	[CELEBORN-704] Print host and port on starting Netty RPC Server ### What changes were proposed in this pull request? Print host and port on starting Netty RPC Server ### Why are the changes needed? Sometimes, the Master/Worker may fail on bootstrap because `BindException: Cannot assign requested address`, but there is no clue which addresses it tried. ``` 2023-06-21 14:28:12 [INFO] [main] org.apache.celeborn.service.deploy.worker.Worker#51 - Metrics system enabled. 2023-06-21 14:28:12 [ERROR] [main] org.apache.celeborn.service.deploy.worker.Worker#80 - Initialize worker failed. java.net.BindException: Cannot assign requested address at sun.nio.ch.Net.bind0(Native Method) ~[?:1.8.0_372] at sun.nio.ch.Net.bind(Net.java:461) ~[?:1.8.0_372] at sun.nio.ch.Net.bind(Net.java:453) ~[?:1.8.0_372] at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) ~[?:1.8.0_372] at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:600) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.handler.logging.LoggingHandler.bind(LoggingHandler.java:230) ~[netty-handler-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:602) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[netty-transport-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.93.Final.jar:4.1.93.Final] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually review. Closes #1614 from pan3793/CELEBORN-704. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-21 15:16:30 +08:00
zky.zhoukeyong	5f4f6d953f	[CELEBORN-702][DOC] Extend doc about migration from 0.2.1 to 0.3.0 ### What changes were proposed in this pull request? Extend doc about migration from 0.2.1 to 0.3.0. Added the following contents: <img width="1084" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/7a9d172c-09ba-48b6-9f5c-73a8b13d035f"> ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1612 from waitinfuture/702. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-20 20:45:58 +08:00
zky.zhoukeyong	fdb126112e	[CELEBORN-700][COMPATIBILITY] Fix compatibility issue caused by WorkerInfo ### What changes were proposed in this pull request? Fixes compatibility issue introduced by change of WorkerInfo. ### Why are the changes needed? When testing with branch-0.2 client and main server, I got the following error: ``` Caused by: scala.MatchError: [Ljava.lang.String;414ca35a (of class [Ljava.lang.String;) at org.apache.celeborn.common.util.PbSerDeUtils$.$anonfun$fromPbWorkerResource$1(PbSerDeUtils.scala:298) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.celeborn.common.util.PbSerDeUtils$.fromPbWorkerResource(PbSerDeUtils.scala:297) at org.apache.celeborn.common.protocol.message.ControlMessages$.fromTransportMessage(ControlMessages.scala:863) at org.apache.celeborn.common.util.Utils$.fromTransportMessage(Utils.scala:828) at org.apache.celeborn.common.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:110) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.$anonfun$deserialize$2(NettyRpcEnv.scala:276) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:323) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.$anonfun$deserialize$1(NettyRpcEnv.scala:275) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:275) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.$anonfun$ask$6(NettyRpcEnv.scala:235) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.$anonfun$ask$6$adapted(NettyRpcEnv.scala:235) at org.apache.celeborn.common.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:82) at org.apache.celeborn.common.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:180) at org.apache.celeborn.common.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:119) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.celeborn.common.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:74) ``` And this is introduced by `811e192bbd (diff-b61712c3683306f65cd2ca051b54075952897a899951f3e37ec3968e7ba75710)` ### Does this PR introduce _any_ user-facing change? Yes, it fixes compatibility error when using branch-0.2 client and main server. ### How was this patch tested? Manual test. Closes #1610 from waitinfuture/700. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-20 20:17:00 +08:00
zky.zhoukeyong	6ca2bb2c6f	[CELEBORN-701][COMPATIBILITY] Fix compatibility issue caused by pushdata timeout ### What changes were proposed in this pull request? Fixes compatibility issue caused by change of push data timeout. ### Why are the changes needed? When I test with branch-0.2 client with main server side, I got the following error: ``` 23/06/20 17:42:34,538 ERROR [push-timeout-checker-12] PushDataHandler: PushData replication failed for partitionLocation: PartitionLocation[ id-epoch:767-0 host-rpcPort-pushPort-fetchPort-replicatePort:192.168.1.17-37687-42605-42891-41319 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:192.168.1.18-45201-35749-41831-46853) storage hint:StorageInfo{type=MEMORY, mountPoint='/mnt/disk4', finalResult=false, filePath=} mapIdBitMap:null] org.apache.celeborn.common.exception.CelebornIOException: PUSH_DATA_TIMEOUT_SLAVE at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredPushRequest(TransportResponseHandler.java:125) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$0(TransportResponseHandler.java:96) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` The error is because in main branch ReserveSlots added a new field ```pushDataTimeout```, so when client is branch-0.2, the default value will be 0, and always trigger timeout. ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes a bug when client is branch-0.2 and server is main. ### How was this patch tested? Manual test. Closes #1611 from waitinfuture/701. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-06-20 18:43:02 +08:00
zky.zhoukeyong	255661bbb7	[CELEBORN-696] Fix bugs related with shutting down and excluded workers ### What changes were proposed in this pull request? 1. Foreach PartitionLocation returned from Register Shuffle, remove from worker's local excluded list to refresh the local information. 2. ChangeLocationResponse will also return whether oldPartition is excluded in LifecycleManager. If so, remove it from Executor's excluded list. 3. Always trigger commit files for shutting down workers returned from HeartbeatFromApplicationResponse. 4. HeartbeatFromWorker sends the correct disk infos regardless of shutting down or not. After this PR, the priority of excluded list is Master > LifecycleManager > Executor. ### Why are the changes needed? During test with graceful turned on(workers have static rpc/push/fetch/replicate ports) and consistently restart one out of three workers, I encountered several bugs. 1. First I killed worker A, then Executor's client's local excluded list will contain A, after A stopped, I started it again, then master will offer slots on A, so we should remove from the executor's excluded list then. 2. When I kill-and-start a worker twice in a short time smaller than the app heartbeat interval, the second time WorkerStatusTracker will not trigger commit files because the local cache for the worker has not been refreshed. 3. When a worker is shutting down, in its heartbeat it passes empty diskInfos, and master blindly added to excluded list. We want a worker be either in the excluded list, or in the shutting down list, exclusively. If a worker is in excluded list, then LIfecycleManager will not trigger commit files when handle heartbeat response; on the other hand, if a worker is in the shutting down list, LifecycleManager will trigger commit files on it. So we must make it correct that a shutting down worker be in the shutting down list. ### Does this PR introduce _any_ user-facing change? Yes, it fixes several bugs described above. ### How was this patch tested? Manual test. Closes #1606 from waitinfuture/696. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-20 16:47:07 +08:00
onebox-li	88586d6c15	[CELEBORN-697] Fix assignment of DeviceInfo deviceStatAvailable ### What changes were proposed in this pull request? DeviceInfo `deviceStatAvailable` 's variable name and assignment does not match ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Cluster test Closes #1607 from onebox-li/fix-deviceStatAvailable. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-20 15:53:15 +08:00
onebox-li	3af81057e9	[CELEBORN-698] Fix LocalDeviceMonitor::readWriteError judge ### What changes were proposed in this pull request? If dataDir does not exists, it will skip check in DeviceMonitor::readWriteError. It will cause that a disk found READ_OR_WRITE_FAILURE when creating FileWriter error, may change to be healthy in the subsequent disk checker. ``` 23/06/19 19:09:52,718 ERROR [dispatcher-event-loop-16] StorageManager: Create FileWriter for /data7/rss_storage/rss-worker/shuffle_data/application_1684305681931_1943199/180/79-0-0 of mount /data7 failed, report to DeviceMonitor java.io.IOException: No such file or directory at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:1012) at org.apache.celeborn.service.deploy.worker.storage.StorageManager.createWriter(StorageManager.scala:330) ... at java.lang.Thread.run(Thread.java:745) 23/06/19 19:09:52,718 ERROR [dispatcher-event-loop-16] LocalDeviceMonitor: Receive non-critical exception, disk: /data7, java.io.IOException: java.io.IOException: No such file or directory updated state DiskInfo(maxSlots: 0, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /data7, usableSpace: 536870912000, avgFlushTime: 0, avgFetchTime: 0, activeSlots: 0) status: READ_OR_WRITE_FAILURE dirs /data7/rss_storage/rss-worker/shuffle_data after disk checker, updated stated DiskInfo(maxSlots: 0, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /data7, usableSpace: 536870912000, avgFlushTime: 0, avgFetchTime: 0, activeSlots: 0) status: HEALTHY dirs /data7/rss_storage/rss-worker/shuffle_data ``` ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Cluster test Closes #1608 from onebox-li/fix-disk-check. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-20 15:51:53 +08:00
Cheng Pan	85be99548a	[CELEBORN-685][WORKER][HDFS] Fix permission on creating shuffle dir on HDFS ### What changes were proposed in this pull request? Correct the FsPermission 755. ### Why are the changes needed? We should use octal 0755 or "755" instead of decimal to represent UNIX permission. String is chosen because octal is deprecated in Scala. ### Does this PR introduce _any_ user-facing change? Yes, it's a bug fix. ### How was this patch tested? Manually review. Closes #1597 from pan3793/CELEBORN-685. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-06-20 15:34:24 +08:00
zky.zhoukeyong	7d634db547	[CELEBORN-695] Fix UnsupportedOperationException by refactoring WorkerInfo ### What changes were proposed in this pull request? Refactor WorkerInfo 1. make ```diskInfos```, ```userResourceConsumption``` new maps instead of using the passed in reference 2. remove ```endpoint``` from the constructor ### Why are the changes needed? When manually test stop-worker.sh with graceful turned on, I got the following Exception ``` 23/06/19 11:04:25,665 INFO [worker-forward-message-scheduler] RssHARetryClient: connect to master master-1-1:9097. 23/06/19 11:04:27,168 ERROR [worker-forward-message-scheduler] RssHARetryClient: Send rpc with failure, has tried 15, max try 15! org.apache.celeborn.common.exception.CelebornException: Exception thrown in awaitResult: at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:231) at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74) at org.apache.celeborn.common.haclient.RssHARetryClient.sendMessageInner(RssHARetryClient.java:150) at org.apache.celeborn.common.haclient.RssHARetryClient.askSync(RssHARetryClient.java:118) at org.apache.celeborn.service.deploy.worker.Worker.org$apache$celeborn$service$deploy$worker$Worker$$heartBeatToMaster(Worker.scala:306) at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.$anonfun$run$1(Worker.scala:332) at org.apache.celeborn.common.util.Utils$.tryLogNonFatalError(Utils.scala:186) at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.run(Worker.scala:332) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.celeborn.common.exception.CelebornIOException: remove at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:65) at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:210) at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:315) at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222) at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110) at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:229) ... 3 more Caused by: java.lang.UnsupportedOperationException: remove at scala.collection.convert.Wrappers$MapWrapper$$anon$2$$anon$3.remove(Wrappers.scala:236) at java.util.AbstractMap.remove(AbstractMap.java:254) at org.apache.celeborn.common.meta.WorkerInfo.$anonfun$updateThenGetDiskInfos$2(WorkerInfo.scala:225) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.celeborn.common.meta.WorkerInfo.updateThenGetDiskInfos(WorkerInfo.scala:224) at org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager.lambda$updateWorkerHeartbeatMeta$5(AbstractMetaManager.java:205) at java.util.Optional.ifPresent(Optional.java:159) at org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager.updateWorkerHeartbeatMeta(AbstractMetaManager.java:203) at org.apache.celeborn.service.deploy.master.clustermeta.SingleMasterMetaManager.handleWorkerHeartbeat(SingleMasterMetaManager.java:105) at org.apache.celeborn.service.deploy.master.Master.org$apache$celeborn$service$deploy$master$Master$$handleHeartbeatFromWorker(Master.scala:428) at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$20(Master.scala:326) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:207) ... 8 more ``` According to the suggestion from https://github.com/apache/incubator-celeborn/pull/1602#issuecomment-1596722991 ### Does this PR introduce _any_ user-facing change? Yes, it fixes bug described in https://github.com/apache/incubator-celeborn/pull/1602 ### How was this patch tested? UTs and manual test. Closes #1605 from waitinfuture/695. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-19 19:38:55 +08:00
zky.zhoukeyong	222ed267b0	[CELEBORN-692] WorkerStatusTracker should handle WORKER_SHUTDOWN properly ### What changes were proposed in this pull request? This PR put workers with WORKER_SHUTDOWN status into shuttingWorkers instead of blacklist. ### Why are the changes needed? If WORKER_SHUTDOWN workers are put into blacklist, it will not trigger commit files, see ```CommitHandler::commitFiles``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1603 from waitinfuture/692. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-19 15:54:45 +08:00
Fu Chen	18f2be0fbe	[CELEBORN-693][SPARK] Align the `incWriterTime` in the hash-based shuffle writer with the sort-based shuffle ### What changes were proposed in this pull request? As title. ### Why are the changes needed? https://github.com/apache/incubator-celeborn/pull/1585#issuecomment-1589164128 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested locally. Closes #1604 from cfmcgrady/hash-based-writer-metrics. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-19 15:42:01 +08:00
sychen	e734ceb558	[MINOR] Cleanup code ### What changes were proposed in this pull request? 1. Use `<arg>-Ywarn-unused-import</arg>` to remove some unused imports There is no way to use `<arg>-Ywarn-unused-import</arg>` at this stage Because we have the following code ``` // Can Remove this if celeborn don't support scala211 in future import org.apache.celeborn.common.util.FunctionConverter._ ``` 2. Fix scala case match not fully covered, avoid `scala.MatchError` 3. Fixed some scala compilation warnings ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1600 from cxzl25/cleanup_code. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-19 11:31:51 +08:00
sychen	4cb4701ede	[CELEBORN-689] Fix the incorrect part of PushDataHandler message type converted to status code ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1601 from cxzl25/CELEBORN-689. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-19 11:25:08 +08:00
zwangsheng	7d7107d607	[CELEBORN-684] Upgrade Netty from 4.1.92.Final to 4.1.93.Final ### What changes were proposed in this pull request? After `Netty` release `4.1.39.Final` for 3 weeks ago, we should update netty version. [Change List](https://github.com/netty/netty/compare/netty-4.1.92.Final...netty-4.1.93.Final) ### Why are the changes needed? Catch up with the Netty version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Closes #1596 from zwangsheng/CELEBORN-684. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-16 20:05:25 +08:00
Shuang	28a99ded8d	[CELEBORN-687] Fix shuffleResourceExists, reduce unexpected slot release request ### What changes were proposed in this pull request? Check ShufflePartitionLocationInfo whether empty or not for every worker ### Why are the changes needed? Actually shuffleResources would only remove related partitionLocations after stageEnd , then workers with empty partitionLocations will left(for speculative task), so shuffleResourceExists need check ShufflePartitionLocationInfo for every worker otherwise it would print wrong message and send release slot requests twice. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test ### Before this pr <img width="1252" alt="image" src="https://github.com/apache/incubator-celeborn/assets/28799061/06b71162-e78b-4163-8f52-24b50bc6c540"> ![image](https://github.com/apache/incubator-celeborn/assets/28799061/fec263e0-9641-4d17-a837-ab03c36c5e6d) ### After this pr ![image](https://github.com/apache/incubator-celeborn/assets/28799061/8f2f9653-ff58-4a6e-ae06-1023922ca5bf) Closes #1599 from RexXiong/CELEBORN-687. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-16 18:37:48 +08:00
Angerszhuuuu	c1c46398d5	[CELEBORN-682] Master and client should handle blacklist worker and shutting down worker separately ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1594 from AngersZhuuuu/CELEBORN-682. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-16 18:29:03 +08:00

... 7 8 9 10 11 ...

1423 Commits