celeborn

Author	SHA1	Message	Date
Wang, Fei	330b2a094e	[CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5 ### What changes were proposed in this pull request? Bump protobuf from 3.21.7 to 3.25.5. ### Why are the changes needed? To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2898 from turboFei/bump_protobuf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-11 17:02:23 +08:00
Wang, Fei	09ffee0365	[CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826 ### What changes were proposed in this pull request? Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826 ### Why are the changes needed? To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #2899 from turboFei/bump_jetty. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-11 16:58:44 +08:00
Wang, Fei	6d2b9f6d92	[CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0 ### What changes were proposed in this pull request? Bump commons-io from 2.13.0 to 2.17.0 ### Why are the changes needed? To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2900 from turboFei/bump_commons_io. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-11 16:57:29 +08:00
SteNicholas	8b54ed8c34	[CELEBORN-1504][FOLLOWUP] Start test nodes in random ports to allow multiple builds run in the same ci server ### What changes were proposed in this pull request? Start test nodes in random ports to allow multiple builds run in the same ci server, which improves the test implementations so that it starts test nodes in random ports instead of using the hardcoded ones. Follow up #2619. ### Why are the changes needed? The test nodes are started in the hard coded ports, this prevents to run multiple builds in the same CI/CD server at present. Bump the changes of #2237. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `RemoteShuffleMasterSuiteJ` Closes #2902 from SteNicholas/CELEBORN-1504. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-11 16:55:14 +08:00
SteNicholas	169b6f6973	[CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric ### What changes were proposed in this pull request? 1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy. 2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle. 3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`. ### Why are the changes needed? The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Cluster test. Closes #2891 from SteNicholas/CELEBORN-1685. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-11 10:37:25 +08:00
sychen	8f34d1555b	[CELEBORN-1686] Avoid return the same pushTaskQueue ### What changes were proposed in this pull request? ### Why are the changes needed? The close method of `SortBasedShuffleWriter#write` will call `sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue());`, but the close method may be interrupted. After the interruption, `SortBasedShuffleWriter#cleanupPusher` will be called, and `sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue());` will also be called. Since `SendBufferPool#pushTaskQueues` is a `LinkedList`, repeated add will store two identical `idleQueue`, which may cause multiple tasks running in parallel to share the same `idleQueue`, resulting in inaccurate data. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Production environment verification Closes #2878 from cxzl25/CELEBORN-1686. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-11 10:35:02 +08:00
Wang, Fei	4545cdc401	[CELEBORN-1699] Add GA to check the openapi codegen ### What changes were proposed in this pull request? Check the consistency between openapi spec and generated openapi-client code in GA. Fix the ThreadStack model. ### Why are the changes needed? Address comments: https://github.com/apache/celeborn/pull/2892#issuecomment-2463170494 Followup for https://github.com/apache/celeborn/pull/2888 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. It works: https://github.com/apache/celeborn/actions/runs/11733693060/job/32688436233?pr=2893 <img width="1059" alt="image" src="https://github.com/user-attachments/assets/84682976-1b7d-42e0-9b62-2966f3e952d7"> After ThreadStack fixed: <img width="1368" alt="image" src="https://github.com/user-attachments/assets/14a2a08f-dbc2-409c-a4ed-fbfee82e50b5"> Closes #2893 from turboFei/diff_openapi. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-08 10:57:46 +08:00
Weijie Guo	164c663492	[CELEBORN-1696] StorageManager#cleanFile should remove file info ### What changes were proposed in this pull request? Remove file ino in `StorageManager#cleanFile` ### Why are the changes needed? StorageManager#cleanFile should remove file info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need. Closes #2887 from reswqa/fix_clean_file. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-07 22:50:02 +08:00
Weijie Guo	855de101c2	[CELEBORN-1691] Fix the issue that upstream tasks don't rerun and the current task still retry when failed to decompress in flink ### What changes were proposed in this pull request? Fix the issue that upstream tasks don't rerun and the current task still retry when failed to decompress in flink ### Why are the changes needed? Decompress error should retry upstream otherwise this is un-recoverable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually Closes #2884 from reswqa/fix_decompress. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-07 22:43:26 +08:00
SteNicholas	54bbd72bd2	[CELEBORN-1697] Improve ThreadStackTrace for thread dump ### What changes were proposed in this pull request? Improve `ThreadStackTrace` with `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` for thread dump. ### Why are the changes needed? ThreadStackTrace does not support stack trace including `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` at present. It's recommend to improve `ThreadStackTrace` of thread dump for more details of thread stack trace. ### Does this PR introduce _any_ user-facing change? The response of `ThreadStack` in `/api/v1/thread_dump` adds `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` fields. Cherry pick: - https://github.com/apache/spark/pull/42575 - https://github.com/apache/spark/pull/43095 ### How was this patch tested? `ApiV1BaseResourceSuite#thread_dump` Closes #2888 from SteNicholas/CELEBORN-1697. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-07 20:36:58 +08:00
Wang, Fei	71e3c03a13	[MINOR] Fix docs typo ### What changes were proposed in this pull request? Fix docs typo. ### Why are the changes needed? Fix docs typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2890 from turboFei/nit. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-07 20:34:14 +08:00
Weijie Guo	4b8b125a1e	[CELEBORN-1692] Set mount point in fromPbFileInfoMap ### What changes were proposed in this pull request? Set mount point in fromPbFileInfoMap ### Why are the changes needed? Fix wrong fileInfo mountpint ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #2885 from reswqa/set-mount-point. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-07 14:12:42 +08:00
Weijie Guo	4e7df13af7	[CELEBORN-1693] Fix storageFetcherPool concurrent problem ### What changes were proposed in this pull request? Fix storageFetcherPool concurrent problem. There may be duplicate thread pools created as multi-thread race condition. ![image](https://github.com/user-attachments/assets/ba4b0964-700e-4502-933a-b6c7cb93f32d) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need. Closes #2886 from reswqa/storageFetcherPool. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-07 13:51:39 +08:00
Wang, Fei	f1bda46de4	[CELEBORN-1680] Introduce ShuffleFallbackCount metrics ### What changes were proposed in this pull request? As title, introduce metrics_ShuffleFallbackCount_Value. ### Why are the changes needed? To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively. Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k. In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health. ### Does this PR introduce _any_ user-facing change? Yes, new metrics. ### How was this patch tested? UT. <img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4"> Closes #2866 from turboFei/record_fallback. Lead-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-07 11:42:17 +08:00
Weijie Guo	d44b23c852	[MINOR] Remove unused TODO comments in CelebornTierProducerAgent#processBuffer ### What changes were proposed in this pull request? Remove unused TODO comments in CelebornTierProducerAgent#processBuffer ### Why are the changes needed? In order for buffers to be packed together, we are going to modify the Flink side implementation to delegate buffer compression to tiers. But after discussion, we have been able to handle the case of receiving the compressed buffer on the Celeborn side, so this TODO is no longer needed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need. Closes #2883 from reswqa/remove_unused_todo. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-07 10:58:58 +08:00
mingji	7dcd25925f	[CELEBORN-1671] CelebornShuffleReader will try replica if create client failed ### What changes were proposed in this pull request? 1. To bypass exceptions when creating clients failed in CelebornShuffleReader in spark 3. 2. Client will try the location's replicas in reading locations. ### Why are the changes needed? Allow clients to retry locations when creating clients failed. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Pass GA. Closes #2854 from FMX/b1671. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-06 11:14:11 +08:00
Weijie Guo	f2e9043028	[CELEBORN-1687] Highlight flink session cluster issue in doc ### What changes were proposed in this pull request? If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need. Closes #2879 from reswqa/session-doc. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-06 10:52:34 +08:00
szt	64f201dd83	[CELEBORN-1636][FOLLOWUP] Dynamic resources will only be utilized in case of candidates shortages ### What changes were proposed in this pull request? Follow up of [https://github.com/apache/celeborn/pull/2835] Only use dynamic resources when candidates are not enough. And change the way geting availableWorkers form heartbeat to requestSlots RPC to avoid the burden of heartbeat. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? Add another configuration. ### How was this patch tested? UT Closes #2852 from zaynt4606/clb1636-flu2. Authored-by: szt <zaynt4606@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-05 18:10:01 +08:00
SteNicholas	38156414a1	[CELEBORN-1678][FOLLOWUP] Improve Celeborn CLI user guide ### What changes were proposed in this pull request? Improve Celeborn CLI user guide including: - Add license of Celeborn CLI user guide. - Optimize the introduction of setup and usage for Celeborn CLI. - Optimize the navigation of Celeborn CLI to combine Celeborn Ratis Shell. ### Why are the changes needed? There is no license in Celeborn CLI user guide. Meanwhile, there are certain improvement in user guide including the license, navigation, and the introduction of setup and usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2875 from SteNicholas/CELEBORN-1678. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-05 15:53:28 +08:00
szt	ec67366b7a	[CELEBORN-1684] Fix ambiguous client jar expression of document ### What changes were proposed in this pull request? When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![image](https://github.com/user-attachments/assets/a4e7c415-8f0e-44bd-8d18-18462896e27c) Closes #2877 from zaynt4606/md. Authored-by: szt <zaynt4606@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-05 13:48:22 +08:00
Wang, Fei	b5201df04c	[CELEBORN-1531][FOLLOWUP] Assign the scheduled check task ### What changes were proposed in this pull request? As title. ### Why are the changes needed? Followup for https://github.com/apache/celeborn/pull/2653 `checkForUnavailableWorkerTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` are not assigned and always null. <img width="834" alt="image" src="https://github.com/user-attachments/assets/747a3054-87db-458f-acf8-876926bd1883"> Combine the `checkForHDFSRemnantDirsTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` with `checkForDFSRemnantDirsTimeOutTask`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2871 from turboFei/1531_followup. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-05 11:18:33 +08:00
Yuxin Tan	7ebd168f80	[CELEBORN-1490][CIP-6] Support process large buffer in flink hybrid shuffle ### What changes were proposed in this pull request? This is the last PR in the CIP-6 series. Fix the bug when hybrid shuffle face the buffer which large then 32K. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #2873 from reswqa/11-large-buffer-10month. Lead-authored-by: Yuxin Tan <tanyuxinwork@gmail.com> Co-authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-04 16:57:43 +08:00
Wang, Fei	d4044c5152	[CELEBORN-1682] Add java tools.jar into classpath for JVM quake ### What changes were proposed in this pull request? Add java tools.jar into classpath for JVM quake. ### Why are the changes needed? Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see https://github.com/apache/celeborn/pull/2061 ``` 24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed. java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192) at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66) at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360) at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041) at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala) Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more ``` Related code: `c12e8881ab/project/JDKTools.scala (L58-L75)` Similar issue: https://github.com/vladimirvivien/jmx-cli/issues/4 After copy the `tools.jar` into worker-jars, the issue got resolved. It is better that to involve the `tools.jar` automatically without copy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d"> Closes #2874 from turboFei/jdk_tools. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-04 11:05:10 +08:00
Weijie Guo	c12e8881ab	[CELEBORN-1490][CIP-6] Add Flink Hybrid Shuffle IT test cases ### What changes were proposed in this pull request? 1. Add Flink Hybrid Shuffle IT test cases 2. Fix bug in open stream. ### Why are the changes needed? Test coverage for celeborn + hybrid shuffle ### Does this PR introduce _any_ user-facing change? No Closes #2859 from reswqa/10-itcase-10month. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-01 17:27:24 +08:00
Wang, Fei	e2f640ce3b	[CELEBORN-1660] Using map for workers to find worker fast ### What changes were proposed in this pull request? Using map for workers so that we can find a worker by uniqueId fast. ### Why are the changes needed? For large celeborn cluster, it might be slow. - updateWorkerHeartbeatMeta `1e77f01cd3/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java (L222)` - handleWorkerLost `1e77f01cd3/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala (L762-L765)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #2870 from turboFei/worksMap. Lead-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-01 15:58:53 +08:00
Wang, Fei	2b026a35fc	[CELEBORN-1564][FOLLOWUP] Remove unused variables ### What changes were proposed in this pull request? Remove unused variables. ### Why are the changes needed? Followup for https://github.com/apache/celeborn/pull/2688 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2872 from turboFei/1564_followup. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-01 15:43:02 +08:00
Wang, Fei	7dc72b35e7	[CELEBORN-1477][FOLLOWUP] Remove scala binary version from openapi-client artifactId ### What changes were proposed in this pull request? 1. remove scala binary version from the openapi-client artifactId. 2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641 ### Why are the changes needed? Because the openapi-client is a pure java module. ### Does this PR introduce _any_ user-facing change? No, it has not been released. ### How was this patch tested? GA. Closes #2861 from turboFei/remove_Scala. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-01 15:08:00 +08:00
Weijie Guo	41fdb8ade1	[CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc ### What changes were proposed in this pull request? Add Flink hybrid shuffle doc ### Why are the changes needed? We need the doc for the new hybrid shuffle mode. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no neeed. Closes #2867 from reswqa/add-hs-doc. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-11-01 13:37:14 +08:00
SteNicholas	165e914b9b	[CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4 ### What changes were proposed in this pull request? Bump Spark from 3.4.3 to 3.4.4. ### Why are the changes needed? Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2851 from SteNicholas/CELEBORN-1672. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-01 11:05:00 +08:00
Shuang	14baec8388	[CELEBORN-1673] Support retry create client ### What changes were proposed in this pull request? As title ### Why are the changes needed? Currently, only Flink retries establishing a client when a connection problem occurs. This would be beneficial for all other engines to implement as well. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #2855 from RexXiong/CELEBORN-1673. Lead-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Co-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-10-31 14:45:48 +08:00
Aravind Patnam	12f25d3d0f	[CELEBORN-1678] Add Celeborn CLI User guide in README ### What changes were proposed in this pull request? adding user guide to README for cli ### Why are the changes needed? better user experience when using CLI. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #2862 from akpatnam25/CELEBORN-1678. Authored-by: Aravind Patnam <akpatnam25@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-30 19:58:34 +08:00
SteNicholas	2dc8077cea	[CELEBORN-1674] Fix reader thread name of MapPartitionData ### What changes were proposed in this pull request? Fix reader thread name of `MapPartitionData` which contains `null`. ### Why are the changes needed? The reader thread name of `MapPartitionData` has null at present, which is caused by `MapFileMeta#getMountPoint` that returns null. The reader thread name of `MapPartitionData` is as follows: ``` celebornjscs-bigdata-rss-worker:/data/service/celeborn$ jstack 65\|grep reader-thread "null-reader-thread-7" #798 prio=5 os_prio=0 tid=0x00007ef03bca8000 nid=0x47f waiting on condition [0x00007eef068cb000] "null-reader-thread-7" #799 prio=5 os_prio=0 tid=0x00007ef03a097000 nid=0x47e waiting on condition [0x00007eef069cc000] "null-reader-thread-5" #796 prio=5 os_prio=0 tid=0x00007ef03a818000 nid=0x47d waiting on condition [0x00007eef06acd000] "null-reader-thread-6" #797 prio=5 os_prio=0 tid=0x00007ef03b896800 nid=0x47c waiting on condition [0x00007eef06bce000] "null-reader-thread-4" #793 prio=5 os_prio=0 tid=0x00007ef03ac6b000 nid=0x47b waiting on condition [0x00007eef06ccf000] "null-reader-thread-6" #794 prio=5 os_prio=0 tid=0x00007ef05829e800 nid=0x47a waiting on condition [0x00007eef06dd0000] "null-reader-thread-7" #795 prio=5 os_prio=0 tid=0x00007ef03b06b800 nid=0x479 waiting on condition [0x00007eef06ed1000] "null-reader-thread-3" #789 prio=5 os_prio=0 tid=0x00007ef03a095000 nid=0x478 waiting on condition [0x00007eef06fd2000] "null-reader-thread-3" #790 prio=5 os_prio=0 tid=0x00007ef03a817000 nid=0x477 waiting on condition [0x00007eef070d3000] "null-reader-thread-4" #791 prio=5 os_prio=0 tid=0x00007ef03b895000 nid=0x476 waiting on condition [0x00007eef071d4000] "null-reader-thread-5" #792 prio=5 os_prio=0 tid=0x00007ef03b06a800 nid=0x475 waiting on condition [0x00007eef072d5000] "null-reader-thread-4" #786 prio=5 os_prio=0 tid=0x00007ef03d06b800 nid=0x474 waiting on condition [0x00007eef073d6000] "null-reader-thread-5" #787 prio=5 os_prio=0 tid=0x00007ef03bca8800 nid=0x473 waiting on condition [0x00007eef074d7000] "null-reader-thread-3" #785 prio=5 os_prio=0 tid=0x00007ef03c884800 nid=0x472 waiting on condition [0x00007eef075d8000] "null-reader-thread-6" #788 prio=5 os_prio=0 tid=0x00007ef03cc6b800 nid=0x471 waiting on condition [0x00007eef076d9000] "null-reader-thread-2" #783 prio=5 os_prio=0 tid=0x00007ef03c06a000 nid=0x470 waiting on condition [0x00007eef077da000] "null-reader-thread-2" #784 prio=5 os_prio=0 tid=0x00007ef05829d000 nid=0x46f waiting on condition [0x00007eef078db000] "null-reader-thread-2" #782 prio=5 os_prio=0 tid=0x00007ef03a815800 nid=0x46e waiting on condition [0x00007eef079dc000] "null-reader-thread-1" #781 prio=5 os_prio=0 tid=0x00007ef01d852000 nid=0x46d waiting on condition [0x00007eef07add000] "null-reader-thread-1" #780 prio=5 os_prio=0 tid=0x00007ef03a815000 nid=0x46c waiting on condition [0x00007eef07bde000] "null-reader-thread-1" #779 prio=5 os_prio=0 tid=0x00007ef03ac6c800 nid=0x46b waiting on condition [0x00007eef07cdf000] "null-reader-thread-0" #777 prio=5 os_prio=0 tid=0x00007ef03d06a800 nid=0x46a waiting on condition [0x00007eef07de0000] "null-reader-thread-0" #778 prio=5 os_prio=0 tid=0x00007ef03ac6b800 nid=0x469 waiting on condition [0x00007eef07ee1000] "null-reader-thread-0" #776 prio=5 os_prio=0 tid=0x00007ef03a095800 nid=0x468 waiting on condition [0x00007eef07fe2000] ``` ``` [ERROR][null-reader-thread-6] - org.apache.celeborn.service.deploy.worker.storage.MapPartitionData -MapPartitionData.java(205) -reader exception, reader: DataPartitionReader {startPartitionIndex=834, endPartitionIndex=834, streamId=1774189696911} , message: Partition reader has been failed or finished. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2853 from SteNicholas/CELEBORN-1674. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-30 18:18:06 +08:00
Sanskar Modi	752a0d9459	[CELEBORN-1516][FOLLOWUP] Support reset method for DynamicConfigServiceFactory ### What changes were proposed in this pull request? - Added a reset method for DynamicConfigServiceFactory - Cleaned up QuotaManagerSuite ### Why are the changes needed? Without this change we can not initialize new configService in any other tests. Ex: test for this PR https://github.com/apache/celeborn/pull/2844 are failing because of this issue. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA Closes #2848 from s0nskar/fix_quotatest. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-30 17:18:31 +08:00
Xianming Lei	6ad02f14a9	[CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle ### What changes were proposed in this pull request? Support interrupt shuffle on client side. I will develop the following functions in order 1. Client supports interrupt shuffle 2. Master supports calculating app-level shuffle usage ### Why are the changes needed? The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unable to test this part independently, Additional tests will be added after completing the second part. Closes #2801 from leixm/CELEBORN-1577-1. Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-30 16:28:09 +08:00
Sanskar Modi	2c996133b9	[CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard ### What changes were proposed in this pull request? Adding IsDecommissioningWorker metric to celeborn dashboard ### Why are the changes needed? Metric was missing from dashboard ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Tested in local grafana setup <img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8"> Closes #2836 from s0nskar/decommision_metric. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-30 09:55:43 +08:00
Weijie Guo	f4dc7a839b	[CELEBORN-1490][CIP-6] Impl worker read process in Flink Hybrid Shuffle ### What changes were proposed in this pull request? Impl worker read process in Flink Hybrid Shuffle ### Does this PR introduce _any_ user-facing change? No Closes #2820 from reswqa/cip6-8-pr. Lead-authored-by: Weijie Guo <reswqa@163.com> Co-authored-by: codenohup <huangxu.walker@gmail.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-29 16:29:16 +08:00
Fu Chen	39c185afd1	[CELEBORN-1677][BUILD] Update SCM information for SBT build configuration ### What changes were proposed in this pull request? As title ### Why are the changes needed? This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo` ```diff <name>org.apache.celeborn</name> </organization> <scm> - <url>https://github.com/cfmcgrady/incubator-celeborn</url> - <connection>scmhttps://github.com/cfmcgrady/incubator-celeborn.git</connection> - <developerConnection>scmgitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection> - </scm> - <url>https://celeborn.apache.org/</url> - <scm> - <url>gitgithub.com:apache/celeborn.git</url> - <connection>scmgitgithub.com:apache/celeborn.git</connection> + <url>https://celeborn.apache.org/</url> + <connection>scmhttps://github.com/apache/celeborn.git</connection> + <developerConnection>scmgitgithub.com:apache/celeborn.git</developerConnection> </scm> ``` The conflicting POM might block publishing to a private Maven repository. ``` [error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom [error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000) [error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589) [error] at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529) [error] at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308) [error] at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284) [error] at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82) [error] at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150) [error] at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84) [error] at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407) [error] at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130) [error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124) [error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111) [error] at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170) [error] at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216) [error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local Closes #2858 from cfmcgrady/sbt-scm. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Fu Chen <cfmcgrady@gmail.com>	2024-10-29 14:04:58 +08:00
Sanskar Modi	e51b0c4f86	[CELEBORN-1642][CIP-11] Support multiple worker tags ### What changes were proposed in this pull request? Current TagsManager code only supported one tags for selecting tagged workers. This change will enable support of passing multiple tags to TagsManager. Multiple tags will be evaluated as "AND" expression i.e only workers tagged with all the passed tags will be selected. Support for more schemes will be added in follow up PRs. ### Why are the changes needed? https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs Closes #2850 from s0nskar/CELEBORN-1642. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-28 10:18:43 +08:00
szt	7685fa7db2	[CELEBORN-1636] Client supports dynamic update of Worker resources on the server ### What changes were proposed in this pull request? Currently, the ChangePartitionManager retrieves workers from the LifeCycleManager's workerSnapshot. However, during the revival process in reallocateChangePartitionRequestSlotsFromCandidates, it does not account for newly added available workers resulting from elastic contraction and expansion. This PR addresses this issue by updating the candidate workers in the ChangePartitionManager to use the available workers reported in the heartbeat from the master. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #2835 from zaynt4606/clbdev. Authored-by: szt <zaynt4606@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-28 09:49:31 +08:00
avishnus	59029a0967	[CELEBORN-1649] Bumping up maven to 3.9.9 ### What changes were proposed in this pull request? Bumping up maven version to 3.9.9 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2834 from avishnus/maven. Authored-by: avishnus <avishnus@visa.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-25 16:20:32 +08:00
mingji	e1bebb9e5b	[CELEBORN-1668] Fix NPE when handle closed file writers ### What changes were proposed in this pull request? To fix an NPE when handling the closed file writers. ### Why are the changes needed? If a file writer stores its shuffle data in memory, the disk file info object will be null, causing NPE. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? GA. Closes #2846 from FMX/b1688. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-24 17:50:16 +08:00
SteNicholas	4b150be0c8	[CELEBORN-1669] Fix NullPointerException for PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle key ### What changes were proposed in this pull request? Fix `NullPointerException` for `PartitionFilesSorter#updateSortedShuffleFiles` after cleaning up expired shuffle key. ### Why are the changes needed? `PartitionFilesSorter` sorts shuffle files in `worker-file-sorter-executor` thread and cleans up expired key in `worker-expired-shuffle-cleaner` thread. There is a case that after `worker-expired-shuffle-cleaner` cleaning up expired shuffle key, `worker-file-sorter-executor` updates sorted shuffle files, which causes `NullPointerException` at present. ``` 2024-10-23 17:26:17,162 [INFO] [worker-expired-shuffle-cleaner] - org.apache.celeborn.service.deploy.worker.Worker -Logging.scala(51) -Cleaned up expired shuffle application_1724141892576_3843182_1-0 2024-10-23 17:26:17,392 [ERROR] [worker-file-sorter-executor-237572] - org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter -PartitionFilesSorter.java(752) -Sorting shuffle file for application_1724141892576_3843182_1-0-1875-0-0 /mnt/storage02/celeborn-worker/shuffle_data/application_1724141892576_3843182_1/0/1875-0-0 failed, detail: java.lang.NullPointerException: null at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.updateSortedShuffleFiles(PartitionFilesSorter.java:455) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT] at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:747) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT] at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$new$1(PartitionFilesSorter.java:164) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2847 from SteNicholas/CELEBORN-1669. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-24 17:47:25 +08:00
Wang, Fei	216152d038	[CELEBORN-1632] Support to apply ratis local raft_meta_conf command with RESTful api ### What changes were proposed in this pull request? Sub-task of CELEBORN-1628. Support to apply ratis local raft_meta_conf with RESTful api. See https://celeborn.apache.org/docs/latest/celeborn_ratis_shell/#local-raftmetaconf ``` $ celeborn-ratis sh local raftMetaConf -peers <[P0_ID\|]P0_HOST:P0_PORT,[P1_ID\|]P1_HOST:P1_PORT,[P2_ID\|]P2_HOST:P2_PORT> -path <PARENT_PATH_OF_RAFT_META_CONF> ``` The implementation is same with `e96ed1a338/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/local/RaftMetaConfCommand.java (L122-L133)` ### Why are the changes needed? We have implemented the RESTful implementation for all the others ratis-shell command. <img width="1219" alt="image" src="https://github.com/user-attachments/assets/4367ddbd-3c55-449a-a1bc-75d6c18e8918"> \| Ratis Shell \| RESTful api \| \|----------------------\|---------------------------------\| \| election transfer \| `/ratis/election/transfer` \| \| election stepDown \| `/ratis/election/step_down` \| \| election pause \| `/ratis/election/pause` \| \| election resume \| `/ratis/election/resume` \| \| group info \| `/masters` \| \| peer add \| `/ratis/peer/add` \| \| peer remove \| `/ratis/peer/remove` \| \| peer setPriority \| `/ratis/peer/set_priority` \| \| snapshot create \| `/ratis/snapshot/create` \| And the local raftMetaConf command is the last one. I closed the ticket CELEBORN-1632 before, I thought it is a local command and wonder whether it is necessary to implement it with RESTful api. But we have implemented all the others, so I decide to implement it as well. ### Does this PR introduce _any_ user-facing change? A new API. The implementation is same with `e96ed1a338/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/local/RaftMetaConfCommand.java (L122-L133)` ### How was this patch tested? ![image](https://github.com/user-attachments/assets/088d8523-e5f5-4546-9159-e12191fd8a29) ![image](https://github.com/user-attachments/assets/ce9c4284-fd61-45de-93e7-d38e3b6afac9) <img width="960" alt="image" src="https://github.com/user-attachments/assets/b302a680-baea-4709-b77f-a2b1946b8dff"> <img width="1471" alt="image" src="https://github.com/user-attachments/assets/4bf090ba-c6f4-4f49-aa57-8dd2c897ff30"> <img width="871" alt="image" src="https://github.com/user-attachments/assets/9959072c-5e96-48f5-911e-546c05a0c443"> Closes #2829 from turboFei/local_raft_conf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-24 16:09:18 +08:00
zhangzhao.08	23113898f6	[CELEBORN-1667] Fix NPE & LEAK occurring prior to worker registration ### What changes were proposed in this pull request? As title ### Why are the changes needed? This PR addressed the memory leak problem of worker nodes before registration and the NPE issue that occurs when PushDataHandler is accessed during the initialization process. ![image](https://github.com/user-attachments/assets/993f9e9c-fb84-4b71-a77f-6c043cda4864) ![image](https://github.com/user-attachments/assets/25545bbf-e838-44b2-88fe-3fe2dada0524) ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Pass GA Closes #2843 from zhaostu4/zhao/worker_npe. Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-24 15:51:29 +08:00
SteNicholas	464a7c71a9	[CELEBORN-1651] Support ratio threshold of unhealthy disks for excluding worker ### What changes were proposed in this pull request? Support ratio threshold of unhealthy disks for excluding worker with `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold`. ### Why are the changes needed? We often encounter issues such as disk input/output errors in production practice. When a bad disk occurs, the worker will be maintained to decommission for repairing the machine disk. The reason is that generally the fault will be repaired in time after it is discovered. It is possible that the machine will not trigger all disk failures if it is out of warranty. It can be replaced directly when it is under warranty. If the disk fails after it is out of warranty, you need to purchase the disk yourself for replacement. At the same time, submitting the disk for repair at one time will affect the failure rate judgment of the system group and scenario. In addition, the occurrence of bad disks will bring about some management problems, such as continuous alarms, and the handling of disk failures is relatively customized. Therefore, it's recommended to configure ratio threshold of unhealthy disks for excluding worker, which does not need to wait for all unhealthy disks to exclude corresponding worker. ### Does this PR introduce _any_ user-facing change? Introduce `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold` to configure max ratio of unhealthy disks for excluding worker. ### How was this patch tested? Cluster test. Closes #2812 from SteNicholas/CELEBORN-1651. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-24 11:43:22 +08:00
Fu Chen	3b9c2f04e7	[CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7 ### What changes were proposed in this pull request? As title ### Why are the changes needed? The version 1.0.6 is outdated and not available on Maven Central. https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #2842 from cfmcgrady/sbt-protoc. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Fu Chen <cfmcgrady@gmail.com>	2024-10-24 11:16:37 +08:00
jiang13021	7018996e24	[MINOR] Fix typo in ExceptionUtils ### What changes were proposed in this pull request? Fix typo. ### Why are the changes needed? The error message was changed in [this pull request](https://github.com/apache/celeborn/pull/1097), but the connectFail method in org.apache.celeborn.common.util.ExceptionUtils has not been updated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No need. Closes #2841 from jiang13021/minor-cause-typo. Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-23 16:08:16 +08:00
YutingWang98	5b2030037a	[CELEBORN-1664] Fix secret fetch failures after LEADER master failover ### What changes were proposed in this pull request? Fix a bug related to auth under master HA mode which would cause app failures when leader master restarts. Also, remove the secrets from memory after app lost. Previous implementation add the registration & secret info in leader Master's memory, and push to other masters though https://github.com/apache/celeborn/pull/2346. After leader restarts, the info will only be in Ratis (AbstractMetaManager), however app still fetch it from new leader's memory, and would fail to get it. Fix this by checking AbstractMetaManager's registration info if not found in memory, and properly authorize the app. ### Why are the changes needed? When auth enabled, and leader master restart, there will be "Registration information not found" error on app side, and failed to send heartbeat to master. It will cause app to be removed on server side after heartbeat timeout, causing job to fail. ``` 24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] client.TransportClientFactory: Exception while bootstrapping client after 71.4 ms java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097 at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110) at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228) at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103) at org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307) at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205) at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133) at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212) at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232) at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097 at org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324) at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95) ... 13 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.RuntimeException: Registration information not found for spark-402a80be70f74455b01 at org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142) at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84) at org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99) at org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58) at org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on dev cluster and job can properly get the secrets after master failover Closes #2826 from YutingWang98/fix_auth_master_ha. Authored-by: YutingWang98 <69848459+YutingWang98@users.noreply.github.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2024-10-22 14:10:08 -05:00
SteNicholas	06bd39b768	[CELEBORN-1665] CommitHandler should process CommitFilesResponse with COMMIT_FILE_EXCEPTION status ### What changes were proposed in this pull request? `CommitHandler` should process `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status. ### Why are the changes needed? `CommitHandler` processes `CommitFilesResponse` with statuses including `SUCCESS`, `PARTIAL_SUCCESS`, `SHUFFLE_NOT_REGISTERED`, `REQUEST_FAILED` and `WORKER_EXCLUDED` at present. Meanwhile, Controller replies `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status for throwable. Therefore, `CommitHandler` should process `COMMIT_FILE_EXCEPTION` status. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2838 from SteNicholas/CELEBORN-1665. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-10-22 17:49:52 +08:00
Sanskar Modi	1e77f01cd3	[CELEBORN-1663][FOLLOWUP] Only register appShuffleDeterminate if stage using celeborn for shuffle ### What changes were proposed in this pull request? Making the same changes for Spark2 codebase ### Why are the changes needed? Followup for https://github.com/apache/celeborn/pull/2832 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? Existing UTs Closes #2837 from s0nskar/fix_register_spark2. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-10-22 14:28:12 +08:00

1 2 3 4 5 ...

2032 Commits