Commit Graph

2020 Commits

Author SHA1 Message Date
Weijie Guo
4e7df13af7
[CELEBORN-1693] Fix storageFetcherPool concurrent problem
### What changes were proposed in this pull request?

Fix storageFetcherPool concurrent problem.

There may be duplicate thread pools created as multi-thread race condition.

![image](https://github.com/user-attachments/assets/ba4b0964-700e-4502-933a-b6c7cb93f32d)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need.

Closes #2886 from reswqa/storageFetcherPool.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-07 13:51:39 +08:00
Wang, Fei
f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
Weijie Guo
d44b23c852 [MINOR] Remove unused TODO comments in CelebornTierProducerAgent#processBuffer
### What changes were proposed in this pull request?
Remove unused TODO comments in CelebornTierProducerAgent#processBuffer

### Why are the changes needed?
In order for buffers to be packed together, we are going to modify the Flink side implementation to delegate buffer compression to tiers. But after discussion, we have been able to handle the case of receiving the compressed buffer on the Celeborn side, so this TODO is no longer needed.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need.

Closes #2883 from reswqa/remove_unused_todo.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-07 10:58:58 +08:00
mingji
7dcd25925f [CELEBORN-1671] CelebornShuffleReader will try replica if create client failed
### What changes were proposed in this pull request?
1. To bypass exceptions when creating clients failed in CelebornShuffleReader in spark 3.
2. Client will try the location's replicas in reading locations.

### Why are the changes needed?
Allow clients to retry locations when creating clients failed.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Pass GA.

Closes #2854 from FMX/b1671.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-06 11:14:11 +08:00
Weijie Guo
f2e9043028 [CELEBORN-1687] Highlight flink session cluster issue in doc
### What changes were proposed in this pull request?

If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No need.

Closes #2879 from reswqa/session-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-06 10:52:34 +08:00
szt
64f201dd83 [CELEBORN-1636][FOLLOWUP] Dynamic resources will only be utilized in case of candidates shortages
### What changes were proposed in this pull request?
Follow up of [https://github.com/apache/celeborn/pull/2835]
Only use dynamic resources when candidates are not enough.
And change the way geting availableWorkers form heartbeat to requestSlots RPC to avoid the burden of heartbeat.

### Why are the changes needed?
No

### Does this PR introduce _any_ user-facing change?
Add another configuration.

### How was this patch tested?
UT

Closes #2852 from zaynt4606/clb1636-flu2.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-05 18:10:01 +08:00
SteNicholas
38156414a1 [CELEBORN-1678][FOLLOWUP] Improve Celeborn CLI user guide
### What changes were proposed in this pull request?

Improve Celeborn CLI user guide including:

- Add license of Celeborn CLI user guide.
- Optimize the introduction of setup and usage for Celeborn CLI.
- Optimize the navigation of Celeborn CLI to combine Celeborn Ratis Shell.

### Why are the changes needed?

There is no license in Celeborn CLI user guide. Meanwhile, there are certain improvement in user guide including the license, navigation, and the introduction of setup and usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2875 from SteNicholas/CELEBORN-1678.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-05 15:53:28 +08:00
szt
ec67366b7a
[CELEBORN-1684] Fix ambiguous client jar expression of document
### What changes were proposed in this pull request?
When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear.

### Why are the changes needed?
No

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
![image](https://github.com/user-attachments/assets/a4e7c415-8f0e-44bd-8d18-18462896e27c)

Closes #2877 from zaynt4606/md.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-05 13:48:22 +08:00
Wang, Fei
b5201df04c [CELEBORN-1531][FOLLOWUP] Assign the scheduled check task
### What changes were proposed in this pull request?

As title.

### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2653

`checkForUnavailableWorkerTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` are not assigned and always null.
<img width="834" alt="image" src="https://github.com/user-attachments/assets/747a3054-87db-458f-acf8-876926bd1883">

Combine the `checkForHDFSRemnantDirsTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` with `checkForDFSRemnantDirsTimeOutTask`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2871 from turboFei/1531_followup.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-05 11:18:33 +08:00
Yuxin Tan
7ebd168f80 [CELEBORN-1490][CIP-6] Support process large buffer in flink hybrid shuffle
### What changes were proposed in this pull request?

This is the last PR in the CIP-6 series.

Fix the bug when hybrid shuffle face the buffer which large then 32K.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2873 from reswqa/11-large-buffer-10month.

Lead-authored-by: Yuxin Tan <tanyuxinwork@gmail.com>
Co-authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-04 16:57:43 +08:00
Wang, Fei
d4044c5152
[CELEBORN-1682] Add java tools.jar into classpath for JVM quake
### What changes were proposed in this pull request?

Add java tools.jar into classpath for JVM quake.

### Why are the changes needed?
Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see https://github.com/apache/celeborn/pull/2061
```
24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed.
java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180)
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179)
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185)
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184)
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192)
    at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66)
    at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360)
    at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041)
    at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 9 more
```

Related code:
c12e8881ab/project/JDKTools.scala (L58-L75)

Similar issue: https://github.com/vladimirvivien/jmx-cli/issues/4

After copy the `tools.jar` into worker-jars, the issue got resolved.

It is better that to involve the `tools.jar` automatically without copy.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d">

Closes #2874 from turboFei/jdk_tools.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-04 11:05:10 +08:00
Weijie Guo
c12e8881ab
[CELEBORN-1490][CIP-6] Add Flink Hybrid Shuffle IT test cases
### What changes were proposed in this pull request?
1. Add Flink Hybrid Shuffle IT test cases
2. Fix bug in open stream.

### Why are the changes needed?

Test coverage for celeborn + hybrid shuffle

### Does this PR introduce _any_ user-facing change?
No

Closes #2859 from reswqa/10-itcase-10month.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-01 17:27:24 +08:00
Wang, Fei
e2f640ce3b [CELEBORN-1660] Using map for workers to find worker fast
### What changes were proposed in this pull request?

Using map for workers so that we can find a worker by uniqueId fast.

### Why are the changes needed?

For large celeborn cluster, it might be slow.

- updateWorkerHeartbeatMeta
1e77f01cd3/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java (L222)

- handleWorkerLost
1e77f01cd3/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala (L762-L765)
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #2870 from turboFei/worksMap.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-01 15:58:53 +08:00
Wang, Fei
2b026a35fc [CELEBORN-1564][FOLLOWUP] Remove unused variables
### What changes were proposed in this pull request?

Remove unused variables.

### Why are the changes needed?

Followup for https://github.com/apache/celeborn/pull/2688

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2872 from turboFei/1564_followup.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-01 15:43:02 +08:00
Wang, Fei
7dc72b35e7 [CELEBORN-1477][FOLLOWUP] Remove scala binary version from openapi-client artifactId
### What changes were proposed in this pull request?

1. remove scala binary version from the openapi-client artifactId.
2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641

### Why are the changes needed?

Because the openapi-client is a pure java module.

### Does this PR introduce _any_ user-facing change?

No, it has not been released.

### How was this patch tested?
GA.

Closes #2861 from turboFei/remove_Scala.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-01 15:08:00 +08:00
Weijie Guo
41fdb8ade1
[CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
### What changes were proposed in this pull request?

Add Flink hybrid shuffle doc

### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

no neeed.

Closes #2867 from reswqa/add-hs-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-01 13:37:14 +08:00
SteNicholas
165e914b9b [CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
### What changes were proposed in this pull request?

Bump Spark from 3.4.3 to 3.4.4.

### Why are the changes needed?

Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2851 from SteNicholas/CELEBORN-1672.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-01 11:05:00 +08:00
Shuang
14baec8388
[CELEBORN-1673] Support retry create client
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Currently, only Flink retries establishing a client when a connection problem occurs. This would be beneficial for all other engines to implement as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2855 from RexXiong/CELEBORN-1673.

Lead-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Co-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-10-31 14:45:48 +08:00
Aravind Patnam
12f25d3d0f [CELEBORN-1678] Add Celeborn CLI User guide in README
### What changes were proposed in this pull request?
adding user guide to README for cli

### Why are the changes needed?
better user experience when using CLI.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
N/A

Closes #2862 from akpatnam25/CELEBORN-1678.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-30 19:58:34 +08:00
SteNicholas
2dc8077cea [CELEBORN-1674] Fix reader thread name of MapPartitionData
### What changes were proposed in this pull request?

Fix reader thread name of `MapPartitionData` which contains `null`.

### Why are the changes needed?

The reader thread name of `MapPartitionData` has null at present, which is caused by `MapFileMeta#getMountPoint` that returns null. The reader thread name of `MapPartitionData` is as follows:
```
celebornjscs-bigdata-rss-worker:/data/service/celeborn$ jstack 65|grep reader-thread
"null-reader-thread-7" #798 prio=5 os_prio=0 tid=0x00007ef03bca8000 nid=0x47f waiting on condition [0x00007eef068cb000]
"null-reader-thread-7" #799 prio=5 os_prio=0 tid=0x00007ef03a097000 nid=0x47e waiting on condition [0x00007eef069cc000]
"null-reader-thread-5" #796 prio=5 os_prio=0 tid=0x00007ef03a818000 nid=0x47d waiting on condition [0x00007eef06acd000]
"null-reader-thread-6" #797 prio=5 os_prio=0 tid=0x00007ef03b896800 nid=0x47c waiting on condition [0x00007eef06bce000]
"null-reader-thread-4" #793 prio=5 os_prio=0 tid=0x00007ef03ac6b000 nid=0x47b waiting on condition [0x00007eef06ccf000]
"null-reader-thread-6" #794 prio=5 os_prio=0 tid=0x00007ef05829e800 nid=0x47a waiting on condition [0x00007eef06dd0000]
"null-reader-thread-7" #795 prio=5 os_prio=0 tid=0x00007ef03b06b800 nid=0x479 waiting on condition [0x00007eef06ed1000]
"null-reader-thread-3" #789 prio=5 os_prio=0 tid=0x00007ef03a095000 nid=0x478 waiting on condition [0x00007eef06fd2000]
"null-reader-thread-3" #790 prio=5 os_prio=0 tid=0x00007ef03a817000 nid=0x477 waiting on condition [0x00007eef070d3000]
"null-reader-thread-4" #791 prio=5 os_prio=0 tid=0x00007ef03b895000 nid=0x476 waiting on condition [0x00007eef071d4000]
"null-reader-thread-5" #792 prio=5 os_prio=0 tid=0x00007ef03b06a800 nid=0x475 waiting on condition [0x00007eef072d5000]
"null-reader-thread-4" #786 prio=5 os_prio=0 tid=0x00007ef03d06b800 nid=0x474 waiting on condition [0x00007eef073d6000]
"null-reader-thread-5" #787 prio=5 os_prio=0 tid=0x00007ef03bca8800 nid=0x473 waiting on condition [0x00007eef074d7000]
"null-reader-thread-3" #785 prio=5 os_prio=0 tid=0x00007ef03c884800 nid=0x472 waiting on condition [0x00007eef075d8000]
"null-reader-thread-6" #788 prio=5 os_prio=0 tid=0x00007ef03cc6b800 nid=0x471 waiting on condition [0x00007eef076d9000]
"null-reader-thread-2" #783 prio=5 os_prio=0 tid=0x00007ef03c06a000 nid=0x470 waiting on condition [0x00007eef077da000]
"null-reader-thread-2" #784 prio=5 os_prio=0 tid=0x00007ef05829d000 nid=0x46f waiting on condition [0x00007eef078db000]
"null-reader-thread-2" #782 prio=5 os_prio=0 tid=0x00007ef03a815800 nid=0x46e waiting on condition [0x00007eef079dc000]
"null-reader-thread-1" #781 prio=5 os_prio=0 tid=0x00007ef01d852000 nid=0x46d waiting on condition [0x00007eef07add000]
"null-reader-thread-1" #780 prio=5 os_prio=0 tid=0x00007ef03a815000 nid=0x46c waiting on condition [0x00007eef07bde000]
"null-reader-thread-1" #779 prio=5 os_prio=0 tid=0x00007ef03ac6c800 nid=0x46b waiting on condition [0x00007eef07cdf000]
"null-reader-thread-0" #777 prio=5 os_prio=0 tid=0x00007ef03d06a800 nid=0x46a waiting on condition [0x00007eef07de0000]
"null-reader-thread-0" #778 prio=5 os_prio=0 tid=0x00007ef03ac6b800 nid=0x469 waiting on condition [0x00007eef07ee1000]
"null-reader-thread-0" #776 prio=5 os_prio=0 tid=0x00007ef03a095800 nid=0x468 waiting on condition [0x00007eef07fe2000]
```
```
[ERROR][null-reader-thread-6] - org.apache.celeborn.service.deploy.worker.storage.MapPartitionData -MapPartitionData.java(205) -reader exception, reader: DataPartitionReader

{startPartitionIndex=834, endPartitionIndex=834, streamId=1774189696911}
, message: Partition reader has been failed or finished.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2853 from SteNicholas/CELEBORN-1674.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 18:18:06 +08:00
Sanskar Modi
752a0d9459 [CELEBORN-1516][FOLLOWUP] Support reset method for DynamicConfigServiceFactory
### What changes were proposed in this pull request?
- Added a reset method for DynamicConfigServiceFactory
- Cleaned up QuotaManagerSuite

### Why are the changes needed?
Without this change we can not initialize new configService in any other tests.
Ex: test for this PR https://github.com/apache/celeborn/pull/2844 are failing because of this issue.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

Closes #2848 from s0nskar/fix_quotatest.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 17:18:31 +08:00
Xianming Lei
6ad02f14a9 [CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle
### What changes were proposed in this pull request?
Support interrupt shuffle on client side.

I will develop the following functions in order
1.  Client supports interrupt shuffle
2. Master supports calculating app-level shuffle usage

### Why are the changes needed?
The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unable to test this part independently, Additional tests will be added after completing the second part.

Closes #2801 from leixm/CELEBORN-1577-1.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 16:28:09 +08:00
Sanskar Modi
2c996133b9 [CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
### What changes were proposed in this pull request?

Adding IsDecommissioningWorker metric to celeborn dashboard

### Why are the changes needed?

Metric was missing from dashboard

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?

Tested in local grafana setup

<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">

Closes #2836 from s0nskar/decommision_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 09:55:43 +08:00
Weijie Guo
f4dc7a839b [CELEBORN-1490][CIP-6] Impl worker read process in Flink Hybrid Shuffle
### What changes were proposed in this pull request?

Impl worker read process in Flink Hybrid Shuffle

### Does this PR introduce _any_ user-facing change?

No

Closes #2820 from reswqa/cip6-8-pr.

Lead-authored-by: Weijie Guo <reswqa@163.com>
Co-authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-29 16:29:16 +08:00
Fu Chen
39c185afd1 [CELEBORN-1677][BUILD] Update SCM information for SBT build configuration
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo`

```diff
         <name>org.apache.celeborn</name>
     </organization>
     <scm>
-        <url>https://github.com/cfmcgrady/incubator-celeborn</url>
-        <connection>scm:git:https://github.com/cfmcgrady/incubator-celeborn.git</connection>
-        <developerConnection>scm:git:gitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection>
-    </scm>
-    <url>https://celeborn.apache.org/</url>
-    <scm>
-        <url>gitgithub.com:apache/celeborn.git</url>
-        <connection>scm:git:gitgithub.com:apache/celeborn.git</connection>
+        <url>https://celeborn.apache.org/</url>
+        <connection>scm:git:https://github.com/apache/celeborn.git</connection>
+        <developerConnection>scm:git:gitgithub.com:apache/celeborn.git</developerConnection>
     </scm>

```

The conflicting POM might block publishing to a private Maven repository.

```
[error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom
[error]         at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
[error]         at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
[error]         at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
[error]         at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308)
[error]         at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284)
[error]         at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82)
[error]         at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150)
[error]         at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84)
[error]         at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407)
[error]         at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130)
[error]         at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124)
[error]         at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111)
[error]         at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170)
[error]         at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216)
[error]         at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

local

Closes #2858 from cfmcgrady/sbt-scm.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2024-10-29 14:04:58 +08:00
Sanskar Modi
e51b0c4f86 [CELEBORN-1642][CIP-11] Support multiple worker tags
### What changes were proposed in this pull request?
Current TagsManager code only supported one tags for selecting tagged workers. This change will enable support of passing multiple tags to TagsManager. Multiple tags will be evaluated as "AND" expression i.e only workers tagged with all the passed tags will be selected.

Support for more schemes will be added in follow up PRs.

### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs

Closes #2850 from s0nskar/CELEBORN-1642.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-28 10:18:43 +08:00
szt
7685fa7db2 [CELEBORN-1636] Client supports dynamic update of Worker resources on the server
### What changes were proposed in this pull request?
Currently, the ChangePartitionManager retrieves workers from the LifeCycleManager's workerSnapshot. However, during the revival process in reallocateChangePartitionRequestSlotsFromCandidates, it does not account for newly added available workers resulting from elastic contraction and expansion. This PR addresses this issue by updating the candidate workers in the ChangePartitionManager to use the available workers reported in the heartbeat from the master.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2835 from zaynt4606/clbdev.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-28 09:49:31 +08:00
avishnus
59029a0967 [CELEBORN-1649] Bumping up maven to 3.9.9
### What changes were proposed in this pull request?
Bumping up maven version to 3.9.9

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2834 from avishnus/maven.

Authored-by: avishnus <avishnus@visa.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-25 16:20:32 +08:00
mingji
e1bebb9e5b [CELEBORN-1668] Fix NPE when handle closed file writers
### What changes were proposed in this pull request?
To fix an NPE when handling the closed file writers.

### Why are the changes needed?
If a file writer stores its shuffle data in memory, the disk file info object will be null, causing NPE.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2846 from FMX/b1688.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-24 17:50:16 +08:00
SteNicholas
4b150be0c8 [CELEBORN-1669] Fix NullPointerException for PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle key
### What changes were proposed in this pull request?

Fix `NullPointerException` for `PartitionFilesSorter#updateSortedShuffleFiles` after cleaning up expired shuffle key.

### Why are the changes needed?

`PartitionFilesSorter` sorts shuffle files in `worker-file-sorter-executor` thread and cleans up expired key in `worker-expired-shuffle-cleaner` thread. There is a case that after `worker-expired-shuffle-cleaner` cleaning up expired shuffle key, `worker-file-sorter-executor` updates sorted shuffle files, which causes `NullPointerException` at present.

```
2024-10-23 17:26:17,162 [INFO] [worker-expired-shuffle-cleaner] - org.apache.celeborn.service.deploy.worker.Worker -Logging.scala(51) -Cleaned up expired shuffle application_1724141892576_3843182_1-0
2024-10-23 17:26:17,392 [ERROR] [worker-file-sorter-executor-237572] - org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter -PartitionFilesSorter.java(752) -Sorting shuffle file for application_1724141892576_3843182_1-0-1875-0-0 /mnt/storage02/celeborn-worker/shuffle_data/application_1724141892576_3843182_1/0/1875-0-0 failed, detail:
java.lang.NullPointerException: null
    at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.updateSortedShuffleFiles(PartitionFilesSorter.java:455) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
    at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:747) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
    at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$new$1(PartitionFilesSorter.java:164) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2847 from SteNicholas/CELEBORN-1669.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-24 17:47:25 +08:00
Wang, Fei
216152d038 [CELEBORN-1632] Support to apply ratis local raft_meta_conf command with RESTful api
### What changes were proposed in this pull request?
Sub-task of CELEBORN-1628.

Support to apply ratis local raft_meta_conf with RESTful api.

See https://celeborn.apache.org/docs/latest/celeborn_ratis_shell/#local-raftmetaconf
```
$ celeborn-ratis sh local raftMetaConf -peers <[P0_ID|]P0_HOST:P0_PORT,[P1_ID|]P1_HOST:P1_PORT,[P2_ID|]P2_HOST:P2_PORT> -path <PARENT_PATH_OF_RAFT_META_CONF>
```

The implementation is same with e96ed1a338/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/local/RaftMetaConfCommand.java (L122-L133)

### Why are the changes needed?

We have implemented the RESTful implementation for all the others ratis-shell command.

<img width="1219" alt="image" src="https://github.com/user-attachments/assets/4367ddbd-3c55-449a-a1bc-75d6c18e8918">

| Ratis Shell               | RESTful api                        |
|----------------------|---------------------------------|
| election transfer    | `/ratis/election/transfer`      |
| election stepDown    | `/ratis/election/step_down`     |
| election pause       | `/ratis/election/pause`         |
| election resume      | `/ratis/election/resume`        |
| group info           | `/masters`                      |
| peer add             | `/ratis/peer/add`               |
| peer remove          | `/ratis/peer/remove`            |
| peer setPriority     | `/ratis/peer/set_priority`      |
| snapshot create      | `/ratis/snapshot/create`        |

And the local raftMetaConf command is the last one.

I closed the ticket CELEBORN-1632 before, I thought it is a local command and wonder whether it is necessary to implement it with RESTful api.

But we have implemented all the others, so I decide to implement it as well.

### Does this PR introduce _any_ user-facing change?

A new API.

The implementation is same with e96ed1a338/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/local/RaftMetaConfCommand.java (L122-L133)

### How was this patch tested?
![image](https://github.com/user-attachments/assets/088d8523-e5f5-4546-9159-e12191fd8a29)
![image](https://github.com/user-attachments/assets/ce9c4284-fd61-45de-93e7-d38e3b6afac9)
<img width="960" alt="image" src="https://github.com/user-attachments/assets/b302a680-baea-4709-b77f-a2b1946b8dff">

<img width="1471" alt="image" src="https://github.com/user-attachments/assets/4bf090ba-c6f4-4f49-aa57-8dd2c897ff30">
<img width="871" alt="image" src="https://github.com/user-attachments/assets/9959072c-5e96-48f5-911e-546c05a0c443">

Closes #2829 from turboFei/local_raft_conf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-24 16:09:18 +08:00
zhangzhao.08
23113898f6 [CELEBORN-1667] Fix NPE & LEAK occurring prior to worker registration
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

This PR  addressed the memory leak problem of worker nodes before registration and the NPE issue that occurs when PushDataHandler is accessed during the initialization process.

![image](https://github.com/user-attachments/assets/993f9e9c-fb84-4b71-a77f-6c043cda4864)
![image](https://github.com/user-attachments/assets/25545bbf-e838-44b2-88fe-3fe2dada0524)

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Pass GA

Closes #2843 from zhaostu4/zhao/worker_npe.

Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-24 15:51:29 +08:00
SteNicholas
464a7c71a9 [CELEBORN-1651] Support ratio threshold of unhealthy disks for excluding worker
### What changes were proposed in this pull request?

Support ratio threshold of unhealthy disks for excluding worker with `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold`.

### Why are the changes needed?

We often encounter issues such as disk input/output errors in production practice. When a bad disk occurs, the worker will be maintained to decommission for repairing the machine disk. The reason is that generally the fault will be repaired in time after it is discovered. It is possible that the machine will not trigger all disk failures if it is out of warranty. It can be replaced directly when it is under warranty. If the disk fails after it is out of warranty, you need to purchase the disk yourself for replacement. At the same time, submitting the disk for repair at one time will affect the failure rate judgment of the system group and scenario. In addition, the occurrence of bad disks will bring about some management problems, such as continuous alarms, and the handling of disk failures is relatively customized.

Therefore, it's recommended to configure ratio threshold of unhealthy disks for excluding worker, which does not need to wait for all unhealthy disks to exclude corresponding worker.

### Does this PR introduce _any_ user-facing change?

Introduce `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold` to configure max ratio of unhealthy disks for excluding worker.

### How was this patch tested?

Cluster test.

Closes #2812 from SteNicholas/CELEBORN-1651.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-24 11:43:22 +08:00
Fu Chen
3b9c2f04e7 [CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

The version 1.0.6 is outdated and not available on Maven Central.

https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #2842 from cfmcgrady/sbt-protoc.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2024-10-24 11:16:37 +08:00
jiang13021
7018996e24 [MINOR] Fix typo in ExceptionUtils
### What changes were proposed in this pull request?
Fix typo.

### Why are the changes needed?
The error message was changed in [this pull request](https://github.com/apache/celeborn/pull/1097), but the connectFail method in org.apache.celeborn.common.util.ExceptionUtils has not been updated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No need.

Closes #2841 from jiang13021/minor-cause-typo.

Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-23 16:08:16 +08:00
YutingWang98
5b2030037a [CELEBORN-1664] Fix secret fetch failures after LEADER master failover
### What changes were proposed in this pull request?
Fix a bug related to auth under master HA mode which would cause app failures when leader master restarts. Also, remove the secrets from memory after app lost.

Previous implementation add the registration & secret info in leader Master's memory, and push to other masters though https://github.com/apache/celeborn/pull/2346. After leader restarts, the info will only be in Ratis (AbstractMetaManager), however app still fetch it from new leader's memory, and would fail to get it.

Fix this by checking AbstractMetaManager's registration info if not found in memory, and properly authorize the app.

### Why are the changes needed?
When auth enabled, and leader master restart, there will be "Registration information not found" error on app side, and failed to send heartbeat to master. It will cause app to be removed on server side after heartbeat timeout, causing job to fail.
```
24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] client.TransportClientFactory: Exception while bootstrapping client after 71.4 ms
java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
    at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
    at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
    at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
    at org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
    at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
    at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
    at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
    at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
    at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
    at org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
    at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
    ... 13 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.RuntimeException: Registration information not found for spark-402a80be70f74455b01
    at org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
    at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
    at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
    at org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
    at org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
    at org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
    at org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on dev cluster and job can properly get the secrets after master failover

Closes #2826 from YutingWang98/fix_auth_master_ha.

Authored-by: YutingWang98 <69848459+YutingWang98@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-10-22 14:10:08 -05:00
SteNicholas
06bd39b768 [CELEBORN-1665] CommitHandler should process CommitFilesResponse with COMMIT_FILE_EXCEPTION status
### What changes were proposed in this pull request?

`CommitHandler` should process `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status.

### Why are the changes needed?

`CommitHandler` processes `CommitFilesResponse` with statuses including `SUCCESS`, `PARTIAL_SUCCESS`, `SHUFFLE_NOT_REGISTERED`, `REQUEST_FAILED` and `WORKER_EXCLUDED` at present. Meanwhile, Controller replies `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status for throwable. Therefore, `CommitHandler` should process `COMMIT_FILE_EXCEPTION` status.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2838 from SteNicholas/CELEBORN-1665.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-22 17:49:52 +08:00
Sanskar Modi
1e77f01cd3
[CELEBORN-1663][FOLLOWUP] Only register appShuffleDeterminate if stage using celeborn for shuffle
### What changes were proposed in this pull request?

Making the same changes for Spark2 codebase

### Why are the changes needed?

Followup for https://github.com/apache/celeborn/pull/2832

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?

Existing UTs

Closes #2837 from s0nskar/fix_register_spark2.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-10-22 14:28:12 +08:00
Sanskar Modi
813b45f284 [CELEBORN-1663] Only register appShuffleDeterminate if stage using celeborn for shuffle
### What changes were proposed in this pull request?
Only register appShuffleDeterminate if stage using celeborn for shuffle

### Why are the changes needed?

Currently we are passing stage info to lifecyclemanager, eventhough it is not required.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
Existing UTs

Closes #2832 from s0nskar/fix_register.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-22 11:40:06 +08:00
sychen
15d1463be8 [CELEBORN-1661] Make sure that the sortedFilesDb is initialized successfully when worker enable graceful shutdown
### What changes were proposed in this pull request?

### Why are the changes needed?
Similar to CELEBORN-1457, `sortedFilesDb` may also fail to initialize.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2831 from cxzl25/CELEBORN-1661.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-22 10:19:31 +08:00
jiang13021
dc5f3fb96b
[CELEBORN-1662] Handle PUSH_DATA_FAIL_PARTITION_NOT_FOUND in getPushDataFailCause
### What changes were proposed in this pull request?
Add a condition at the start of the failure cause logic to check for PUSH_DATA_FAIL_PARTITION_NOT_FOUND.

### Why are the changes needed?
Currently, the getPushDataFailCause method does not identify and handle the PUSH_DATA_FAIL_PARTITION_NOT_FOUND error type. All other failure causes are explicitly checked and managed, but this specific error type is overlooked.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test

Closes #2833 from jiang13021/celeborn-1662.

Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-10-21 21:06:06 +08:00
mingji
df01fadc9f
[CELEBORN-1601] Support revise lost shuffles
### What changes were proposed in this pull request?
To support revising lost shuffle IDs in a long-running job such as flink batch jobs.

### Why are the changes needed?
1. To support revise lost shuffles.
2. To add an HTTP endpoint to revise lost shuffles manually.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster tests.

Closes #2746 from FMX/b1600.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-10-21 16:44:37 +08:00
Wang, Fei
bcb43183af [CELEBORN-1629][FOLLOWUP] Fix broken RESTful api link
### What changes were proposed in this pull request?
Fix the broken link.

### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2779.
The RESTful api docs was renamed from webapi.md to restapi.md in https://github.com/apache/celeborn/pull/2775.

And due these two PRs were merged in sequence nearly, so I did not aware this change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

<img width="1255" alt="image" src="https://github.com/user-attachments/assets/a09aecf8-6e7e-458b-871d-f8dd5a0ac6b2">
<img width="937" alt="image" src="https://github.com/user-attachments/assets/bcefeecf-7a24-4616-9f5e-f2a11f464769">

Closes #2828 from turboFei/ratis_docs_link.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:42:52 +08:00
SteNicholas
3726aefecf [CELEBORN-1659][FOLLOWUP] Dockerfile should support copying CLI jars
### What changes were proposed in this pull request?

Dockerfile should support copying CLI jars.

### Why are the changes needed?

CLI jars are generated from `make-distribution.sh`. Therefore, Dockerfile could copy CLI jars to `/opt/celeborn/` directory.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2823 from SteNicholas/CELEBORN-1659.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:37:41 +08:00
Wang, Fei
ffc4980847 [CELEBORN-1627][FOLLOWUP] Fix typo for metrics_SlotsAllocated_increas_1h
### What changes were proposed in this pull request?
Fix typo in prometheus expr.

### Why are the changes needed?

Fix typo.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<img width="1220" alt="image" src="https://github.com/user-attachments/assets/0b8649b6-163a-4868-9eb4-31a25a225d0e">

Closes #2825 from turboFei/fix_typo.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:33:54 +08:00
Fu Chen
24b9b24712 [CELEBORN-1658] Add Git Commit Info and Build JDK Spec to sbt Manifest
### What changes were proposed in this pull request?

This PR  adding Git commit information and JVM build specifications to package manifest.

the `META-INF/MANIFEST.MF` before this PR:

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
```

after this PR:

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
Build-Jdk-Spec: 17.0.9
Build-Revision: 03247c19f1b38096a4080fe97e94dbeb20ebcbe9
Build-Branch: jdk-git-spec
Build-Time: 2024-10-18T17:53:02.723124+08:00[Asia/Shanghai]
```

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
Build-Jdk-Spec: 17.0.9
Build-Revision: N/A
Build-Branch:
Build-Time: 2024-10-18T17:54:16.932121+08:00[Asia/Shanghai]
```

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

local

Closes #2821 from cfmcgrady/jdk-git-spec.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:05:38 +08:00
mingji
a94147cd9d [CELEBORN-1655] Fix read buffer dispatcher thread terminate unexpectedly
### What changes were proposed in this pull request?
The read buffer dispatcher may lose its dispatcher thread which is not acceptable.

### Why are the changes needed?
1. Add a scheduler pool to ensure the dispatcher thread is alive.
2. Add an unhandled exception handler to record possible exceptions that cause the thread to be lost.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster test.

Closes #2815 from FMX/b1655.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-18 15:53:23 +08:00
Aravind Patnam
9620415ae9 [CELEBORN-1659] Fix sbt make-distribution for cli
### What changes were proposed in this pull request?
Fix make-distribution for SBT.

### Why are the changes needed?
same as above

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
ran `./build/make-distribution.sh --sbt-enabled -Pspark-3.5` to ensure it works

Closes #2822 from akpatnam25/CELEBORN-1659.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-18 15:47:01 +08:00
Xianming Lei
7c9a008a14 [CELEBORN-1487][PHASE2] CongestionController support dynamic config
### What changes were proposed in this pull request?
CongestionController support dynamic config

### Why are the changes needed?
Currently, Celeborn only supports quota management based on disk file bytes/count, and this quota management cannot cope with sudden increases in traffic, which will cause corrupt to the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes #2817 from leixm/CELEBORN-1487-2.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-18 15:41:51 +08:00
SteNicholas
497bfdf5d7 [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
### What changes were proposed in this pull request?

`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.

### Why are the changes needed?

`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:

```
public int numHeapArenas() {
  return this.allocator.numHeapArenas();
}

public int numDirectArenas() {
  return this.allocator.numDirectArenas();
}

public List<PoolArenaMetric> heapArenas() {
  return this.allocator.heapArenas();
}

public List<PoolArenaMetric> directArenas() {
  return this.allocator.directArenas();
}

public int numThreadLocalCaches() {
  return this.allocator.numThreadLocalCaches();
}

public int tinyCacheSize() {
  return this.allocator.tinyCacheSize();
}

public int smallCacheSize() {
  return this.allocator.smallCacheSize();
}

public int normalCacheSize() {
  return this.allocator.normalCacheSize();
}

public int chunkSize() {
  return this.allocator.chunkSize();
}

public long usedHeapMemory() {
  return this.allocator.usedHeapMemory();
}

public long usedDirectMemory() {
  return this.allocator.usedDirectMemory();
}
```

`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)

Closes #2802 from SteNicholas/CELEBORN-1640.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-17 18:12:08 +08:00