### What changes were proposed in this pull request?
Fix storageFetcherPool concurrent problem.
There may be duplicate thread pools created as multi-thread race condition.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2886 from reswqa/storageFetcherPool.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
As title, introduce metrics_ShuffleFallbackCount_Value.
### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.
Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.
In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.
### Does this PR introduce _any_ user-facing change?
Yes, new metrics.
### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">
Closes#2866 from turboFei/record_fallback.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove unused TODO comments in CelebornTierProducerAgent#processBuffer
### Why are the changes needed?
In order for buffers to be packed together, we are going to modify the Flink side implementation to delegate buffer compression to tiers. But after discussion, we have been able to handle the case of receiving the compressed buffer on the Celeborn side, so this TODO is no longer needed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2883 from reswqa/remove_unused_todo.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. To bypass exceptions when creating clients failed in CelebornShuffleReader in spark 3.
2. Client will try the location's replicas in reading locations.
### Why are the changes needed?
Allow clients to retry locations when creating clients failed.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Pass GA.
Closes#2854 from FMX/b1671.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2879 from reswqa/session-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Follow up of [https://github.com/apache/celeborn/pull/2835]
Only use dynamic resources when candidates are not enough.
And change the way geting availableWorkers form heartbeat to requestSlots RPC to avoid the burden of heartbeat.
### Why are the changes needed?
No
### Does this PR introduce _any_ user-facing change?
Add another configuration.
### How was this patch tested?
UT
Closes#2852 from zaynt4606/clb1636-flu2.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve Celeborn CLI user guide including:
- Add license of Celeborn CLI user guide.
- Optimize the introduction of setup and usage for Celeborn CLI.
- Optimize the navigation of Celeborn CLI to combine Celeborn Ratis Shell.
### Why are the changes needed?
There is no license in Celeborn CLI user guide. Meanwhile, there are certain improvement in user guide including the license, navigation, and the introduction of setup and usage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2875 from SteNicholas/CELEBORN-1678.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear.
### Why are the changes needed?
No
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Closes#2877 from zaynt4606/md.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2653
`checkForUnavailableWorkerTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` are not assigned and always null.
<img width="834" alt="image" src="https://github.com/user-attachments/assets/747a3054-87db-458f-acf8-876926bd1883">
Combine the `checkForHDFSRemnantDirsTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` with `checkForDFSRemnantDirsTimeOutTask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2871 from turboFei/1531_followup.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
This is the last PR in the CIP-6 series.
Fix the bug when hybrid shuffle face the buffer which large then 32K.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2873 from reswqa/11-large-buffer-10month.
Lead-authored-by: Yuxin Tan <tanyuxinwork@gmail.com>
Co-authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add java tools.jar into classpath for JVM quake.
### Why are the changes needed?
Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see https://github.com/apache/celeborn/pull/2061
```
24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed.
java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66)
at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360)
at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041)
at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
```
Related code:
c12e8881ab/project/JDKTools.scala (L58-L75)
Similar issue: https://github.com/vladimirvivien/jmx-cli/issues/4
After copy the `tools.jar` into worker-jars, the issue got resolved.
It is better that to involve the `tools.jar` automatically without copy.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
<img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d">
Closes#2874 from turboFei/jdk_tools.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Add Flink Hybrid Shuffle IT test cases
2. Fix bug in open stream.
### Why are the changes needed?
Test coverage for celeborn + hybrid shuffle
### Does this PR introduce _any_ user-facing change?
No
Closes#2859 from reswqa/10-itcase-10month.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Remove unused variables.
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2688
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2872 from turboFei/1564_followup.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. remove scala binary version from the openapi-client artifactId.
2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641
### Why are the changes needed?
Because the openapi-client is a pure java module.
### Does this PR introduce _any_ user-facing change?
No, it has not been released.
### How was this patch tested?
GA.
Closes#2861 from turboFei/remove_Scala.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add Flink hybrid shuffle doc
### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no neeed.
Closes#2867 from reswqa/add-hs-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump Spark from 3.4.3 to 3.4.4.
### Why are the changes needed?
Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2851 from SteNicholas/CELEBORN-1672.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Currently, only Flink retries establishing a client when a connection problem occurs. This would be beneficial for all other engines to implement as well.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2855 from RexXiong/CELEBORN-1673.
Lead-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Co-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
adding user guide to README for cli
### Why are the changes needed?
better user experience when using CLI.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#2862 from akpatnam25/CELEBORN-1678.
Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- Added a reset method for DynamicConfigServiceFactory
- Cleaned up QuotaManagerSuite
### Why are the changes needed?
Without this change we can not initialize new configService in any other tests.
Ex: test for this PR https://github.com/apache/celeborn/pull/2844 are failing because of this issue.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
Closes#2848 from s0nskar/fix_quotatest.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support interrupt shuffle on client side.
I will develop the following functions in order
1. Client supports interrupt shuffle
2. Master supports calculating app-level shuffle usage
### Why are the changes needed?
The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unable to test this part independently, Additional tests will be added after completing the second part.
Closes#2801 from leixm/CELEBORN-1577-1.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding IsDecommissioningWorker metric to celeborn dashboard
### Why are the changes needed?
Metric was missing from dashboard
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Tested in local grafana setup
<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">
Closes#2836 from s0nskar/decommision_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Impl worker read process in Flink Hybrid Shuffle
### Does this PR introduce _any_ user-facing change?
No
Closes#2820 from reswqa/cip6-8-pr.
Lead-authored-by: Weijie Guo <reswqa@163.com>
Co-authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo`
```diff
<name>org.apache.celeborn</name>
</organization>
<scm>
- <url>https://github.com/cfmcgrady/incubator-celeborn</url>
- <connection>scmhttps://github.com/cfmcgrady/incubator-celeborn.git</connection>
- <developerConnection>scmgitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection>
- </scm>
- <url>https://celeborn.apache.org/</url>
- <scm>
- <url>gitgithub.com:apache/celeborn.git</url>
- <connection>scmgitgithub.com:apache/celeborn.git</connection>
+ <url>https://celeborn.apache.org/</url>
+ <connection>scmhttps://github.com/apache/celeborn.git</connection>
+ <developerConnection>scmgitgithub.com:apache/celeborn.git</developerConnection>
</scm>
```
The conflicting POM might block publishing to a private Maven repository.
```
[error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
[error] at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
[error] at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308)
[error] at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284)
[error] at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82)
[error] at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150)
[error] at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84)
[error] at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407)
[error] at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111)
[error] at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170)
[error] at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216)
[error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local
Closes#2858 from cfmcgrady/sbt-scm.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Current TagsManager code only supported one tags for selecting tagged workers. This change will enable support of passing multiple tags to TagsManager. Multiple tags will be evaluated as "AND" expression i.e only workers tagged with all the passed tags will be selected.
Support for more schemes will be added in follow up PRs.
### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs
Closes#2850 from s0nskar/CELEBORN-1642.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Currently, the ChangePartitionManager retrieves workers from the LifeCycleManager's workerSnapshot. However, during the revival process in reallocateChangePartitionRequestSlotsFromCandidates, it does not account for newly added available workers resulting from elastic contraction and expansion. This PR addresses this issue by updating the candidate workers in the ChangePartitionManager to use the available workers reported in the heartbeat from the master.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2835 from zaynt4606/clbdev.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bumping up maven version to 3.9.9
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2834 from avishnus/maven.
Authored-by: avishnus <avishnus@visa.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
To fix an NPE when handling the closed file writers.
### Why are the changes needed?
If a file writer stores its shuffle data in memory, the disk file info object will be null, causing NPE.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#2846 from FMX/b1688.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix `NullPointerException` for `PartitionFilesSorter#updateSortedShuffleFiles` after cleaning up expired shuffle key.
### Why are the changes needed?
`PartitionFilesSorter` sorts shuffle files in `worker-file-sorter-executor` thread and cleans up expired key in `worker-expired-shuffle-cleaner` thread. There is a case that after `worker-expired-shuffle-cleaner` cleaning up expired shuffle key, `worker-file-sorter-executor` updates sorted shuffle files, which causes `NullPointerException` at present.
```
2024-10-23 17:26:17,162 [INFO] [worker-expired-shuffle-cleaner] - org.apache.celeborn.service.deploy.worker.Worker -Logging.scala(51) -Cleaned up expired shuffle application_1724141892576_3843182_1-0
2024-10-23 17:26:17,392 [ERROR] [worker-file-sorter-executor-237572] - org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter -PartitionFilesSorter.java(752) -Sorting shuffle file for application_1724141892576_3843182_1-0-1875-0-0 /mnt/storage02/celeborn-worker/shuffle_data/application_1724141892576_3843182_1/0/1875-0-0 failed, detail:
java.lang.NullPointerException: null
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.updateSortedShuffleFiles(PartitionFilesSorter.java:455) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:747) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$new$1(PartitionFilesSorter.java:164) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2847 from SteNicholas/CELEBORN-1669.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addressed the memory leak problem of worker nodes before registration and the NPE issue that occurs when PushDataHandler is accessed during the initialization process.


### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Pass GA
Closes#2843 from zhaostu4/zhao/worker_npe.
Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support ratio threshold of unhealthy disks for excluding worker with `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold`.
### Why are the changes needed?
We often encounter issues such as disk input/output errors in production practice. When a bad disk occurs, the worker will be maintained to decommission for repairing the machine disk. The reason is that generally the fault will be repaired in time after it is discovered. It is possible that the machine will not trigger all disk failures if it is out of warranty. It can be replaced directly when it is under warranty. If the disk fails after it is out of warranty, you need to purchase the disk yourself for replacement. At the same time, submitting the disk for repair at one time will affect the failure rate judgment of the system group and scenario. In addition, the occurrence of bad disks will bring about some management problems, such as continuous alarms, and the handling of disk failures is relatively customized.
Therefore, it's recommended to configure ratio threshold of unhealthy disks for excluding worker, which does not need to wait for all unhealthy disks to exclude corresponding worker.
### Does this PR introduce _any_ user-facing change?
Introduce `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold` to configure max ratio of unhealthy disks for excluding worker.
### How was this patch tested?
Cluster test.
Closes#2812 from SteNicholas/CELEBORN-1651.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The version 1.0.6 is outdated and not available on Maven Central.
https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CI
Closes#2842 from cfmcgrady/sbt-protoc.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Fix typo.
### Why are the changes needed?
The error message was changed in [this pull request](https://github.com/apache/celeborn/pull/1097), but the connectFail method in org.apache.celeborn.common.util.ExceptionUtils has not been updated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No need.
Closes#2841 from jiang13021/minor-cause-typo.
Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix a bug related to auth under master HA mode which would cause app failures when leader master restarts. Also, remove the secrets from memory after app lost.
Previous implementation add the registration & secret info in leader Master's memory, and push to other masters though https://github.com/apache/celeborn/pull/2346. After leader restarts, the info will only be in Ratis (AbstractMetaManager), however app still fetch it from new leader's memory, and would fail to get it.
Fix this by checking AbstractMetaManager's registration info if not found in memory, and properly authorize the app.
### Why are the changes needed?
When auth enabled, and leader master restart, there will be "Registration information not found" error on app side, and failed to send heartbeat to master. It will cause app to be removed on server side after heartbeat timeout, causing job to fail.
```
24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] client.TransportClientFactory: Exception while bootstrapping client after 71.4 ms
java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
at org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
at org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
... 13 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.RuntimeException: Registration information not found for spark-402a80be70f74455b01
at org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
at org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
at org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
at org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested on dev cluster and job can properly get the secrets after master failover
Closes#2826 from YutingWang98/fix_auth_master_ha.
Authored-by: YutingWang98 <69848459+YutingWang98@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
`CommitHandler` should process `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status.
### Why are the changes needed?
`CommitHandler` processes `CommitFilesResponse` with statuses including `SUCCESS`, `PARTIAL_SUCCESS`, `SHUFFLE_NOT_REGISTERED`, `REQUEST_FAILED` and `WORKER_EXCLUDED` at present. Meanwhile, Controller replies `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status for throwable. Therefore, `CommitHandler` should process `COMMIT_FILE_EXCEPTION` status.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2838 from SteNicholas/CELEBORN-1665.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Making the same changes for Spark2 codebase
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2832
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Existing UTs
Closes#2837 from s0nskar/fix_register_spark2.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Only register appShuffleDeterminate if stage using celeborn for shuffle
### Why are the changes needed?
Currently we are passing stage info to lifecyclemanager, eventhough it is not required.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Existing UTs
Closes#2832 from s0nskar/fix_register.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
Similar to CELEBORN-1457, `sortedFilesDb` may also fail to initialize.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2831 from cxzl25/CELEBORN-1661.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add a condition at the start of the failure cause logic to check for PUSH_DATA_FAIL_PARTITION_NOT_FOUND.
### Why are the changes needed?
Currently, the getPushDataFailCause method does not identify and handle the PUSH_DATA_FAIL_PARTITION_NOT_FOUND error type. All other failure causes are explicitly checked and managed, but this specific error type is overlooked.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#2833 from jiang13021/celeborn-1662.
Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
To support revising lost shuffle IDs in a long-running job such as flink batch jobs.
### Why are the changes needed?
1. To support revise lost shuffles.
2. To add an HTTP endpoint to revise lost shuffles manually.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Cluster tests.
Closes#2746 from FMX/b1600.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Dockerfile should support copying CLI jars.
### Why are the changes needed?
CLI jars are generated from `make-distribution.sh`. Therefore, Dockerfile could copy CLI jars to `/opt/celeborn/` directory.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2823 from SteNicholas/CELEBORN-1659.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix typo in prometheus expr.
### Why are the changes needed?
Fix typo.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
<img width="1220" alt="image" src="https://github.com/user-attachments/assets/0b8649b6-163a-4868-9eb4-31a25a225d0e">
Closes#2825 from turboFei/fix_typo.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
The read buffer dispatcher may lose its dispatcher thread which is not acceptable.
### Why are the changes needed?
1. Add a scheduler pool to ensure the dispatcher thread is alive.
2. Add an unhandled exception handler to record possible exceptions that cause the thread to be lost.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Cluster test.
Closes#2815 from FMX/b1655.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix make-distribution for SBT.
### Why are the changes needed?
same as above
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
ran `./build/make-distribution.sh --sbt-enabled -Pspark-3.5` to ensure it works
Closes#2822 from akpatnam25/CELEBORN-1659.
Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
CongestionController support dynamic config
### Why are the changes needed?
Currently, Celeborn only supports quota management based on disk file bytes/count, and this quota management cannot cope with sudden increases in traffic, which will cause corrupt to the cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
Closes#2817 from leixm/CELEBORN-1487-2.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.
### Why are the changes needed?
`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:
```
public int numHeapArenas() {
return this.allocator.numHeapArenas();
}
public int numDirectArenas() {
return this.allocator.numDirectArenas();
}
public List<PoolArenaMetric> heapArenas() {
return this.allocator.heapArenas();
}
public List<PoolArenaMetric> directArenas() {
return this.allocator.directArenas();
}
public int numThreadLocalCaches() {
return this.allocator.numThreadLocalCaches();
}
public int tinyCacheSize() {
return this.allocator.tinyCacheSize();
}
public int smallCacheSize() {
return this.allocator.smallCacheSize();
}
public int normalCacheSize() {
return this.allocator.normalCacheSize();
}
public int chunkSize() {
return this.allocator.chunkSize();
}
public long usedHeapMemory() {
return this.allocator.usedHeapMemory();
}
public long usedDirectMemory() {
return this.allocator.usedDirectMemory();
}
```
`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)
Closes#2802 from SteNicholas/CELEBORN-1640.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>