### What changes were proposed in this pull request?
Bump protobuf from 3.21.7 to 3.25.5.
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2898 from turboFei/bump_protobuf.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#2899 from turboFei/bump_jetty.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump commons-io from 2.13.0 to 2.17.0
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2900 from turboFei/bump_commons_io.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Start test nodes in random ports to allow multiple builds run in the same ci server, which improves the test implementations so that it starts test nodes in random ports instead of using the hardcoded ones.
Follow up #2619.
### Why are the changes needed?
The test nodes are started in the hard coded ports, this prevents to run multiple builds in the same CI/CD server at present.
Bump the changes of #2237.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`RemoteShuffleMasterSuiteJ`
Closes#2902 from SteNicholas/CELEBORN-1504.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.
### Why are the changes needed?
The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2891 from SteNicholas/CELEBORN-1685.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
The close method of `SortBasedShuffleWriter#write` will call `sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue());`, but the close method may be interrupted.
After the interruption, `SortBasedShuffleWriter#cleanupPusher` will be called, and `sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue());` will also be called.
Since `SendBufferPool#pushTaskQueues` is a `LinkedList`, repeated add will store two identical `idleQueue`, which may cause multiple tasks running in parallel to share the same `idleQueue`, resulting in inaccurate data.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Production environment verification
Closes#2878 from cxzl25/CELEBORN-1686.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove file ino in `StorageManager#cleanFile`
### Why are the changes needed?
StorageManager#cleanFile should remove file info
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2887 from reswqa/fix_clean_file.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix the issue that upstream tasks don't rerun and the current task still retry when failed to decompress in flink
### Why are the changes needed?
Decompress error should retry upstream otherwise this is un-recoverable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually
Closes#2884 from reswqa/fix_decompress.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve `ThreadStackTrace` with `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` for thread dump.
### Why are the changes needed?
ThreadStackTrace does not support stack trace including `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` at present. It's recommend to improve `ThreadStackTrace` of thread dump for more details of thread stack trace.
### Does this PR introduce _any_ user-facing change?
The response of `ThreadStack` in `/api/v1/thread_dump` adds `synchronizers`, `monitors`, `lockName`, `lockOwnerName`, `suspended`, `inNative` fields.
Cherry pick:
- https://github.com/apache/spark/pull/42575
- https://github.com/apache/spark/pull/43095
### How was this patch tested?
`ApiV1BaseResourceSuite#thread_dump`
Closes#2888 from SteNicholas/CELEBORN-1697.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix docs typo.
### Why are the changes needed?
Fix docs typo.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2890 from turboFei/nit.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Set mount point in fromPbFileInfoMap
### Why are the changes needed?
Fix wrong fileInfo mountpint
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2885 from reswqa/set-mount-point.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Fix storageFetcherPool concurrent problem.
There may be duplicate thread pools created as multi-thread race condition.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2886 from reswqa/storageFetcherPool.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
As title, introduce metrics_ShuffleFallbackCount_Value.
### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.
Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.
In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.
### Does this PR introduce _any_ user-facing change?
Yes, new metrics.
### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">
Closes#2866 from turboFei/record_fallback.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove unused TODO comments in CelebornTierProducerAgent#processBuffer
### Why are the changes needed?
In order for buffers to be packed together, we are going to modify the Flink side implementation to delegate buffer compression to tiers. But after discussion, we have been able to handle the case of receiving the compressed buffer on the Celeborn side, so this TODO is no longer needed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2883 from reswqa/remove_unused_todo.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. To bypass exceptions when creating clients failed in CelebornShuffleReader in spark 3.
2. Client will try the location's replicas in reading locations.
### Why are the changes needed?
Allow clients to retry locations when creating clients failed.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Pass GA.
Closes#2854 from FMX/b1671.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2879 from reswqa/session-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Follow up of [https://github.com/apache/celeborn/pull/2835]
Only use dynamic resources when candidates are not enough.
And change the way geting availableWorkers form heartbeat to requestSlots RPC to avoid the burden of heartbeat.
### Why are the changes needed?
No
### Does this PR introduce _any_ user-facing change?
Add another configuration.
### How was this patch tested?
UT
Closes#2852 from zaynt4606/clb1636-flu2.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve Celeborn CLI user guide including:
- Add license of Celeborn CLI user guide.
- Optimize the introduction of setup and usage for Celeborn CLI.
- Optimize the navigation of Celeborn CLI to combine Celeborn Ratis Shell.
### Why are the changes needed?
There is no license in Celeborn CLI user guide. Meanwhile, there are certain improvement in user guide including the license, navigation, and the introduction of setup and usage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2875 from SteNicholas/CELEBORN-1678.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear.
### Why are the changes needed?
No
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Closes#2877 from zaynt4606/md.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2653
`checkForUnavailableWorkerTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` are not assigned and always null.
<img width="834" alt="image" src="https://github.com/user-attachments/assets/747a3054-87db-458f-acf8-876926bd1883">
Combine the `checkForHDFSRemnantDirsTimeOutTask` and `checkForS3RemnantDirsTimeOutTask` with `checkForDFSRemnantDirsTimeOutTask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2871 from turboFei/1531_followup.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
This is the last PR in the CIP-6 series.
Fix the bug when hybrid shuffle face the buffer which large then 32K.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2873 from reswqa/11-large-buffer-10month.
Lead-authored-by: Yuxin Tan <tanyuxinwork@gmail.com>
Co-authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add java tools.jar into classpath for JVM quake.
### Why are the changes needed?
Meet below issue with `celeborn.worker.jvmQuake.enabled=true`, see https://github.com/apache/celeborn/pull/2061
```
24/11/03 15:51:08,453 ERROR [main] Worker: Initialize worker failed.
java.lang.NoClassDefFoundError: sun/jvmstat/monitor/HostIdentifier
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm$lzycompute(JVMQuake.scala:180)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.monitoredVm(JVMQuake.scala:179)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor$lzycompute(JVMQuake.scala:185)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.ygcExitTimeMonitor(JVMQuake.scala:184)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake$.org$apache$celeborn$service$deploy$worker$monitor$JVMQuake$$getLastExitTime(JVMQuake.scala:192)
at org.apache.celeborn.service.deploy.worker.monitor.JVMQuake.start(JVMQuake.scala:66)
at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:360)
at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:1041)
at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
Caused by: java.lang.ClassNotFoundException: sun.jvmstat.monitor.HostIdentifier
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
```
Related code:
c12e8881ab/project/JDKTools.scala (L58-L75)
Similar issue: https://github.com/vladimirvivien/jmx-cli/issues/4
After copy the `tools.jar` into worker-jars, the issue got resolved.
It is better that to involve the `tools.jar` automatically without copy.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
<img width="1202" alt="image" src="https://github.com/user-attachments/assets/af8f6c0d-9123-4a73-93b5-69836c5f826d">
Closes#2874 from turboFei/jdk_tools.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Add Flink Hybrid Shuffle IT test cases
2. Fix bug in open stream.
### Why are the changes needed?
Test coverage for celeborn + hybrid shuffle
### Does this PR introduce _any_ user-facing change?
No
Closes#2859 from reswqa/10-itcase-10month.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Remove unused variables.
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2688
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2872 from turboFei/1564_followup.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. remove scala binary version from the openapi-client artifactId.
2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641
### Why are the changes needed?
Because the openapi-client is a pure java module.
### Does this PR introduce _any_ user-facing change?
No, it has not been released.
### How was this patch tested?
GA.
Closes#2861 from turboFei/remove_Scala.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add Flink hybrid shuffle doc
### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no neeed.
Closes#2867 from reswqa/add-hs-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump Spark from 3.4.3 to 3.4.4.
### Why are the changes needed?
Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2851 from SteNicholas/CELEBORN-1672.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Currently, only Flink retries establishing a client when a connection problem occurs. This would be beneficial for all other engines to implement as well.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2855 from RexXiong/CELEBORN-1673.
Lead-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Co-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
adding user guide to README for cli
### Why are the changes needed?
better user experience when using CLI.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#2862 from akpatnam25/CELEBORN-1678.
Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- Added a reset method for DynamicConfigServiceFactory
- Cleaned up QuotaManagerSuite
### Why are the changes needed?
Without this change we can not initialize new configService in any other tests.
Ex: test for this PR https://github.com/apache/celeborn/pull/2844 are failing because of this issue.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
Closes#2848 from s0nskar/fix_quotatest.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support interrupt shuffle on client side.
I will develop the following functions in order
1. Client supports interrupt shuffle
2. Master supports calculating app-level shuffle usage
### Why are the changes needed?
The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unable to test this part independently, Additional tests will be added after completing the second part.
Closes#2801 from leixm/CELEBORN-1577-1.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding IsDecommissioningWorker metric to celeborn dashboard
### Why are the changes needed?
Metric was missing from dashboard
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Tested in local grafana setup
<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">
Closes#2836 from s0nskar/decommision_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Impl worker read process in Flink Hybrid Shuffle
### Does this PR introduce _any_ user-facing change?
No
Closes#2820 from reswqa/cip6-8-pr.
Lead-authored-by: Weijie Guo <reswqa@163.com>
Co-authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo`
```diff
<name>org.apache.celeborn</name>
</organization>
<scm>
- <url>https://github.com/cfmcgrady/incubator-celeborn</url>
- <connection>scmhttps://github.com/cfmcgrady/incubator-celeborn.git</connection>
- <developerConnection>scmgitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection>
- </scm>
- <url>https://celeborn.apache.org/</url>
- <scm>
- <url>gitgithub.com:apache/celeborn.git</url>
- <connection>scmgitgithub.com:apache/celeborn.git</connection>
+ <url>https://celeborn.apache.org/</url>
+ <connection>scmhttps://github.com/apache/celeborn.git</connection>
+ <developerConnection>scmgitgithub.com:apache/celeborn.git</developerConnection>
</scm>
```
The conflicting POM might block publishing to a private Maven repository.
```
[error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
[error] at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
[error] at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308)
[error] at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284)
[error] at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82)
[error] at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150)
[error] at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84)
[error] at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407)
[error] at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111)
[error] at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170)
[error] at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216)
[error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local
Closes#2858 from cfmcgrady/sbt-scm.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Current TagsManager code only supported one tags for selecting tagged workers. This change will enable support of passing multiple tags to TagsManager. Multiple tags will be evaluated as "AND" expression i.e only workers tagged with all the passed tags will be selected.
Support for more schemes will be added in follow up PRs.
### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs
Closes#2850 from s0nskar/CELEBORN-1642.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Currently, the ChangePartitionManager retrieves workers from the LifeCycleManager's workerSnapshot. However, during the revival process in reallocateChangePartitionRequestSlotsFromCandidates, it does not account for newly added available workers resulting from elastic contraction and expansion. This PR addresses this issue by updating the candidate workers in the ChangePartitionManager to use the available workers reported in the heartbeat from the master.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2835 from zaynt4606/clbdev.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bumping up maven version to 3.9.9
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2834 from avishnus/maven.
Authored-by: avishnus <avishnus@visa.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
To fix an NPE when handling the closed file writers.
### Why are the changes needed?
If a file writer stores its shuffle data in memory, the disk file info object will be null, causing NPE.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#2846 from FMX/b1688.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix `NullPointerException` for `PartitionFilesSorter#updateSortedShuffleFiles` after cleaning up expired shuffle key.
### Why are the changes needed?
`PartitionFilesSorter` sorts shuffle files in `worker-file-sorter-executor` thread and cleans up expired key in `worker-expired-shuffle-cleaner` thread. There is a case that after `worker-expired-shuffle-cleaner` cleaning up expired shuffle key, `worker-file-sorter-executor` updates sorted shuffle files, which causes `NullPointerException` at present.
```
2024-10-23 17:26:17,162 [INFO] [worker-expired-shuffle-cleaner] - org.apache.celeborn.service.deploy.worker.Worker -Logging.scala(51) -Cleaned up expired shuffle application_1724141892576_3843182_1-0
2024-10-23 17:26:17,392 [ERROR] [worker-file-sorter-executor-237572] - org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter -PartitionFilesSorter.java(752) -Sorting shuffle file for application_1724141892576_3843182_1-0-1875-0-0 /mnt/storage02/celeborn-worker/shuffle_data/application_1724141892576_3843182_1/0/1875-0-0 failed, detail:
java.lang.NullPointerException: null
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.updateSortedShuffleFiles(PartitionFilesSorter.java:455) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:747) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$new$1(PartitionFilesSorter.java:164) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2847 from SteNicholas/CELEBORN-1669.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addressed the memory leak problem of worker nodes before registration and the NPE issue that occurs when PushDataHandler is accessed during the initialization process.


### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Pass GA
Closes#2843 from zhaostu4/zhao/worker_npe.
Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support ratio threshold of unhealthy disks for excluding worker with `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold`.
### Why are the changes needed?
We often encounter issues such as disk input/output errors in production practice. When a bad disk occurs, the worker will be maintained to decommission for repairing the machine disk. The reason is that generally the fault will be repaired in time after it is discovered. It is possible that the machine will not trigger all disk failures if it is out of warranty. It can be replaced directly when it is under warranty. If the disk fails after it is out of warranty, you need to purchase the disk yourself for replacement. At the same time, submitting the disk for repair at one time will affect the failure rate judgment of the system group and scenario. In addition, the occurrence of bad disks will bring about some management problems, such as continuous alarms, and the handling of disk failures is relatively customized.
Therefore, it's recommended to configure ratio threshold of unhealthy disks for excluding worker, which does not need to wait for all unhealthy disks to exclude corresponding worker.
### Does this PR introduce _any_ user-facing change?
Introduce `celeborn.master.excludeWorker.unhealthyDiskRatioThreshold` to configure max ratio of unhealthy disks for excluding worker.
### How was this patch tested?
Cluster test.
Closes#2812 from SteNicholas/CELEBORN-1651.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The version 1.0.6 is outdated and not available on Maven Central.
https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CI
Closes#2842 from cfmcgrady/sbt-protoc.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Fix typo.
### Why are the changes needed?
The error message was changed in [this pull request](https://github.com/apache/celeborn/pull/1097), but the connectFail method in org.apache.celeborn.common.util.ExceptionUtils has not been updated.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No need.
Closes#2841 from jiang13021/minor-cause-typo.
Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix a bug related to auth under master HA mode which would cause app failures when leader master restarts. Also, remove the secrets from memory after app lost.
Previous implementation add the registration & secret info in leader Master's memory, and push to other masters though https://github.com/apache/celeborn/pull/2346. After leader restarts, the info will only be in Ratis (AbstractMetaManager), however app still fetch it from new leader's memory, and would fail to get it.
Fix this by checking AbstractMetaManager's registration info if not found in memory, and properly authorize the app.
### Why are the changes needed?
When auth enabled, and leader master restart, there will be "Registration information not found" error on app side, and failed to send heartbeat to master. It will cause app to be removed on server side after heartbeat timeout, causing job to fail.
```
24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] client.TransportClientFactory: Exception while bootstrapping client after 71.4 ms
java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
at org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
at org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Exception in sendRpcSync to: celeborn-moka-test-manager-3/{ip}:9097
at org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
at org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
... 13 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: java.lang.RuntimeException: Registration information not found for spark-402a80be70f74455b01
at org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
at java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
at org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
at org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
at org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested on dev cluster and job can properly get the secrets after master failover
Closes#2826 from YutingWang98/fix_auth_master_ha.
Authored-by: YutingWang98 <69848459+YutingWang98@users.noreply.github.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
`CommitHandler` should process `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status.
### Why are the changes needed?
`CommitHandler` processes `CommitFilesResponse` with statuses including `SUCCESS`, `PARTIAL_SUCCESS`, `SHUFFLE_NOT_REGISTERED`, `REQUEST_FAILED` and `WORKER_EXCLUDED` at present. Meanwhile, Controller replies `CommitFilesResponse` with `COMMIT_FILE_EXCEPTION` status for throwable. Therefore, `CommitHandler` should process `COMMIT_FILE_EXCEPTION` status.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2838 from SteNicholas/CELEBORN-1665.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Making the same changes for Spark2 codebase
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2832
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Existing UTs
Closes#2837 from s0nskar/fix_register_spark2.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>