### What changes were proposed in this pull request?
Support Flink 2.1.
### Why are the changes needed?
Flink 2.1 has already released, which release notes refer to [Release notes - Flink 2.1](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.1).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3404 from SteNicholas/CELEBORN-2093.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Support `MapPartitionData` on DFS.
### Why are the changes needed?
`MapPartitionData` only supports on local, which does not support on DFS. It's recommended to support `MapPartitionData` on DFS for MapPartition mode to align the ability of ReducePartition mode.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`WordCountTestWithHDFS`.
Closes#3349 from SteNicholas/CELEBORN-2047.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Flink from 1.19.2, 1.20.1 to 1.19.3, 1.20.2.
### Why are the changes needed?
Flink has released v1.19.3 and v1.20.2, which release notes refer to:
- [Apache Flink 1.19.3 Release Announcement](https://flink.apache.org/2025/07/10/apache-flink-1.19.3-release-announcement/)
- [Apache Flink 1.20.2 Release Announcement](https://flink.apache.org/2025/07/10/apache-flink-1.20.2-release-announcement/)
Flink v1.19.3 adds the `getConsumedPartitionType()` interface into `IndexedInputGate`, which refers to https://github.com/apache/flink/pull/26548.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3385 from SteNicholas/CELEBORN-2080.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump ap-loader version from 3.0-9 to 4.0-10.
### Why are the changes needed?
`ap-loader` has already released v4.0-10, which release note refers to [Loader for 4.0 (v10): Heatmaps and Native memory profiling](https://github.com/jvm-profiling-tools/ap-loader/releases/tag/4.0-10). It should bump version from 3.0-9 to 4.0-10 for `JVMProfiler`.
Backport https://github.com/apache/spark/pull/51257.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#3359 from SteNicholas/CELEBORN-2057.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark from 3.5.5 to 3.5.6.
### Why are the changes needed?
Spark 3.5.6 has been announced to release: [Spark 3.5.6 released](https://spark.apache.org/news/spark-3-5-6-released.html). The profile spark-3.5 could bump Spark from 3.5.5 to 3.5.6.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3319 from SteNicholas/CELEBORN-2030.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.
### Why are the changes needed?
Spark 4.0.0 is ready.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#3282 from turboFei/spark_4.0.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add release guide and fix several issues during 0.6.0 release.
### Why are the changes needed?
Add docs.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested locally.
Closes#3271 from turboFei/release_guide.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
### What changes were proposed in this pull request?
Introduce disruptor dependency to support asynchronous logging of log4j2.
### Why are the changes needed?
We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#3246 from SteNicholas/CELEBORN-1994.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove ([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447)) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
- `haveDisk` is computed per partitionId, iterated across all workers. We precompute this and store it as a field in `WorkerInfo`.
See the below flamegraph for the hotspot of `iter.remove` (`oop_disjoint_arraycopy`) after running a benchmark.

Below is what we actually observed in production which matches with the above observation from the benchmark:

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:

Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration 2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration 3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration 4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration 5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration 1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration 2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration 3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration 4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration 5: 907655.363 ±(99.9%) 11893.789 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
882048.066 ±(99.9%) 98360.936 us/op [Average]
(min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)
# Run complete. Total time: 00:05:43
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 882048.066 ± 98360.936 us/op
Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration 2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration 3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration 4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration 5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration 1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration 2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration 3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration 4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration 5: 134179.546 ±(99.9%) 5137.425 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
135207.354 ±(99.9%) 20845.544 us/op [Average]
(min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)
# Run complete. Total time: 00:05:29
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 135207.354 ± 20845.544 us/op
Process finished with exit code 0
```
882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.
Closes#3228 from akpatnam25/CELEBORN-1982.
Lead-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.
### Why are the changes needed?
- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Deployment integration testing has been completed in the local environment.
Closes#3157 from shouwangyw/optimize/mpu-oss.
Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support Flink 2.0. The major changes of Flink 2.0 include:
- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.
### Why are the changes needed?
Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3179 from SteNicholas/CELEBORN-1925.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
`celeborn-flink-it` project should set `FLINK_VERSION` environment variable for `HybridShuffleWordCountTest`.
Follow up #2662.
### Why are the changes needed?
Because of the lack of `FLINK_VERSION` environment variable, the sbt flink project of CI cancels the tests of `HybridShuffleWordCountTest` as follows:
```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count !!! CANCELED !!!
[info] "" was empty (HybridShuffleWordCountTest.scala:75)
[info] org.scalatest.exceptions.TestCanceledException:
[info] at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info] at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info] at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$1(HybridShuffleWordCountTest.scala:62)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:431)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info] at java.base/java.lang.Thread.run(Thread.java:829)
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count !!! CANCELED !!!
[info] "" was empty (HybridShuffleWordCountTest.scala:75)
[info] org.scalatest.exceptions.TestCanceledException:
[info] at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info] at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info] at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$2(HybridShuffleWordCountTest.scala:68)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:431)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info] at java.base/java.lang.Thread.run(Thread.java:829)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`HybridShuffleWordCountTest`
- Flink 1.16
```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count !!! CANCELED !!!
[info] "1.16.3" was not empty, but FlinkVersion.fromVersionStr(scala.Predef.refArrayOps[String](scala.Predef.refArrayOps[String](flinkVersion.split("\\.")).take(2)).mkString(".")).isNewerOrEqualVersionThan(v1_20) was false (HybridShuffleWordCountTest.scala:75)
[info] org.scalatest.exceptions.TestCanceledException:
[info] at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info] at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info] at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$1(HybridShuffleWordCountTest.scala:62)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:[1564](https://github.com/apache/celeborn/actions/runs/14124619270/job/39570964025?pr=3184#step:4:1565))
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:431)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info] at java.base/java.lang.Thread.run(Thread.java:829)
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count !!! CANCELED !!!
[info] "1.16.3" was not empty, but FlinkVersion.fromVersionStr(scala.Predef.refArrayOps[String](scala.Predef.refArrayOps[String](flinkVersion.split("\\.")).take(2)).mkString(".")).isNewerOrEqualVersionThan(v1_20) was false (HybridShuffleWordCountTest.scala:75)
[info] org.scalatest.exceptions.TestCanceledException:
[info] at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info] at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info] at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$2(HybridShuffleWordCountTest.scala:68)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info] at scala.collection.immutable.List.foreach(List.scala:431)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info] at org.scalatest.Suite.run(Suite.scala:1114)
[info] at org.scalatest.Suite.run$(Suite.scala:1096)
[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info] at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info] at java.base/java.lang.Thread.run(Thread.java:829)
```
- Flink 1.20
```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count
2025-03-28T08:15:25.149779Z flink-pekko.actor.default-dispatcher-10 WARN found 2 argument placeholders, but provided 3 for pattern `Disconnect TaskExecutor {} because: {}`
25/03/28 08:15:25,199 ERROR [celeborn-client-lifecycle-manager-release-partition-executor-1] LifecycleManager: Request DestroyWorkerSlots(1743149720044-d5114c42fdccf6220ec7a9e1fb856ac3-1,[1-0, 4-0, 7-0],[],false) to NettyRpcEndpointRef(celeborn://WorkerEndpoint10.1.0.100:37683) failed 1/3 for 1743149720044-d5114c42fdccf6220ec7a9e1fb856ac3-1, reason: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask351c2453[Not completed, task = java.util.concurrent.Executors$RunnableAdaptera979257[Wrapped task = org.apache.celeborn.common.rpc.netty.NettyRpcEnv$$anon$16312b56f]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor23161ba2[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], will retry.
...
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count
25/03/28 08:15:30,980 ERROR [worker 1 starter thread] MasterClient: Send rpc with failure, has tried 0, max try 15!
```
Closes#3184 from SteNicholas/CELEBORN-1543.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- close [CELEBORN-1911](https://issues.apache.org/jira/browse/CELEBORN-1911)
This PR refactors the project structure by moving the multipart-uploader module into multipart-uploader/multipart-uploader-s3.
### Why are the changes needed?
This change improves modularity and enables future extensions, such as multipart-uploader/multipart-uploader-oss, allowing better support for multiple object storage backends.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Deployment integration testing has been completed in the local environment.
Closes#3153 from shouwangyw/optimize/mpu-s3.
Authored-by: veli.yang <897900564@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump log4j2 version to 2.24.3
https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.24.3
### Why are the changes needed?
Bump to latest log4j2 bug fix release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#3134 from turboFei/log4j2.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This PR partially reverts the change of https://github.com/apache/celeborn/pull/2813, namely, restores the renaming of `celeborn-client-spark-3`
### Why are the changes needed?
The renaming is not necessary, and might cause some confusion, for example, I wrongly interpreted the `spark-3-4` as Spark 3.4, it also increases the backport efforts for branch-0.5
### Does this PR introduce _any_ user-facing change?
No, it's dev only, before/after this change, the end users always use the shaded client
```
celeborn-client-spark-2-shaded_2.11-0.6.0-SNAPSHOT.jar
celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.jar
celeborn-client-spark-4-shaded_2.13-0.6.0-SNAPSHOT.jar
```
### How was this patch tested?
Pass GA.
Closes#3133 from pan3793/CELEBORN-1413-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark from 3.5.4 to 3.5.5.
### Why are the changes needed?
Spark 3.5.5 has been announced to release: [Spark 3.5.5 released](https://spark.apache.org/news/spark-3-5-5-released.html). The profile spark-3.5 could bump Spark from 3.5.4 to 3.5.5.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3129 from SteNicholas/CELEBORN-1890.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump zstd-jni version from 1.5.2-1 to 1.5.7-1.
### Why are the changes needed?
Bump zstd-jni to the latest version.
Backport https://github.com/apache/spark/pull/50057.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#3114 from SteNicholas/CELEBORN-1877.
Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1.
### Why are the changes needed?
Flink 1.19.2 and 1.20.1 have already released.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3107 from SteNicholas/CELEBORN-1872.
Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
Bump Netty version from 4.1.115.Final to 4.1.118.Final.
### Why are the changes needed?
The Netty 4.1.118.Final version has been released, which netty version is 4.1.115.Final at present. The changes between 4.1.115.Final and 4.1.118.Final is as follows:
- 4.1.116.Final: https://netty.io/news/2024/12/17/4-1-116-Final.html
- 4.1.117.Final: https://netty.io/news/2025/01/14/4-1-117-Final.html
- 4.1.118.Final: https://netty.io/news/2025/02/10/4-1-118-Final.html
- **SslHandler doesn't correctly validate packets which can lead to native crash when using native SSLEngine.**
- **Denial of Service attack on windows app using Netty, again.**
Backport:
- https://github.com/apache/spark/pull/49756
- https://github.com/apache/spark/pull/49923
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3098 from SteNicholas/CELEBORN-1864.
Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Upgrading ratis version to 3.1.3
### Why are the changes needed?
For fixing the CVE-2024-7254 and sonatype-2020-0026 coming from its transitive dependency - ratis-thirdparty-misc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Locally and CI tests
Closes#3095 from Madhukar525722/main.
Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Backport https://github.com/apache/celeborn/pull/3070 to main branch.
## What changes were proposed in this pull request?
Do not trigger fetch failure if a spark task attempt is interrupted(speculation enabled). Do not trigger fetch failure if the RPC of getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.
## Why are the changes needed?
Avoid unnecessary fetch failures and stage re-runs.
## Does this PR introduce any user-facing change?
NO.
## How was this patch tested?
1. GA.
2. Manually tested on cluster with spark speculation tasks.
Here is the test case
```scala
sc.parallelize(1 to 100, 100).flatMap(i => {
(1 to 150000).iterator.map(num => num)
}).groupBy(i => i, 100)
.map(i => {
if (i._1 < 5) {
Thread.sleep(15000)
}
i
})
.repartition(400).count
```
<img width="1384" alt="截屏2025-01-18 16 16 16" src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb" /> <img width="1393" alt="截屏2025-01-18 16 16 22" src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530" /> <img width="1258" alt="截屏2025-01-18 16 19 15" src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6" /> <img width="892" alt="截屏2025-01-18 16 17 27" src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0" />
Closes#3070 from FMX/branch-0.5-b1838.
Authored-by: mingji <fengmingxiao.fmxalibaba-inc.com>
Closes#3080 from turboFei/b1838.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Bump ap-loader version from 3.0-8 to 3.0-9.
### Why are the changes needed?
ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.
Backport:
1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3072 from SteNicholas/CELEBORN-1842.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump `zstd-jni` version to 1.5.6-5 for 4.0.0-preview2.
### Why are the changes needed?
`zstd-jni` version is 1.5.6-5 for 4.0.0-preview2 for [<version>1.5.6-5</version>](https://github.com/apache/spark/blob/v4.0.0-preview2/pom.xml#L838C18-L838C25).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3054 from SteNicholas/CELEBORN-1413.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.
For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9
### Why are the changes needed?
Reduce maintenance burden.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Changes can be covered by existing tests.
Closes#3029 from codenohup/remove-flink14and15.
Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Add Tez packaging script.
### Why are the changes needed?
To support build tez client.
### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.
### How was this patch tested?
Cluster test.
Closes#3028 from GH-Gloway/1737.
Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark from 3.5.3 to 3.5.4.
### Why are the changes needed?
Spark 3.5.4 has been announced to release: [Spark 3.5.4 released](https://spark.apache.org/news/spark-3-5-4-released.html). The profile spark-3.5 could bump Spark from 3.5.3 to 3.5.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3034 from SteNicholas/CELEBORN-1806.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.
### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2813 from FMX/b1413.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, add `--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED` into default java options.
### Why are the changes needed?
It is necessary for JDK17 + HDFS Storage + Kerberos enabled, see details in https://github.com/apache/spark/pull/34615
The exception stack likes:
```
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
....
Caused by: java.lang.IllegalAccessException: class org.apache.hadoop.security.authentication.util.KerberosUtil cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module 3a0baae5
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
... 9 more
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2999 from turboFei/jdk_opt_krb5.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture
Closes#2944 from zhaohehuhu/dev-1125.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.
<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">
The above screenshot is the second test with 5000 mapper and reducer that I did.
Closes#2830 from zhaohehuhu/dev-1021.
Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Ratis version from 3.1.1 to 3.1.2 including:
- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171
### Why are the changes needed?
Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.
The 3.1.2 version fixed the following `NullPointerException` in CI log:
```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2897 from SteNicholas/CELEBORN-1702.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump protobuf from 3.21.7 to 3.25.5.
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2898 from turboFei/bump_protobuf.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#2899 from turboFei/bump_jetty.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump commons-io from 2.13.0 to 2.17.0
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2900 from turboFei/bump_commons_io.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. remove scala binary version from the openapi-client artifactId.
2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641
### Why are the changes needed?
Because the openapi-client is a pure java module.
### Does this PR introduce _any_ user-facing change?
No, it has not been released.
### How was this patch tested?
GA.
Closes#2861 from turboFei/remove_Scala.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark from 3.4.3 to 3.4.4.
### Why are the changes needed?
Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2851 from SteNicholas/CELEBORN-1672.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo`
```diff
<name>org.apache.celeborn</name>
</organization>
<scm>
- <url>https://github.com/cfmcgrady/incubator-celeborn</url>
- <connection>scmhttps://github.com/cfmcgrady/incubator-celeborn.git</connection>
- <developerConnection>scmgitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection>
- </scm>
- <url>https://celeborn.apache.org/</url>
- <scm>
- <url>gitgithub.com:apache/celeborn.git</url>
- <connection>scmgitgithub.com:apache/celeborn.git</connection>
+ <url>https://celeborn.apache.org/</url>
+ <connection>scmhttps://github.com/apache/celeborn.git</connection>
+ <developerConnection>scmgitgithub.com:apache/celeborn.git</developerConnection>
</scm>
```
The conflicting POM might block publishing to a private Maven repository.
```
[error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
[error] at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
[error] at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
[error] at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308)
[error] at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284)
[error] at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82)
[error] at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150)
[error] at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84)
[error] at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407)
[error] at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124)
[error] at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111)
[error] at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170)
[error] at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216)
[error] at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local
Closes#2858 from cfmcgrady/sbt-scm.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The version 1.0.6 is outdated and not available on Maven Central.
https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CI
Closes#2842 from cfmcgrady/sbt-protoc.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Bump Ratis version from 3.1.0 to 3.1.1 including:
- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.
### Why are the changes needed?
Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2759 from SteNicholas/CELEBORN-1525.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
Bump Spark from 3.5.2 to 3.5.3.
### Why are the changes needed?
Spark 3.5.3 has been announced to release: [Spark 3.5.3 released](https://spark.apache.org/news/spark-3-5-3-released.html). The profile spark-3.5 could bump Spark from 3.5.2 to 3.5.3.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2760 from SteNicholas/CELEBORN-1613.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
This pr is a followup for https://github.com/apache/celeborn/pull/2641
In above PR, I upgrade the version to 7.7.0, and there were two generated java files not with apache licenses.
And then I raised a PR in https://github.com/OpenAPITools/openapi-generator/pull/19273 to followup it, and it is released in https://github.com/OpenAPITools/openapi-generator/releases/tag/v7.8.0.
### Why are the changes needed?
Upgrade to the latest openapi-generator version to resolve the unlicensed java files.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing GA.
Closes#2695 from turboFei/openapi_upgrade.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Enable `useEnumCaseInsensitive` for openapi-generator.
And then in celeborn server end, the enum will be mapped to celeborn internal WorkerEventType.
### Why are the changes needed?
I met exception when sending worker event with openapi sdk.
```
Exception in thread "main" ApiException{code=400, responseHeaders={Server=[Jetty(9.4.52.v20230823)], Content-Length=[491], Date=[Fri, 20 Sep 2024 23:50:27 GMT], Content-Type=[text/plain]}, responseBody='Cannot deserialize value of type `org.apache.celeborn.rest.v1.model.SendWorkerEventRequest$EventTypeEnum` from String "DecommissionThenIdle": not one of the values accepted for Enum class: [DECOMMISSION_THEN_IDLE, GRACEFUL, NONE, DECOMMISSION, IMMEDIATELY, RECOMMISSION]
at [Source: (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream); line: 1, column: 14] (through reference chain: org.apache.celeborn.rest.v1.model.SendWorkerEventRequest["eventType"])'}
at org.apache.celeborn.rest.v1.master.invoker.ApiClient.processResponse(ApiClient.java:913)
at org.apache.celeborn.rest.v1.master.invoker.ApiClient.invokeAPI(ApiClient.java:1000)
at org.apache.celeborn.rest.v1.master.WorkerApi.sendWorkerEvent(WorkerApi.java:378)
at org.apache.celeborn.rest.v1.master.WorkerApi.sendWorkerEvent(WorkerApi.java:334)
at org.example.Main.main(Main.java:22)
```
The testing code to re-produce:
```
package org.example;
import org.apache.celeborn.rest.v1.master.WorkerApi;
import org.apache.celeborn.rest.v1.master.invoker.ApiClient;
import org.apache.celeborn.rest.v1.model.ExcludeWorkerRequest;
import org.apache.celeborn.rest.v1.model.SendWorkerEventRequest;
import org.apache.celeborn.rest.v1.model.WorkerId;
public class Main {
public static void main(String[] args) throws Exception {
String cmUrl = "http://localhost:9098";
WorkerApi workerApi = new WorkerApi(new ApiClient().setBasePath(cmUrl));
workerApi.excludeWorker(new ExcludeWorkerRequest()
.addAddItem(new WorkerId()
.host("localhost")
.rpcPort(1)
.pushPort(2)
.fetchPort(3)
.replicatePort(4)));
workerApi.sendWorkerEvent(new SendWorkerEventRequest()
.addWorkersItem(new WorkerId()
.host("127.0.0.1")
.rpcPort(56116)
.pushPort(56117)
.fetchPort(56119)
.replicatePort(56118))
.eventType(SendWorkerEventRequest.EventTypeEnum.DECOMMISSION_THEN_IDLE));
}
}
```
Seems because for the EventTypeEnum, the name and value not the same and then cause this issue.
Not sure why the UT passed, but the integration testing failed.
For EventTypeEnum, because its value is case sensitive, so we meet this issue.
8734d16638/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/SendWorkerEventRequest.java (L47-L83)
Related issue in jersey end I think, https://github.com/eclipse-ee4j/jersey/issues/5288
In this PR, `useEnumCaseInsensitive` is enabled for openapi-generator.
### Does this PR introduce _any_ user-facing change?
No, there is not user facing change and this SDK has not been released yet.
### How was this patch tested?
Existing UT and Integration testing.
<img width="1265" alt="image" src="https://github.com/user-attachments/assets/6a34a0dd-c474-4e8d-b372-19b0fda94972">
Closes#2754 from turboFei/eventTypeEnumMapping.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2749 from cxzl25/CELEBORN-1604.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>