Commit Graph

139 Commits

Author SHA1 Message Date
SteNicholas
75446a05d3 [CELEBORN-2093] Support Flink 2.1
### What changes were proposed in this pull request?

Support Flink 2.1.

### Why are the changes needed?

Flink 2.1 has already released, which release notes refer to [Release notes - Flink 2.1](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3404 from SteNicholas/CELEBORN-2093.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-04 14:12:55 +08:00
SteNicholas
ae40222351 [CELEBORN-2047] Support MapPartitionData on DFS
### What changes were proposed in this pull request?

Support `MapPartitionData` on DFS.

### Why are the changes needed?

`MapPartitionData` only supports on local, which does not support on DFS. It's recommended to support `MapPartitionData` on DFS for MapPartition mode to align the ability of ReducePartition mode.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTestWithHDFS`.

Closes #3349 from SteNicholas/CELEBORN-2047.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-26 22:11:32 +08:00
SteNicholas
4ced621534 [CELEBORN-2080] Bump Flink from 1.19.2, 1.20.1 to 1.19.3, 1.20.2
### What changes were proposed in this pull request?

Bump Flink from 1.19.2, 1.20.1 to 1.19.3, 1.20.2.

### Why are the changes needed?

Flink has released v1.19.3 and v1.20.2, which release notes refer to:

- [Apache Flink 1.19.3 Release Announcement](https://flink.apache.org/2025/07/10/apache-flink-1.19.3-release-announcement/)
- [Apache Flink 1.20.2 Release Announcement](https://flink.apache.org/2025/07/10/apache-flink-1.20.2-release-announcement/)

Flink v1.19.3 adds the `getConsumedPartitionType()` interface into `IndexedInputGate`, which refers to https://github.com/apache/flink/pull/26548.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3385 from SteNicholas/CELEBORN-2080.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-24 14:13:12 +08:00
SteNicholas
cfb4438ade [CELEBORN-2057] Bump ap-loader version from 3.0-9 to 4.0-10
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-9 to 4.0-10.

### Why are the changes needed?

`ap-loader` has already released v4.0-10, which release note refers to [Loader for 4.0 (v10): Heatmaps and Native memory profiling](https://github.com/jvm-profiling-tools/ap-loader/releases/tag/4.0-10). It should bump version from 3.0-9 to 4.0-10 for `JVMProfiler`.

Backport https://github.com/apache/spark/pull/51257.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #3359 from SteNicholas/CELEBORN-2057.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-10 16:18:28 +08:00
SteNicholas
8be7d928a3 [CELEBORN-2030] Bump Spark from 3.5.5 to 3.5.6
### What changes were proposed in this pull request?

Bump Spark from 3.5.5 to 3.5.6.

### Why are the changes needed?

Spark 3.5.6 has been announced to release: [Spark 3.5.6 released](https://spark.apache.org/news/spark-3-5-6-released.html). The profile spark-3.5 could bump Spark from 3.5.5 to 3.5.6.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3319 from SteNicholas/CELEBORN-2030.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-09 20:40:33 +08:00
Fei Wang
b44730771d [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.

### Why are the changes needed?
Spark 4.0.0 is ready.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #3282 from turboFei/spark_4.0.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-28 17:56:08 +08:00
SteNicholas
0dffcf6c9d [CELEBORN-2013] Upgrade scala binary version of spark-3.3, spark-3.4, spark-3.5 profile to 2.13.8
Upgrade scala binary version of `spark-3.3`, `spark-3.4`, `spark-3.5` profile to 2.13.8.

The scala binary version of `spark-3.3`, `spark-3.4`, `spark-3.5` profile is 2.13.8 for scala 2.13.

- https://github.com/apache/spark/blob/v3.3.4/pom.xml#L3548
- https://github.com/apache/spark/blob/v3.4.4/pom.xml#L3646
- https://github.com/apache/spark/blob/v3.5.5/pom.xml#L3681

No.

CI.

Closes #3284 from SteNicholas/CELEBORN-2013.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-26 17:57:38 +08:00
Fei Wang
81c3d91f75 [CELEBORN-2010][INFRA] Add release guide
### What changes were proposed in this pull request?

Add release guide and fix several issues during 0.6.0 release.

### Why are the changes needed?
Add docs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested locally.

Closes #3271 from turboFei/release_guide.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
2025-05-24 17:55:10 -07:00
SteNicholas
8e66ac833a [CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
### What changes were proposed in this pull request?

Introduce disruptor dependency to support asynchronous logging of log4j2.

### Why are the changes needed?

We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #3246 from SteNicholas/CELEBORN-1994.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-13 19:45:51 +08:00
Aravind Patnam
714722b5d3 [CELEBORN-1982] Slot Selection Perf Improvements
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove ([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447)) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
- `haveDisk` is computed per partitionId, iterated across all workers.  We precompute this and store it as a field in `WorkerInfo`.

See the below flamegraph for the hotspot of `iter.remove` (`oop_disjoint_arraycopy`) after running a benchmark.

![Screenshot 2025-04-24 at 12 58 34 AM](https://github.com/user-attachments/assets/30bb38f7-9a92-4b52-8480-5e7f26b0d48b)

Below is what we actually observed in production which matches with the above observation from the benchmark:
![realprodflamegraph](https://github.com/user-attachments/assets/d06e095c-2d6d-4892-982a-1c2e828eb71e)

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:
![Screenshot 2025-04-24 at 12 53 24 AM](https://github.com/user-attachments/assets/99372140-5746-4a34-9918-642c81fb52e8)

Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration   2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration   3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration   4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration   5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration   1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration   2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration   3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration   4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration   5: 907655.363 ±(99.9%) 11893.789 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  882048.066 ±(99.9%) 98360.936 us/op [Average]
  (min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
  CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)

# Run complete. Total time: 00:05:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  882048.066 ± 98360.936  us/op

Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration   2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration   3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration   4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration   5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration   1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration   2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration   3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration   4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration   5: 134179.546 ±(99.9%) 5137.425 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  135207.354 ±(99.9%) 20845.544 us/op [Average]
  (min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
  CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)

# Run complete. Total time: 00:05:29

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  135207.354 ± 20845.544  us/op

Process finished with exit code 0
```

882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.

Closes #3228 from akpatnam25/CELEBORN-1982.

Lead-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-27 11:13:21 +08:00
veli.yang
7d0ba7f9b8 [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
### What changes were proposed in this pull request?

- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.

### Why are the changes needed?

- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Deployment integration testing has been completed in the local environment.

Closes #3157 from shouwangyw/optimize/mpu-oss.

Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-08 15:10:33 +08:00
SteNicholas
2b8f3520f9 [CELEBORN-1925] Support Flink 2.0
### What changes were proposed in this pull request?

Support Flink 2.0. The major changes of Flink 2.0 include:

- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.

### Why are the changes needed?

Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3179 from SteNicholas/CELEBORN-1925.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-04-07 15:23:20 +08:00
SteNicholas
56bf87d3c1 [CELEBORN-1543][FOLLOWUP] celeborn-flink-it project should set FLINK_VERSION environment variable for HybridShuffleWordCountTest
### What changes were proposed in this pull request?

`celeborn-flink-it` project should set `FLINK_VERSION` environment variable for `HybridShuffleWordCountTest`.

Follow up #2662.

### Why are the changes needed?

Because of the lack of `FLINK_VERSION` environment variable, the sbt flink project of CI cancels the tests of `HybridShuffleWordCountTest` as follows:

```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count !!! CANCELED !!!
[info]   "" was empty (HybridShuffleWordCountTest.scala:75)
[info]   org.scalatest.exceptions.TestCanceledException:
[info]   at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info]   at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info]   at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$1(HybridShuffleWordCountTest.scala:62)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]   at java.base/java.lang.Thread.run(Thread.java:829)
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count !!! CANCELED !!!
[info]   "" was empty (HybridShuffleWordCountTest.scala:75)
[info]   org.scalatest.exceptions.TestCanceledException:
[info]   at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info]   at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info]   at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$2(HybridShuffleWordCountTest.scala:68)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]   at java.base/java.lang.Thread.run(Thread.java:829)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HybridShuffleWordCountTest`

- Flink 1.16
```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count !!! CANCELED !!!
[info]   "1.16.3" was not empty, but FlinkVersion.fromVersionStr(scala.Predef.refArrayOps[String](scala.Predef.refArrayOps[String](flinkVersion.split("\\.")).take(2)).mkString(".")).isNewerOrEqualVersionThan(v1_20) was false (HybridShuffleWordCountTest.scala:75)
[info]   org.scalatest.exceptions.TestCanceledException:
[info]   at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info]   at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info]   at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$1(HybridShuffleWordCountTest.scala:62)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:[1564](https://github.com/apache/celeborn/actions/runs/14124619270/job/39570964025?pr=3184#step:4:1565))
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]   at java.base/java.lang.Thread.run(Thread.java:829)
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count !!! CANCELED !!!
[info]   "1.16.3" was not empty, but FlinkVersion.fromVersionStr(scala.Predef.refArrayOps[String](scala.Predef.refArrayOps[String](flinkVersion.split("\\.")).take(2)).mkString(".")).isNewerOrEqualVersionThan(v1_20) was false (HybridShuffleWordCountTest.scala:75)
[info]   org.scalatest.exceptions.TestCanceledException:
[info]   at org.scalatest.Assertions.newTestCanceledException(Assertions.scala:475)
[info]   at org.scalatest.Assertions.newTestCanceledException$(Assertions.scala:474)
[info]   at org.scalatest.Assertions$.newTestCanceledException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssume(Assertions.scala:1310)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.assumeFlinkVersion(HybridShuffleWordCountTest.scala:75)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.$anonfun$new$2(HybridShuffleWordCountTest.scala:68)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.org$scalatest$BeforeAndAfterAll$$super$run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.celeborn.tests.flink.HybridShuffleWordCountTest.run(HybridShuffleWordCountTest.scala:37)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[info]   at java.base/java.lang.Thread.run(Thread.java:829)
```
- Flink 1.20
```
[info] - Celeborn Flink Hybrid Shuffle Integration test(Local) - word count
2025-03-28T08:15:25.149779Z flink-pekko.actor.default-dispatcher-10 WARN found 2 argument placeholders, but provided 3 for pattern `Disconnect TaskExecutor {} because: {}`
25/03/28 08:15:25,199 ERROR [celeborn-client-lifecycle-manager-release-partition-executor-1] LifecycleManager: Request DestroyWorkerSlots(1743149720044-d5114c42fdccf6220ec7a9e1fb856ac3-1,[1-0, 4-0, 7-0],[],false) to NettyRpcEndpointRef(celeborn://WorkerEndpoint10.1.0.100:37683) failed 1/3 for 1743149720044-d5114c42fdccf6220ec7a9e1fb856ac3-1, reason: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask351c2453[Not completed, task = java.util.concurrent.Executors$RunnableAdaptera979257[Wrapped task = org.apache.celeborn.common.rpc.netty.NettyRpcEnv$$anon$16312b56f]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor23161ba2[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0], will retry.
...
[info] - Celeborn Flink Hybrid Shuffle Integration test(Flink mini cluster) single tier - word count
25/03/28 08:15:30,980 ERROR [worker 1 starter thread] MasterClient: Send rpc with failure, has tried 0, max try 15!
```

Closes #3184 from SteNicholas/CELEBORN-1543.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-31 10:51:12 +08:00
veli.yang
d96457909d [CELEBORN-1911] Move multipart-uploader to multipart-uploader/multipart-uploader-s3 for extensibility
### What changes were proposed in this pull request?
- close [CELEBORN-1911](https://issues.apache.org/jira/browse/CELEBORN-1911)

This PR refactors the project structure by moving the multipart-uploader module into multipart-uploader/multipart-uploader-s3.

### Why are the changes needed?
This change improves modularity and enables future extensions, such as multipart-uploader/multipart-uploader-oss, allowing better support for multiple object storage backends.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Deployment integration testing has been completed in the local environment.

Closes #3153 from shouwangyw/optimize/mpu-s3.

Authored-by: veli.yang <897900564@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-14 22:34:32 +08:00
Wang, Fei
3d05c8998f [CELEBORN-1895] Bump log4j2 version to 2.24.3
### What changes were proposed in this pull request?

Bump log4j2 version to 2.24.3
https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.24.3

### Why are the changes needed?
Bump to latest log4j2 bug fix release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #3134 from turboFei/log4j2.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-10 11:30:52 +08:00
Cheng Pan
e85207e2c7 [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3
### What changes were proposed in this pull request?

This PR partially reverts the change of https://github.com/apache/celeborn/pull/2813, namely, restores the renaming of `celeborn-client-spark-3`

### Why are the changes needed?

The renaming is not necessary, and might cause some confusion, for example, I wrongly interpreted the `spark-3-4` as Spark 3.4, it also increases the backport efforts for branch-0.5

### Does this PR introduce _any_ user-facing change?

No, it's dev only, before/after this change, the end users always use the shaded client

```
celeborn-client-spark-2-shaded_2.11-0.6.0-SNAPSHOT.jar
celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.jar
celeborn-client-spark-4-shaded_2.13-0.6.0-SNAPSHOT.jar
```

### How was this patch tested?

Pass GA.

Closes #3133 from pan3793/CELEBORN-1413-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-04 22:25:10 +08:00
SteNicholas
15e34eca6e [CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
### What changes were proposed in this pull request?

Bump Spark from 3.5.4 to 3.5.5.

### Why are the changes needed?

Spark 3.5.5 has been announced to release: [Spark 3.5.5 released](https://spark.apache.org/news/spark-3-5-5-released.html). The profile spark-3.5 could bump Spark from 3.5.4 to 3.5.5.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3129 from SteNicholas/CELEBORN-1890.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-04 14:15:04 +08:00
SteNicholas
d90cf0d427 [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
### What changes were proposed in this pull request?

Bump rocksdbjni version from 9.5.2 to 9.10.0.

### Why are the changes needed?

There are some bug fixes and performance Improvements. The full release notes:

- https://github.com/facebook/rocksdb/releases/tag/v9.6.1
- https://github.com/facebook/rocksdb/releases/tag/v9.7.3
- https://github.com/facebook/rocksdb/releases/tag/v9.7.4
- https://github.com/facebook/rocksdb/releases/tag/v9.8.4
- https://github.com/facebook/rocksdb/releases/tag/v9.9.3
- https://github.com/facebook/rocksdb/releases/tag/v9.10.0

Backport:

- https://github.com/apache/spark/pull/48155
- https://github.com/apache/spark/pull/49538
- https://github.com/apache/spark/pull/50076

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3121 from SteNicholas/CELEBORN-1884.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-28 11:29:42 +08:00
Nicholas Jiang
79b49805e8 [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
### What changes were proposed in this pull request?

Bump zstd-jni version from 1.5.2-1 to 1.5.7-1.

### Why are the changes needed?

Bump zstd-jni to the latest version.

Backport https://github.com/apache/spark/pull/50057.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #3114 from SteNicholas/CELEBORN-1877.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-25 11:08:44 +08:00
Nicholas Jiang
5b507aed72 [CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1
### What changes were proposed in this pull request?

Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1.

### Why are the changes needed?

Flink 1.19.2 and 1.20.1 have already released.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3107 from SteNicholas/CELEBORN-1872.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-02-19 10:49:44 +08:00
Nicholas Jiang
2dd26936e8 [CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.115.Final to 4.1.118.Final.

### Why are the changes needed?

The Netty 4.1.118.Final version has been released, which netty version is 4.1.115.Final at present. The changes between 4.1.115.Final and 4.1.118.Final is as follows:

- 4.1.116.Final: https://netty.io/news/2024/12/17/4-1-116-Final.html
- 4.1.117.Final: https://netty.io/news/2025/01/14/4-1-117-Final.html
- 4.1.118.Final: https://netty.io/news/2025/02/10/4-1-118-Final.html
   - **SslHandler doesn't correctly validate packets which can lead to native crash when using native SSLEngine.**
   - **Denial of Service attack on windows app using Netty, again.**

Backport:

- https://github.com/apache/spark/pull/49756
- https://github.com/apache/spark/pull/49923

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3098 from SteNicholas/CELEBORN-1864.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-15 11:46:28 +08:00
madlnu
b5c00ea645 [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
### What changes were proposed in this pull request?
Upgrading ratis version to 3.1.3

### Why are the changes needed?
For fixing the CVE-2024-7254 and sonatype-2020-0026 coming from its transitive dependency - ratis-thirdparty-misc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Locally and CI tests

Closes #3095 from Madhukar525722/main.

Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-12 17:46:58 +08:00
mingji
75b697d815 [CELEBORN-1838] Interrupt spark task should not report fetch failure
Backport https://github.com/apache/celeborn/pull/3070 to main branch.
## What changes were proposed in this pull request?
Do not trigger fetch failure if a spark task attempt is interrupted(speculation enabled). Do not trigger fetch failure if the RPC of getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.

## Why are the changes needed?
Avoid unnecessary fetch failures and stage re-runs.

## Does this PR introduce any user-facing change?
NO.

## How was this patch tested?
1. GA.
2. Manually tested on cluster with spark speculation tasks.

Here is the test case
```scala
sc.parallelize(1 to 100, 100).flatMap(i => {
        (1 to 150000).iterator.map(num => num)
      }).groupBy(i => i, 100)
      .map(i => {
        if (i._1 < 5) {
          Thread.sleep(15000)
        }
        i
      })
      .repartition(400).count
```

<img width="1384" alt="截屏2025-01-18 16 16 16" src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb" /> <img width="1393" alt="截屏2025-01-18 16 16 22" src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530" /> <img width="1258" alt="截屏2025-01-18 16 19 15" src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6" /> <img width="892" alt="截屏2025-01-18 16 17 27" src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0" />

Closes #3070 from FMX/branch-0.5-b1838.

Authored-by: mingji <fengmingxiao.fmxalibaba-inc.com>

Closes #3080 from turboFei/b1838.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-01-23 14:46:36 +08:00
SteNicholas
30e46eee28
[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-8 to 3.0-9.

### Why are the changes needed?

ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.

Backport:

1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3072 from SteNicholas/CELEBORN-1842.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-01-21 12:22:00 +08:00
SteNicholas
19fecadcd7 [CELEBORN-1413][FOLLOWUP] Bump zstd-jni version to 1.5.6-5 for 4.0.0-preview2
### What changes were proposed in this pull request?

Bump `zstd-jni` version to 1.5.6-5 for 4.0.0-preview2.

### Why are the changes needed?

`zstd-jni` version is 1.5.6-5 for 4.0.0-preview2 for [<version>1.5.6-5</version>](https://github.com/apache/spark/blob/v4.0.0-preview2/pom.xml#L838C18-L838C25).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3054 from SteNicholas/CELEBORN-1413.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-07 17:37:22 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
SteNicholas
eb59c17638
[CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
### What changes were proposed in this pull request?

Bump Spark from 3.5.3 to 3.5.4.

### Why are the changes needed?

Spark 3.5.4 has been announced to release: [Spark 3.5.4 released](https://spark.apache.org/news/spark-3-5-4-released.html). The profile spark-3.5 could bump Spark from 3.5.3 to 3.5.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3034 from SteNicholas/CELEBORN-1806.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-27 16:29:35 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
Wang, Fei
27e34ecad0 [CELEBORN-1797] Support to adjust the logger level with RESTful API during runtime
### What changes were proposed in this pull request?

Support to adjust the logger level during runtime without restarting the server.

### Why are the changes needed?
It is useful for debug, likes hadoop daemonlog command: https://hadoop.apache.org/docs/r3.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html#daemonlog

### Does this PR introduce _any_ user-facing change?
Yes, new RESTful api.

### How was this patch tested?

GA.
<img width="1430" alt="image" src="https://github.com/user-attachments/assets/9d974fd9-21f3-429a-a35f-e15662aa75ac" />
<img width="1428" alt="image" src="https://github.com/user-attachments/assets/ca32b596-12a1-4038-9e1b-4fdc6a377b54" />

<img width="1255" alt="image" src="https://github.com/user-attachments/assets/5c399a73-9f53-43a8-b337-5a79621abea4" />
<img width="1244" alt="image" src="https://github.com/user-attachments/assets/16dc9ede-01bb-4e38-80fe-acb044ae6cc7" />

Closes #3022 from turboFei/log_level.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-24 11:24:30 +08:00
Fei Wang
6b884dee66 [CELEBORN-1777] Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
### What changes were proposed in this pull request?
As title, add `--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED` into default java options.

### Why are the changes needed?

It is necessary for JDK17 + HDFS Storage + Kerberos enabled, see details in https://github.com/apache/spark/pull/34615

The exception stack likes:
```
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
	at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
....
Caused by: java.lang.IllegalAccessException: class org.apache.hadoop.security.authentication.util.KerberosUtil cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module 3a0baae5
	at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
	at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
	at java.base/java.lang.reflect.Method.invoke(Method.java:560)
	at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
	... 9 more
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2999 from turboFei/jdk_opt_krb5.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-19 14:54:21 +08:00
SteNicholas
f3dac7e879 [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.109.Final to 4.1.115.Final.

### Why are the changes needed?

The Netty 4.1.115.Final version has been released, which netty version is 4.1.109.Final at present. The changes between 4.1.110.Final and 4.1.115.Final is as follows:

- [4.1.110.Final](https://netty.io/news/2024/05/22/4-1-110-Final.html)
- [4.1.111.Final](https://netty.io/news/2024/06/11/4-1-111-Final.html)
- [4.1.112.Final](https://netty.io/news/2024/07/19/4-1-112-Final.html)
- [4.1.113.Final](https://netty.io/news/2024/09/04/4-1-113-Final.html)
- [4.1.114.Final](https://netty.io/news/2024/10/01/4-1-114-Final.html)
- [4.1.115.Final](https://netty.io/news/2024/11/12/4-1-115-Final.html)

Bump https://github.com/apache/spark/pull/46945.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2903 from SteNicholas/CELEBORN-1712.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:29:07 +08:00
zhaohehuhu
3bf91929b6 [CELEBORN-1746] Reduce the size of aws dependencies
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture

Closes #2944 from zhaohehuhu/dev-1125.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 19:45:01 +08:00
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
SteNicholas
7d1da5e915 [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.1 to 3.1.2 including:

- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171

### Why are the changes needed?

Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.

The 3.1.2 version fixed the following `NullPointerException` in CI log:

```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
    at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
    at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
    at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runners.Suite.runChild(Suite.java:128)
    at org.junit.runners.Suite.runChild(Suite.java:27)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
    at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
    at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
    at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
    at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2897 from SteNicholas/CELEBORN-1702.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:15:20 +08:00
Wang, Fei
330b2a094e [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
### What changes were proposed in this pull request?

Bump protobuf from 3.21.7 to 3.25.5.

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2898 from turboFei/bump_protobuf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:02:23 +08:00
Wang, Fei
09ffee0365 [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### What changes were proposed in this pull request?

 Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826

### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #2899 from turboFei/bump_jetty.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:58:44 +08:00
Wang, Fei
6d2b9f6d92 [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
### What changes were proposed in this pull request?
 Bump commons-io from 2.13.0 to 2.17.0

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #2900 from turboFei/bump_commons_io.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:57:29 +08:00
Wang, Fei
4545cdc401
[CELEBORN-1699] Add GA to check the openapi codegen
### What changes were proposed in this pull request?
Check the consistency between openapi spec and generated openapi-client code in GA.
Fix the ThreadStack model.

### Why are the changes needed?
Address comments: https://github.com/apache/celeborn/pull/2892#issuecomment-2463170494

Followup for https://github.com/apache/celeborn/pull/2888
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

It works:
https://github.com/apache/celeborn/actions/runs/11733693060/job/32688436233?pr=2893

<img width="1059" alt="image" src="https://github.com/user-attachments/assets/84682976-1b7d-42e0-9b62-2966f3e952d7">

After ThreadStack fixed:
<img width="1368" alt="image" src="https://github.com/user-attachments/assets/14a2a08f-dbc2-409c-a4ed-fbfee82e50b5">

Closes #2893 from turboFei/diff_openapi.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-08 10:57:46 +08:00
Wang, Fei
7dc72b35e7 [CELEBORN-1477][FOLLOWUP] Remove scala binary version from openapi-client artifactId
### What changes were proposed in this pull request?

1. remove scala binary version from the openapi-client artifactId.
2. skip openapi-client doc compile, it was missed in https://github.com/apache/celeborn/pull/2641

### Why are the changes needed?

Because the openapi-client is a pure java module.

### Does this PR introduce _any_ user-facing change?

No, it has not been released.

### How was this patch tested?
GA.

Closes #2861 from turboFei/remove_Scala.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-01 15:08:00 +08:00
SteNicholas
165e914b9b [CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
### What changes were proposed in this pull request?

Bump Spark from 3.4.3 to 3.4.4.

### Why are the changes needed?

Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2851 from SteNicholas/CELEBORN-1672.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-01 11:05:00 +08:00
Fu Chen
39c185afd1 [CELEBORN-1677][BUILD] Update SCM information for SBT build configuration
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

This PR addresses a conflict in the sbt generated POM by replacing `pomExtra` with `scmInfo`

```diff
         <name>org.apache.celeborn</name>
     </organization>
     <scm>
-        <url>https://github.com/cfmcgrady/incubator-celeborn</url>
-        <connection>scm:git:https://github.com/cfmcgrady/incubator-celeborn.git</connection>
-        <developerConnection>scm:git:gitgithub.com:cfmcgrady/incubator-celeborn.git</developerConnection>
-    </scm>
-    <url>https://celeborn.apache.org/</url>
-    <scm>
-        <url>gitgithub.com:apache/celeborn.git</url>
-        <connection>scm:git:gitgithub.com:apache/celeborn.git</connection>
+        <url>https://celeborn.apache.org/</url>
+        <connection>scm:git:https://github.com/apache/celeborn.git</connection>
+        <developerConnection>scm:git:gitgithub.com:apache/celeborn.git</developerConnection>
     </scm>

```

The conflicting POM might block publishing to a private Maven repository.

```
[error] Caused by: java.io.IOException: Server returned HTTP response code: 409 for URL: https://artifactory.devops.xxx.com/artifactory/maven-snapshots/org/apache/celeborn/celeborn-client-spark-3-shaded_2.12/0.6.0-SNAPSHOT/celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.pom
[error]         at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:2000)
[error]         at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
[error]         at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
[error]         at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:308)
[error]         at org.apache.ivy.util.url.BasicURLHandler.upload(BasicURLHandler.java:284)
[error]         at org.apache.ivy.util.url.URLHandlerDispatcher.upload(URLHandlerDispatcher.java:82)
[error]         at org.apache.ivy.util.FileUtil.copy(FileUtil.java:150)
[error]         at org.apache.ivy.plugins.repository.url.URLRepository.put(URLRepository.java:84)
[error]         at sbt.internal.librarymanagement.ConvertResolver$LocalIfFileRepo.put(ConvertResolver.scala:407)
[error]         at org.apache.ivy.plugins.repository.AbstractRepository.put(AbstractRepository.java:130)
[error]         at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put(ConvertResolver.scala:124)
[error]         at sbt.internal.librarymanagement.ConvertResolver$ChecksumFriendlyURLResolver.put$(ConvertResolver.scala:111)
[error]         at sbt.internal.librarymanagement.ConvertResolver$$anonfun$defaultConvert$lzycompute$1$PluginCapableResolver$1.put(ConvertResolver.scala:170)
[error]         at org.apache.ivy.plugins.resolver.RepositoryResolver.publish(RepositoryResolver.java:216)
[error]         at sbt.internal.librarymanagement.IvyActions$.$anonfun$publish$5(IvyActions.scala:501)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

local

Closes #2858 from cfmcgrady/sbt-scm.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2024-10-29 14:04:58 +08:00
Fu Chen
3b9c2f04e7 [CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

The version 1.0.6 is outdated and not available on Maven Central.

https://mvnrepository.com/artifact/com.thesamet/sbt-protoc_2.12_1.0

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #2842 from cfmcgrady/sbt-protoc.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2024-10-24 11:16:37 +08:00
Fu Chen
24b9b24712 [CELEBORN-1658] Add Git Commit Info and Build JDK Spec to sbt Manifest
### What changes were proposed in this pull request?

This PR  adding Git commit information and JVM build specifications to package manifest.

the `META-INF/MANIFEST.MF` before this PR:

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
```

after this PR:

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
Build-Jdk-Spec: 17.0.9
Build-Revision: 03247c19f1b38096a4080fe97e94dbeb20ebcbe9
Build-Branch: jdk-git-spec
Build-Time: 2024-10-18T17:53:02.723124+08:00[Asia/Shanghai]
```

```
Manifest-Version: 1.0
Specification-Title: celeborn-client-spark-3-shaded
Specification-Version: 0.6.0-SNAPSHOT
Specification-Vendor: org.apache.celeborn
Implementation-Title: celeborn-client-spark-3-shaded
Implementation-Version: 0.6.0-SNAPSHOT
Implementation-Vendor: org.apache.celeborn
Implementation-Vendor-Id: org.apache.celeborn
Build-Jdk-Spec: 17.0.9
Build-Revision: N/A
Build-Branch:
Build-Time: 2024-10-18T17:54:16.932121+08:00[Asia/Shanghai]
```

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

local

Closes #2821 from cfmcgrady/jdk-git-spec.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:05:38 +08:00
SteNicholas
651cbebc1a [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.0 to 3.1.1 including:

- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.

### Why are the changes needed?

Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2759 from SteNicholas/CELEBORN-1525.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-09-26 10:45:38 -05:00
SteNicholas
416c84acce [CELEBORN-1613] Bump Spark from 3.5.2 to 3.5.3
### What changes were proposed in this pull request?

Bump Spark from 3.5.2 to 3.5.3.

### Why are the changes needed?

Spark 3.5.3 has been announced to release: [Spark 3.5.3 released](https://spark.apache.org/news/spark-3-5-3-released.html). The profile spark-3.5 could bump Spark from 3.5.2 to 3.5.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2760 from SteNicholas/CELEBORN-1613.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-25 15:32:04 +08:00
Wang, Fei
909d6c3b9c [CELEBORN-1477][FOLLOWUP] Upgrade openapi-generator to 7.8.0
### What changes were proposed in this pull request?
This pr is a followup for https://github.com/apache/celeborn/pull/2641

In above PR, I upgrade the version to 7.7.0, and there were two generated java files not with apache licenses.

And then I raised a PR in https://github.com/OpenAPITools/openapi-generator/pull/19273 to followup it, and it is released in https://github.com/OpenAPITools/openapi-generator/releases/tag/v7.8.0.

### Why are the changes needed?

Upgrade to the latest openapi-generator version to resolve the unlicensed java files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing GA.

Closes #2695 from turboFei/openapi_upgrade.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-24 16:02:09 +08:00
Wang, Fei
1596cffb39 [CELEBORN-1607] Enable useEnumCaseInsensitive for openapi-generator
### What changes were proposed in this pull request?
Enable `useEnumCaseInsensitive` for openapi-generator.
And then in celeborn server end, the enum will be mapped to celeborn internal WorkerEventType.

### Why are the changes needed?

I met exception when sending worker event with openapi sdk.
```
Exception in thread "main" ApiException{code=400, responseHeaders={Server=[Jetty(9.4.52.v20230823)], Content-Length=[491], Date=[Fri, 20 Sep 2024 23:50:27 GMT], Content-Type=[text/plain]}, responseBody='Cannot deserialize value of type `org.apache.celeborn.rest.v1.model.SendWorkerEventRequest$EventTypeEnum` from String "DecommissionThenIdle": not one of the values accepted for Enum class: [DECOMMISSION_THEN_IDLE, GRACEFUL, NONE, DECOMMISSION, IMMEDIATELY, RECOMMISSION]
 at [Source: (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream); line: 1, column: 14] (through reference chain: org.apache.celeborn.rest.v1.model.SendWorkerEventRequest["eventType"])'}
    at org.apache.celeborn.rest.v1.master.invoker.ApiClient.processResponse(ApiClient.java:913)
    at org.apache.celeborn.rest.v1.master.invoker.ApiClient.invokeAPI(ApiClient.java:1000)
    at org.apache.celeborn.rest.v1.master.WorkerApi.sendWorkerEvent(WorkerApi.java:378)
    at org.apache.celeborn.rest.v1.master.WorkerApi.sendWorkerEvent(WorkerApi.java:334)
    at org.example.Main.main(Main.java:22)

```

The testing code to re-produce:
```
package org.example;

import org.apache.celeborn.rest.v1.master.WorkerApi;
import org.apache.celeborn.rest.v1.master.invoker.ApiClient;
import org.apache.celeborn.rest.v1.model.ExcludeWorkerRequest;
import org.apache.celeborn.rest.v1.model.SendWorkerEventRequest;
import org.apache.celeborn.rest.v1.model.WorkerId;

public class Main {
    public static void main(String[] args) throws Exception {

        String cmUrl = "http://localhost:9098";
        WorkerApi workerApi = new WorkerApi(new ApiClient().setBasePath(cmUrl));
        workerApi.excludeWorker(new ExcludeWorkerRequest()
                .addAddItem(new WorkerId()
                        .host("localhost")
                        .rpcPort(1)
                        .pushPort(2)
                        .fetchPort(3)
                        .replicatePort(4)));
        workerApi.sendWorkerEvent(new SendWorkerEventRequest()
                        .addWorkersItem(new WorkerId()
                                .host("127.0.0.1")
                                .rpcPort(56116)
                                .pushPort(56117)
                                .fetchPort(56119)
                                .replicatePort(56118))
                .eventType(SendWorkerEventRequest.EventTypeEnum.DECOMMISSION_THEN_IDLE));
    }
}
```

Seems because for the EventTypeEnum, the name and value not the same and then cause this issue.

Not sure why the UT passed, but the integration testing failed.

For EventTypeEnum, because its value is case sensitive, so we meet this issue.

8734d16638/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/SendWorkerEventRequest.java (L47-L83)

Related issue in jersey end I think, https://github.com/eclipse-ee4j/jersey/issues/5288

In this PR, `useEnumCaseInsensitive` is enabled for openapi-generator.

### Does this PR introduce _any_ user-facing change?
No, there is not user facing change and this SDK has not been released yet.

### How was this patch tested?
Existing UT and Integration testing.
<img width="1265" alt="image" src="https://github.com/user-attachments/assets/6a34a0dd-c474-4e8d-b372-19b0fda94972">

Closes #2754 from turboFei/eventTypeEnumMapping.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-23 20:43:04 +08:00
sychen
8734d16638 [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
### What changes were proposed in this pull request?

### Why are the changes needed?
https://commons.apache.org/proper/commons-lang/changes-report.html

https://github.com/apache/celeborn/pull/2544#issuecomment-2349065779

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2750 from cxzl25/CELEBORN-1605.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:37:31 +08:00
sychen
40f8eccecd [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2749 from cxzl25/CELEBORN-1604.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:35:42 +08:00