Commit Graph

206 Commits

Author SHA1 Message Date
SteNicholas
8be7d928a3 [CELEBORN-2030] Bump Spark from 3.5.5 to 3.5.6
### What changes were proposed in this pull request?

Bump Spark from 3.5.5 to 3.5.6.

### Why are the changes needed?

Spark 3.5.6 has been announced to release: [Spark 3.5.6 released](https://spark.apache.org/news/spark-3-5-6-released.html). The profile spark-3.5 could bump Spark from 3.5.5 to 3.5.6.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3319 from SteNicholas/CELEBORN-2030.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-09 20:40:33 +08:00
Fei Wang
b44730771d [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.

### Why are the changes needed?
Spark 4.0.0 is ready.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #3282 from turboFei/spark_4.0.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-28 17:56:08 +08:00
Wang, Fei
cbf4a145c5 Bump 0.7.0-SNAPSHOT 2025-05-21 17:20:12 -07:00
CodingCat
0b5a09a9f7 [CELEBORN-1896] delete data from failed to fetch shuffles
### What changes were proposed in this pull request?

it's a joint work with YutingWang98

currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles

As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage

this PR implements the mechanism to delete files from failed-to-fetch shuffles

the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage

the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages

### Why are the changes needed?

saving disk space

### Does this PR introduce _any_ user-facing change?

a new config

### How was this patch tested?

we manually delete some files

![image](https://github.com/user-attachments/assets/4136cd52-78b2-44e7-8244-db3c5bf9d9c4)

from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers

![image](https://github.com/user-attachments/assets/7d3b4d90-ae5a-4a54-8dec-a5005850ef0a)

in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0

Closes #3109 from CodingCat/delete_fi.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-21 11:23:11 +08:00
SteNicholas
8e66ac833a [CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
### What changes were proposed in this pull request?

Introduce disruptor dependency to support asynchronous logging of log4j2.

### Why are the changes needed?

We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #3246 from SteNicholas/CELEBORN-1994.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-13 19:45:51 +08:00
Aravind Patnam
714722b5d3 [CELEBORN-1982] Slot Selection Perf Improvements
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove ([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447)) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
- `haveDisk` is computed per partitionId, iterated across all workers.  We precompute this and store it as a field in `WorkerInfo`.

See the below flamegraph for the hotspot of `iter.remove` (`oop_disjoint_arraycopy`) after running a benchmark.

![Screenshot 2025-04-24 at 12 58 34 AM](https://github.com/user-attachments/assets/30bb38f7-9a92-4b52-8480-5e7f26b0d48b)

Below is what we actually observed in production which matches with the above observation from the benchmark:
![realprodflamegraph](https://github.com/user-attachments/assets/d06e095c-2d6d-4892-982a-1c2e828eb71e)

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:
![Screenshot 2025-04-24 at 12 53 24 AM](https://github.com/user-attachments/assets/99372140-5746-4a34-9918-642c81fb52e8)

Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration   2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration   3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration   4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration   5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration   1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration   2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration   3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration   4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration   5: 907655.363 ±(99.9%) 11893.789 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  882048.066 ±(99.9%) 98360.936 us/op [Average]
  (min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
  CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)

# Run complete. Total time: 00:05:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  882048.066 ± 98360.936  us/op

Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration   2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration   3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration   4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration   5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration   1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration   2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration   3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration   4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration   5: 134179.546 ±(99.9%) 5137.425 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  135207.354 ±(99.9%) 20845.544 us/op [Average]
  (min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
  CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)

# Run complete. Total time: 00:05:29

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  135207.354 ± 20845.544  us/op

Process finished with exit code 0
```

882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.

Closes #3228 from akpatnam25/CELEBORN-1982.

Lead-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-27 11:13:21 +08:00
gaoyajun02
e5ccc9b623 [CELEBORN-81][FOLLOWUP] Correct scala test plugin args
### What changes were proposed in this pull request?
Modifies the maven configuration to properly pass Jacoco's argLine to ScalaTest plugin, enabling code coverage measurement for Scala tests.

### Why are the changes needed?
Previously Scala tests were not properly included in code coverage reports

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

Closes #3205 from gaoyajun02/ci.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-04-09 10:23:33 +08:00
veli.yang
7d0ba7f9b8 [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
### What changes were proposed in this pull request?

- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.

### Why are the changes needed?

- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Deployment integration testing has been completed in the local environment.

Closes #3157 from shouwangyw/optimize/mpu-oss.

Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-08 15:10:33 +08:00
SteNicholas
2b8f3520f9 [CELEBORN-1925] Support Flink 2.0
### What changes were proposed in this pull request?

Support Flink 2.0. The major changes of Flink 2.0 include:

- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.

### Why are the changes needed?

Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3179 from SteNicholas/CELEBORN-1925.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-04-07 15:23:20 +08:00
SteNicholas
5f298f5ce2 [CELEBORN-1190][FOLLOWUP] Use -XepDisableWarningsInGeneratedCode to disable warnings for openapi-client module
### What changes were proposed in this pull request?

Use `-XepDisableWarningsInGeneratedCode` to disable warnings for `openapi-client` module.

### Why are the changes needed?

There are some warnings in compilation of `openapi-client` module as follows:

```
$ mvn clean install -pl openapi/openapi-client -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep "[WARNING].*java.*"
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/worker/invoker/ApiException.java:[100,18] [OverrideThrowableToString] To return a custom message with a Throwable class, one should override getMessage() instead of toString().
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/DynamicConfig.java:[55,19] [ImmutableEnumChecker] enums should be immutable: 'LevelEnum' has non-final field 'value'
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/master/invoker/ApiException.java:[100,18] [OverrideThrowableToString] To return a custom message with a Throwable class, one should override getMessage() instead of toString().
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/WorkerExitRequest.java:[51,19] [ImmutableEnumChecker] enums should be immutable: 'TypeEnum' has non-final field 'value'
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/PartitionLocationData.java:[58,19] [ImmutableEnumChecker] enums should be immutable: 'ModeEnum' has non-final field 'value'
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/PartitionLocationData.java:[107,19] [ImmutableEnumChecker] enums should be immutable: 'StorageEnum' has non-final field 'value'
[WARNING] /Users/nicholasjiang/Github/celeborn/openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/SendWorkerEventRequest.java:[60,19] [ImmutableEnumChecker] enums should be immutable: 'EventTypeEnum' has non-final field 'value'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

```
$ mvn clean install -pl openapi/openapi-client -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep "[WARNING].*java.*"
```

Closes #3169 from SteNicholas/CELEBORN-1190.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-26 12:04:07 +08:00
veli.yang
d96457909d [CELEBORN-1911] Move multipart-uploader to multipart-uploader/multipart-uploader-s3 for extensibility
### What changes were proposed in this pull request?
- close [CELEBORN-1911](https://issues.apache.org/jira/browse/CELEBORN-1911)

This PR refactors the project structure by moving the multipart-uploader module into multipart-uploader/multipart-uploader-s3.

### Why are the changes needed?
This change improves modularity and enables future extensions, such as multipart-uploader/multipart-uploader-oss, allowing better support for multiple object storage backends.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Deployment integration testing has been completed in the local environment.

Closes #3153 from shouwangyw/optimize/mpu-s3.

Authored-by: veli.yang <897900564@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-14 22:34:32 +08:00
Wang, Fei
3d05c8998f [CELEBORN-1895] Bump log4j2 version to 2.24.3
### What changes were proposed in this pull request?

Bump log4j2 version to 2.24.3
https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.24.3

### Why are the changes needed?
Bump to latest log4j2 bug fix release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #3134 from turboFei/log4j2.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-10 11:30:52 +08:00
Cheng Pan
e85207e2c7 [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3
### What changes were proposed in this pull request?

This PR partially reverts the change of https://github.com/apache/celeborn/pull/2813, namely, restores the renaming of `celeborn-client-spark-3`

### Why are the changes needed?

The renaming is not necessary, and might cause some confusion, for example, I wrongly interpreted the `spark-3-4` as Spark 3.4, it also increases the backport efforts for branch-0.5

### Does this PR introduce _any_ user-facing change?

No, it's dev only, before/after this change, the end users always use the shaded client

```
celeborn-client-spark-2-shaded_2.11-0.6.0-SNAPSHOT.jar
celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.jar
celeborn-client-spark-4-shaded_2.13-0.6.0-SNAPSHOT.jar
```

### How was this patch tested?

Pass GA.

Closes #3133 from pan3793/CELEBORN-1413-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-04 22:25:10 +08:00
SteNicholas
15e34eca6e [CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
### What changes were proposed in this pull request?

Bump Spark from 3.5.4 to 3.5.5.

### Why are the changes needed?

Spark 3.5.5 has been announced to release: [Spark 3.5.5 released](https://spark.apache.org/news/spark-3-5-5-released.html). The profile spark-3.5 could bump Spark from 3.5.4 to 3.5.5.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3129 from SteNicholas/CELEBORN-1890.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-04 14:15:04 +08:00
SteNicholas
d90cf0d427 [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
### What changes were proposed in this pull request?

Bump rocksdbjni version from 9.5.2 to 9.10.0.

### Why are the changes needed?

There are some bug fixes and performance Improvements. The full release notes:

- https://github.com/facebook/rocksdb/releases/tag/v9.6.1
- https://github.com/facebook/rocksdb/releases/tag/v9.7.3
- https://github.com/facebook/rocksdb/releases/tag/v9.7.4
- https://github.com/facebook/rocksdb/releases/tag/v9.8.4
- https://github.com/facebook/rocksdb/releases/tag/v9.9.3
- https://github.com/facebook/rocksdb/releases/tag/v9.10.0

Backport:

- https://github.com/apache/spark/pull/48155
- https://github.com/apache/spark/pull/49538
- https://github.com/apache/spark/pull/50076

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3121 from SteNicholas/CELEBORN-1884.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-28 11:29:42 +08:00
Nicholas Jiang
79b49805e8 [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
### What changes were proposed in this pull request?

Bump zstd-jni version from 1.5.2-1 to 1.5.7-1.

### Why are the changes needed?

Bump zstd-jni to the latest version.

Backport https://github.com/apache/spark/pull/50057.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #3114 from SteNicholas/CELEBORN-1877.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-25 11:08:44 +08:00
Nicholas Jiang
5b507aed72 [CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1
### What changes were proposed in this pull request?

Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1.

### Why are the changes needed?

Flink 1.19.2 and 1.20.1 have already released.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3107 from SteNicholas/CELEBORN-1872.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-02-19 10:49:44 +08:00
Nicholas Jiang
2dd26936e8 [CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.115.Final to 4.1.118.Final.

### Why are the changes needed?

The Netty 4.1.118.Final version has been released, which netty version is 4.1.115.Final at present. The changes between 4.1.115.Final and 4.1.118.Final is as follows:

- 4.1.116.Final: https://netty.io/news/2024/12/17/4-1-116-Final.html
- 4.1.117.Final: https://netty.io/news/2025/01/14/4-1-117-Final.html
- 4.1.118.Final: https://netty.io/news/2025/02/10/4-1-118-Final.html
   - **SslHandler doesn't correctly validate packets which can lead to native crash when using native SSLEngine.**
   - **Denial of Service attack on windows app using Netty, again.**

Backport:

- https://github.com/apache/spark/pull/49756
- https://github.com/apache/spark/pull/49923

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3098 from SteNicholas/CELEBORN-1864.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-15 11:46:28 +08:00
madlnu
b5c00ea645 [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
### What changes were proposed in this pull request?
Upgrading ratis version to 3.1.3

### Why are the changes needed?
For fixing the CVE-2024-7254 and sonatype-2020-0026 coming from its transitive dependency - ratis-thirdparty-misc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Locally and CI tests

Closes #3095 from Madhukar525722/main.

Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-12 17:46:58 +08:00
SteNicholas
30e46eee28
[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-8 to 3.0-9.

### Why are the changes needed?

ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.

Backport:

1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3072 from SteNicholas/CELEBORN-1842.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-01-21 12:22:00 +08:00
SteNicholas
19fecadcd7 [CELEBORN-1413][FOLLOWUP] Bump zstd-jni version to 1.5.6-5 for 4.0.0-preview2
### What changes were proposed in this pull request?

Bump `zstd-jni` version to 1.5.6-5 for 4.0.0-preview2.

### Why are the changes needed?

`zstd-jni` version is 1.5.6-5 for 4.0.0-preview2 for [<version>1.5.6-5</version>](https://github.com/apache/spark/blob/v4.0.0-preview2/pom.xml#L838C18-L838C25).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3054 from SteNicholas/CELEBORN-1413.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-07 17:37:22 +08:00
mingji
5d2831bbad [CELEBORN-1816] Bump scala-maven-plugin to avoid compilation loop
### What changes were proposed in this pull request?
To update zinc to fix an issue that may cause the compilation process to keep compiling the project.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manual tests on Mac, and Ubuntu nodes.

Closes #3045 from FMX/b1816.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-03 09:27:47 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
SteNicholas
eb59c17638
[CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
### What changes were proposed in this pull request?

Bump Spark from 3.5.3 to 3.5.4.

### Why are the changes needed?

Spark 3.5.4 has been announced to release: [Spark 3.5.4 released](https://spark.apache.org/news/spark-3-5-4-released.html). The profile spark-3.5 could bump Spark from 3.5.3 to 3.5.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3034 from SteNicholas/CELEBORN-1806.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-27 16:29:35 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
Fei Wang
6b884dee66 [CELEBORN-1777] Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
### What changes were proposed in this pull request?
As title, add `--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED` into default java options.

### Why are the changes needed?

It is necessary for JDK17 + HDFS Storage + Kerberos enabled, see details in https://github.com/apache/spark/pull/34615

The exception stack likes:
```
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
	at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
....
Caused by: java.lang.IllegalAccessException: class org.apache.hadoop.security.authentication.util.KerberosUtil cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module 3a0baae5
	at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
	at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
	at java.base/java.lang.reflect.Method.invoke(Method.java:560)
	at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
	... 9 more
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2999 from turboFei/jdk_opt_krb5.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-19 14:54:21 +08:00
SteNicholas
f3dac7e879 [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.109.Final to 4.1.115.Final.

### Why are the changes needed?

The Netty 4.1.115.Final version has been released, which netty version is 4.1.109.Final at present. The changes between 4.1.110.Final and 4.1.115.Final is as follows:

- [4.1.110.Final](https://netty.io/news/2024/05/22/4-1-110-Final.html)
- [4.1.111.Final](https://netty.io/news/2024/06/11/4-1-111-Final.html)
- [4.1.112.Final](https://netty.io/news/2024/07/19/4-1-112-Final.html)
- [4.1.113.Final](https://netty.io/news/2024/09/04/4-1-113-Final.html)
- [4.1.114.Final](https://netty.io/news/2024/10/01/4-1-114-Final.html)
- [4.1.115.Final](https://netty.io/news/2024/11/12/4-1-115-Final.html)

Bump https://github.com/apache/spark/pull/46945.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2903 from SteNicholas/CELEBORN-1712.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:29:07 +08:00
hongguangwei
ca8831e55f [CELEBORN-1736] Add tez integration tests
### What changes were proposed in this pull request?
Add tez integration tests

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2991 from GH-Gloway/1736.

Authored-by: hongguangwei <hongguangwei@bytedance.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-13 14:06:08 +08:00
mingji
34d70ca7a4 [CELEBORN-1530][FOLLOWUP] Exclude web modules by default
### What changes were proposed in this pull request?
Exclude web modules by default.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #2961 from FMX/b1530-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 16:28:08 +08:00
mingji
3590fa778e [CELEBORN-1545] Add Tez plugin skeleton and dag app master
### What changes were proposed in this pull request?
1. Add directories for Apache Tez framework
2. Add a CelebornDagAppMaster with Lifecycmanager

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2939 from GH-Gloway/b1545-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 18:38:25 +08:00
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
SteNicholas
7d1da5e915 [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.1 to 3.1.2 including:

- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171

### Why are the changes needed?

Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.

The 3.1.2 version fixed the following `NullPointerException` in CI log:

```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
    at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
    at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
    at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runners.Suite.runChild(Suite.java:128)
    at org.junit.runners.Suite.runChild(Suite.java:27)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
    at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
    at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
    at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
    at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2897 from SteNicholas/CELEBORN-1702.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:15:20 +08:00
Wang, Fei
330b2a094e [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
### What changes were proposed in this pull request?

Bump protobuf from 3.21.7 to 3.25.5.

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2898 from turboFei/bump_protobuf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:02:23 +08:00
Wang, Fei
09ffee0365 [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### What changes were proposed in this pull request?

 Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826

### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #2899 from turboFei/bump_jetty.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:58:44 +08:00
Wang, Fei
6d2b9f6d92 [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
### What changes were proposed in this pull request?
 Bump commons-io from 2.13.0 to 2.17.0

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #2900 from turboFei/bump_commons_io.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:57:29 +08:00
Weijie Guo
c12e8881ab
[CELEBORN-1490][CIP-6] Add Flink Hybrid Shuffle IT test cases
### What changes were proposed in this pull request?
1. Add Flink Hybrid Shuffle IT test cases
2. Fix bug in open stream.

### Why are the changes needed?

Test coverage for celeborn + hybrid shuffle

### Does this PR introduce _any_ user-facing change?
No

Closes #2859 from reswqa/10-itcase-10month.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-01 17:27:24 +08:00
SteNicholas
165e914b9b [CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
### What changes were proposed in this pull request?

Bump Spark from 3.4.3 to 3.4.4.

### Why are the changes needed?

Spark 3.4.4 has been announced to release: [Spark 3.4.4 released](https://spark.apache.org/news/spark-3-4-4-released.html). The profile spark-3.4 could bump Spark from 3.4.3 to 3.4.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2851 from SteNicholas/CELEBORN-1672.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-01 11:05:00 +08:00
avishnus
59029a0967 [CELEBORN-1649] Bumping up maven to 3.9.9
### What changes were proposed in this pull request?
Bumping up maven version to 3.9.9

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2834 from avishnus/maven.

Authored-by: avishnus <avishnus@visa.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-25 16:20:32 +08:00
SteNicholas
651cbebc1a [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.0 to 3.1.1 including:

- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.

### Why are the changes needed?

Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2759 from SteNicholas/CELEBORN-1525.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-09-26 10:45:38 -05:00
SteNicholas
416c84acce [CELEBORN-1613] Bump Spark from 3.5.2 to 3.5.3
### What changes were proposed in this pull request?

Bump Spark from 3.5.2 to 3.5.3.

### Why are the changes needed?

Spark 3.5.3 has been announced to release: [Spark 3.5.3 released](https://spark.apache.org/news/spark-3-5-3-released.html). The profile spark-3.5 could bump Spark from 3.5.2 to 3.5.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2760 from SteNicholas/CELEBORN-1613.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-25 15:32:04 +08:00
Wang, Fei
909d6c3b9c [CELEBORN-1477][FOLLOWUP] Upgrade openapi-generator to 7.8.0
### What changes were proposed in this pull request?
This pr is a followup for https://github.com/apache/celeborn/pull/2641

In above PR, I upgrade the version to 7.7.0, and there were two generated java files not with apache licenses.

And then I raised a PR in https://github.com/OpenAPITools/openapi-generator/pull/19273 to followup it, and it is released in https://github.com/OpenAPITools/openapi-generator/releases/tag/v7.8.0.

### Why are the changes needed?

Upgrade to the latest openapi-generator version to resolve the unlicensed java files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing GA.

Closes #2695 from turboFei/openapi_upgrade.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-24 16:02:09 +08:00
sychen
8734d16638 [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
### What changes were proposed in this pull request?

### Why are the changes needed?
https://commons.apache.org/proper/commons-lang/changes-report.html

https://github.com/apache/celeborn/pull/2544#issuecomment-2349065779

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2750 from cxzl25/CELEBORN-1605.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:37:31 +08:00
sychen
40f8eccecd [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2749 from cxzl25/CELEBORN-1604.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:35:42 +08:00
sychen
589100ea91 [CELEBORN-1600] Enable check server dependencies
### What changes were proposed in this pull request?

### Why are the changes needed?
Server module missing checks.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2742 from cxzl25/check_server_deps.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 15:14:56 +08:00
Aravind Patnam
cc26131f88 [CELEBORN-1572] Celeborn CLI initial REST API support
### What changes were proposed in this pull request?
Introducing the Celeborn CLI (based on this [CPIP](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI)). For the first iteration, adding support for querying the existing REST api endpoints.
After this will add a layer for external cluster manager support. Further improvements are needed such as pretty print, which can be added in subsequent PRs.

### Why are the changes needed?
see [CPIP](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI)

### Does this PR introduce _any_ user-facing change?
yes, new CLI tool.

### How was this patch tested?
added UTs and also tested internally.

Closes #2699 from akpatnam25/cli-CELEBORN-1572.

Lead-authored-by: Aravind Patnam <apatnam@linkedin.com>
Co-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-09-05 11:15:16 -05:00
SteNicholas
cd916040da [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5
### What changes were proposed in this pull request?

Introduce `spark-3.5-columnar-shuffle` module to support columnar shuffle for Spark 3.5.

### Why are the changes needed?

#1850 does not support columnar shuffle for Spark 3.5, in which version building `spark-3-columnar-shuffle` module has compilation error. The compilation error is caused by https://github.com/apache/spark/pull/40784, which incompatible changes move `InternalType` from `AtomicType` to `PhysicalDataType`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2710 from SteNicholas/CELEBORN-912.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2024-09-05 14:26:54 +08:00
sychen
3ee672e15d
[CELEBORN-1565] Introduce warn-unused-import in Scala
### What changes were proposed in this pull request?
This PR aims to introduce `warn-unused-import` in Scala.

### Why are the changes needed?
There are currently many invalid imports, which can be checked using `-Ywarn-unused-import`.
And through `silencer`  plugin we can avoid some imports required in scala 2.11.

```scala
import org.apache.celeborn.common.util.FunctionConverter._
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2689 from cxzl25/CELEBORN-1565.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shaoyun Chen <csy@apache.org>
2024-08-29 13:43:17 +08:00
SteNicholas
c0dda4a15a [CELEBORN-1240][FOLLOWUP] Introduce web profile for web module
### What changes were proposed in this pull request?

Introduce web profile for web module.

### Why are the changes needed?

The compilation speed of web module is sometimes very slow due to the influence of the network, which hinders the development process.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2679 from SteNicholas/CELEBORN-1240.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-27 15:00:52 +08:00
SteNicholas
15a29d5e9d [CELEBORN-1562] Bump Spark from 3.5.1 to 3.5.2
### What changes were proposed in this pull request?

Bump Spark from 3.5.1 to 3.5.2. Meanwhile, bump the default `spark.version` from 3.5.1 to 3.5.2.

### Why are the changes needed?

Spark 3.5.2 has been announced to release: [Spark 3.5.2 released](https://spark.apache.org/news/spark-3-5-2-released.html). The profile spark-3.5 could bump Spark from 3.5.1 to 3.5.2.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2684 from SteNicholas/CELEBORN-1562.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-15 19:44:16 +08:00