### What changes were proposed in this pull request?
Support dependencies of `spark-4.0` profile.
Follow up #3282.
### Why are the changes needed?
#3282 is lack of dependencies support of `spark-4.0` profile.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Dependencies check: maven-jdk17 (spark-4.0).
Closes#3298 from SteNicholas/CELEBORN-1413.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.
### Why are the changes needed?
Spark 4.0.0 is ready.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#3282 from turboFei/spark_4.0.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce disruptor dependency to support asynchronous logging of log4j2.
### Why are the changes needed?
We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#3246 from SteNicholas/CELEBORN-1994.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.
### Why are the changes needed?
- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Deployment integration testing has been completed in the local environment.
Closes#3157 from shouwangyw/optimize/mpu-oss.
Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support Flink 2.0. The major changes of Flink 2.0 include:
- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.
### Why are the changes needed?
Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3179 from SteNicholas/CELEBORN-1925.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
Bump log4j2 version to 2.24.3
https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.24.3
### Why are the changes needed?
Bump to latest log4j2 bug fix release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#3134 from turboFei/log4j2.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This PR partially reverts the change of https://github.com/apache/celeborn/pull/2813, namely, restores the renaming of `celeborn-client-spark-3`
### Why are the changes needed?
The renaming is not necessary, and might cause some confusion, for example, I wrongly interpreted the `spark-3-4` as Spark 3.4, it also increases the backport efforts for branch-0.5
### Does this PR introduce _any_ user-facing change?
No, it's dev only, before/after this change, the end users always use the shaded client
```
celeborn-client-spark-2-shaded_2.11-0.6.0-SNAPSHOT.jar
celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.jar
celeborn-client-spark-4-shaded_2.13-0.6.0-SNAPSHOT.jar
```
### How was this patch tested?
Pass GA.
Closes#3133 from pan3793/CELEBORN-1413-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump zstd-jni version from 1.5.2-1 to 1.5.7-1.
### Why are the changes needed?
Bump zstd-jni to the latest version.
Backport https://github.com/apache/spark/pull/50057.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#3114 from SteNicholas/CELEBORN-1877.
Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Netty version from 4.1.115.Final to 4.1.118.Final.
### Why are the changes needed?
The Netty 4.1.118.Final version has been released, which netty version is 4.1.115.Final at present. The changes between 4.1.115.Final and 4.1.118.Final is as follows:
- 4.1.116.Final: https://netty.io/news/2024/12/17/4-1-116-Final.html
- 4.1.117.Final: https://netty.io/news/2025/01/14/4-1-117-Final.html
- 4.1.118.Final: https://netty.io/news/2025/02/10/4-1-118-Final.html
- **SslHandler doesn't correctly validate packets which can lead to native crash when using native SSLEngine.**
- **Denial of Service attack on windows app using Netty, again.**
Backport:
- https://github.com/apache/spark/pull/49756
- https://github.com/apache/spark/pull/49923
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3098 from SteNicholas/CELEBORN-1864.
Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Upgrading ratis version to 3.1.3
### Why are the changes needed?
For fixing the CVE-2024-7254 and sonatype-2020-0026 coming from its transitive dependency - ratis-thirdparty-misc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Locally and CI tests
Closes#3095 from Madhukar525722/main.
Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump ap-loader version from 3.0-8 to 3.0-9.
### Why are the changes needed?
ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.
Backport:
1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3072 from SteNicholas/CELEBORN-1842.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.
For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9
### Why are the changes needed?
Reduce maintenance burden.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Changes can be covered by existing tests.
Closes#3029 from codenohup/remove-flink14and15.
Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Add Tez packaging script.
### Why are the changes needed?
To support build tez client.
### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.
### How was this patch tested?
Cluster test.
Closes#3028 from GH-Gloway/1737.
Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.
### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2813 from FMX/b1413.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture
Closes#2944 from zhaohehuhu/dev-1125.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Add directories for Apache Tez framework
2. Add a CelebornDagAppMaster with Lifecycmanager
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2939 from GH-Gloway/b1545-1.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.
<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">
The above screenshot is the second test with 5000 mapper and reducer that I did.
Closes#2830 from zhaohehuhu/dev-1021.
Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Ratis version from 3.1.1 to 3.1.2 including:
- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171
### Why are the changes needed?
Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.
The 3.1.2 version fixed the following `NullPointerException` in CI log:
```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2897 from SteNicholas/CELEBORN-1702.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump protobuf from 3.21.7 to 3.25.5.
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2898 from turboFei/bump_protobuf.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#2899 from turboFei/bump_jetty.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump commons-io from 2.13.0 to 2.17.0
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2900 from turboFei/bump_commons_io.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Ratis version from 3.1.0 to 3.1.1 including:
- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.
### Why are the changes needed?
Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2759 from SteNicholas/CELEBORN-1525.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
CELEBORN-1504 supports Flink 1.16, but `dependencies-client-flink-1.16` is not generated. dependencies.sh will pass the file non-existence check.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2751 from cxzl25/CELEBORN-1606.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2749 from cxzl25/CELEBORN-1604.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
Server module missing checks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2742 from cxzl25/check_server_deps.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
1.20 was the last non-bug-fix release before Flink 2.0, you can found all main upgrade features in this [release note](https://nightlies.apache.org/flink/flink-docs-release-1.20/release-notes/flink-1.20/). I think the most important feature related to Celeborn is we expose some interface to support Flink hybrid shuffle integration with Celeborn([FLIP-459](https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn)). This(supporting hybrid shuffle in Celeborn side) is also a follow-up stuff to this PR.
incompatible changes in 1.20:
- 1.20 use enum `CompressionCodec` instead of `String` to construct `BufferDecompressor` and `BufferCompressor`.
- 1.20 introduce a new method(`notifyPartitionRecoveryStarted`) to `JobShuffleContext` in a non-compatible way.
I've already done the adaptation in this PR.
Closes#2662 from reswqa/support-120.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
We used `jersey2` library for celeborn-openapi-client before, and I found that there is dependencies lack issue for shaded celeborn-openapi-client.
I tried to raise a [PR #2640] to fix it, but seems It is difficult to maintain the dependencies transition from jersey dependencies.
And I received the suggestion from pan to migrate the library from jersey2 to `apache-httpclient`.
FYI: for https://openapi-generator.tech/docs/generators/java/
<img width="500" alt="image" src="https://github.com/user-attachments/assets/d102a7c9-46cd-4fd7-a2a0-7396a815776d">
To leverage the latest openapi-generator plugin, I upgrade the openapi-generator version to latest 7.7.0 and it requires JDK11+.
Due celeborn does not drop the Java8 support so far, so I include the generated code into repo and add user guide for re-generation.
### Why are the changes needed?
To fix dependencies leak issue and maintain the dependencies easily.
### Does this PR introduce _any_ user-facing change?
No, this SDK has not been released, so no user-facing change.
### How was this patch tested?
Testing with sample maven project.
pom.xml:
```
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>test_openapi</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.celeborn</groupId>
<artifactId>celeborn-openapi-client_2.12</artifactId>
<version>0.6.0-SNAPSHOT</version>
</dependency>
</dependencies>
</project>
```
Testing code:
```
package org.example;
import org.apache.celeborn.rest.v1.master.MasterApi;
import org.apache.celeborn.rest.v1.master.WorkerApi;
import org.apache.celeborn.rest.v1.master.invoker.ApiClient;
public class Main {
public static void main(String[] args) throws Exception {
String cmUrl = "http://***:9098";
MasterApi masterApi = new MasterApi(new ApiClient().setBasePath(cmUrl));
System.out.println(masterApi.getMasterGroupInfo().getLeader().getAddress().split(":")[0]);
WorkerApi workerApi = new WorkerApi(new ApiClient().setBasePath(cmUrl));
System.out.println(workerApi.getWorkers());
System.out.println(workerApi.getWorkerEvents());
}
}
```
```
java -Dfile.encoding=UTF-8 -classpath /Users/fwang12/todo/test_openapi/target/classes:/Users/fwang12/todo/celeborn/openapi/openapi-client/target/celeborn-openapi-client_2.12-0.6.0-SNAPSHOT.jar org.example.Main
```
<img width="1727" alt="image" src="https://github.com/user-attachments/assets/2da8b126-be96-4c37-9a33-ba196024f2ba">
Closes#2641 from turboFei/appache_httpclient.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#2579 from zhaohehuhu/dev-0619.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Now there are three different jackson versions in the server dependency list.
It is better to align them.
### Why are the changes needed?
To align the dependency versions and reduce the conflicts in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GA.
Closes#2620 from turboFei/align_jackson.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Add support for Apache Flink 1.16 in Celeborn.
### Why are the changes needed?
User requests for Apache Flink 1.16.
This implementation is a synthesis of 1.15 and 1.17 support which already exists in Apache Celeborn
### Does this PR introduce _any_ user-facing change?
Yes, supports Apache Flink 1.16
### How was this patch tested?
Tests for 1.16 added, which are based on 1.15 and 1.17
Closes#2619 from mridulm/flink-1.16-support.
Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of https://github.com/apache/ratis/pull/1111.
### Why are the changes needed?
Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](https://github.com/apache/ratis/compare/ratis-3.0.1...ratis-3.1.0) releases.
Follow up #2547.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`MasterStateMachineSuiteJ#testInstallSnapshot`
Closes#2610 from SteNicholas/CELEBORN-1499.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).
We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.
### Why are the changes needed?
Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).
### Does this PR introduce _any_ user-facing change?
This pr introduce a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.
### How was this patch tested?
UT.
Closes#2599 from turboFei/cip_9_openapi.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Simplify `DirectByteBuffer` constructor lookup logic in `Platform`. Meanwhile, bump `commons-lang3` version from `3.12.0` to `3.13.0`.
### Why are the changes needed?
`try-catch` statement is not needed because we know version number already.
Backport:
- https://github.com/apache/spark/pull/41780
- https://github.com/apache/spark/pull/42269
- https://github.com/apache/spark/pull/44444
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2544 from SteNicholas/CELEBORN-1327.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.
### Why are the changes needed?
Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:
* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601
Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.
Backport:
- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2540 from SteNicholas/CELEBORN-1389.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove ratis dependencies from common module.
### Why are the changes needed?
Ratis is only depended on by the master module. Removing ratis dependencies from the common module reduces the size of the Celeborn client package.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2538 from SteNicholas/CELEBORN-1443.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Netty from 4.1.107.Final to 4.1.109.Final.
### Why are the changes needed?
Netty has released v4.1.108.Final, v4.1.109.Final, which release note refers to [4.1.108.Final](https://netty.io/news/2024/03/21/4-1-108-Final.html), [4.1.109.Final](https://netty.io/news/2024/04/15/4-1-109-Final.html). This version includes some bugfixes and improvements including:
- 4.1.108.Final
- Epoll: Correctly handle splice tasks when Channel is closed: https://github.com/netty/netty/issues/13848
- 4.1.109.Final
- Don't send a RST frame when closing the stream in a write future while processing inbound frames: https://github.com/netty/netty/pull/13973
- Fix DefaultChannelId#asLongText NPE: https://github.com/netty/netty/pull/13971
- Rewrite ZstdDecoder to remove the need of allocate a huge byte[] internally: https://github.com/netty/netty/pull/13928
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2474 from SteNicholas/CELEBORN-1396.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump RoaringBitmap version from 1.0.5 to 1.0.6.
### Why are the changes needed?
RoaringBitmap has released v1.0.6, which release note refers to [1.0.6](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.6). This version includes some bugfixes and improvements including:
- Implement BatchIterator's promise to fill the input buffer.
- RoaringBitmap to BitSet/long[]/byte[].
Backport https://github.com/apache/spark/pull/46152. https://github.com/apache/spark/pull/46152#issuecomment-2068727268 mentions the performance of the benchmark test based on JDK21 is quite good.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2473 from SteNicholas/CELEBORN-1395.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Bump RoaringBitmap version from 0.9.32 to 1.0.5.
### Why are the changes needed?
RoaringBitmap has released v1.0.5, which release note refers to [1.0.5](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.5). This version includes some bugfixes and improvements including:
- Fix roaringbitmap - batchiterator's advanceIfNeeded to handle run lengths of zero.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2454 from SteNicholas/CELEBORN-1382.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Bump guava from 32.1.3-jre to 33.1.0-jre.
### Why are the changes needed?
Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows:
* cache: Fixed a bug that could cause https://github.com/google/guava/pull/6851#issuecomment-1931276822 for `CacheLoader`/`CacheBuilder`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2439 from SteNicholas/CELEBORN-1366.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.
In this PR, we refine the code for http service and provide swagger ui.
Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.
TODO:
1. define DTO
2. http request authentication
<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">
<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">
### Why are the changes needed?
To close CELEBORN-1317
### Does this PR introduce _any_ user-facing change?
The api is align with before.
### How was this patch tested?
UT.
Closes#2371 from turboFei/jetty.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Compile Spark-3.5 with
`./build/make-distribution.sh -Pspark-3.5 -Pjdk-21`
or
`./build/make-distribution.sh --sbt-enabled -Pspark-3.5 -Pjdk-21`
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manual tests
Closes#2385 from waitinfuture/1327.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove incubator/incubating for graduation including:
- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.
Fix#2415.
### Why are the changes needed?
The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2421 from SteNicholas/infra-graduation.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.
### Why are the changes needed?
[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.
Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Worker cluster test.
Closes#2409 from SteNicholas/CELEBORN-1299.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support Flink 1.19.
### Why are the changes needed?
Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).
The main changes includes:
- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
- `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local test:
- Flink batch job submission
```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```
- Flink batch job execution

- Celeborn master log
```
24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```
- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```
Closes#2399 from SteNicholas/CELEBORN-1310.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump `rocksdbjni` version from 8.5.3 to 8.11.3.
### Why are the changes needed?
The new version bring some bug fixes:
- Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
- Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
- Fix a WRITE_STALL counter that was reporting wrong value in few cases.
- A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.
- Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with BackupEngine, sst_dump, or ldb.
- Fixed issues with experimental preclude_last_level_data_seconds option that could interfere with expected data tiering.
- Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.
The full release notes as follows: [rocksdbjni releases](https://github.com/facebook/rocksdb/releases).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2389 from SteNicholas/CELEBORN-1330.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>