Commit Graph

104 Commits

Author SHA1 Message Date
SteNicholas
75446a05d3 [CELEBORN-2093] Support Flink 2.1
### What changes were proposed in this pull request?

Support Flink 2.1.

### Why are the changes needed?

Flink 2.1 has already released, which release notes refer to [Release notes - Flink 2.1](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3404 from SteNicholas/CELEBORN-2093.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-04 14:12:55 +08:00
SteNicholas
cfb4438ade [CELEBORN-2057] Bump ap-loader version from 3.0-9 to 4.0-10
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-9 to 4.0-10.

### Why are the changes needed?

`ap-loader` has already released v4.0-10, which release note refers to [Loader for 4.0 (v10): Heatmaps and Native memory profiling](https://github.com/jvm-profiling-tools/ap-loader/releases/tag/4.0-10). It should bump version from 3.0-9 to 4.0-10 for `JVMProfiler`.

Backport https://github.com/apache/spark/pull/51257.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #3359 from SteNicholas/CELEBORN-2057.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-10 16:18:28 +08:00
Wang, Fei
9a689b7482 [CELEBORN-2028] Setup GA for grafana dashboard
### What changes were proposed in this pull request?

Setup the GA for grafana dashboard.

1. Lint the dashboard with https://github.com/grafana/dashboard-linter
2. Check the duplicate id in dashboard json file

### Why are the changes needed?

It is helpful for grafana related PR review, for example: https://github.com/apache/celeborn/pull/3307#discussion_r2134799722

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

<img width="1407" alt="image" src="https://github.com/user-attachments/assets/35452633-ddff-4140-b929-3c44a943a2ab" />

Closes #3316 from turboFei/dashboard.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-06-10 16:14:49 +08:00
SteNicholas
3fb6d5b829 [CELEBORN-1413][FOLLOWUP] Support dependencies of spark-4.0 profile
### What changes were proposed in this pull request?

Support dependencies of `spark-4.0` profile.

Follow up #3282.

### Why are the changes needed?

#3282 is lack of dependencies support of `spark-4.0` profile.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Dependencies check: maven-jdk17 (spark-4.0).

Closes #3298 from SteNicholas/CELEBORN-1413.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-30 10:14:30 +08:00
Fei Wang
b44730771d [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.

### Why are the changes needed?
Spark 4.0.0 is ready.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #3282 from turboFei/spark_4.0.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-28 17:56:08 +08:00
SteNicholas
8e66ac833a [CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
### What changes were proposed in this pull request?

Introduce disruptor dependency to support asynchronous logging of log4j2.

### Why are the changes needed?

We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #3246 from SteNicholas/CELEBORN-1994.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-13 19:45:51 +08:00
veli.yang
7d0ba7f9b8 [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
### What changes were proposed in this pull request?

- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.

### Why are the changes needed?

- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Deployment integration testing has been completed in the local environment.

Closes #3157 from shouwangyw/optimize/mpu-oss.

Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-08 15:10:33 +08:00
SteNicholas
2b8f3520f9 [CELEBORN-1925] Support Flink 2.0
### What changes were proposed in this pull request?

Support Flink 2.0. The major changes of Flink 2.0 include:

- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.

### Why are the changes needed?

Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3179 from SteNicholas/CELEBORN-1925.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-04-07 15:23:20 +08:00
Wang, Fei
3d05c8998f [CELEBORN-1895] Bump log4j2 version to 2.24.3
### What changes were proposed in this pull request?

Bump log4j2 version to 2.24.3
https://github.com/apache/logging-log4j2/releases/tag/rel%2F2.24.3

### Why are the changes needed?
Bump to latest log4j2 bug fix release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #3134 from turboFei/log4j2.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-03-10 11:30:52 +08:00
Cheng Pan
e85207e2c7 [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3
### What changes were proposed in this pull request?

This PR partially reverts the change of https://github.com/apache/celeborn/pull/2813, namely, restores the renaming of `celeborn-client-spark-3`

### Why are the changes needed?

The renaming is not necessary, and might cause some confusion, for example, I wrongly interpreted the `spark-3-4` as Spark 3.4, it also increases the backport efforts for branch-0.5

### Does this PR introduce _any_ user-facing change?

No, it's dev only, before/after this change, the end users always use the shaded client

```
celeborn-client-spark-2-shaded_2.11-0.6.0-SNAPSHOT.jar
celeborn-client-spark-3-shaded_2.12-0.6.0-SNAPSHOT.jar
celeborn-client-spark-4-shaded_2.13-0.6.0-SNAPSHOT.jar
```

### How was this patch tested?

Pass GA.

Closes #3133 from pan3793/CELEBORN-1413-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-03-04 22:25:10 +08:00
SteNicholas
d90cf0d427 [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
### What changes were proposed in this pull request?

Bump rocksdbjni version from 9.5.2 to 9.10.0.

### Why are the changes needed?

There are some bug fixes and performance Improvements. The full release notes:

- https://github.com/facebook/rocksdb/releases/tag/v9.6.1
- https://github.com/facebook/rocksdb/releases/tag/v9.7.3
- https://github.com/facebook/rocksdb/releases/tag/v9.7.4
- https://github.com/facebook/rocksdb/releases/tag/v9.8.4
- https://github.com/facebook/rocksdb/releases/tag/v9.9.3
- https://github.com/facebook/rocksdb/releases/tag/v9.10.0

Backport:

- https://github.com/apache/spark/pull/48155
- https://github.com/apache/spark/pull/49538
- https://github.com/apache/spark/pull/50076

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3121 from SteNicholas/CELEBORN-1884.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-28 11:29:42 +08:00
Nicholas Jiang
79b49805e8 [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
### What changes were proposed in this pull request?

Bump zstd-jni version from 1.5.2-1 to 1.5.7-1.

### Why are the changes needed?

Bump zstd-jni to the latest version.

Backport https://github.com/apache/spark/pull/50057.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #3114 from SteNicholas/CELEBORN-1877.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-25 11:08:44 +08:00
Nicholas Jiang
2dd26936e8 [CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.115.Final to 4.1.118.Final.

### Why are the changes needed?

The Netty 4.1.118.Final version has been released, which netty version is 4.1.115.Final at present. The changes between 4.1.115.Final and 4.1.118.Final is as follows:

- 4.1.116.Final: https://netty.io/news/2024/12/17/4-1-116-Final.html
- 4.1.117.Final: https://netty.io/news/2025/01/14/4-1-117-Final.html
- 4.1.118.Final: https://netty.io/news/2025/02/10/4-1-118-Final.html
   - **SslHandler doesn't correctly validate packets which can lead to native crash when using native SSLEngine.**
   - **Denial of Service attack on windows app using Netty, again.**

Backport:

- https://github.com/apache/spark/pull/49756
- https://github.com/apache/spark/pull/49923

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3098 from SteNicholas/CELEBORN-1864.

Authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-15 11:46:28 +08:00
madlnu
b5c00ea645 [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
### What changes were proposed in this pull request?
Upgrading ratis version to 3.1.3

### Why are the changes needed?
For fixing the CVE-2024-7254 and sonatype-2020-0026 coming from its transitive dependency - ratis-thirdparty-misc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Locally and CI tests

Closes #3095 from Madhukar525722/main.

Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-12 17:46:58 +08:00
SteNicholas
30e46eee28
[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-8 to 3.0-9.

### Why are the changes needed?

ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.

Backport:

1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3072 from SteNicholas/CELEBORN-1842.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-01-21 12:22:00 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
SteNicholas
f3dac7e879 [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.109.Final to 4.1.115.Final.

### Why are the changes needed?

The Netty 4.1.115.Final version has been released, which netty version is 4.1.109.Final at present. The changes between 4.1.110.Final and 4.1.115.Final is as follows:

- [4.1.110.Final](https://netty.io/news/2024/05/22/4-1-110-Final.html)
- [4.1.111.Final](https://netty.io/news/2024/06/11/4-1-111-Final.html)
- [4.1.112.Final](https://netty.io/news/2024/07/19/4-1-112-Final.html)
- [4.1.113.Final](https://netty.io/news/2024/09/04/4-1-113-Final.html)
- [4.1.114.Final](https://netty.io/news/2024/10/01/4-1-114-Final.html)
- [4.1.115.Final](https://netty.io/news/2024/11/12/4-1-115-Final.html)

Bump https://github.com/apache/spark/pull/46945.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2903 from SteNicholas/CELEBORN-1712.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:29:07 +08:00
zhaohehuhu
3bf91929b6 [CELEBORN-1746] Reduce the size of aws dependencies
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture

Closes #2944 from zhaohehuhu/dev-1125.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 19:45:01 +08:00
mingji
3590fa778e [CELEBORN-1545] Add Tez plugin skeleton and dag app master
### What changes were proposed in this pull request?
1. Add directories for Apache Tez framework
2. Add a CelebornDagAppMaster with Lifecycmanager

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2939 from GH-Gloway/b1545-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 18:38:25 +08:00
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
SteNicholas
7d1da5e915 [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.1 to 3.1.2 including:

- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171

### Why are the changes needed?

Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.

The 3.1.2 version fixed the following `NullPointerException` in CI log:

```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
    at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
    at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
    at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runners.Suite.runChild(Suite.java:128)
    at org.junit.runners.Suite.runChild(Suite.java:27)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
    at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
    at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
    at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
    at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2897 from SteNicholas/CELEBORN-1702.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:15:20 +08:00
Wang, Fei
330b2a094e [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
### What changes were proposed in this pull request?

Bump protobuf from 3.21.7 to 3.25.5.

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2898 from turboFei/bump_protobuf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:02:23 +08:00
Wang, Fei
09ffee0365 [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### What changes were proposed in this pull request?

 Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826

### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #2899 from turboFei/bump_jetty.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:58:44 +08:00
Wang, Fei
6d2b9f6d92 [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
### What changes were proposed in this pull request?
 Bump commons-io from 2.13.0 to 2.17.0

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #2900 from turboFei/bump_commons_io.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:57:29 +08:00
SteNicholas
651cbebc1a [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.0 to 3.1.1 including:

- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.

### Why are the changes needed?

Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2759 from SteNicholas/CELEBORN-1525.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-09-26 10:45:38 -05:00
sychen
6e071344ba [CELEBORN-1606] Generate dependencies-client-flink-1.16
### What changes were proposed in this pull request?

### Why are the changes needed?
CELEBORN-1504 supports Flink 1.16, but `dependencies-client-flink-1.16` is not generated. dependencies.sh will pass the file non-existence check.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2751 from cxzl25/CELEBORN-1606.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-23 20:18:44 +08:00
sychen
8734d16638 [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
### What changes were proposed in this pull request?

### Why are the changes needed?
https://commons.apache.org/proper/commons-lang/changes-report.html

https://github.com/apache/celeborn/pull/2544#issuecomment-2349065779

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2750 from cxzl25/CELEBORN-1605.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:37:31 +08:00
sychen
40f8eccecd [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2749 from cxzl25/CELEBORN-1604.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:35:42 +08:00
sychen
589100ea91 [CELEBORN-1600] Enable check server dependencies
### What changes were proposed in this pull request?

### Why are the changes needed?
Server module missing checks.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2742 from cxzl25/check_server_deps.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 15:14:56 +08:00
Weijie Guo
a759efb6dd [CELEBORN-1543] Support Flink 1.20
1.20 was the last non-bug-fix release before Flink 2.0, you can found all main upgrade features in this [release note](https://nightlies.apache.org/flink/flink-docs-release-1.20/release-notes/flink-1.20/). I think the most important feature related to Celeborn is we expose some interface to support Flink hybrid shuffle integration with Celeborn([FLIP-459](https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn)). This(supporting hybrid shuffle in Celeborn side) is also a follow-up stuff to this PR.

incompatible changes in 1.20:
- 1.20 use enum `CompressionCodec` instead of `String` to construct `BufferDecompressor` and `BufferCompressor`.
- 1.20 introduce a new method(`notifyPartitionRecoveryStarted`) to `JobShuffleContext` in a non-compatible way.

I've already done the adaptation in this PR.

Closes #2662 from reswqa/support-120.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-09 17:05:58 +08:00
Wang, Fei
1515ed38b2 [CELEBORN-1477] Using openapi-generator apache-httpclient library instead of jersey2
### What changes were proposed in this pull request?
We used `jersey2` library for celeborn-openapi-client before, and I found that there is dependencies lack issue for shaded celeborn-openapi-client.
I tried to raise a [PR #2640] to fix it, but seems It is difficult to maintain the dependencies transition from jersey dependencies.

And I received the suggestion from pan to migrate the library from jersey2 to `apache-httpclient`.

FYI: for https://openapi-generator.tech/docs/generators/java/

<img width="500" alt="image" src="https://github.com/user-attachments/assets/d102a7c9-46cd-4fd7-a2a0-7396a815776d">

To leverage the latest openapi-generator plugin, I upgrade the openapi-generator version to latest 7.7.0 and it requires JDK11+.
Due celeborn does not drop the Java8 support so far, so I include the generated code into repo and add user guide for re-generation.

### Why are the changes needed?

To fix dependencies leak issue and maintain the dependencies easily.

### Does this PR introduce _any_ user-facing change?

No, this SDK has not been released, so no user-facing change.

### How was this patch tested?

Testing with sample maven project.

pom.xml:
```
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>test_openapi</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.celeborn</groupId>
            <artifactId>celeborn-openapi-client_2.12</artifactId>
            <version>0.6.0-SNAPSHOT</version>
        </dependency>
    </dependencies>
</project>
```

Testing code:
```
package org.example;

import org.apache.celeborn.rest.v1.master.MasterApi;
import org.apache.celeborn.rest.v1.master.WorkerApi;
import org.apache.celeborn.rest.v1.master.invoker.ApiClient;

public class Main {
    public static void main(String[] args) throws Exception {

        String cmUrl = "http://***:9098";
        MasterApi masterApi  = new MasterApi(new ApiClient().setBasePath(cmUrl));
        System.out.println(masterApi.getMasterGroupInfo().getLeader().getAddress().split(":")[0]);
        WorkerApi workerApi = new WorkerApi(new ApiClient().setBasePath(cmUrl));
        System.out.println(workerApi.getWorkers());
        System.out.println(workerApi.getWorkerEvents());
    }
}
```

```
java -Dfile.encoding=UTF-8 -classpath /Users/fwang12/todo/test_openapi/target/classes:/Users/fwang12/todo/celeborn/openapi/openapi-client/target/celeborn-openapi-client_2.12-0.6.0-SNAPSHOT.jar org.example.Main
```

<img width="1727" alt="image" src="https://github.com/user-attachments/assets/2da8b126-be96-4c37-9a33-ba196024f2ba">

Closes #2641 from turboFei/appache_httpclient.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-31 15:02:41 +08:00
zhaohehuhu
7a596bbed1 [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2579 from zhaohehuhu/dev-0619.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-07-24 11:59:15 +08:00
Wang, Fei
0b8c9fdd4c [CELEBORN-1505] Algin the celeborn server jackson dependency versions
### What changes were proposed in this pull request?

Now there are three different jackson versions in the server dependency list.

It is better to align them.

### Why are the changes needed?
To align the dependency versions and reduce the conflicts in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the GA.

Closes #2620 from turboFei/align_jackson.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-15 11:00:23 +08:00
Mridul Muralidharan
17f89c553e [CELEBORN-1504] Support for Apache Flink 1.16
### What changes were proposed in this pull request?

Add support for Apache Flink 1.16 in Celeborn.

### Why are the changes needed?

User requests for Apache Flink 1.16.
This implementation is a synthesis of 1.15 and 1.17 support which already exists in Apache Celeborn

### Does this PR introduce _any_ user-facing change?

Yes, supports Apache Flink 1.16

### How was this patch tested?

Tests for 1.16 added, which are based on 1.15 and 1.17

Closes #2619 from mridulm/flink-1.16-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-15 10:44:16 +08:00
SteNicholas
adbef7b441 [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
### What changes were proposed in this pull request?

Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of https://github.com/apache/ratis/pull/1111.

### Why are the changes needed?

Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](https://github.com/apache/ratis/compare/ratis-3.0.1...ratis-3.1.0) releases.

Follow up #2547.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterStateMachineSuiteJ#testInstallSnapshot`

Closes #2610 from SteNicholas/CELEBORN-1499.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-11 16:29:58 +08:00
Fei Wang
d698a69edc
[CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
### What changes were proposed in this pull request?

This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).

We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.

### Why are the changes needed?

Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).

### Does this PR introduce _any_ user-facing change?
This pr introduce  a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.

### How was this patch tested?
UT.

Closes #2599 from turboFei/cip_9_openapi.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 10:57:00 +08:00
SteNicholas
7188e845f7
[CELEBORN-1327][FOLLOWUP] Simplify DirectByteBuffer constructor lookup logic
### What changes were proposed in this pull request?

Simplify `DirectByteBuffer` constructor lookup logic in `Platform`. Meanwhile, bump `commons-lang3` version from `3.12.0` to `3.13.0`.

### Why are the changes needed?

`try-catch` statement is not needed because we know version number already.

Backport:

- https://github.com/apache/spark/pull/41780
- https://github.com/apache/spark/pull/42269
- https://github.com/apache/spark/pull/44444

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2544 from SteNicholas/CELEBORN-1327.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-07 16:23:32 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
SteNicholas
e5f09ce4e0 [CELEBORN-1443] Remove ratis dependencies from common module
### What changes were proposed in this pull request?

Remove ratis dependencies from common module.

### Why are the changes needed?

Ratis is only depended on by the master module. Removing ratis dependencies from the common module reduces the size of the Celeborn client package.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2538 from SteNicholas/CELEBORN-1443.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-06-03 10:15:51 +08:00
SteNicholas
2a57fab869 [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(https://github.com/apache/ratis/pull/964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(https://github.com/apache/ratis/pull/718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (https://github.com/apache/ratis/pull/718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (https://github.com/apache/ratis/pull/731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (https://github.com/apache/ratis/pull/903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (https://github.com/apache/ratis/pull/728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (https://github.com/apache/ratis/pull/732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (https://github.com/apache/ratis/pull/734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (https://github.com/apache/ratis/pull/741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (https://github.com/apache/ratis/pull/742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (https://github.com/apache/ratis/pull/751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (https://github.com/apache/ratis/pull/632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (https://github.com/apache/ratis/pull/671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (https://github.com/apache/ratis/pull/635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (https://github.com/apache/ratis/pull/897)
      - RATIS-1866. Maintain leader lease after AppendEntries (https://github.com/apache/ratis/pull/898)
      - RATIS-1894. Implement ReadOnly based on leader lease (https://github.com/apache/ratis/pull/925)
      - RATIS-1882. Support read-after-write consistency (https://github.com/apache/ratis/pull/913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (https://github.com/apache/ratis/pull/906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (https://github.com/apache/ratis/pull/930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (https://github.com/apache/ratis/pull/896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (https://github.com/apache/ratis/pull/961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (https://github.com/apache/ratis/pull/950)
      - RATIS-1636. Support re-config ratis properties (https://github.com/apache/ratis/pull/800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (https://github.com/apache/ratis/pull/901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (https://github.com/apache/ratis/pull/931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (https://github.com/apache/ratis/pull/932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (https://github.com/apache/ratis/pull/955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (https://github.com/apache/ratis/pull/956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (https://github.com/apache/ratis/pull/957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (https://github.com/apache/ratis/pull/923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge https://github.com/apache/ratis/pull/922 (https://github.com/apache/ratis/pull/922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (https://github.com/apache/ratis/pull/927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (https://github.com/apache/ratis/pull/933)
         - RATIS-1912. Fix infinity election when perform membership change. (https://github.com/apache/ratis/pull/954)
         - RATIS-1858. Follower keeps logging first election timeout. (https://github.com/apache/ratis/pull/894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](https://github.com/apache/ratis/compare/ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-30 17:22:22 +08:00
SteNicholas
bd77f3e22d
[CELEBORN-1396] Bump Netty from 4.1.107.Final to 4.1.109.Final
### What changes were proposed in this pull request?

Bump Netty from 4.1.107.Final to 4.1.109.Final.

### Why are the changes needed?

Netty has released v4.1.108.Final, v4.1.109.Final, which release note refers to [4.1.108.Final](https://netty.io/news/2024/03/21/4-1-108-Final.html), [4.1.109.Final](https://netty.io/news/2024/04/15/4-1-109-Final.html). This version includes some bugfixes and improvements including:

- 4.1.108.Final
  - Epoll: Correctly handle splice tasks when Channel is closed: https://github.com/netty/netty/issues/13848
- 4.1.109.Final
  - Don't send a RST frame when closing the stream in a write future while processing inbound frames: https://github.com/netty/netty/pull/13973
  - Fix DefaultChannelId#asLongText NPE: https://github.com/netty/netty/pull/13971
  - Rewrite ZstdDecoder to remove the need of allocate a huge byte[] internally: https://github.com/netty/netty/pull/13928

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2474 from SteNicholas/CELEBORN-1396.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-22 20:31:29 +08:00
SteNicholas
e890f38656
[CELEBORN-1395] Bump RoaringBitmap version from 1.0.5 to 1.0.6
### What changes were proposed in this pull request?

Bump RoaringBitmap version from 1.0.5 to 1.0.6.

### Why are the changes needed?

RoaringBitmap has released v1.0.6, which release note refers to [1.0.6](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.6). This version includes some bugfixes and improvements including:

- Implement BatchIterator's promise to fill the input buffer.
- RoaringBitmap to BitSet/long[]/byte[].

Backport https://github.com/apache/spark/pull/46152. https://github.com/apache/spark/pull/46152#issuecomment-2068727268 mentions the performance of the benchmark test based on JDK21 is quite good.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2473 from SteNicholas/CELEBORN-1395.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-22 20:31:14 +08:00
SteNicholas
3c11e70c37 [CELEBORN-1382] Bump RoaringBitmap version from 0.9.32 to 1.0.5
### What changes were proposed in this pull request?

Bump RoaringBitmap version from 0.9.32 to 1.0.5.

### Why are the changes needed?

RoaringBitmap has released v1.0.5, which release note refers to [1.0.5](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.5). This version includes some bugfixes and improvements including:

- Fix roaringbitmap - batchiterator's advanceIfNeeded to handle run lengths of zero.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2454 from SteNicholas/CELEBORN-1382.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-12 14:22:57 +08:00
SteNicholas
fa25ba8e1c
[CELEBORN-1366] Bump guava from 32.1.3-jre to 33.1.0-jre
### What changes were proposed in this pull request?

Bump guava from 32.1.3-jre to 33.1.0-jre.

### Why are the changes needed?

Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows:

* cache: Fixed a bug that could cause https://github.com/google/guava/pull/6851#issuecomment-1931276822 for `CacheLoader`/`CacheBuilder`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2439 from SteNicholas/CELEBORN-1366.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-02 16:46:03 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
zky.zhoukeyong
7af3126c7e Support Spark3.5 with JDK21
### What changes were proposed in this pull request?
Compile Spark-3.5 with
`./build/make-distribution.sh -Pspark-3.5 -Pjdk-21`
or
`./build/make-distribution.sh --sbt-enabled -Pspark-3.5 -Pjdk-21`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual tests

Closes #2385 from waitinfuture/1327.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:42:16 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
73cf1562f7 [CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler
### What changes were proposed in this pull request?

Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.

### Why are the changes needed?

[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Worker cluster test.

Closes #2409 from SteNicholas/CELEBORN-1299.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 14:05:50 +08:00