Commit Graph

89 Commits

Author SHA1 Message Date
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
SteNicholas
f3dac7e879 [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.109.Final to 4.1.115.Final.

### Why are the changes needed?

The Netty 4.1.115.Final version has been released, which netty version is 4.1.109.Final at present. The changes between 4.1.110.Final and 4.1.115.Final is as follows:

- [4.1.110.Final](https://netty.io/news/2024/05/22/4-1-110-Final.html)
- [4.1.111.Final](https://netty.io/news/2024/06/11/4-1-111-Final.html)
- [4.1.112.Final](https://netty.io/news/2024/07/19/4-1-112-Final.html)
- [4.1.113.Final](https://netty.io/news/2024/09/04/4-1-113-Final.html)
- [4.1.114.Final](https://netty.io/news/2024/10/01/4-1-114-Final.html)
- [4.1.115.Final](https://netty.io/news/2024/11/12/4-1-115-Final.html)

Bump https://github.com/apache/spark/pull/46945.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2903 from SteNicholas/CELEBORN-1712.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:29:07 +08:00
zhaohehuhu
3bf91929b6 [CELEBORN-1746] Reduce the size of aws dependencies
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture

Closes #2944 from zhaohehuhu/dev-1125.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 19:45:01 +08:00
mingji
3590fa778e [CELEBORN-1545] Add Tez plugin skeleton and dag app master
### What changes were proposed in this pull request?
1. Add directories for Apache Tez framework
2. Add a CelebornDagAppMaster with Lifecycmanager

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2939 from GH-Gloway/b1545-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 18:38:25 +08:00
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
SteNicholas
7d1da5e915 [CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.1 to 3.1.2 including:

- Fix NPE in `RaftServerImpl.getLogInfo`: https://github.com/apache/ratis/pull/1171

### Why are the changes needed?

Bump Ratis version from 3.1.1 to 3.1.2. Ratis has released v3.1.2, of which release note refers to [3.1.2](https://ratis.apache.org/post/3.1.2.html). The 3.1.2 version is a minor release with multiple improvements and bugfixes including [[RATIS-2179] Fix NPE in `RaftServerImpl.getLogInfo`](https://issues.apache.org/jira/browse/RATIS-2179). See the [changes between 3.1.1 and 3.1.2](https://github.com/apache/ratis/compare/ratis-3.1.1...ratis-3.1.2) releases.

The 3.1.2 version fixed the following `NullPointerException` in CI log:

```
[info] Test org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader started
24/10/24 08:16:30,295 ERROR [pool-1-thread-1] HARaftServer: Failed to retrieve RaftPeerRole. Setting cached role to UNRECOGNIZED and resetting leader info.
java.io.IOException: java.lang.NullPointerException
    at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:56)
    at org.apache.ratis.server.impl.RaftServerImpl.waitForReply(RaftServerImpl.java:1148)
    at org.apache.ratis.server.impl.RaftServerProxy.getGroupInfo(RaftServerProxy.java:607)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.getGroupInfo(HARaftServer.java:599)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.updateServerRole(HARaftServer.java:514)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HARaftServer.isLeader(HARaftServer.java:489)
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterRatisServerSuiteJ.testIsLeader(MasterRatisServerSuiteJ.java:47)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runners.Suite.runChild(Suite.java:128)
    at org.junit.runners.Suite.runChild(Suite.java:27)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    at com.novocode.junit.JUnitTask.execute(JUnitTask.java:64)
    at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
    at org.apache.ratis.server.impl.RaftServerImpl.getLogInfo(RaftServerImpl.java:665)
    at org.apache.ratis.server.impl.RaftServerImpl.getGroupInfo(RaftServerImpl.java:658)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$getGroupInfoAsync$23(RaftServerProxy.java:613)
    at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
    at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
    at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2897 from SteNicholas/CELEBORN-1702.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:15:20 +08:00
Wang, Fei
330b2a094e [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
### What changes were proposed in this pull request?

Bump protobuf from 3.21.7 to 3.25.5.

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2898 from turboFei/bump_protobuf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:02:23 +08:00
Wang, Fei
09ffee0365 [CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
### What changes were proposed in this pull request?

 Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826

### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-g8m5-722r-8whq

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #2899 from turboFei/bump_jetty.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:58:44 +08:00
Wang, Fei
6d2b9f6d92 [CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
### What changes were proposed in this pull request?
 Bump commons-io from 2.13.0 to 2.17.0

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-78wr-2p64-hpwj

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

Closes #2900 from turboFei/bump_commons_io.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 16:57:29 +08:00
SteNicholas
651cbebc1a [CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
### What changes were proposed in this pull request?

Bump Ratis version from 3.1.0 to 3.1.1 including:

- Remove `address2String` and use `setAddress(ratisAddr)` with the release of https://github.com/apache/ratis/pull/1125.
- Support `raft.grpc.message.size.max` must be 1m larger than `raft.server.log.appender.buffer.byte-limit` for https://github.com/apache/ratis/pull/1132.

### Why are the changes needed?

Bump Ratis version from 3.1.0 to 3.1.1. Ratis has released v3.1.1, of which release note refers to [3.1.1](https://ratis.apache.org/post/3.1.1.html). The 3.1.1 version is a minor release with multiple improvements and bugfixes including [[RATIS-2116] Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely](https://issues.apache.org/jira/browse/RATIS-2116), [[RATIS-2131] Configuring Ratis fails when hostname is used, and is an IPv6 host](https://issues.apache.org/jira/browse/RATIS-2131). See the [changes between 3.1.0 and 3.1.1](https://github.com/apache/ratis/compare/ratis-3.1.0...ratis-3.1.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2759 from SteNicholas/CELEBORN-1525.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-09-26 10:45:38 -05:00
sychen
6e071344ba [CELEBORN-1606] Generate dependencies-client-flink-1.16
### What changes were proposed in this pull request?

### Why are the changes needed?
CELEBORN-1504 supports Flink 1.16, but `dependencies-client-flink-1.16` is not generated. dependencies.sh will pass the file non-existence check.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2751 from cxzl25/CELEBORN-1606.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-23 20:18:44 +08:00
sychen
8734d16638 [CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
### What changes were proposed in this pull request?

### Why are the changes needed?
https://commons.apache.org/proper/commons-lang/changes-report.html

https://github.com/apache/celeborn/pull/2544#issuecomment-2349065779

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2750 from cxzl25/CELEBORN-1605.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:37:31 +08:00
sychen
40f8eccecd [CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/facebook/rocksdb/compare/v8.11.3...v9.5.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2749 from cxzl25/CELEBORN-1604.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 17:35:42 +08:00
sychen
589100ea91 [CELEBORN-1600] Enable check server dependencies
### What changes were proposed in this pull request?

### Why are the changes needed?
Server module missing checks.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2742 from cxzl25/check_server_deps.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-20 15:14:56 +08:00
Weijie Guo
a759efb6dd [CELEBORN-1543] Support Flink 1.20
1.20 was the last non-bug-fix release before Flink 2.0, you can found all main upgrade features in this [release note](https://nightlies.apache.org/flink/flink-docs-release-1.20/release-notes/flink-1.20/). I think the most important feature related to Celeborn is we expose some interface to support Flink hybrid shuffle integration with Celeborn([FLIP-459](https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn)). This(supporting hybrid shuffle in Celeborn side) is also a follow-up stuff to this PR.

incompatible changes in 1.20:
- 1.20 use enum `CompressionCodec` instead of `String` to construct `BufferDecompressor` and `BufferCompressor`.
- 1.20 introduce a new method(`notifyPartitionRecoveryStarted`) to `JobShuffleContext` in a non-compatible way.

I've already done the adaptation in this PR.

Closes #2662 from reswqa/support-120.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-09 17:05:58 +08:00
Wang, Fei
1515ed38b2 [CELEBORN-1477] Using openapi-generator apache-httpclient library instead of jersey2
### What changes were proposed in this pull request?
We used `jersey2` library for celeborn-openapi-client before, and I found that there is dependencies lack issue for shaded celeborn-openapi-client.
I tried to raise a [PR #2640] to fix it, but seems It is difficult to maintain the dependencies transition from jersey dependencies.

And I received the suggestion from pan to migrate the library from jersey2 to `apache-httpclient`.

FYI: for https://openapi-generator.tech/docs/generators/java/

<img width="500" alt="image" src="https://github.com/user-attachments/assets/d102a7c9-46cd-4fd7-a2a0-7396a815776d">

To leverage the latest openapi-generator plugin, I upgrade the openapi-generator version to latest 7.7.0 and it requires JDK11+.
Due celeborn does not drop the Java8 support so far, so I include the generated code into repo and add user guide for re-generation.

### Why are the changes needed?

To fix dependencies leak issue and maintain the dependencies easily.

### Does this PR introduce _any_ user-facing change?

No, this SDK has not been released, so no user-facing change.

### How was this patch tested?

Testing with sample maven project.

pom.xml:
```
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>test_openapi</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.celeborn</groupId>
            <artifactId>celeborn-openapi-client_2.12</artifactId>
            <version>0.6.0-SNAPSHOT</version>
        </dependency>
    </dependencies>
</project>
```

Testing code:
```
package org.example;

import org.apache.celeborn.rest.v1.master.MasterApi;
import org.apache.celeborn.rest.v1.master.WorkerApi;
import org.apache.celeborn.rest.v1.master.invoker.ApiClient;

public class Main {
    public static void main(String[] args) throws Exception {

        String cmUrl = "http://***:9098";
        MasterApi masterApi  = new MasterApi(new ApiClient().setBasePath(cmUrl));
        System.out.println(masterApi.getMasterGroupInfo().getLeader().getAddress().split(":")[0]);
        WorkerApi workerApi = new WorkerApi(new ApiClient().setBasePath(cmUrl));
        System.out.println(workerApi.getWorkers());
        System.out.println(workerApi.getWorkerEvents());
    }
}
```

```
java -Dfile.encoding=UTF-8 -classpath /Users/fwang12/todo/test_openapi/target/classes:/Users/fwang12/todo/celeborn/openapi/openapi-client/target/celeborn-openapi-client_2.12-0.6.0-SNAPSHOT.jar org.example.Main
```

<img width="1727" alt="image" src="https://github.com/user-attachments/assets/2da8b126-be96-4c37-9a33-ba196024f2ba">

Closes #2641 from turboFei/appache_httpclient.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-31 15:02:41 +08:00
zhaohehuhu
7a596bbed1 [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2579 from zhaohehuhu/dev-0619.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-07-24 11:59:15 +08:00
Wang, Fei
0b8c9fdd4c [CELEBORN-1505] Algin the celeborn server jackson dependency versions
### What changes were proposed in this pull request?

Now there are three different jackson versions in the server dependency list.

It is better to align them.

### Why are the changes needed?
To align the dependency versions and reduce the conflicts in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the GA.

Closes #2620 from turboFei/align_jackson.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-15 11:00:23 +08:00
Mridul Muralidharan
17f89c553e [CELEBORN-1504] Support for Apache Flink 1.16
### What changes were proposed in this pull request?

Add support for Apache Flink 1.16 in Celeborn.

### Why are the changes needed?

User requests for Apache Flink 1.16.
This implementation is a synthesis of 1.15 and 1.17 support which already exists in Apache Celeborn

### Does this PR introduce _any_ user-facing change?

Yes, supports Apache Flink 1.16

### How was this patch tested?

Tests for 1.16 added, which are based on 1.15 and 1.17

Closes #2619 from mridulm/flink-1.16-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-15 10:44:16 +08:00
SteNicholas
adbef7b441 [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
### What changes were proposed in this pull request?

Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of https://github.com/apache/ratis/pull/1111.

### Why are the changes needed?

Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](https://github.com/apache/ratis/compare/ratis-3.0.1...ratis-3.1.0) releases.

Follow up #2547.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterStateMachineSuiteJ#testInstallSnapshot`

Closes #2610 from SteNicholas/CELEBORN-1499.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-11 16:29:58 +08:00
Fei Wang
d698a69edc
[CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
### What changes were proposed in this pull request?

This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).

We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.

### Why are the changes needed?

Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).

### Does this PR introduce _any_ user-facing change?
This pr introduce  a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.

### How was this patch tested?
UT.

Closes #2599 from turboFei/cip_9_openapi.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 10:57:00 +08:00
SteNicholas
7188e845f7
[CELEBORN-1327][FOLLOWUP] Simplify DirectByteBuffer constructor lookup logic
### What changes were proposed in this pull request?

Simplify `DirectByteBuffer` constructor lookup logic in `Platform`. Meanwhile, bump `commons-lang3` version from `3.12.0` to `3.13.0`.

### Why are the changes needed?

`try-catch` statement is not needed because we know version number already.

Backport:

- https://github.com/apache/spark/pull/41780
- https://github.com/apache/spark/pull/42269
- https://github.com/apache/spark/pull/44444

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2544 from SteNicholas/CELEBORN-1327.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-07 16:23:32 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
SteNicholas
e5f09ce4e0 [CELEBORN-1443] Remove ratis dependencies from common module
### What changes were proposed in this pull request?

Remove ratis dependencies from common module.

### Why are the changes needed?

Ratis is only depended on by the master module. Removing ratis dependencies from the common module reduces the size of the Celeborn client package.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2538 from SteNicholas/CELEBORN-1443.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-06-03 10:15:51 +08:00
SteNicholas
2a57fab869 [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(https://github.com/apache/ratis/pull/964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(https://github.com/apache/ratis/pull/718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (https://github.com/apache/ratis/pull/718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (https://github.com/apache/ratis/pull/731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (https://github.com/apache/ratis/pull/903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (https://github.com/apache/ratis/pull/728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (https://github.com/apache/ratis/pull/732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (https://github.com/apache/ratis/pull/734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (https://github.com/apache/ratis/pull/741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (https://github.com/apache/ratis/pull/742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (https://github.com/apache/ratis/pull/751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (https://github.com/apache/ratis/pull/632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (https://github.com/apache/ratis/pull/671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (https://github.com/apache/ratis/pull/635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (https://github.com/apache/ratis/pull/897)
      - RATIS-1866. Maintain leader lease after AppendEntries (https://github.com/apache/ratis/pull/898)
      - RATIS-1894. Implement ReadOnly based on leader lease (https://github.com/apache/ratis/pull/925)
      - RATIS-1882. Support read-after-write consistency (https://github.com/apache/ratis/pull/913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (https://github.com/apache/ratis/pull/906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (https://github.com/apache/ratis/pull/930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (https://github.com/apache/ratis/pull/896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (https://github.com/apache/ratis/pull/961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (https://github.com/apache/ratis/pull/950)
      - RATIS-1636. Support re-config ratis properties (https://github.com/apache/ratis/pull/800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (https://github.com/apache/ratis/pull/901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (https://github.com/apache/ratis/pull/931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (https://github.com/apache/ratis/pull/932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (https://github.com/apache/ratis/pull/955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (https://github.com/apache/ratis/pull/956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (https://github.com/apache/ratis/pull/957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (https://github.com/apache/ratis/pull/923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge https://github.com/apache/ratis/pull/922 (https://github.com/apache/ratis/pull/922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (https://github.com/apache/ratis/pull/927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (https://github.com/apache/ratis/pull/933)
         - RATIS-1912. Fix infinity election when perform membership change. (https://github.com/apache/ratis/pull/954)
         - RATIS-1858. Follower keeps logging first election timeout. (https://github.com/apache/ratis/pull/894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](https://github.com/apache/ratis/compare/ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-30 17:22:22 +08:00
SteNicholas
bd77f3e22d
[CELEBORN-1396] Bump Netty from 4.1.107.Final to 4.1.109.Final
### What changes were proposed in this pull request?

Bump Netty from 4.1.107.Final to 4.1.109.Final.

### Why are the changes needed?

Netty has released v4.1.108.Final, v4.1.109.Final, which release note refers to [4.1.108.Final](https://netty.io/news/2024/03/21/4-1-108-Final.html), [4.1.109.Final](https://netty.io/news/2024/04/15/4-1-109-Final.html). This version includes some bugfixes and improvements including:

- 4.1.108.Final
  - Epoll: Correctly handle splice tasks when Channel is closed: https://github.com/netty/netty/issues/13848
- 4.1.109.Final
  - Don't send a RST frame when closing the stream in a write future while processing inbound frames: https://github.com/netty/netty/pull/13973
  - Fix DefaultChannelId#asLongText NPE: https://github.com/netty/netty/pull/13971
  - Rewrite ZstdDecoder to remove the need of allocate a huge byte[] internally: https://github.com/netty/netty/pull/13928

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2474 from SteNicholas/CELEBORN-1396.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-22 20:31:29 +08:00
SteNicholas
e890f38656
[CELEBORN-1395] Bump RoaringBitmap version from 1.0.5 to 1.0.6
### What changes were proposed in this pull request?

Bump RoaringBitmap version from 1.0.5 to 1.0.6.

### Why are the changes needed?

RoaringBitmap has released v1.0.6, which release note refers to [1.0.6](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.6). This version includes some bugfixes and improvements including:

- Implement BatchIterator's promise to fill the input buffer.
- RoaringBitmap to BitSet/long[]/byte[].

Backport https://github.com/apache/spark/pull/46152. https://github.com/apache/spark/pull/46152#issuecomment-2068727268 mentions the performance of the benchmark test based on JDK21 is quite good.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2473 from SteNicholas/CELEBORN-1395.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-22 20:31:14 +08:00
SteNicholas
3c11e70c37 [CELEBORN-1382] Bump RoaringBitmap version from 0.9.32 to 1.0.5
### What changes were proposed in this pull request?

Bump RoaringBitmap version from 0.9.32 to 1.0.5.

### Why are the changes needed?

RoaringBitmap has released v1.0.5, which release note refers to [1.0.5](https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.5). This version includes some bugfixes and improvements including:

- Fix roaringbitmap - batchiterator's advanceIfNeeded to handle run lengths of zero.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2454 from SteNicholas/CELEBORN-1382.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-12 14:22:57 +08:00
SteNicholas
fa25ba8e1c
[CELEBORN-1366] Bump guava from 32.1.3-jre to 33.1.0-jre
### What changes were proposed in this pull request?

Bump guava from 32.1.3-jre to 33.1.0-jre.

### Why are the changes needed?

Guava v33.1.0 has been released, which release note refers to [v33.1.0](https://github.com/google/guava/releases/tag/v33.1.0). v33.1.0 brings some bug fixes and optimizations as follows:

* cache: Fixed a bug that could cause https://github.com/google/guava/pull/6851#issuecomment-1931276822 for `CacheLoader`/`CacheBuilder`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2439 from SteNicholas/CELEBORN-1366.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-02 16:46:03 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
zky.zhoukeyong
7af3126c7e Support Spark3.5 with JDK21
### What changes were proposed in this pull request?
Compile Spark-3.5 with
`./build/make-distribution.sh -Pspark-3.5 -Pjdk-21`
or
`./build/make-distribution.sh --sbt-enabled -Pspark-3.5 -Pjdk-21`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual tests

Closes #2385 from waitinfuture/1327.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:42:16 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
73cf1562f7 [CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler
### What changes were proposed in this pull request?

Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.

### Why are the changes needed?

[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Worker cluster test.

Closes #2409 from SteNicholas/CELEBORN-1299.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 14:05:50 +08:00
SteNicholas
adaa96fc60 [CELEBORN-1310][FLINK] Support Flink 1.19
### What changes were proposed in this pull request?

Support Flink 1.19.

### Why are the changes needed?

Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).

The main changes includes:

- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
   - `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
   - adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

- Flink batch job submission

```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```

- Flink batch job execution

![image](https://github.com/apache/incubator-celeborn/assets/10048174/18b60861-cafc-4df3-b94d-93307e728be2)

- Celeborn master log
```

24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```

- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```

Closes #2399 from SteNicholas/CELEBORN-1310.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-20 11:51:23 +08:00
SteNicholas
12c3779805 [CELEBORN-1330] Bump rocksdbjni version from 8.5.3 to 8.11.3
### What changes were proposed in this pull request?

Bump `rocksdbjni` version from 8.5.3 to 8.11.3.

### Why are the changes needed?

The new version bring some bug fixes:

- Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
- Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
- Fix a WRITE_STALL counter that was reporting wrong value in few cases.
- A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.
- Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with BackupEngine, sst_dump, or ldb.
- Fixed issues with experimental preclude_last_level_data_seconds option that could interfere with expected data tiering.
- Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.

The full release notes as follows: [rocksdbjni releases](https://github.com/facebook/rocksdb/releases).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2389 from SteNicholas/CELEBORN-1330.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 18:01:03 +08:00
Fei Wang
1200e97b6c [BUILD] Bump netty version to latest 4.1.107.Final
### What changes were proposed in this pull request?
Update netty to latest version.

### Why are the changes needed?
[Netty 4.1.107.Final](https://netty.io/news/2024/02/13/4-1-107-Final.html) has been released two weeks ago, seems many useful changes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #2328 from turboFei/netty_bump.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-25 21:55:13 +08:00
Shuang
d89dcf0e06 [CELEBORN-1054] Support db based dynamic config service
### What changes were proposed in this pull request?

Support database based store backend implementation for dynamic configuration management

### Why are the changes needed?

Currently celeborn provides `FsConfigServiceImpl` implementation for dynamic config service which is based on file system, We cloud Support database based store backend implementation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- `ConfigServiceSuiteJ#testDbConfig`

Closes #2273 from RexXiong/CELEBORN-1054.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-05 13:23:25 +08:00
tiny-dust
d315ff5055 [CELEBORN-1240] Introduce Husky Configuration to Celeborn Web
![image](https://github.com/apache/incubator-celeborn/assets/49502875/4404770c-c46e-470b-8f5e-c244c6656339)

### What changes were proposed in this pull request?

- Added Husky to enforce code quality with automated tasks during Git events.
- Added lint-staged for optimized linting on staged files before each commit.

### Why are the changes needed?

Enhances code quality.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2250 from tiny-dust/CELEBORN-1240.

Lead-authored-by: tiny-dust <idioticzhou@foxmail.com>
Co-authored-by: 周顺顺 <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-01-26 16:23:42 +08:00
pengqli
a808c252ba
[CELEBORN-1184] Update the snakeyaml version from 1.33 to 2.2
### What changes were proposed in this pull request?
Update the snakeyaml version from 1.33 to 2.2 reducing direct CVE vulnerabilities.

### Why are the changes needed?
The snakeyaml version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-1471

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2170 from dev-lpq/snakeyaml_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:23:22 +08:00
pengqli
1037fbf921 [CELEBORN-1173] Upgrade netty version from 4.1.93.Final to 4.1.101.Final
### What changes were proposed in this pull request?
upgrade netty all version from 4.1.93.Final to 4.1.101.Final reducing direct CVE vulnerabilities

### Why are the changes needed?
The netty version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-4586
https://scout.docker.com/vulnerabilities/id/CVE-2023-44487
https://scout.docker.com/vulnerabilities/id/GHSA-xpw8-rcwv-8f8p

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2150 from dev-lpq/update_netty_all_version.

Lead-authored-by: pengqli <pengqli@cisco.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 14:03:37 +08:00
pengqli
0860553e18 [CELEBORN-1163] Upgrade protobuf from 3.19.2 to 3.21.7
### What changes were proposed in this pull request?
upgrade protobuf from 3.19.2 to 3.21.7 reducing direct CVE vulnerabilities

### Why are the changes needed?

The protobuf version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-3510
https://scout.docker.com/vulnerabilities/id/CVE-2022-3509
https://scout.docker.com/vulnerabilities/id/CVE-2021-22570
https://scout.docker.com/vulnerabilities/id/CVE-2021-22569

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
`./build/make-distribution.sh` to package and run test on the local.

Closes #2142 from dev-lpq/upgrade_protobuf-java_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 13:58:36 +08:00
sychen
2504b50dd2 [CELEBORN-1170] Upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/2143

The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2158 from cxzl25/CELEBORN-1170.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:28:32 +08:00
pengqli
80458d18fa upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?
upgrade snappy-java from 1.1.8.2 to 1.1.10.5 reducing direct CVE vulnerabilities

### Why are the changes needed?
The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
`./build/make-distribution.sh` to package and run test on the local

Closes #2143 from dev-lpq/update_snappy_java.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-11 18:38:06 +08:00
qinrui
04a1e90207 [CELEBORN-1122] Metrics supports json format
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format

### How was this patch tested?
Cluster test.

Closes #2089 from suizhe007/CELEBORN-1122.

Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-06 09:24:28 +08:00
sychen
89b6cac5ab
[CELEBORN-1113] Bump Hadoop client version from 3.2.4 to 3.3.6
### What changes were proposed in this pull request?

### Why are the changes needed?

[[HADOOP-17098](https://issues.apache.org/jira/browse/HADOOP-17098)] Reduce Guava dependency in Hadoop source code

The higher version of hadoop client removes many guava-related methods, which avoids some conflicts on guava.

`hadoop-client-api` 3.3.6
`hadoop-client-runtime` 3.3.6

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2077 from cxzl25/CELEBORN-1113.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-01 15:41:04 +08:00
SteNicholas
4dfcd9b56b [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake
### What changes were proposed in this pull request?

Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.

### Why are the changes needed?

When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of [jvmquake](https://github.com/Netflix-Skunkworks/jvmquake) which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`JVMQuakeSuite`

Closes #2061 from SteNicholas/CELEBORN-1092.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:45:08 +08:00
Fu Chen
aab073ab16
[CELEBORN-1125] Bump guava from 14.0.1 to 32.1.3-jre
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

- bump guava from 14.0.1 to 32.1.3-jre
- refer to https://github.com/apache/spark/pull/26911, remove usages of Guava that no longer work in Guava 27/32, and replace with workalikes. After this PR, Celeborn no longer relies on a specific version of Guava, and is compatible with Guava 14/27/32. we have the ability to specify Guava to 27 when running MapReduce integration tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2090 from cfmcgrady/guava-27.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-21 16:18:14 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00