Commit Graph

21 Commits

Author SHA1 Message Date
Jray
cfb490c938 [CELEBORN-2090] Support Lz4 Decompression in CppClient
### What changes were proposed in this pull request?
This PR adds support for lz4 decompression in CppClient.

### Why are the changes needed?
To support reading from Celeborn with CppClient.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By compilation and UTs.

Closes #3402 from Jraaay/feat/cpp_client_lz4_decompression.

Authored-by: Jray <1075860716@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-08 18:19:48 +08:00
Wang, Fei
5e12b7d607 [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted
### What changes were proposed in this pull request?

For spark celeborn application, if the GetReducerFileGroupResponse is larger than the threshold, Spark driver would broadcast the GetReducerFileGroupResponse to the executors, it prevents the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Why are the changes needed?
To prevent the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Does this PR introduce _any_ user-facing change?
No, the feature is not enabled by defaults.

### How was this patch tested?

UT.

Cluster testing with `spark.celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.enabled=true`.

The broadcast response size should be always about 1kb.
![image](https://github.com/user-attachments/assets/d5d1b751-762d-43c8-8a84-0674630a5638)
![image](https://github.com/user-attachments/assets/4841a29e-5d11-4932-9fa5-f6e78b7bc521)
Application succeed.
![image](https://github.com/user-attachments/assets/9b570f70-1433-4457-90ae-b8292e5476ba)

Closes #3158 from turboFei/broadcast_rgf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-04-01 08:29:21 -07:00
HolyLow
e496a3cfae [CELEBORN-1785][CIP-14] Add baseConf to cppClient
### What changes were proposed in this pull request?
Add baseConf to cppClient, which is the building block of conf module.

### Why are the changes needed?
To support CelebornCpp configuration module in cppClient.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #3013 from HolyLow/issue/celeborn-1785-add-base-conf-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-23 16:45:02 +08:00
HolyLow
b2b9a0ab4b [CELEBORN-1754][CIP-14] Add exceptions and checking utils to cppClient
### What changes were proposed in this pull request?
This PR adds exceptions and checking utils code to CppClient.
Besides, the ctest framework is added to CppClient for UTs.

### Why are the changes needed?
To provide exception utils and UT frmework to CppClient.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #2966 from HolyLow/issue/celeborn-1754-add-exceptions-utils-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-04 14:05:38 +08:00
HolyLow
6a0f763e23 [CELEBORN-1751][CIP-14] Add celebornException utils to cppClient
### What changes were proposed in this pull request?
This PR adds CelebornException utils code to CppClient.

### Why are the changes needed?
To provide CelebornException utils.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation.

Closes #2958 from HolyLow/issue/celeborn-1751-add-celeborn-exception-utils-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 11:10:58 +08:00
HolyLow
1aefd8f42e [CELEBORN-1740][CIP-14] Add stackTrace utils to cppClient
### What changes were proposed in this pull request?
This PR adds StackTrace utils code to CppClient.

### Why are the changes needed?
To provide StackTrace utils.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation.

Closes #2951 from HolyLow/issue/celeborn-1740-add-stacktrace-utils-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-27 14:21:51 +08:00
HolyLow
77c7a8b91d [CELEBORN-1741][CIP-14] Add processBase utils to cppClient
### What changes were proposed in this pull request?
This PR adds CMakeList structure and ProcessBase utils code to CppClient.

### Why are the changes needed?
To organize the compiling structure and to provide ProcessBase utils.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation.

Closes #2940 from HolyLow/issue/celeborn-1741-add-processbase-utils-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-26 13:38:16 +08:00
SteNicholas
adbef7b441 [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
### What changes were proposed in this pull request?

Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of https://github.com/apache/ratis/pull/1111.

### Why are the changes needed?

Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](https://github.com/apache/ratis/compare/ratis-3.0.1...ratis-3.1.0) releases.

Follow up #2547.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterStateMachineSuiteJ#testInstallSnapshot`

Closes #2610 from SteNicholas/CELEBORN-1499.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-11 16:29:58 +08:00
Xianming Lei
cb30e911e5 [CELEBORN-1452] Master follower node metadata is out of sync after installing snapshot
### What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes #2547 from leixm/issue_1452.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-06-13 17:09:12 +08:00
SteNicholas
7fa1d32a98 [CELEBORN-1374] Refactor SortBuffer and PartitionSortedBuffer
### What changes were proposed in this pull request?

Refactor `SortBuffer` and `PartitionSortedBuffer` with introduction of `DataBuffer` and `SortBasedDataBuffer`.

### Why are the changes needed?

`SortBuffer` and `PartitionSortedBuffer` is refactored in https://github.com/apache/flink/pull/18505. Celeborn Flink should also refactor `SortBuffer` and `PartitionSortedBuffer` to sync the interface changes in Flink. Meanwhile, `SortBuffer` and `PartitionSortedBuffer` should distinguish channel and subpartition for https://github.com/apache/flink/pull/23927.

### Does this PR introduce _any_ user-facing change?

- `SortBuffer` renames to `DataBuffer`.
- `PartitionSortedBuffer` renames to `SortBasedDataBuffer`.
- `SortBuffer.BufferWithChannel` renames to `BufferWithSubpartition`

### How was this patch tested?

UT and IT.

Closes #2448 from SteNicholas/CELEBORN-1374.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-09 15:47:57 +08:00
Mridul Muralidharan
b1f8ec8357 [CELEBORN-1351] Introduce SSLFactory and enable TLS support
### What changes were proposed in this pull request?

Add SSLFactory, and wire up TLS support with rest of Celeborn to enable secure over the wire communication.

### Why are the changes needed?
Add support for TLS to secure wire communication.
This is the last PR to add basic support for TLS.
There will be a follow up for CELEBORN-1356 and documentation ofcourse !

### Does this PR introduce _any_ user-facing change?
Yes, completes basic support for TLS in Celeborn.

### How was this patch tested?
Existing tests, augmented with additional unit tests.

Closes #2438 from mridulm/add-sslfactory-and-related-changes.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-08 10:42:29 +08:00
Mridul Muralidharan
3ff8812cdd [CELEBORN-1348] Update infrastructure for SSL communication
### What changes were proposed in this pull request?

Update infrastructure for SSL support.
Please see #2416 for the consolidated PR with all the changes for reference.

### Why are the changes needed?

At a high level, the changes are:
* `ManagedBuffer.convertToNettyForSsl`, to support SSL encryption.
* Add `EncryptedMessageWithHeader`, which is used to wrap the message and body, for use with SSL.
* `SslMessageEncoder`  is an encoder for SSL

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

The overall PR #2416 (and this PR as well) passes all tests, and this PR includes relevant subset of tests.

Closes #2427 from mridulm/update-infra-for-ssl.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-01 19:59:44 +08:00
SteNicholas
6fdeced158 [CELEBORN-1359] Support Netty Logging at the network layer
### What changes were proposed in this pull request?

Support Netty level logging at the network layer for Celeborn. To configure Netty level logging a LogHandler must be added to the channel pipeline. `NettyLogger` is introduced as a new class which is able to construct a log handler depending on the log level:

- In case of `<Logger name="org.apache.celeborn.common.network.util.NettyLogger" level="DEBUG" additivity="false">`: a custom log handler is created which does not dump the message contents. This way the log is a bit more compact. Moreover when network level encryption is switched on this level might be sufficient.
- In case of `<Logger name="org.apache.celeborn.common.network.util.NettyLogger" level="TRACE" additivity="false">`: Netty's own log handler is used which dumps the message contents.
- Otherwise (when the logger is not `TRACE` or `DEBUG`) the pipeline does not contain a log handler (there is no runtime penalty for the default setting but a long running service must be restarted along with the new log level to have an effect).

Backport:

- [[SPARK-36719][CORE] Supporting Netty Logging at the network layer](https://github.com/apache/spark/pull/33962)
- [[SPARK-45377][CORE] Handle InputStream in NettyLogger](https://github.com/apache/spark/pull/43165)

### Why are the changes needed?

This level of logging proved to be sufficient during debugging some external shuffle related problem. Compared with the tcpdump this log lines can be more easily correlated with the Celeborn internal calls. Moreover the log layout can be configured to contain the thread names that way for a timeout a busy thread could be identified.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local manually test.

Closes #2423 from SteNicholas/CELEBORN-1359.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-28 16:11:37 +08:00
Mridul Muralidharan
b14254be9a
[CELEBORN-1349] Add SSL related configs and support for ReloadingX509TrustManager
Add SSL related configs and support for `ReloadingX509TrustManager`, required for enabling SSL support.
Please see #2416 for the consolidated PR with all the changes for reference.

Introduces SSL related configs for enabling and configuring use of TLS.

Yes, introduces configs to control behavior of SSL

The overall PR #2411 (and this PR as well) passes all tests, this is specifically pulling out the `ReloadingX509TrustManager` and config related changes

Closes #2419 from mridulm/config-for-ssl.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:21:14 +08:00
sychen
b3eed34b57
[CELEBORN-1293] Output received signals at master and worker
### What changes were proposed in this pull request?
When we shut down the master or worker, we can output the signal as a record.

### Why are the changes needed?
Conveniently track the status of master and workers.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
local test

```bash
./sbin/stop-all.sh
```

```
12:20:59.932 [SIGTERM handler] ERROR org.apache.celeborn.service.deploy.master.Master - RECEIVED SIGNAL TERM
```

```
12:20:59.563 [SIGTERM handler] ERROR org.apache.celeborn.service.deploy.worker.Worker - RECEIVED SIGNAL TERM
```

Closes #2334 from cxzl25/CELEBORN-1293.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-08 15:48:57 +08:00
mingji
fd944b2509
[CELEBORN-1250][FOLLOWUP] Fix license issues
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Fix license issues for the main branch

cherry-pick https://github.com/apache/incubator-celeborn/pull/2259 and https://github.com/apache/incubator-celeborn/pull/2268 into the main branch.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2271 from cfmcgrady/license-main.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-30 16:45:21 +08:00
Cheng Pan
bb86074163
[CELEBORN-1202][FOLLOWUP] Update LICENSE and NOTICE files
### What changes were proposed in this pull request?

Update LICENSE and NOTICE files according to the mailing list comments.

### Why are the changes needed?

https://lists.apache.org/thread/zw5cw621dqgbktdolx7qynho0zt451pk

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Review.

Closes #2213 from pan3793/CELEBORN-1202-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-01-10 19:26:54 +08:00
mingji
b4b86848e3
[CELEBORN-1202] LICENSE mentions third-party components under other open source licenses
### What changes were proposed in this pull request?

`LICENSE` mentions third-party components under other open source licenses like Apache Spark etc.

### Why are the changes needed?

`LICENSE` mentions 1 3rd party file from Guava. However, the `NOTICE` lists both Apache Spark and Apache Flink. `LICENSE` should mention all third-party components under other open source licenses.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2193 from SteNicholas/CELEBORN-1202.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-29 11:35:50 +08:00
Cheng Pan
78553f1418
[CELEBORN-1003] Correct the LICENSE and NOTICE for shaded client jars
### What changes were proposed in this pull request?

Correct the `LICENSE` and `NOTICE` for the following shaded client jars

- `celeborn-client-flink-1.14-shaded_2.12-<version>.jar`
- `celeborn-client-flink-1.15-shaded_2.12-<version>.jar`
- `celeborn-client-flink-1.17-shaded_2.12-<version>.jar`
- `celeborn-client-mr-shaded_2.12-<version>.jar`
- `celeborn-client-spark-2-shaded_2.11-<version>.jar`
- `celeborn-client-spark-3-shaded_2.12-<version>.jar`

### Why are the changes needed?

The `LICENSE` and `NOTICE` shipped in a jar should match the content of the jar, for shaded jars, it should acknowledge all the third-party classes that are bundled.

See more discussion at https://lists.apache.org/thread/8v4wy5o132rpsjync6465zztgjlf6h5p

For how to determine which third-party jars are bundled, take `celeborn-client-spark-3-shaded_2.12-<version>.jar` as an example, the following command performs the packaging, and we can find them out by looking at logs like `Including ... in the shaded jar`

```
build/mvn clean package -DskipTests -pl :celeborn-client-spark-3-shaded_2.12 -am -Pspark-3.3
```

```
[INFO] --- maven-shade-plugin:3.4.0:shade (default)  celeborn-client-spark-3-shaded_2.12 ---
[INFO] Including org.apache.celeborn:celeborn-client-spark-3_2.12🫙0.4.0-SNAPSHOT in the shaded jar.
[INFO] Including org.apache.celeborn:celeborn-common_2.12🫙0.4.0-SNAPSHOT in the shaded jar.
[INFO] Including org.apache.commons:commons-lang3:jar:3.12.0 in the shaded jar.
[INFO] Including io.netty:netty-all:jar:4.1.93.Final in the shaded jar.
[INFO] Including io.netty:netty-buffer:jar:4.1.93.Final in the shaded jar.
...
[INFO] Excluding org.apache.ratis:ratis-common:jar:2.5.1 from the shaded jar.
[INFO] Excluding org.apache.ratis:ratis-thirdparty-misc:jar:1.0.4 from the shaded jar.
[INFO] Excluding org.apache.ratis:ratis-proto:jar:2.5.1 from the shaded jar.
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.

Closes #1933 from pan3793/CELEBORN-1003.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:23:54 +08:00
Ethan Feng
a43e3141bc
[CELEBORN-224][FOLLOWUP] Correct license and notices. (#1189) 2023-02-02 10:52:11 +08:00
Alibaba OSS
0d29f88ada
Initial commit 2021-12-10 16:57:16 +08:00