Commit Graph

2151 Commits

Author SHA1 Message Date
SteNicholas
4ccb0c7fce [MINOR] Rename org.apache.celeborn.plugin.flink.readclient to org.apache.celeborn.plugin.flink.client
### What changes were proposed in this pull request?

Rename `org.apache.celeborn.plugin.flink.readclient` to `org.apache.celeborn.plugin.flink.client`.

### Why are the changes needed?

`FlinkShuffleClientImpl` is designed to write and read shuffle data including pushing and fetching shuffle data. Therefore, the package name of `FlinkShuffleClientImpl` should use `org.apache.celeborn.plugin.flink.client`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3048 from SteNicholas/shuffle-client-package.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-01-03 20:53:54 +08:00
HolyLow
8b096ea879 [CELEBORN-1814][CIP-14] Add transportMessage to cppClient
### What changes were proposed in this pull request?
Add transportMessage to cppClient.

### Why are the changes needed?
TransportMessage is the building block of controlMessages.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #3042 from HolyLow/issue/celeborn-1814-add-transport-message-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-03 16:18:50 +08:00
Xianming Lei
0c3ceeb0a5 [CELEBORN-1791] All NettyMemoryMetrics should register to source
### What changes were proposed in this pull request?
StorageManager and ReadBufferDispacther does not register netty metrics.

### Why are the changes needed?
All NettyMemoryMetrics should register to source

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #3016 from leixm/CELEBORN-1791.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-03 11:28:57 +08:00
mingji
5d2831bbad [CELEBORN-1816] Bump scala-maven-plugin to avoid compilation loop
### What changes were proposed in this pull request?
To update zinc to fix an issue that may cause the compilation process to keep compiling the project.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manual tests on Mac, and Ubuntu nodes.

Closes #3045 from FMX/b1816.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-03 09:27:47 +08:00
Wang, Fei
b7e3eaa46d [CELEBORN-1477][FOLLOWUP] Minor fix for v1 RESTful apis before release
### What changes were proposed in this pull request?

Minor fix the v1 RESTful apis before 0.6.0 release.

1. update  the API description to use UPPER case worker EventType
2. `subResourceConsumption`  => `subResourceConsumptions`.

### Why are the changes needed?
1. With https://github.com/apache/celeborn/pull/2754, the openapi-sdk works well. but for the RESTful call without SDK, the worker eventType is still case sensitive, might be caused by the jersey issue mentioned in https://github.com/eclipse-ee4j/jersey/issues/5288. So, In this PR, I change the description in the swagger for user guidance.
<img width="1524" alt="image" src="https://github.com/user-attachments/assets/70e4f239-dc36-47bc-902e-5340986f014a" />

2. rename `subResourceConsumption`  to `subResourceConsumptions`.

### Does this PR introduce _any_ user-facing change?
No, the api has not been released.

### How was this patch tested?
GA.

Closes #3023 from turboFei/restful_minor_fix.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-02 23:00:15 +08:00
SteNicholas
16762c659c
[CELEBORN-1774][FOLLOWUP] Change celeborn.<module>.io.mode optional to explain default behavior in description
### What changes were proposed in this pull request?

Change `celeborn.<module>.io.mode` optional to explain default behavior in description.

### Why are the changes needed?

The default value of `celeborn.<module>.io.mode` in document could be changed by whether epoll mode is available for different os. Therefore, `celeborn.<module>.io.mode` should be changed to optional and explained the default behavior in description of option.

Follow up https://github.com/apache/celeborn/pull/3039#discussion_r1899340272.

### Does this PR introduce _any_ user-facing change?

`celeborn.<module>.io.mode` is optional and explains default behavior in description.

### How was this patch tested?

CI.

Closes #3044 from SteNicholas/CELEBORN-1774.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-01-02 21:15:19 +08:00
Cheng Pan
a318eb43ab
[CELEBORN-1815] Support UnpooledByteBufAllocator
### What changes were proposed in this pull request?

This PR introduces a configuration `celeborn.network.memory.allocator.pooled` to allow users to disable `PooledByteBufAllocator` globally and always use `UnpooledByteBufAllocator`.

### Why are the changes needed?

In some extreme cases, the Netty's `PooledByteBufAllocator` might have tons of 4MiB chunks but only a few sizes of the capacity are used by the real data(see https://github.com/apache/celeborn/pull/3018), for scenarios that stability is important than performance, it's desirable to allow users to disable the `PooledByteBufAllocator` globally.

### Does this PR introduce _any_ user-facing change?

Add a new feature, disabled by default.

### How was this patch tested?

Pass UT to ensure correctness. Performance and memory impact need to be verified in the production scale cluster.

Closes #3043 from pan3793/CELEBORN-1815.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2025-01-02 20:54:34 +08:00
SteNicholas
d6496ae183 [CELEBORN-1801][FOLLOWUP] Extract RemoteShuffleEnvironment, NettyShuffleEnvironmentWrapper, SimpleResultPartitionAdapter to flink common module
### What changes were proposed in this pull request?

Extract `RemoteShuffleEnvironment`, `NettyShuffleEnvironmentWrapper`, `SimpleResultPartitionAdapter` to flink common module. Meanwhile, `RemoteShuffleInputGate` and `RemoteShuffleResultPartition` are abstracted in flink common module.

### Why are the changes needed?

After removing out-of-dated flink 1.14 and 1.15 in #3029, `RemoteShuffleEnvironment`, `NettyShuffleEnvironmentWrapper`, `SimpleResultPartitionAdapter` could be extracted to flink common module. Meanwhile, `RemoteShuffleInputGate` and `RemoteShuffleResultPartition` could also be abstracted in flink common module.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3041 from SteNicholas/CELEBORN-1801.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2024-12-31 17:39:51 +08:00
mingji
4ec02286e8
[CELEBORN-1811] Update default value for celeborn.master.slot.assign.extraSlots
### What changes were proposed in this pull request?
To avoid possible worker load skew for the stages with tiny reducer numbers.

### Why are the changes needed?
If a stage has tiny reducers and skewed partitions, The default value will lead to serious worker load imbalance cause some workers unable to handle shuffle data.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
GA and cluster test.

Closes #3039 from FMX/1811.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-31 15:37:28 +08:00
SteNicholas
56019c714a [CELEBORN-1804] Shuffle environment metrics of RemoteShuffleEnvironment should use Shuffle.Remote metric group
### What changes were proposed in this pull request?

Shuffle environment metrics of `RemoteShuffleEnvironment` should use `Shuffle.Remote` metric group.

### Why are the changes needed?

Shuffle environment metrics of `RemoteShuffleEnvironment` uses incorrect netty metric group defined as `Shuffle.Netty`. Therefore, `RemoteShuffleEnvironment` should use remote metric group like `Shuffle.Remote` for shuffle environment metrics.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3032 from SteNicholas/CELEBORN-1804.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2024-12-31 14:39:43 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
HolyLow
4714e91420 [CELEBORN-1809][CIP-14] Add partitionLocation to cppClient
### What changes were proposed in this pull request?
This PR adds PartitionLocation to cppClient, which is the component of protocol module.

### Why are the changes needed?
To support communication message of PartitionLocation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #3035 from HolyLow/issue/celeborn-1809-add-partition-location-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-27 20:38:43 +08:00
SteNicholas
eb59c17638
[CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
### What changes were proposed in this pull request?

Bump Spark from 3.5.3 to 3.5.4.

### Why are the changes needed?

Spark 3.5.4 has been announced to release: [Spark 3.5.4 released](https://spark.apache.org/news/spark-3-5-4-released.html). The profile spark-3.5 could bump Spark from 3.5.3 to 3.5.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3034 from SteNicholas/CELEBORN-1806.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-27 16:29:35 +08:00
Wang, Fei
1b3bd6eb38 [CELEBORN-1802] Fail the celeborn master/worker start if CELEBORN_CONF_DIR is not directory
### What changes were proposed in this pull request?
Fail the celeborn master/worker start if `CELEBORN_CONF_DIR` is not directory. Otherwise, the  process would run into unexpected status.

### Why are the changes needed?
In the  `celeborn-daemon.sh` , if we specify the `--config <conf-dir>` option. It would fail the master/worker start if the `conf-dir` is not a directory,  likes the systemctl `ConditionPathExists=$CELEBORN_CONF_DIR` requirement check.

fde6365f68/sbin/celeborn-daemon.sh (L35)

fde6365f68/sbin/celeborn-daemon.sh (L53-L62)

But before this PR, for the start master/worker scripts, it did not check if the `CELEBORN_CONF_DIR` is dirctory because the scripts did not leverage `--config <conf-dir>` option.

In this PR, we check the final `CELEBORN_CONF_DIR` before start celeborn, so that all the scripts would check if the `CELEBORN_CONF_DIR` is a directory before start.
### Does this PR introduce _any_ user-facing change?
Yes, it would fail the start if  `CELEBORN_CONF_DIR` is not a directory.

### How was this patch tested?

<img width="840" alt="image" src="https://github.com/user-attachments/assets/e670d21b-cb01-4fa6-8a2f-c94dc06cce4a" />

Closes #3030 from turboFei/check_config_dir.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-26 20:22:34 +08:00
mingji
52fa151aa4
[CELEBORN-1701][FOLLOWUP] Support stage rerun for shuffle data lost
### What changes were proposed in this pull request?
Fix an error that may cause the application master retry stage rerun infinitely.

### Why are the changes needed?
Correct the parameters passed.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #3033 from FMX/b1071-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-26 17:58:41 +08:00
HolyLow
7f030d424d [CELEBORN-1799][CIP-14] Add celebornConf to cppClient
### What changes were proposed in this pull request?
Add CelebornConf to cppClient.

### Why are the changes needed?
The CelebornConf will be used as configuration module in cppClient.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #3027 from HolyLow/issue/celeborn-1799-add-celeborn-conf-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-25 20:07:35 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
xiyu.zk
4b60dae0f0 [CELEBORN-1789][DOC] Document on Java Columnar Shuffle
### What changes were proposed in this pull request?
Introduction to Celeborn's Java Columnar Shuffle

### Why are the changes needed?
Introduction to Celeborn's Java Columnar Shuffle

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

Closes #3010 from kerwin-zk/CELEBORN-1789.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-24 11:40:18 +08:00
Wang, Fei
6028a049df
[CELEBORN-1700][FOLLOWUP] Fix flaky test RemoteShuffleMasterSuiteJ - testRegisterPartitionWithProducer
### What changes were proposed in this pull request?
Increase `celeborn.client.application.heartbeatInterval` from default `10s` to `30s` to fix flaky test `RemoteShuffleMasterSuiteJ`.

### Why are the changes needed?
Many flaky test failure for `RemoteShuffleMasterSuiteJ` when assert the `lifecycleManager().shuffleCount() == 3`.

```
Error:  Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.186 s <<< FAILURE! - in org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ
Error:  org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer  Time elapsed: 10.05 s  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<0>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer(RemoteShuffleMasterSuiteJ.java:146)
```

680b072b5b/client-flink/flink-1.15/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleMasterSuiteJ.java (L146)

The `lifecycleManager().shuffleCount()` would reset when reporting application heartbeat, so the test would fail if its duration is more than default application heartbeat interval, 10s.

680b072b5b/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala (L210-L220)

So, in this PR, we increase the application heartbeat interval from defaults `10s` to `30s` to reduce the flaky test.
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #3025 from turboFei/fix_RemoteShuffleMasterSuiteJ_failure.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-12-24 11:33:17 +08:00
Wang, Fei
27e34ecad0 [CELEBORN-1797] Support to adjust the logger level with RESTful API during runtime
### What changes were proposed in this pull request?

Support to adjust the logger level during runtime without restarting the server.

### Why are the changes needed?
It is useful for debug, likes hadoop daemonlog command: https://hadoop.apache.org/docs/r3.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html#daemonlog

### Does this PR introduce _any_ user-facing change?
Yes, new RESTful api.

### How was this patch tested?

GA.
<img width="1430" alt="image" src="https://github.com/user-attachments/assets/9d974fd9-21f3-429a-a35f-e15662aa75ac" />
<img width="1428" alt="image" src="https://github.com/user-attachments/assets/ca32b596-12a1-4038-9e1b-4fdc6a377b54" />

<img width="1255" alt="image" src="https://github.com/user-attachments/assets/5c399a73-9f53-43a8-b337-5a79621abea4" />
<img width="1244" alt="image" src="https://github.com/user-attachments/assets/16dc9ede-01bb-4e38-80fe-acb044ae6cc7" />

Closes #3022 from turboFei/log_level.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-24 11:24:30 +08:00
Wang, Fei
03656b5b1c [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
### What changes were proposed in this pull request?

1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.

### Why are the changes needed?

For monitoring

### Does this PR introduce _any_ user-facing change?
No, it has not been released

### How was this patch tested?
UT for  metrics source `instance`.

<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">

<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">

Closes #2990 from turboFei/rpc_metrics.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 11:13:49 +08:00
zhengtao
2eb4c23eb8 [CELEBORN-1771] Bring forward PUSH_DATA_CREATE_CONNECTION_FAIL_REPLICA
### What changes were proposed in this pull request?
1.Move the peerWorker available judgement out of ThreadPool.
2.Move `retain` after the available worker judgment Which means we don't have to release if peerWorker is unavailable.
2. Add `fileWriter.decrementPendingWrites()` if peerWorker is unavailable since it will return and won't decrementPendingWreites in `writeLocalData`.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UT & cluster testing.

Closes #2989 from zaynt4606/clb1771.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 11:09:19 +08:00
Wang, Fei
680b072b5b [CELEBORN-1753] Optimize the code for exists and find method
### What changes were proposed in this pull request?

Optimize the code for `exists` and `find`.

1.  Enhance the performance to lookup workerInfo by workerUniqueId instead of looping the collection:
 74c1ec0a7f/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala (L65-L66)

Change the type to:
```
 type ShuffleAllocatedWorkers =
    ConcurrentHashMap[Int, ConcurrentHashMap[String, ShufflePartitionLocationInfo]]
```
And save the `WorkerInfo` into `ShufflePartitionLocationInfo`.
```
class ShufflePartitionLocationInfo(val workerInfo: WorkerInfo) {
...
}
```

So that, we can get the `WorkerInfo` by worker uniqueId fast.

2. Reduce the loop cost for below code: 33ba0e02f5/worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala (L455-L466)

### Why are the changes needed?

Enhance the performance.
Address comments:
https://github.com/apache/celeborn/pull/2959#pullrequestreview-2466200199
https://github.com/apache/celeborn/pull/2959#issuecomment-2505137166

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA

Closes #2962 from turboFei/CELEBORN_1753_exists.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-23 17:56:20 +08:00
zhengtao
406ceb64c1 [CELEBORN-1794] Fix TestCongestionController occasional failing
### Why are the changes needed?
There are a small probability of the TestCongestionController test failing.

![image](https://github.com/user-attachments/assets/bc5bfb91-0b40-4ee0-bd74-2b96715d7cd7)

That is because the `checkService` will excute once it was init, which can cause a multithreading conflict with the test code.

![image](https://github.com/user-attachments/assets/dff58b08-a340-4c29-a61d-fafaeb182c5d)

### What changes were proposed in this pull request?
Fix ut bug.
In fact, `shutDownCheckService` still wont prevent the `checkService` from excuting at once but can make the main testing thread waiting for it to shutDown.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
manual test.

Closes #3017 from zaynt4606/clb1794.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-23 17:47:28 +08:00
HolyLow
e496a3cfae [CELEBORN-1785][CIP-14] Add baseConf to cppClient
### What changes were proposed in this pull request?
Add baseConf to cppClient, which is the building block of conf module.

### Why are the changes needed?
To support CelebornCpp configuration module in cppClient.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #3013 from HolyLow/issue/celeborn-1785-add-base-conf-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-23 16:45:02 +08:00
zhangzhao.08
9e04ff4a9f [CELEBORN-1763] Fix DataPusher be blocked for a long time
### What changes were proposed in this pull request?
fix DataPusher be blocked for a long time

### Why are the changes needed?
The worker has been at a performance bottleneck for a long time, the slow start strategy adjusts its maxInFlight to 1, which may cause RequestInFlight to exceed maxInFlight. If the task’s main thread has been blocked in the waitIdleQueueFullWithLock call, then the main thread will not be able to detect the sending failure since this failure changes the exception in the push state, and the waitIdleQueueFullWithLock function does not check for it

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
GA

Closes #2978 from zhaostu4/fix_pusher_block.

Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-12-22 23:08:36 -08:00
zhaohehuhu
eaa0726c5c [CELEBORN-1788] Add role and roleBinding helm charts
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

 help service account control what permissions and resources a pod has access to.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

test the template rendering by helm template command line

Closes #3009 from zhaohehuhu/dev-1219.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-23 11:42:16 +08:00
Sanskar Modi
80523214e4 [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE
### What changes were proposed in this pull request?

Add documentation for `CELEBORN_NO_DAEMONIZE`

### Why are the changes needed?

Currently the celeborn processes starts in background and it was difficult to figure out how to change that behaviour. Setting this flag to true, will allow Celeborn processes to run in foreground.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

Closes #3020 from s0nskar/no-daemonize-docs.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-23 10:31:37 +08:00
SteNicholas
a357df94f1 [CELEBORN-1700][FOLLOWUP] Support ShuffleFallbackCount metric for fallback to vanilla Flink built-in shuffle implementation
### What changes were proposed in this pull request?

 Support `ShuffleFallbackCount` metric for fallback to vanilla Flink built-in shuffle implementation.

### Why are the changes needed?

#2932 has already supported fallback to vanilla Flink built-in shuffle implementation, which is lack of `ShuffleFallbackCount` metric to feedback the situation of fallback.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RemoteShuffleMasterSuiteJ#testRegisterPartitionWithProducerForForceFallbackPolicy`

Closes #3012 from SteNicholas/CELEBORN-1700.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-20 14:12:24 +08:00
Fei Wang
6b884dee66 [CELEBORN-1777] Add java.security.jgss/sun.security.krb5 to DEFAULT_MODULE_OPTIONS
### What changes were proposed in this pull request?
As title, add `--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED` into default java options.

### Why are the changes needed?

It is necessary for JDK17 + HDFS Storage + Kerberos enabled, see details in https://github.com/apache/spark/pull/34615

The exception stack likes:
```
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
	at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
....
Caused by: java.lang.IllegalAccessException: class org.apache.hadoop.security.authentication.util.KerberosUtil cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module 3a0baae5
	at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
	at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
	at java.base/java.lang.reflect.Method.invoke(Method.java:560)
	at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
	at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
	... 9 more
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2999 from turboFei/jdk_opt_krb5.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-19 14:54:21 +08:00
zhengtao
67971df68f [CELEBORN-1783] Fix Pending task in commitThreadPool wont be canceled
### What changes were proposed in this pull request?
1. Cancel all commit file jobs when handleCommitFiles timeout.
2. Fix timeout commit jobs wont be set `CommitInfo.COMMIT_FINISHED`

### Why are the changes needed?
1. Pending task in commitThreadPool wont be canceled.
3. Timeout commit jobs should set `commitInfo.status`.
![image](https://github.com/user-attachments/assets/38528460-8114-4c42-8dc2-a47ec396f99e)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT test.
The `commitInfo.status` should be `COMMIT_FINISHED` when commitFile jobs timeout.
Cluster test.
![image](https://github.com/user-attachments/assets/01beb183-da7e-4e44-85e1-3836fcad3c79)

Closes #3004 from zaynt4606/clb1783.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-19 14:24:36 +08:00
Xianming Lei
cec88b2def [CELEBORN-1782] Worker in congestion control should be in blacklist to avoid impact new shuffle
### What changes were proposed in this pull request?
Set highWorkload=true when worker in congestion control.

### Why are the changes needed?
Worker in congestion control should be in blacklist to avoid impact new shuffle.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTS.

Closes #3003 from leixm/CELEBORN-1782.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-19 10:12:09 +08:00
zhaohehuhu
6cce51e597 [CELEBORN-1786] add serviceAccount helm chart
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?

as title

### Why are the changes needed?

Support custom serviceAccount to control what permissions and resources a pod has access to.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

 test the template rendering by helm template command line

Closes #3006 from zhaohehuhu/dev-1218.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-18 20:09:43 +08:00
HolyLow
e75d84fc19 [CELEBORN-1772][CIP-14] Add memory module to cppClient
### What changes were proposed in this pull request?
Add memory module to cppClient to provide ByteBuffer functionality.

### Why are the changes needed?
The memory module is added to provide ByteBuffer functionality, which would be used across the data parsing layers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Compilation and UTs.

Closes #2996 from HolyLow/issue/celeborn-1772-add-memory-module-to-cppClient.

Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:52:38 +08:00
SteNicholas
f3dac7e879 [CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
### What changes were proposed in this pull request?

Bump Netty version from 4.1.109.Final to 4.1.115.Final.

### Why are the changes needed?

The Netty 4.1.115.Final version has been released, which netty version is 4.1.109.Final at present. The changes between 4.1.110.Final and 4.1.115.Final is as follows:

- [4.1.110.Final](https://netty.io/news/2024/05/22/4-1-110-Final.html)
- [4.1.111.Final](https://netty.io/news/2024/06/11/4-1-111-Final.html)
- [4.1.112.Final](https://netty.io/news/2024/07/19/4-1-112-Final.html)
- [4.1.113.Final](https://netty.io/news/2024/09/04/4-1-113-Final.html)
- [4.1.114.Final](https://netty.io/news/2024/10/01/4-1-114-Final.html)
- [4.1.115.Final](https://netty.io/news/2024/11/12/4-1-115-Final.html)

Bump https://github.com/apache/spark/pull/46945.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2903 from SteNicholas/CELEBORN-1712.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-17 17:29:07 +08:00
SteNicholas
0eb8af98de [CELEBORN-1774] Update default value of celeborn.<module>.io.mode to whether epoll mode is available
### What changes were proposed in this pull request?

Update default value of `celeborn.<module>.io.mode` to whether epoll mode is available. Meanwhile, the io mode of transport is `NIO` for unavailable epoll mode.

### Why are the changes needed?

The JDK NIO bug produces the situation that empty polling of `Selector` could cause CPU 100%, which refers to

1. [JDK-2147719 : (se) Selector doesn't block on Selector.select(timeout) (lnx)](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=2147719)
2. [JDK-6403933 : (se) Selector doesn't block on Selector.select(timeout) (lnx)](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6403933)

When the epoll mode is available, the default IO mode should be `EPOLL`, which backports [NettyServer.java#L92](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyServer.java#L92). Meanwhile, the transport io mode should be `NIO` when the epoll mode is unavailable.

### Does this PR introduce _any_ user-facing change?

Change the default value of `celeborn.<module>.io.mode`.

### How was this patch tested?

CI.

Closes #2994 from SteNicholas/CELEBORN-1774.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-17 15:26:01 +08:00
zhangzhao.08
b24f867784 [CELEBORN-1510] Partial task unable to switch to the replica
### What changes were proposed in this pull request?

Attempt%2 = 1 unable to switch to the replica

### Why are the changes needed?

When Attempt equals 1, createReader will access location.peer. Special circumstances can lead to unexpected behaviors. For example, an exception occurs during the process of obtaining data and the peer needs to be used. However, this logic will switch to the abnormal node again.

<img width="1626" alt="image" src="https://github.com/user-attachments/assets/21c50953-db0f-4717-9b91-e3aeae16ece2">
<img width="1652" alt="image" src="https://github.com/user-attachments/assets/7430786c-26a4-4b3b-be68-f8bdf780c58c">

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Internal test.

Closes #2626 from zhaostu4/switch_replica.

Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-17 15:13:14 +08:00
Wang, Fei
2efdf755cc [CELEBORN-1711][TEST] Fix flaky test caused by master/worker setup issue
### What changes were proposed in this pull request?

1. retry on BindException when starting master/worker http server
2. record the used ports and pre-check whether the selected port is used or bounded before binding

### Why are the changes needed?

To fix flaky test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2906 from turboFei/retry_master_suite.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-17 10:45:40 +08:00
ShlomiTubul
17df678c77
[CELEBORN-1780] Add support for NodePort Service per Master replica
### What changes were proposed in this pull request?
This PR add support for NodePort svc per master replica, instead of only dealing on hostnet in master's when client is outside of k8s

### Why are the changes needed?
To better support external access

### Does this PR introduce _any_ user-facing change?
Added optional fields

### How was this patch tested?
locally on my cluster

Closes #2998 from shlomitubul/main.

Authored-by: ShlomiTubul <shlomi.tubul@placer.ai>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-16 16:58:26 +08:00
onebox-li
33ba0e02f5 [CELEBORN-1775] Improve some logs
### What changes were proposed in this pull request?
Improve some logs, mainly including checking commit result and waiting partition location empty when worker gracefully shutdown.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Some logs changed.

### How was this patch tested?
Manual test.

Closes #2995 from onebox-li/improve-logs.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 16:24:18 +08:00
mingji
c40f69b941 [CELEBORN-1766] Add detail metrics about fetch chunk
### What changes were proposed in this pull request?
1. Add histogram
2. Collect critical metrics about fetch chunk

### Why are the changes needed?
1. To find out IO pattern of fetch chunk
2. To have detail metrics about fetch chunk time

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

<img width="940" alt="截屏2024-12-09 15 42 50" src="https://github.com/user-attachments/assets/9f526103-c162-4607-a031-ba90f42ae83e">
<img width="962" alt="截屏2024-12-09 15 42 56" src="https://github.com/user-attachments/assets/c17822da-0433-4701-b0cc-0887ac970353">

Closes #2983 from FMX/b1766.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 16:17:14 +08:00
Wang, Fei
d85fb7826f [CELEBORN-1711][TEST] RetryReviveTest - shutdownMiniCluster after each test
### What changes were proposed in this pull request?
For RetryReviveTest, shutdownMiniCluster after each test

### Why are the changes needed?

Currently, the minicluster is not shutdown after each test.
ca8831e55f/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/RetryReviveTest.scala (L43-L80)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #3000 from turboFei/stop_miniCluster.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 15:43:41 +08:00
zhengtao
4aabe37765 [CELEBORN-1778] Fix commitInfo NPE and add assert in LifecycleManagerCommitFilesSuite
### What changes were proposed in this pull request?
1. Fix commitInfo NPE in LifecycleManagerCommitFilesSuite. Not all the workers are assigned slots.
2. Add `assert` in the logic of judgement.

### Why are the changes needed?
Errors in CI.
![image](https://github.com/user-attachments/assets/d4335c65-7a10-4db9-8446-694093bbde31)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing CI.

Closes #3001 from zaynt4606/clb1778.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 15:41:37 +08:00
jiang13021
74c1ec0a7f [CELEBORN-1670] Avoid swallowing InterruptedException in ShuffleClientImpl
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In the ShuffleClientImpl, methods such as pushData and pushMergedData might encounter interruptions during message transmission via the TransportClient. However, the InterruptedException may be ignored, as it is handled as a standard exception. As a result, the ShuffleClientImpl continues its operation(retry or revive) even when an InterruptedException occurs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add UT: org.apache.celeborn.client.ShuffleClientSuiteJ#testPushDataAndInterrupted

Closes #2849 from jiang13021/interrupt_in_shuffle_client.

Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 11:26:17 +08:00
hongguangwei
ca8831e55f [CELEBORN-1736] Add tez integration tests
### What changes were proposed in this pull request?
Add tez integration tests

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2991 from GH-Gloway/1736.

Authored-by: hongguangwei <hongguangwei@bytedance.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-13 14:06:08 +08:00
zhengtao
c316fdbdfb Revert "[CELEBORN-1376] Push data failed should always release request body"
This reverts commit b65b5433dc.

<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
Revert [CELEBORN-1376](https://github.com/apache/celeborn/pull/2449)
This pr will introduce reference count error when replica enable and workers randomly terminate

### Why are the changes needed?
When data replication is enabled and workers are randomly terminated there will be IllegalReferenceCountException `refCnt: 0, decrement: 1` which will fail the task.
![image](https://github.com/user-attachments/assets/bb4965e4-5fa2-44ad-bb88-36bd475a6b5f)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
cluster testing.

Closes #2992 from zaynt4606/clbr1376.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-13 11:20:11 +08:00
zhengtao
f7b036d4c7 [CELEBORN-1770] FlushNotifier should setException for all Throwables in Flusher
### What changes were proposed in this pull request?
Non-IOException (will throw illegalReferenceCountException If a netty's buffer reference count is incorrect)
should also be set in FlushNotifier.
Provides Utils to convert non-IOExceptions to IOExceptions.

### Why are the changes needed?
In some test scenarios where data replication is enabled and workers are randomly terminated, it will throw illegalReferenceCountException which won't be caught.
![image](https://github.com/user-attachments/assets/6503204a-9594-4a7e-850c-c411e0b07117)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT & cluster test.

Closes #2988 from zaynt4606/clb1770-m.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-12 14:23:04 +08:00
mingji
069e5b6c18 [CELEBORN-1769] Fix packed partition location cause GetReducerFileGroupResponse lose location
### What changes were proposed in this pull request?
Fix the issue of losing the primary location when parsing `GetReducerFileGroupResponse` from `LifecycleManager`.

### Why are the changes needed?
In previous optimizations, I introduced packed partition locations to reduce the size of RPC calls, based on the assumption that primary partition locations would always be available. However, in some test scenarios where data replication is enabled and workers are randomly terminated, the primary location may be lost while the replica location remains. This causes the replica locations to be ignored which will cause data loss.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2986 from FMX/b1769.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-10 18:03:00 +08:00
zhengtao
11cbacb049 [CELEBORN-1767] Fix occasional errors in UT when creating workers
### What changes were proposed in this pull request?
Remove assert in setupWorker for UT which might make UT fail directly.
![image](https://github.com/user-attachments/assets/a7ddd01c-4069-4cb2-9f8d-5c81e9dfb389)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2984 from zaynt4606/clb1767.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-10 17:51:08 +08:00