### What changes were proposed in this pull request?
Rename `org.apache.celeborn.plugin.flink.readclient` to `org.apache.celeborn.plugin.flink.client`.
### Why are the changes needed?
`FlinkShuffleClientImpl` is designed to write and read shuffle data including pushing and fetching shuffle data. Therefore, the package name of `FlinkShuffleClientImpl` should use `org.apache.celeborn.plugin.flink.client`
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3048 from SteNicholas/shuffle-client-package.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
Add transportMessage to cppClient.
### Why are the changes needed?
TransportMessage is the building block of controlMessages.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compilation and UTs.
Closes#3042 from HolyLow/issue/celeborn-1814-add-transport-message-to-cppClient.
Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
StorageManager and ReadBufferDispacther does not register netty metrics.
### Why are the changes needed?
All NettyMemoryMetrics should register to source
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#3016 from leixm/CELEBORN-1791.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
To update zinc to fix an issue that may cause the compilation process to keep compiling the project.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and manual tests on Mac, and Ubuntu nodes.
Closes#3045 from FMX/b1816.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Minor fix the v1 RESTful apis before 0.6.0 release.
1. update the API description to use UPPER case worker EventType
2. `subResourceConsumption` => `subResourceConsumptions`.
### Why are the changes needed?
1. With https://github.com/apache/celeborn/pull/2754, the openapi-sdk works well. but for the RESTful call without SDK, the worker eventType is still case sensitive, might be caused by the jersey issue mentioned in https://github.com/eclipse-ee4j/jersey/issues/5288. So, In this PR, I change the description in the swagger for user guidance.
<img width="1524" alt="image" src="https://github.com/user-attachments/assets/70e4f239-dc36-47bc-902e-5340986f014a" />
2. rename `subResourceConsumption` to `subResourceConsumptions`.
### Does this PR introduce _any_ user-facing change?
No, the api has not been released.
### How was this patch tested?
GA.
Closes#3023 from turboFei/restful_minor_fix.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Change `celeborn.<module>.io.mode` optional to explain default behavior in description.
### Why are the changes needed?
The default value of `celeborn.<module>.io.mode` in document could be changed by whether epoll mode is available for different os. Therefore, `celeborn.<module>.io.mode` should be changed to optional and explained the default behavior in description of option.
Follow up https://github.com/apache/celeborn/pull/3039#discussion_r1899340272.
### Does this PR introduce _any_ user-facing change?
`celeborn.<module>.io.mode` is optional and explains default behavior in description.
### How was this patch tested?
CI.
Closes#3044 from SteNicholas/CELEBORN-1774.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This PR introduces a configuration `celeborn.network.memory.allocator.pooled` to allow users to disable `PooledByteBufAllocator` globally and always use `UnpooledByteBufAllocator`.
### Why are the changes needed?
In some extreme cases, the Netty's `PooledByteBufAllocator` might have tons of 4MiB chunks but only a few sizes of the capacity are used by the real data(see https://github.com/apache/celeborn/pull/3018), for scenarios that stability is important than performance, it's desirable to allow users to disable the `PooledByteBufAllocator` globally.
### Does this PR introduce _any_ user-facing change?
Add a new feature, disabled by default.
### How was this patch tested?
Pass UT to ensure correctness. Performance and memory impact need to be verified in the production scale cluster.
Closes#3043 from pan3793/CELEBORN-1815.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Extract `RemoteShuffleEnvironment`, `NettyShuffleEnvironmentWrapper`, `SimpleResultPartitionAdapter` to flink common module. Meanwhile, `RemoteShuffleInputGate` and `RemoteShuffleResultPartition` are abstracted in flink common module.
### Why are the changes needed?
After removing out-of-dated flink 1.14 and 1.15 in #3029, `RemoteShuffleEnvironment`, `NettyShuffleEnvironmentWrapper`, `SimpleResultPartitionAdapter` could be extracted to flink common module. Meanwhile, `RemoteShuffleInputGate` and `RemoteShuffleResultPartition` could also be abstracted in flink common module.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3041 from SteNicholas/CELEBORN-1801.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
To avoid possible worker load skew for the stages with tiny reducer numbers.
### Why are the changes needed?
If a stage has tiny reducers and skewed partitions, The default value will lead to serious worker load imbalance cause some workers unable to handle shuffle data.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
GA and cluster test.
Closes#3039 from FMX/1811.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Shuffle environment metrics of `RemoteShuffleEnvironment` should use `Shuffle.Remote` metric group.
### Why are the changes needed?
Shuffle environment metrics of `RemoteShuffleEnvironment` uses incorrect netty metric group defined as `Shuffle.Netty`. Therefore, `RemoteShuffleEnvironment` should use remote metric group like `Shuffle.Remote` for shuffle environment metrics.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3032 from SteNicholas/CELEBORN-1804.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.
For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9
### Why are the changes needed?
Reduce maintenance burden.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Changes can be covered by existing tests.
Closes#3029 from codenohup/remove-flink14and15.
Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Add Tez packaging script.
### Why are the changes needed?
To support build tez client.
### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.
### How was this patch tested?
Cluster test.
Closes#3028 from GH-Gloway/1737.
Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR adds PartitionLocation to cppClient, which is the component of protocol module.
### Why are the changes needed?
To support communication message of PartitionLocation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compilation and UTs.
Closes#3035 from HolyLow/issue/celeborn-1809-add-partition-location-to-cppClient.
Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark from 3.5.3 to 3.5.4.
### Why are the changes needed?
Spark 3.5.4 has been announced to release: [Spark 3.5.4 released](https://spark.apache.org/news/spark-3-5-4-released.html). The profile spark-3.5 could bump Spark from 3.5.3 to 3.5.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3034 from SteNicholas/CELEBORN-1806.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Fail the celeborn master/worker start if `CELEBORN_CONF_DIR` is not directory. Otherwise, the process would run into unexpected status.
### Why are the changes needed?
In the `celeborn-daemon.sh` , if we specify the `--config <conf-dir>` option. It would fail the master/worker start if the `conf-dir` is not a directory, likes the systemctl `ConditionPathExists=$CELEBORN_CONF_DIR` requirement check.
fde6365f68/sbin/celeborn-daemon.sh (L35)fde6365f68/sbin/celeborn-daemon.sh (L53-L62)
But before this PR, for the start master/worker scripts, it did not check if the `CELEBORN_CONF_DIR` is dirctory because the scripts did not leverage `--config <conf-dir>` option.
In this PR, we check the final `CELEBORN_CONF_DIR` before start celeborn, so that all the scripts would check if the `CELEBORN_CONF_DIR` is a directory before start.
### Does this PR introduce _any_ user-facing change?
Yes, it would fail the start if `CELEBORN_CONF_DIR` is not a directory.
### How was this patch tested?
<img width="840" alt="image" src="https://github.com/user-attachments/assets/e670d21b-cb01-4fa6-8a2f-c94dc06cce4a" />
Closes#3030 from turboFei/check_config_dir.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix an error that may cause the application master retry stage rerun infinitely.
### Why are the changes needed?
Correct the parameters passed.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#3033 from FMX/b1071-1.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add CelebornConf to cppClient.
### Why are the changes needed?
The CelebornConf will be used as configuration module in cppClient.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compilation and UTs.
Closes#3027 from HolyLow/issue/celeborn-1799-add-celeborn-conf-to-cppClient.
Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.
### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2813 from FMX/b1413.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduction to Celeborn's Java Columnar Shuffle
### Why are the changes needed?
Introduction to Celeborn's Java Columnar Shuffle
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
Closes#3010 from kerwin-zk/CELEBORN-1789.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Increase `celeborn.client.application.heartbeatInterval` from default `10s` to `30s` to fix flaky test `RemoteShuffleMasterSuiteJ`.
### Why are the changes needed?
Many flaky test failure for `RemoteShuffleMasterSuiteJ` when assert the `lifecycleManager().shuffleCount() == 3`.
```
Error: Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.186 s <<< FAILURE! - in org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ
Error: org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer Time elapsed: 10.05 s <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<0>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at org.apache.celeborn.plugin.flink.RemoteShuffleMasterSuiteJ.testRegisterPartitionWithProducer(RemoteShuffleMasterSuiteJ.java:146)
```
680b072b5b/client-flink/flink-1.15/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleMasterSuiteJ.java (L146)
The `lifecycleManager().shuffleCount()` would reset when reporting application heartbeat, so the test would fail if its duration is more than default application heartbeat interval, 10s.
680b072b5b/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala (L210-L220)
So, in this PR, we increase the application heartbeat interval from defaults `10s` to `30s` to reduce the flaky test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#3025 from turboFei/fix_RemoteShuffleMasterSuiteJ_failure.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.
### Why are the changes needed?
For monitoring
### Does this PR introduce _any_ user-facing change?
No, it has not been released
### How was this patch tested?
UT for metrics source `instance`.
<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">
Closes#2990 from turboFei/rpc_metrics.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1.Move the peerWorker available judgement out of ThreadPool.
2.Move `retain` after the available worker judgment Which means we don't have to release if peerWorker is unavailable.
2. Add `fileWriter.decrementPendingWrites()` if peerWorker is unavailable since it will return and won't decrementPendingWreites in `writeLocalData`.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT & cluster testing.
Closes#2989 from zaynt4606/clb1771.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### Why are the changes needed?
There are a small probability of the TestCongestionController test failing.

That is because the `checkService` will excute once it was init, which can cause a multithreading conflict with the test code.

### What changes were proposed in this pull request?
Fix ut bug.
In fact, `shutDownCheckService` still wont prevent the `checkService` from excuting at once but can make the main testing thread waiting for it to shutDown.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
manual test.
Closes#3017 from zaynt4606/clb1794.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add baseConf to cppClient, which is the building block of conf module.
### Why are the changes needed?
To support CelebornCpp configuration module in cppClient.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compilation and UTs.
Closes#3013 from HolyLow/issue/celeborn-1785-add-base-conf-to-cppClient.
Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
fix DataPusher be blocked for a long time
### Why are the changes needed?
The worker has been at a performance bottleneck for a long time, the slow start strategy adjusts its maxInFlight to 1, which may cause RequestInFlight to exceed maxInFlight. If the task’s main thread has been blocked in the waitIdleQueueFullWithLock call, then the main thread will not be able to detect the sending failure since this failure changes the exception in the push state, and the waitIdleQueueFullWithLock function does not check for it
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
GA
Closes#2978 from zhaostu4/fix_pusher_block.
Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
help service account control what permissions and resources a pod has access to.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
test the template rendering by helm template command line
Closes#3009 from zhaohehuhu/dev-1219.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add documentation for `CELEBORN_NO_DAEMONIZE`
### Why are the changes needed?
Currently the celeborn processes starts in background and it was difficult to figure out how to change that behaviour. Setting this flag to true, will allow Celeborn processes to run in foreground.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
Closes#3020 from s0nskar/no-daemonize-docs.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support `ShuffleFallbackCount` metric for fallback to vanilla Flink built-in shuffle implementation.
### Why are the changes needed?
#2932 has already supported fallback to vanilla Flink built-in shuffle implementation, which is lack of `ShuffleFallbackCount` metric to feedback the situation of fallback.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`RemoteShuffleMasterSuiteJ#testRegisterPartitionWithProducerForForceFallbackPolicy`
Closes#3012 from SteNicholas/CELEBORN-1700.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, add `--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED` into default java options.
### Why are the changes needed?
It is necessary for JDK17 + HDFS Storage + Kerberos enabled, see details in https://github.com/apache/spark/pull/34615
The exception stack likes:
```
Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
....
Caused by: java.lang.IllegalAccessException: class org.apache.hadoop.security.authentication.util.KerberosUtil cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module 3a0baae5
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:85)
at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
... 9 more
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2999 from turboFei/jdk_opt_krb5.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Cancel all commit file jobs when handleCommitFiles timeout.
2. Fix timeout commit jobs wont be set `CommitInfo.COMMIT_FINISHED`
### Why are the changes needed?
1. Pending task in commitThreadPool wont be canceled.
3. Timeout commit jobs should set `commitInfo.status`.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT test.
The `commitInfo.status` should be `COMMIT_FINISHED` when commitFile jobs timeout.
Cluster test.

Closes#3004 from zaynt4606/clb1783.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Set highWorkload=true when worker in congestion control.
### Why are the changes needed?
Worker in congestion control should be in blacklist to avoid impact new shuffle.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTS.
Closes#3003 from leixm/CELEBORN-1782.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
Support custom serviceAccount to control what permissions and resources a pod has access to.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test the template rendering by helm template command line
Closes#3006 from zhaohehuhu/dev-1218.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add memory module to cppClient to provide ByteBuffer functionality.
### Why are the changes needed?
The memory module is added to provide ByteBuffer functionality, which would be used across the data parsing layers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compilation and UTs.
Closes#2996 from HolyLow/issue/celeborn-1772-add-memory-module-to-cppClient.
Authored-by: HolyLow <jiaming.xie7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Update default value of `celeborn.<module>.io.mode` to whether epoll mode is available. Meanwhile, the io mode of transport is `NIO` for unavailable epoll mode.
### Why are the changes needed?
The JDK NIO bug produces the situation that empty polling of `Selector` could cause CPU 100%, which refers to
1. [JDK-2147719 : (se) Selector doesn't block on Selector.select(timeout) (lnx)](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=2147719)
2. [JDK-6403933 : (se) Selector doesn't block on Selector.select(timeout) (lnx)](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6403933)
When the epoll mode is available, the default IO mode should be `EPOLL`, which backports [NettyServer.java#L92](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyServer.java#L92). Meanwhile, the transport io mode should be `NIO` when the epoll mode is unavailable.
### Does this PR introduce _any_ user-facing change?
Change the default value of `celeborn.<module>.io.mode`.
### How was this patch tested?
CI.
Closes#2994 from SteNicholas/CELEBORN-1774.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Attempt%2 = 1 unable to switch to the replica
### Why are the changes needed?
When Attempt equals 1, createReader will access location.peer. Special circumstances can lead to unexpected behaviors. For example, an exception occurs during the process of obtaining data and the peer needs to be used. However, this logic will switch to the abnormal node again.
<img width="1626" alt="image" src="https://github.com/user-attachments/assets/21c50953-db0f-4717-9b91-e3aeae16ece2">
<img width="1652" alt="image" src="https://github.com/user-attachments/assets/7430786c-26a4-4b3b-be68-f8bdf780c58c">
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Internal test.
Closes#2626 from zhaostu4/switch_replica.
Authored-by: zhangzhao.08 <zhangzhao.08@bytedance.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. retry on BindException when starting master/worker http server
2. record the used ports and pre-check whether the selected port is used or bounded before binding
### Why are the changes needed?
To fix flaky test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2906 from turboFei/retry_master_suite.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR add support for NodePort svc per master replica, instead of only dealing on hostnet in master's when client is outside of k8s
### Why are the changes needed?
To better support external access
### Does this PR introduce _any_ user-facing change?
Added optional fields
### How was this patch tested?
locally on my cluster
Closes#2998 from shlomitubul/main.
Authored-by: ShlomiTubul <shlomi.tubul@placer.ai>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve some logs, mainly including checking commit result and waiting partition location empty when worker gracefully shutdown.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
Some logs changed.
### How was this patch tested?
Manual test.
Closes#2995 from onebox-li/improve-logs.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Add histogram
2. Collect critical metrics about fetch chunk
### Why are the changes needed?
1. To find out IO pattern of fetch chunk
2. To have detail metrics about fetch chunk time
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
<img width="940" alt="截屏2024-12-09 15 42 50" src="https://github.com/user-attachments/assets/9f526103-c162-4607-a031-ba90f42ae83e">
<img width="962" alt="截屏2024-12-09 15 42 56" src="https://github.com/user-attachments/assets/c17822da-0433-4701-b0cc-0887ac970353">
Closes#2983 from FMX/b1766.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
For RetryReviveTest, shutdownMiniCluster after each test
### Why are the changes needed?
Currently, the minicluster is not shutdown after each test.
ca8831e55f/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/RetryReviveTest.scala (L43-L80)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA
Closes#3000 from turboFei/stop_miniCluster.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Fix commitInfo NPE in LifecycleManagerCommitFilesSuite. Not all the workers are assigned slots.
2. Add `assert` in the logic of judgement.
### Why are the changes needed?
Errors in CI.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing CI.
Closes#3001 from zaynt4606/clb1778.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
In the ShuffleClientImpl, methods such as pushData and pushMergedData might encounter interruptions during message transmission via the TransportClient. However, the InterruptedException may be ignored, as it is handled as a standard exception. As a result, the ShuffleClientImpl continues its operation(retry or revive) even when an InterruptedException occurs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add UT: org.apache.celeborn.client.ShuffleClientSuiteJ#testPushDataAndInterrupted
Closes#2849 from jiang13021/interrupt_in_shuffle_client.
Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add tez integration tests
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2991 from GH-Gloway/1736.
Authored-by: hongguangwei <hongguangwei@bytedance.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
This reverts commit b65b5433dc.
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
Revert [CELEBORN-1376](https://github.com/apache/celeborn/pull/2449)
This pr will introduce reference count error when replica enable and workers randomly terminate
### Why are the changes needed?
When data replication is enabled and workers are randomly terminated there will be IllegalReferenceCountException `refCnt: 0, decrement: 1` which will fail the task.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
cluster testing.
Closes#2992 from zaynt4606/clbr1376.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Non-IOException (will throw illegalReferenceCountException If a netty's buffer reference count is incorrect)
should also be set in FlushNotifier.
Provides Utils to convert non-IOExceptions to IOExceptions.
### Why are the changes needed?
In some test scenarios where data replication is enabled and workers are randomly terminated, it will throw illegalReferenceCountException which won't be caught.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT & cluster test.
Closes#2988 from zaynt4606/clb1770-m.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix the issue of losing the primary location when parsing `GetReducerFileGroupResponse` from `LifecycleManager`.
### Why are the changes needed?
In previous optimizations, I introduced packed partition locations to reduce the size of RPC calls, based on the assumption that primary partition locations would always be available. However, in some test scenarios where data replication is enabled and workers are randomly terminated, the primary location may be lost while the replica location remains. This causes the replica locations to be ignored which will cause data loss.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#2986 from FMX/b1769.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove assert in setupWorker for UT which might make UT fail directly.

### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2984 from zaynt4606/clb1767.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>