### What changes were proposed in this pull request?
In this PR, worker always report node unavailable regardless graceful shutdown is turned on or off.
### Why are the changes needed?
To inform master the shutting down worker as soon as possible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1575 from waitinfuture/662.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
In our prod meet many times of push queue stuck caused by PushState's status was not being removed.
Caused DataPushQueue to keep waiting for taking task.
Although have resolved some bugs, here we'd better add a max wait time for taking tasks since we already have the `PUSH_DATA_TIMEOUT` check method. If the target worker is really stuck, we can retry our task finally.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1552 from AngersZhuuuu/CELEBORN-640.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
DataPushQueue return task should always remove iterator
Related to
251b923b5bcb19ed1c66
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1568 from AngersZhuuuu/CELEBORN-657.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
Dot is missing after `spark`
### Why are the changes needed?
Correct the configuration key.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1563 from pan3793/CELEBORN-653.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
This PR upgrades
- `mockito` from 1.10.19 and 3.6.0 to 4.11.0
- `scalatest` from 3.2.3 to 3.2.16
- `mockito-scalatest` from 1.16.37 to 1.17.14
### Why are the changes needed?
Housekeeping, making test dependencies up-to-date and unified.
### Does this PR introduce _any_ user-facing change?
No, it only affects test.
### How was this patch tested?
Pass GA.
Closes#1562 from pan3793/CELEBORN-650.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
- Replace index-based item access with an iterator for LinkedList.
- Always try to remove a buffer if SendBufferPool does not have a matched candidate, this change makes the total buffer number from `capacity+N-1` to `capacity` in worst cases.
- Some logs and code polish.
### Why are the changes needed?
Improve performance and logs, reduce memory consumption.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1560 from pan3793/CELEBORN-648.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
`mapStatusRecords` is required in Spark 2 for constructing `MapStatus` when AQE is enabled, but not in Spark 3, so remove it to save memory and compute resources.
This PR also simplifies the `for loop` code.
### Why are the changes needed?
Remove unnecessary variables to save resources and clean up code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1564 from pan3793/CELEBORN-654.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename variable `newAppId` to `appUniqueId` in Spark client.
### Why are the changes needed?
Make the variable name intuitive.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1565 from pan3793/CELEBORN-655.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix potential NPE when remove push status
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1559 from AngersZhuuuu/CELEBORN-647.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
This PR aims to improve `build/make-distribution.sh` by
- skip building javadoc and source artifacts
- skip building unnecessary modules
- allow skipping client modules
### Why are the changes needed?
Speed up the packaging process.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with
```
build/make-distribution.sh
```
```
build/make-distribution.sh -Pspark-3.3
```
```
build/make-distribution.sh -Pflink-1.17
```
Closes#1561 from pan3793/CELEBORN-649.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
…ngth
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1519 from zhongqiangczq/mapfilelength.
Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove support for `rss.*` configuration alias
### Why are the changes needed?
The legacy `rss.*` configuration alias was added during Celeborn entering Apache Incubator, to simplify users' migration from RSS to Celeborn.
Lots of configuration changes happened after Celeborn 0.2, the `rss.*` configuration alias become less helpful, so remove it to clean up the code.
### Does this PR introduce _any_ user-facing change?
Yes, but it's expected, the `rss.*` compatibility has never been documented.
### How was this patch tested?
Pass GA.
Closes#1547 from pan3793/CELEBORN-637.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Use scala 2.12.15 as default scala version for flink.
### Why are the changes needed?
There is incompatible serialize problem between scala 2.12.7 to scala 2.12.15/scala 2.11.12, when use different scala version, the generated serialVersionUID is different, Then we may encounter deserialize problem between client/server rpc, refer [scala ](https://users.scala-lang.org/t/serialversionuid-change-between-scala-2-12-6-and-2-12-7/3478/3)

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test use Flink scala2.12.7 runtime with Celeborn scala 2.12.15 compiled Flink client
Closes#1553 from RexXiong/CELEBORN-641.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Refine the logic here to make it easier understand.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1555 from AngersZhuuuu/CELEBORN-645.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
If we meet some unexpected exception, `getPushDataFailCause ` will throw NPE and broke the process of revive and remove push states. Here we should handle the NPE
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1551 from AngersZhuuuu/CELEBORN-639.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
`SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat`
### Why are the changes needed?
`FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`.
### Does this PR introduce _any_ user-facing change?
Yes, it's a bug fix.
### How was this patch tested?
Manually review.
Closes#1545 from pan3793/CELEBORN-636.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Adapt Spark DRA patch for spark 3.4
### Why are the changes needed?
To support enabling DRA w/ Celeborn on Spark 3.4
### Does this PR introduce _any_ user-facing change?
Yes, this PR provides a DRA patch for Spark 3.4
### How was this patch tested?
Compiled with Spark 3.4
Closes#1546 from FMX/CELEBORN-619.
Lead-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Pluginconf might be hard to understand why Celeborn needs to config class.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT.
Closes#1524 from FMX/CELEBORN-610.
Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Add Kubernetes Integration Test
- [x] test helm install deploy
- [ ] test shuffle
### Why are the changes needed?
Add integration test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci test
Closes#1484 from zwangsheng/CELEBORN-105.
Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Fix columnar shuffle codegen exception. This is a refactoring of #1523。
Closes#1543 from kerwin-zk/issue-620.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
### What changes were proposed in this pull request?
Add doc about enabling rac-awareness
### Why are the changes needed?
Document new features.
### Does this PR introduce _any_ user-facing change?
Yes, the docs are updated.
### How was this patch tested?
<img width="1085" alt="截屏2023-06-02 下午3 19 10" src="https://github.com/apache/incubator-celeborn/assets/46485123/c8c51a4c-40be-40ea-befd-3c369b9f7600">
Closes#1536 from AngersZhuuuu/CELEBORN-629.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Adapt Spark DRA patch for spark 3.4
### Why are the changes needed?
To support enabling DRA w/ Celeborn on Spark 3.4
### Does this PR introduce _any_ user-facing change?
Yes, this PR provides a DRA patch for Spark 3.4
### How was this patch tested?
Compiled with Spark 3.4
Closes#1529 from FMX/CELEBORN-619.
Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Refine this doc since:
1. It didn't mention our cluster default RPC type is `NETTY`
2. If the user use the ratis shell with `GRPC` but didn't know the ratis cluster is `NETTY`, the error is not clear and hard to debug.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1542 from AngersZhuuuu/CELEBORN-623-FOLLOWUP.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Ratis-shell use GRPC by default. Celeborn support Netty for ratis, if `raft.rpc.type` is not specified, commands may fail.
e.g.
```
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 14.947369960s. [closed=[], open=[[buffered_nanos=14962358255, waiting_for_connection]]]
```
So I think we should update the document to mention how to change the RPC type to in `celeborn-ratis`.
### Why are the changes needed?
Improve user experience
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually test
Closes#1530 from onebox-li/ratis-shell-default-rpc.
Lead-authored-by: liyihe <liyihe@bigo.sg>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Introduce PR merge script `dev/merge_pr.py`, which is borrowed from Apache Spark
### Why are the changes needed?
This script simplifies the PR merge procedure
- auto backport to release branches
- auto close the JIRA ticket
- auto fill in the JIRA fixed version
- reserve the PR description in git log
- reserve the author and committer in git log
### Does this PR introduce _any_ user-facing change?
No, it's for committers.
### How was this patch tested?
a1de16a80f was merged by this tool
Closes#1539 from pan3793/CELEBORN-633.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Lock flushBuffer field and flush method to make sure thread safe access.
### Why are the changes needed?
When stageEnd, worker will commit files and filewriters would be closed, the speculative task may still push data to the file writer, if the push task increment numPendingWrites. the commit thread which hold the filewriter object lock will need wait the pending writes decrement to 0. but push thread need the filewriter object lock to decrement numPendingWrites, this cause deadlock..
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#1534 from RexXiong/CELEBORN-626.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
…link version
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1537 from zhongqiangczq/release-content.
Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>