### What changes were proposed in this pull request?
As Title
### Why are the changes needed?
As Title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1821 from jiaoqingbo/fixtypo-doc.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
After obtaining the results of reviveBatch, determine whether it contains the corresponding partitionId.
### Why are the changes needed?
that maybe cause npe in some versions of jdk8.The decompilation result is as follows

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
through existing uts
Closes#1819 from lyy-pineapple/fix-npe.
Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Recently, I came across an issue in the SBT CI process that can result in failure due to the `NoClassDefFoundError` exception.
```
[error] Uncaught exception when running org.apache.celeborn.common.unsafe.PlatformUtilSuite: java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing
[error] at java.lang.ClassLoader.defineClass1(Native Method)
[error] at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
[error] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error] at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
[error] at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
[error] at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
[error] at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
[error] at java.security.AccessController.doPrivileged(Native Method)
[error] at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
[error] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
[error] at org.junit.runner.Computer.getSuite(Computer.java:28)
[error] at org.junit.runner.Request.classes(Request.java:77)
[error] at org.junit.runner.Request.classes(Request.java:92)
[error] at com.novocode.junit.JUnitTask.execute(JUnitTask.java:52)
[error] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at java.lang.Thread.run(Thread.java:750)
[error] Caused by: sbt.ForkMain$ForkError: java.lang.ClassNotFoundException: org.hamcrest.SelfDescribing
[error] at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
[error] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
[error] at java.lang.ClassLoader.defineClass1(Native Method)
[error] at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
[error] at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error] at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
[error] at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
[error] at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
[error] at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
[error] at java.security.AccessController.doPrivileged(Native Method)
[error] at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
```
Upon further investigation, I found that the root cause is SBT's sometimes inability to resolve Maven dependencies cached within GA.
```shell
./build/sbt "show celeborn-common/update"
```
```
[info] org.hamcrest:hamcrest-core:1.3:default: (MISSING) Artifact(hamcrest-core, jar, jar, None, Vector(), Some(file:/home/runner/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar), Map(), None, false)
```
This PR addresses the random issue by disabling the Maven cache for SBT CI.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
https://github.com/apache/incubator-celeborn/pull/1797 pass GA after disabled maven cache.
Closes#1818 from cfmcgrady/sbt-ci.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
consolidate all sbt dependencies into a global object `Dependencies`, similar to Maven's dependencyManagement, to improve dependency management.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1802 from cfmcgrady/sbt-dependencies.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
I tested 1.1T and 3.3T shuffle, as well as 3T TPCDS with thread cache on and off in the shared PooledByteBufAllocator and find no
difference:
| Benchmark | Cache On | Cache Off|
| -------- | ------- |------- |
|1.1T Shuffle| 3.7min/1.9min |3.7min/1.9min|
| 3.3T Shuffle| 12min/6.7min |12min/6.2min|
| 3T TPCDS | 2645s |2644s|
And since the configuration has a big influence to the direct memory usage, see https://github.com/apache/incubator-celeborn/pull/1716 , it's very necessary to set the default value to false.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1817 from waitinfuture/897.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1816 from jiaoqingbo/typo-conf-followup.
Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1768 from AngersZhuuuu/CELEBORN-847.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix typo in CelebornConf
### Why are the changes needed?
Fix typo in CelebornConf
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Passing GA
Closes#1813 from jiaoqingbo/typo-conf.
Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix statistics error of commitFiles method
res1 should be res2
### Why are the changes needed?
Fix statistics error of commitFiles method
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
passing GA
Closes#1809 from jiaoqingbo/892.
Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR fixes a bug that in rare cases it may cause data lost.
### Why are the changes needed?
I received a bug report from one of the users that in an extreme case small data lost happens. I
reproduced the bug under the following conditions:
1. Shuffle data size for one partition id is relatively large, for example 400GB
2. `celeborn.client.shuffle.partitionSplit.mode` is set to HARD
3. `celeborn.client.shuffle.batchHandleCommitPartition.enabled` is enabled
At the mean time, there are warning messages in worker's log
```
23/08/11 17:10:04,501 WARN [push-server-6-44] PushDataHandler: Append data failed for task(shuffle application_1691635581416_0021-0, map 746, attempt 0), caused by AlreadyClosedException, endedAttempt -1, error message: FileWriter has already closed!, fileName /mnt/disk1/celeborn-worker/shuffle_data/application_1691635581416_0021/0/0-107-0
23/08/11 17:12:04,445 WARN [push-server-6-35] PushDataHandler: Append data failed for task(shuffle application_1691635581416_0021-0, map 3016, attempt 0), caused by AlreadyClosedException, endedAttempt -1, error message: FileWriter has already closed!, fileName /mnt/disk3/celeborn-worker/shuffle_data/application_1691635581416_0021/0/0-356-0
```

After digging into it, I found the reason for the data lost is as follows:
1. For some partition id in some worker, the file size exceeds `celeborn.client.shuffle.partitionSplit.threshold`, then
`CommitManager` in `LifecycleManager` will trigger `CommitFiles` because `batchHandleCommitPartition` is enabled
2. Before `CommitFile` finishes, `PushDataHandler` receives `PushData` or `PushMergedData`, it finds that the partition has not committed yet, and is preparing to call `fileWriter.incrementPendingWrites()` and `callback.onSuccess`
3. Before `PushDataHandler` calls `fileWriter.incrementPendingWrites()`, the `CommitFiles` finishes and the FileWriter
successfully closes.
4. Then `PushDataHandler` calls `fileWriter.incrementPendingWrites()` and `callback.onSuccess`. After this time,
`ShuffleClient` thinks the `PushData` succeeds. However, when `PushDataHandler` calls `fileWriter.write()`, it
finds it already closed and throws the above exception. However, the exception is ignored, so the data lost happens.
This PR fixes this by checking whether FileWriter has closed after calling `incrementPendingWrites`. If true,
`PushDataHandler` calls `onFailure`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1808 from waitinfuture/890.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Keep ReleaseSlots RPC to make sure that 0.3 client can worker with 0.3.1-SNAPSHOT and 0.4.0-SNAPSHOT.
This PR will need to merged into main and branch-0.3.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#1794 from FMX/CELEBORN-846-FOLLOWUP.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
I find a little difficult to use `celeborn-daemon.sh` to get instance status, so I polish the usage and fix --config load.
### Why are the changes needed?
Ditto
### Does this PR introduce _any_ user-facing change?
Polish the `celeborn-daemon.sh` usage
### How was this patch tested?
Manually test.
Closes#1805 from onebox-li/improve-script.
Lead-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Leo Li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1806 from cfmcgrady/sbt-docs-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
Yes, the thread local cache of shared `PooledByteBufAllocator` can be disabled by setting `celeborn.network.memory.allocator.allowCache=false`
### How was this patch tested?
Pass GA
Closes#1716 from cfmcgrady/allow-cache.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Shade roaring bitmap to void dependency conflicts.
### Why are the changes needed?
Some user reports that celeborn client will introduce roaring bitmap conflicts.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#1803 from FMX/CELEBORN-885.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#1795 from cfmcgrady/sbt-docs.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
1. Expose the config check logic during `MemoryManager#initialization` in the user configuration doc.
2. Add Preconditions Error Message
3. Add unit test to make sure that part of the logic isn't altered by mistake
### Why are the changes needed?
User-friendly
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Add Unit Test
Closes#1801 from zwangsheng/CELEBORN-883.
Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: zwangsheng <2213335496@qq.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
the plugin may generate unexpected source files in the project root directory. we need to refine this feature if we want to generate Java doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1799 from cfmcgrady/sbt-compiler-plugin.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
1. Log offer slots results from LifecycleManager.
2. Log change partition results from LifecycleManager.
3. Log reserve slots results.
4. Log fetch file group failure instead of data lost.
### Why are the changes needed?
If data lost happened, we need to find out what worker cause this failure. So we need to check reserve slots result from LifecycleManager.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#1798 from FMX/CELEBORN-876.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Wrap IOException to PartitionUnRetryAbleException when fetch
2. Improve message logging when open stream/read data error
### Why are the changes needed?
When open stream, there would be encounter many different IOExceptions such as NoSuchFileException, FileNotFoundException,FileCorruptedException etc, for these checked exception should wrap to PartitionUnRetryAbleException to let client choose to regenerate the data.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT & Manual test
Closes#1796 from RexXiong/CELEBORN-878-IO-Exception.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add config to limit max workers when offering slots, the config can be set both
in server side and client side. Celeborn will choose the smaller positive configs from client and master.
### Why are the changes needed?
For large Celeborn clusters, users may want to limit the number of workers that
a shuffle can spread, reasons are:
1. One worker failure will not affect all applications
2. One huge shuffle will not affect all applications
3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than
spreading across a large number of workers, say 1000, because the network connections
in pushing data is `number of ShuffleClient` * `number of allocated Workers`
The recommended number of Workers should depend on workload and Worker hardware,
and this can be configured per application, so it's relatively flexible.
### Does this PR introduce _any_ user-facing change?
No, added a new configuration.
### How was this patch tested?
Added ITs and passes GA.
Closes#1790 from waitinfuture/152.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Merge OpenStream and StreamHandler to transport messages to enhance celeborn's compatibility.
### Why are the changes needed?
1. Improve flexibility to change RPC.
2. Compatible with 0.2 client.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1750 from FMX/CELEBORN-760.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1791 from cfmcgrady/enrich-fetch-log.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Support worker recovery if the worker has crashed when workers has enabled graceful shutdown..
1. Persist committed file info to LevelDB.
2. Load levelDB when worker started.
3. Clean expired file infos in LevelDB.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster. After testing on a cluster I found that 8k file infos will consume about 2MB of disk space, disk space can be reclaimed if shuffle is expired shortly.
Closes#1779 from FMX/CELEBORN-863.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA
Closes#1792 from waitinfuture/712-fu.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1784 from kerwin-zk/gluten_celeborn.
Lead-authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Co-authored-by: Kerwin Zhang <xiyu.zk@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Reduce duplicate code segments, improve code readability and maintenance difficulty.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit Test
Closes#1786 from zwangsheng/CELEBORN-872.
Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. replaced the usage of `HashMap` with `ConcurrentHashMap` for `partitionBatchIdMap` to ensure thread safety during parallel data processing
2. put the partition id and batch id into the `partitionBatchIdMap` before adding the task to prevent the possibility of a NPE
### Why are the changes needed?
to fix NPE
https://github.com/apache/incubator-celeborn/actions/runs/5734532048/job/15540863715?pr=1785
```
xception in thread "DataPusher-0" java.lang.NullPointerException
at org.apache.celeborn.client.write.DataPushQueueSuiteJ$1.pushData(DataPushQueueSuiteJ.java:121)
at org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:125)
Error: The operation was canceled.
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1789 from cfmcgrady/celeborn-875-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add active connections count metrics to grafana dashboard.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
Yes, new metric chart in the grafana dashboard.
### How was this patch tested?
Cluster.
Closes#1783 from FMX/CELEBORN-852.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1788 from waitinfuture/869-fu.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1787 from waitinfuture/869.
Lead-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1782 from waitinfuture/864.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding new metrics to record the number of registered connections
### Why are the changes needed?
Monitor the number of active connections on worker nodes
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no
Closes#1773 from JQ-Cao/852.
Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fallback in following order:
1. usableDisks is empty (no need to call iter)
2. under replicate case, first usableDisks == 1 fast fallback
3. count distinct worker
### Why are the changes needed?
Clear about the logic here
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit Test
Closes#1781 from zwangsheng/CELEBORN-868.
Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR adds new GitHub Actions workflows to enable Continuous Integration using SBT based on #1764
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1771 from cfmcgrady/sbt-ci.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
This PR adds packaging and testing support for Flink-related modules using SBT based on #1757
### Why are the changes needed?
improve project build speed
running flink-it tests with -Pflink-1.14
```shell
sbt:celeborn> project flink-it
sbt:flink-it> clean
sbt:flink-it> test
[success] Total time: 136 s (02:16), completed 2023-7-27 11:55:10
```
running flink-it tests with -Pflink-1.17
```shell
$ ./build/sbt -Pflink-1.17
sbt:celeborn> project flink-it
sbt:flink-it> clean
sbt:flink-it> test
[success] Total time: 168 s (02:48), completed 2023-7-27 11:28:35
```
packing and shading the flink 1.14 client
```shell
$ ./build/sbt -Pflink-1.14
sbt:celeborn> clean
sbt:celeborn> project celeborn-client-flink-1_14-shaded
sbt:celeborn-client-flink-1_14-shaded> assembly
[success] Total time: 35 s, completed 2023-7-27 11:51:54
```
packing and shading the flink 1.17 client
```shell
$ ./build/sbt -Pflink-1.17
sbt:celeborn> clean
sbt:celeborn> project celeborn-client-flink-1_17-shaded
sbt:celeborn-client-flink-1_17-shaded> assembly
[success] Total time: 39 s, completed 2023-7-27 11:49:20
```
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
tested locally
Closes#1764 from cfmcgrady/sbt-flink.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passes GA
Closes#1776 from waitinfuture/method.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1778 from waitinfuture/860-1.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1775 from waitinfuture/853.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
1. This PR propose renaming the class `DataPushQueueSuitJ` to `DataPushQueueSuiteJ` in order to enable its integration with the test suite. This change is required to comply with our maven-surefire-plugin plugin rule
5f0295e9f3/pom.xml (L543-L551)
2. To fix a potential logic bug in the test, tasks within `DataPushQueue` may inadvertently be consumed by the `DataPusher`s built-in thread `DataPusher-${taskId}`, leading to test suite failures.


### Why are the changes needed?
fix DataPushQueueSuiteJ bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1774 from cfmcgrady/refine-data-push-queue-suite.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR introduces the SBT build system implementation that operates independently from the current Maven build system. Different from https://github.com/apache/incubator-celeborn/pull/1627, the current implementation does not depend on `pom.xml`
The implementation enables packaging and testing functionalities for server-related modules and Spark-related modules using SBT.
For Flink-related build/test, sbt build documentation, continuous integration, and plugins, they will be submitted in separate PRs
### Why are the changes needed?
improve project build speed
packing the project.
```shell
$ ./build/sbt
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:36:12
sbt:celeborn> package
[success] Total time: 28 s, completed 2023-7-25 16:36:46
```
packing and shading the spark 3.3 client
```shell
$ ./build/sbt -Pspark-3.3
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:39:11
sbt:celeborn> project celeborn-client-spark-3-shaded
sbt:celeborn-client-spark-3-shaded> assembly
[success] Total time: 37 s, completed 2023-7-25 16:40:03
```
packing and shading the spark 2.4 client
```shell
$ ./build/sbt -Pspark-2.4
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:41:06
sbt:celeborn> project celeborn-client-spark-2-shaded
sbt:celeborn-client-spark-2-shaded> assembly
[success] Total time: 36 s, completed 2023-7-25 16:41:53
```
running server-related tests
```shell
$ ./build/sbt clean test
[success] Total time: 350 s (05:50), completed 2023-7-25 16:48:58
```
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
tested locally
Closes#1757 from cfmcgrady/pure-sbt.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1769 from waitinfuture/834.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1772 from waitinfuture/849.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1770 from AngersZhuuuu/CELEBORN-851.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1731 from AngersZhuuuu/CELEBORN-808.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
CELEBORN-791 removed sending the ReleaseSlotsRequest from worker, so Master is not required to handle it.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1767 from AngersZhuuuu/CELEBORN-846.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1759 from AngersZhuuuu/CELEBORN-832.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>