### What changes were proposed in this pull request?
Decrease sort memory counter after sorting procedure is complete.
### Why are the changes needed?
Fix incorrect sort memory counter.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT.
Closes#1766 from FMX/CELEBORN-845.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1765 from AngersZhuuuu/CELEBORN-844.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
make it more clarity and readability
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CI
Closes#1763 from cfmcgrady/celeborn-822-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add an option to pass in a custom maven installation, similar to how [Spark does it](https://github.com/apache/spark/blob/master/dev/make-distribution.sh#L65).
### Why are the changes needed?
We need this internally as some of our machines may not have access to external Maven.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
ran make-distribution.sh to make sure it worked.
Closes#1761 from akpatnam25/CELEBORN-838.
Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
to suppress all warnings related to deprecations during the compilation process.
to fix
```
class OpenStream in package protocol is deprecated
val openStream = msg.asInstanceOf[OpenStream]
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested locally
Closes#1760 from cfmcgrady/silence-deprecated.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Passes GA.
Closes#1758 from jiaoqingbo/CELEBORN-835.
Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
Remove Unused code
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Passes GA.
Closes#1753 from jiaoqingbo/CELEBORN-833.
Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:
- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1748 from AngersZhuuuu/CELEBORN-819.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1756 from AngersZhuuuu/CELEBORN-656-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1755 from AngersZhuuuu/CELEBORN-656-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#1754 from waitinfuture/831.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#1752 from waitinfuture/826.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
<img width="1610" alt="截屏2023-07-24 上午11 34 43" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba1b040b-9ea4-4c93-b055-75a469365ff2">
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1751 from AngersZhuuuu/CELEBORN-828.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Eliminate `chunksBeingTransferred` calculation when `celeborn.shuffle.io.maxChunksBeingTransferred` is not configured
### Why are the changes needed?
I observed high CPU usage on `ChunkStreamManager#chunksBeingTransferred` calculation. We can eliminate the method call if no threshold is configured, and investigate how to improve the method itself in the future.
<img width="1947" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/412c6a41-c0ce-440c-ae99-4424cb8702d3">
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI and Review.
Closes#1749 from pan3793/CELEBORN-827.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#1747 from waitinfuture/824.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#1746 from waitinfuture/823.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#1745 from waitinfuture/822.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1742 from AngersZhuuuu/CELEBORN-820.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1744 from cfmcgrady/junit.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1741 from AngersZhuuuu/CELEBORN-788-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Inside `ShuffleClient.submitRetryPushData`, update the latest PartitionLocation before retry push data again.
### Why are the changes needed?
Before this PR, inside `ShuffleClient.submitRetryPushData`, push data will use the previous PartitionLocation,
which is incorrect, and may cause inefficiency in some cases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA.
Closes#1706 from JQ-Cao/788.
Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA.
Closes#1739 from AngersZhuuuu/CELEBORN-815.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1737 from AngersZhuuuu/CELEBORN-804-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
<img width="1643" alt="截屏2023-07-20 下午12 01 06" src="https://github.com/apache/incubator-celeborn/assets/46485123/d8822003-602f-4fe8-9634-ff25c0367cb1">
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1738 from AngersZhuuuu/CELEBORN-814.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than
`celeborn.client.push.sendbufferpool.expireTimeout`.
### Why are the changes needed?
Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large
and will not be reused in the future, it wastes memory and causes GC.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual tests.
Closes#1735 from waitinfuture/812-1.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr:
1. Add shuffle client both in driver and executor side in ShuffleManager
2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`.
This replaced https://github.com/apache/incubator-celeborn/pull/1719
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1726 from AngersZhuuuu/CELEBORN-804.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Fix some typos and grammar
### Why are the changes needed?
Ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manually test
Closes#1733 from onebox-li/fix-typo.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
<img width="1051" alt="截屏2023-07-19 下午1 01 25" src="https://github.com/apache/incubator-celeborn/assets/46485123/26d506b2-bab9-43f5-9bbe-58d22a761bab">
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1732 from AngersZhuuuu/CELEBORN-809.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
In a long run celeborn cluster, there are some shutdown workers. Whether it is a new task or an old task, even if the worker is not assigned , it will always log below, seems a little noisy.
ERROR CommitManager: Worker xx shutdown, commit all it's partition location.
### Why are the changes needed?
Ditto
### Does this PR introduce _any_ user-facing change?
shutdown worker logs in LifecycleManager changes
### How was this patch tested?
manually test
Closes#1730 from onebox-li/adjust-log.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The configuration key `celeborn.data.io.threads` underwent an inadvertent modification in https://github.com/apache/incubator-celeborn/pull/1077
### Does this PR introduce _any_ user-facing change?
Bug fix
### How was this patch tested?
Pass GA
Closes#1729 from cfmcgrady/fix-conf-key.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
cleanup the unnecessary TODO which introduced in https://github.com/apache/incubator-celeborn/pull/1727
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Review
Closes#1728 from cfmcgrady/shutdown.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.
This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.
Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.
```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:12345 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:41563 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:42905 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:44419 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:45025 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:44799 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39053 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39029 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39475 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:40153 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:33051 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:33449 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:34073 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:35347 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:35971 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:36799 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 192.168.1.151:40775 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 192.168.1.151:44457 0.0.0.0:* LISTEN 1130743/java
```
After this PR:
```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1727 from cfmcgrady/shutdown.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```,
so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes#1725 from waitinfuture/803-fu.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s.
It's more reasonable because commit files costs relatively long time. In my testing with tough disks,
30s timeout with 2 retires is not enough.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes#1724 from waitinfuture/803.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…up client
### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.
### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):

After this PR, the number of PartitionLocation objects decreases to 275 thousands

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes#1719 from waitinfuture/798.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…Flight.total```
### What changes were proposed in this pull request?
Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA.
Closes#1723 from waitinfuture/799-fu.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Warn when local shuffle reader is enabled.
```
Detected spark.sql.adaptive.localShuffleReader.enabled (default is true) is enabled,
it's highly recommended to disable it when use Celeborn as Remote Shuffle Service to
avoid performance degradation.
```
### Why are the changes needed?
When local shuffle reader is enabled, the reduce task may read shuffle data in by map id, which is not match the Celeborn shuffle data clustering model, then cause extremely bad shuffle read performance.
### Does this PR introduce _any_ user-facing change?
Yes, user would see warning message from Driver log when `spark.sql.adaptive.localShuffleReader.enabled` is true.
### How was this patch tested?
Review.
Closes#1721 from pan3793/CELEBORN-801.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]```
objects in ```PushTask```.
### Why are the changes needed?
I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G:

After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number
of PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M:

Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and Manual test.
Closes#1722 from waitinfuture/802.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes#1720 from waitinfuture/799.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Decrease writeTime metric sampling frequency to improve perf
2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4
### Why are the changes needed?
Following are test cases
case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data
case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data
Following are e2e time of shuffle write stage
||Sort pusher before|Sort pusher after|Hash pusher before|Hash pusher after|
|----|----|----|----|-----|
|case1|4.4min|4.1min|4.4min|3.9min|
|case2|9.1min|8.4min|9.7min|8.5min|
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes#1718 from waitinfuture/797.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Master won't simulate slots allocations and use active slots sent from worker.
### Why are the changes needed?
I have observed that a new worker might allocate more slots than other workers when using the round-robin slot allocation algorithm.
There is a logic error in processing heartbeat from worker. It will update disk info's active slots to max(current disk info active slots, disk info sent from worker active slots). If I registered a huge shuffle, master will allocate more slots than a disk's max slots and mark them as unknown disk slots but worker will count the unknown disk slots as active slots and report it to the master. Then the slots release logic can not distinguish unknown slots from a number so the release will not decrease active slots properly.
Due to the gap between work and master, so I think it's OK to remove slots allocation simulation from worker and use active slots from worker.
Before this patch:
<img width="928" alt="截屏2023-07-12 16 51 15" src="https://github.com/apache/incubator-celeborn/assets/4150993/9c8a46d9-26a8-42f5-a956-938273277c9b">
After this patch:
<img width="509" alt="截屏2023-07-12 16 25 52" src="https://github.com/apache/incubator-celeborn/assets/4150993/c49b3d91-14ea-4eb8-9b71-9aab73541faf">
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1710 from FMX/CELEBORN-791.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…r Spark2
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA and manual test.
Closes#1717 from shujiewu/CELEBORN-792.
Authored-by: 无迹 <peter.wsj@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Change the parameter of getLogger to ReviveManager.class
### Why are the changes needed?
The parameter of getLogger in the ReviveManager class should be ReviveManager.class
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
NO
Closes#1715 from jiaoqingbo/795.
Authored-by: e <1178404354@qq.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
According to https://github.com/apache/incubator-celeborn/pull/1709#discussion_r1260133078
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA.
Closes#1711 from waitinfuture/790-fu.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Set default value of ```celeborn.worker.push.compositeBuffer.maxComponents``` to 256, to be aligned with 0.2.1-incubating version.
### Why are the changes needed?
Default 16 is too small, and causes ~~severe GC~~ and CPU high load.
<img width="1719" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ab9675e-c19e-44f1-af46-90c29dc4df75">
### Does this PR introduce _any_ user-facing change?
No, it's internal config.
### How was this patch tested?
Passes GA.
Closes#1707 from waitinfuture/789.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
After https://github.com/apache/incubator-celeborn/pull/1658 merged, we can format the message type now.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1696 from AngersZhuuuu/CELEBORN-731.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1699#discussion_r1259137323
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1704 from cfmcgrady/insert-record-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>