Commit Graph

1423 Commits

Author SHA1 Message Date
mingji
90959cbfd7
[CELEBORN-845][BUG] Sort memory counter won't decrease after sort failed
### What changes were proposed in this pull request?
Decrease sort memory counter after sorting procedure is complete.

### Why are the changes needed?
Fix incorrect sort memory counter.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT.

Closes #1766 from FMX/CELEBORN-845.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-07-27 15:16:04 +08:00
Angerszhuuuu
faba405ebc [CELEBORN-844] Fix incorrect config name in ConfigEntity checkvalue method and format message
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1765 from AngersZhuuuu/CELEBORN-844.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 14:46:11 +08:00
Fu Chen
c5ddf9b2ca [CELEBORN-822][FOLLOWUP] Format the example code in the docs/README.md
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

make it more clarity and readability

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CI

Closes #1763 from cfmcgrady/celeborn-822-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-26 20:02:13 +08:00
Aravind Patnam
e708c3cd25
[CELEBORN-838] Add custom mvn flag to celeborn
### What changes were proposed in this pull request?
Add an option to pass in a custom maven installation, similar to how [Spark does it](https://github.com/apache/spark/blob/master/dev/make-distribution.sh#L65).

### Why are the changes needed?
We need this internally as some of our machines may not have access to external Maven.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
ran make-distribution.sh to make sure it worked.

Closes #1761 from akpatnam25/CELEBORN-838.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-26 09:51:30 +08:00
Fu Chen
e16b26762b
[CELEBORN-837][BUILD] Add silencer plugin to suppress deprecated warnings
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

to suppress all warnings related to deprecations during the compilation process.

to fix
```
class OpenStream in package protocol is deprecated
        val openStream = msg.asInstanceOf[OpenStream]
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

tested locally

Closes #1760 from cfmcgrady/silence-deprecated.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-25 21:14:45 +08:00
e
e8dd4bbf45 [CELEBORN-835] Format specifiers should be used instead of string concatenation
### What changes were proposed in this pull request?

As title.

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Passes GA.

Closes #1758 from jiaoqingbo/CELEBORN-835.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-25 17:58:47 +08:00
e
d93c679ad3 [CELEBORN-833] Remove unused code
### What changes were proposed in this pull request?

As title.

### Why are the changes needed?

Remove Unused code

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Passes GA.

Closes #1753 from jiaoqingbo/CELEBORN-833.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-25 14:58:39 +08:00
Angerszhuuuu
2ab88f773a [CELEBORN-819] Worker close should pass close status to support handle graceful shutdown and decommission
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:

- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1748 from AngersZhuuuu/CELEBORN-819.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-25 14:54:01 +08:00
Angerszhuuuu
6427ed35cd [CELEBORN-656] Should also refine log about return HARD_SPLIT in handlePushMergedData
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1756 from AngersZhuuuu/CELEBORN-656-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 20:43:56 +08:00
Angerszhuuuu
67c18e6607 [CELEBORN-656][FOLLOWUP] Fix wrong message call when revive return STAGE_END
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1755 from AngersZhuuuu/CELEBORN-656-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 20:20:22 +08:00
zky.zhoukeyong
b8cdf36b40 [CELEBORN-831][DOC] Add traffic control document
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1754 from waitinfuture/831.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 19:51:02 +08:00
zky.zhoukeyong
070d8bc0f8 [CELEBORN-826][DOC] Add storage document
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1752 from waitinfuture/826.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 16:12:42 +08:00
Angerszhuuuu
00c36fda99 [CELEBORN-828] Merge Monitoring to Development doc
### What changes were proposed in this pull request?
As title

<img width="1610" alt="截屏2023-07-24 上午11 34 43" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba1b040b-9ea4-4c93-b055-75a469365ff2">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1751 from AngersZhuuuu/CELEBORN-828.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 15:37:32 +08:00
Cheng Pan
fa79b263a0
[CELEBORN-827] Eliminate unnecessary chunksBeingTransferred calculation
### What changes were proposed in this pull request?

Eliminate `chunksBeingTransferred` calculation when `celeborn.shuffle.io.maxChunksBeingTransferred` is not configured

### Why are the changes needed?

I observed high CPU usage on `ChunkStreamManager#chunksBeingTransferred` calculation. We can eliminate the method call if no threshold is configured, and investigate how to improve the method itself in the future.

<img width="1947" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/412c6a41-c0ce-440c-ae99-4424cb8702d3">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI and Review.

Closes #1749 from pan3793/CELEBORN-827.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-24 15:31:57 +08:00
zky.zhoukeyong
8e849645eb [CELEBORN-824][DOC] Add PushData document
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1747 from waitinfuture/824.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 10:38:46 +08:00
zky.zhoukeyong
27521547f0 [CELEBORN-823][DOC] Add Celeborn architecture document
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1746 from waitinfuture/823.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-22 23:57:22 +08:00
zky.zhoukeyong
fb2af146bf [CELEBORN-822][DOC] Add quick start guide
### What changes were proposed in this pull request?
As title.
![image](https://github.com/apache/incubator-celeborn/assets/948245/e2e96131-26be-497f-9f11-e8b5e215a15d)

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1745 from waitinfuture/822.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-22 21:39:41 +08:00
Angerszhuuuu
76201c92f8 [CELEBORN-820] Merge service shutdown and close method
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1742 from AngersZhuuuu/CELEBORN-820.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-22 21:04:29 +08:00
Fu Chen
0bb73ece3b [CELEBORN-821][BUILD] Bump junit from 4.12 to 4.13.2
### What changes were proposed in this pull request?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1744 from cfmcgrady/junit.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-22 10:00:25 +08:00
Angerszhuuuu
4af5114e17 [CELEBORN-788][FOLLOWUP] Update callback's location should also update the PushState to keep consistent
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1741 from AngersZhuuuu/CELEBORN-788-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-21 12:14:57 +08:00
caojiaqing
4669d1e31c [CELEBORN-788] Update latest PartitionLocation before retry PushData
### What changes were proposed in this pull request?

Inside `ShuffleClient.submitRetryPushData`,  update the latest PartitionLocation before retry push data again.

### Why are the changes needed?
Before this PR, inside `ShuffleClient.submitRetryPushData`, push data will use the previous PartitionLocation,
which is incorrect, and may cause inefficiency in some cases.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1706 from JQ-Cao/788.

Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 21:36:37 +08:00
Angerszhuuuu
be05ae37fe [CELEBORN-815] Remove unused ShuffleClient.readPartition
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1739 from AngersZhuuuu/CELEBORN-815.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 20:49:29 +08:00
Angerszhuuuu
5c7848d531 [CELEBORN-804][FOLLOWUP] ShuffleManager stop should set shuffleClient to null
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1737 from AngersZhuuuu/CELEBORN-804-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 20:35:23 +08:00
Angerszhuuuu
f15c2a7a68 [CELEBORN-814] Merge upgrade doc to Deployment tab and add TOC
### What changes were proposed in this pull request?
As title

<img width="1643" alt="截屏2023-07-20 下午12 01 06" src="https://github.com/apache/incubator-celeborn/assets/46485123/d8822003-602f-4fe8-9634-ff25c0367cb1">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1738 from AngersZhuuuu/CELEBORN-814.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-20 14:06:12 +08:00
zky.zhoukeyong
6a5e3ed794 [CELEBORN-812] Cleanup SendBufferPool if idle for long
### What changes were proposed in this pull request?
Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than
`celeborn.client.push.sendbufferpool.expireTimeout`.

### Why are the changes needed?
Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large
and will not be reused in the future, it wastes memory and causes GC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual tests.

Closes #1735 from waitinfuture/812-1.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 00:34:55 +08:00
Angerszhuuuu
14c6e5719f
[CELEBORN-811] Refine monitoring doc
### What changes were proposed in this pull request?
Refine monitoring doc

1. Remove unnecessary left side navigator
2. Add TOC in right side
3. fix list indentation

Before
![celeborn apache org_docs_latest_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/885da0e5-f2f9-41ba-a9fe-257e46e76a78)

After
![127 0 0 1_8000_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/8cb3fc60-0a2e-4134-8edb-dd0fe434be60)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1734 from AngersZhuuuu/CELEBORN-811.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:53:21 +08:00
Angerszhuuuu
5471a6afe5
[CELEBORN-804] ShuffleClient should cleanup shuffle infos when trigger unregisterShuffle
### What changes were proposed in this pull request?

After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr:

  1. Add shuffle client both in driver and executor side in ShuffleManager
  2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`.

This replaced https://github.com/apache/incubator-celeborn/pull/1719

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1726 from AngersZhuuuu/CELEBORN-804.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:50:18 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Angerszhuuuu
c8ad39d9bd [CELEBORN-809] Directly use isDriver passed from SparkEnv
### What changes were proposed in this pull request?
As title
<img width="1051" alt="截屏2023-07-19 下午1 01 25" src="https://github.com/apache/incubator-celeborn/assets/46485123/26d506b2-bab9-43f5-9bbe-58d22a761bab">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1732 from AngersZhuuuu/CELEBORN-809.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-19 15:20:01 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
onebox-li
061febe46f [CELEBORN-807] Adjust shutdown worker logs in LifecycleManager
### What changes were proposed in this pull request?
In a long run celeborn cluster,  there are some shutdown workers. Whether it is a new task or an old task, even if the worker is not assigned , it will always log below, seems a little noisy.
ERROR CommitManager: Worker xx shutdown, commit all it's partition location.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
shutdown worker logs in LifecycleManager changes

### How was this patch tested?
manually test

Closes #1730 from onebox-li/adjust-log.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 11:38:54 +08:00
Fu Chen
8b7a761859 [CELEBORN-806] Correct the conf key celeborn.data.io.threads within the class ShuffleClientImpl
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

The configuration key `celeborn.data.io.threads` underwent an inadvertent modification in https://github.com/apache/incubator-celeborn/pull/1077

### Does this PR introduce _any_ user-facing change?

Bug fix

### How was this patch tested?

Pass GA

Closes #1729 from cfmcgrady/fix-conf-key.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-18 17:54:49 +08:00
Fu Chen
16d9c657c2
[CELEBORN-805][FOLLOWUP] Remove unnecessary TODO
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

cleanup the unnecessary TODO which introduced in https://github.com/apache/incubator-celeborn/pull/1727

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Review

Closes #1728 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:37:21 +08:00
Fu Chen
7c6644b1a7
[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.

This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.

Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.

```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:12345         0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:41563           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:42905           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44419           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:45025           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39053           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39029           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39475           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:40153           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33051           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33449           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:34073           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35347           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35971           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:36799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:40775     0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:44457     0.0.0.0:*               LISTEN      1130743/java
```

After this PR:

```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1727 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:12:51 +08:00
zky.zhoukeyong
1109e2c8f4 [CELEBORN-803][FOLLOWUP] Make ``rpcAskTimeout`` default to 60s
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```,
so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1725 from waitinfuture/803-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 23:53:52 +08:00
zky.zhoukeyong
9ec223edd7 [CELEBORN-803] Increase default timeout for commit files
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s.
It's more reasonable because commit files costs relatively long time. In my testing with tough disks,
30s timeout with 2 retires is not enough.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1724 from waitinfuture/803.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 22:31:36 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
95119b1e4b [CELEBORN-799][FOLLOWUP] Fix doc of celeborn.client.push.maxReqsInFlight.total
…Flight.total```

### What changes were proposed in this pull request?
Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1723 from waitinfuture/799-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:01:03 +08:00
Cheng Pan
1ec4f4a9f5 [CELEBORN-801] Warn when local shuffle reader is enabled
### What changes were proposed in this pull request?

Warn when local shuffle reader is enabled.

```
Detected spark.sql.adaptive.localShuffleReader.enabled (default is true) is enabled,
it's highly recommended to disable it when use Celeborn as Remote Shuffle Service to
avoid performance degradation.
```

### Why are the changes needed?

When local shuffle reader is enabled, the reduce task may read shuffle data in by map id, which is not match the Celeborn shuffle data clustering model, then cause extremely bad shuffle read performance.

### Does this PR introduce _any_ user-facing change?

Yes, user would see warning message from Driver log when `spark.sql.adaptive.localShuffleReader.enabled` is true.

### How was this patch tested?

Review.

Closes #1721 from pan3793/CELEBORN-801.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:43:50 +08:00
zky.zhoukeyong
10a1def512 [CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects
### What changes were proposed in this pull request?
Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]```
objects in ```PushTask```.

### Why are the changes needed?
I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G:
![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092)

After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number
of  PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M:
![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392)

Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and Manual test.

Closes #1722 from waitinfuture/802.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:35:14 +08:00
zky.zhoukeyong
4b3a47c9db [CELEBORN-799] Limit total inflight push requests
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1720 from waitinfuture/799.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:17:24 +08:00
zky.zhoukeyong
a7bbbd05c4 [CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf
### What changes were proposed in this pull request?
1. Decrease writeTime metric sampling frequency to improve perf
2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4

### Why are the changes needed?
Following are test cases
case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data
case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data
Following are e2e time of shuffle write stage
||Sort pusher before|Sort pusher after|Hash pusher before|Hash pusher after|
|----|----|----|----|-----|
|case1|4.4min|4.1min|4.4min|3.9min|
|case2|9.1min|8.4min|9.7min|8.5min|

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1718 from waitinfuture/797.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:51:50 +08:00
mingji
a4687716d2 [CELEBORN-791] Remove slots allocation simulation in master and use active slots sent from worker's heartbeat
### What changes were proposed in this pull request?
Master won't simulate slots allocations and use active slots sent from worker.

### Why are the changes needed?
I have observed that a new worker might allocate more slots than other workers when using the round-robin slot allocation algorithm.
There is a logic error in processing heartbeat from worker. It will update disk info's active slots to max(current disk info active slots, disk info sent from worker active slots). If I registered a huge shuffle, master will allocate more slots than a disk's max slots and mark them as unknown disk slots but worker will count the unknown disk slots as active slots and report it to the master. Then the slots release logic can not distinguish unknown slots from a number so the release will not decrease active slots properly.
Due to the gap between work and master, so I think it's OK to remove slots allocation simulation from worker and use active slots from worker.

Before this patch:
<img width="928" alt="截屏2023-07-12 16 51 15" src="https://github.com/apache/incubator-celeborn/assets/4150993/9c8a46d9-26a8-42f5-a956-938273277c9b">

After this patch:
<img width="509" alt="截屏2023-07-12 16 25 52" src="https://github.com/apache/incubator-celeborn/assets/4150993/c49b3d91-14ea-4eb8-9b71-9aab73541faf">

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1710 from FMX/CELEBORN-791.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:40:55 +08:00
e
f78a7d349f [CELEBORN-794] Fix link of CONFIGURATIONS in README
### What changes were proposed in this pull request?

Modify CONFIGURATIONS to point to the correct address

### Why are the changes needed?

CONFIGURATIONS in README.md points to an invalid address

![image](https://github.com/apache/incubator-celeborn/assets/14961757/538294ee-3432-4e1e-a45e-4dc1983d50e8)
![image](https://github.com/apache/incubator-celeborn/assets/14961757/d4681603-5317-46ae-a2f5-e58fa72c706c)

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NO

Closes #1714 from jiaoqingbo/CELEBORN-794.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 18:08:09 +08:00
无迹
e1337972e8 [CELEBORN-792] SparkShuffleManager.getWriter use wrong appUniqueId fo…
…r Spark2

### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA and manual test.

Closes #1717 from shujiewu/CELEBORN-792.

Authored-by: 无迹 <peter.wsj@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 17:17:48 +08:00
e
867469a201 [CELEBORN-795] Change the parameter of getLogger to ReviveManager.class
### What changes were proposed in this pull request?

Change the parameter of getLogger to ReviveManager.class

### Why are the changes needed?

The parameter of getLogger in the ReviveManager class should be ReviveManager.class

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NO

Closes #1715 from jiaoqingbo/795.

Authored-by: e <1178404354@qq.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-14 15:52:25 +08:00
zky.zhoukeyong
03717375fc [CELEBORN-790][FOLLOWUP] Use allocator.compositeDirectBuffer to track memory leak
### What changes were proposed in this pull request?
According to https://github.com/apache/incubator-celeborn/pull/1709#discussion_r1260133078

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1711 from waitinfuture/790-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-13 10:01:11 +08:00
zky.zhoukeyong
dcf6be29d8 [CELEBORN-789] Increase default value of flushBuffer's max components
### What changes were proposed in this pull request?
Set default value of ```celeborn.worker.push.compositeBuffer.maxComponents``` to 256, to be aligned with 0.2.1-incubating version.

### Why are the changes needed?

Default 16 is too small, and causes ~~severe GC~~ and CPU high load.

<img width="1719" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ab9675e-c19e-44f1-af46-90c29dc4df75">

### Does this PR introduce _any_ user-facing change?
No, it's internal config.

### How was this patch tested?
Passes GA.

Closes #1707 from waitinfuture/789.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-12 20:18:48 +08:00
Angerszhuuuu
1642090f9f [CELEBORN-781] Refactor RPC message type name
### What changes were proposed in this pull request?
After https://github.com/apache/incubator-celeborn/pull/1658 merged, we can format the message type now.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1696 from AngersZhuuuu/CELEBORN-731.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-12 14:16:19 +08:00
Fu Chen
90ba9f3e87 [CELEBORN-783][FOLLOWUP] Private member updates and cleanup in SortBasedPusher
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1699#discussion_r1259137323

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1704 from cfmcgrady/insert-record-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 23:08:42 +08:00