Commit Graph

1099 Commits

Author SHA1 Message Date
zky.zhoukeyong
6a5e3ed794 [CELEBORN-812] Cleanup SendBufferPool if idle for long
### What changes were proposed in this pull request?
Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than
`celeborn.client.push.sendbufferpool.expireTimeout`.

### Why are the changes needed?
Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large
and will not be reused in the future, it wastes memory and causes GC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual tests.

Closes #1735 from waitinfuture/812-1.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 00:34:55 +08:00
Angerszhuuuu
14c6e5719f
[CELEBORN-811] Refine monitoring doc
### What changes were proposed in this pull request?
Refine monitoring doc

1. Remove unnecessary left side navigator
2. Add TOC in right side
3. fix list indentation

Before
![celeborn apache org_docs_latest_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/885da0e5-f2f9-41ba-a9fe-257e46e76a78)

After
![127 0 0 1_8000_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/8cb3fc60-0a2e-4134-8edb-dd0fe434be60)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1734 from AngersZhuuuu/CELEBORN-811.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:53:21 +08:00
Angerszhuuuu
5471a6afe5
[CELEBORN-804] ShuffleClient should cleanup shuffle infos when trigger unregisterShuffle
### What changes were proposed in this pull request?

After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr:

  1. Add shuffle client both in driver and executor side in ShuffleManager
  2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`.

This replaced https://github.com/apache/incubator-celeborn/pull/1719

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1726 from AngersZhuuuu/CELEBORN-804.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:50:18 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Angerszhuuuu
c8ad39d9bd [CELEBORN-809] Directly use isDriver passed from SparkEnv
### What changes were proposed in this pull request?
As title
<img width="1051" alt="截屏2023-07-19 下午1 01 25" src="https://github.com/apache/incubator-celeborn/assets/46485123/26d506b2-bab9-43f5-9bbe-58d22a761bab">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1732 from AngersZhuuuu/CELEBORN-809.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-19 15:20:01 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
onebox-li
061febe46f [CELEBORN-807] Adjust shutdown worker logs in LifecycleManager
### What changes were proposed in this pull request?
In a long run celeborn cluster,  there are some shutdown workers. Whether it is a new task or an old task, even if the worker is not assigned , it will always log below, seems a little noisy.
ERROR CommitManager: Worker xx shutdown, commit all it's partition location.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
shutdown worker logs in LifecycleManager changes

### How was this patch tested?
manually test

Closes #1730 from onebox-li/adjust-log.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 11:38:54 +08:00
Fu Chen
8b7a761859 [CELEBORN-806] Correct the conf key celeborn.data.io.threads within the class ShuffleClientImpl
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

The configuration key `celeborn.data.io.threads` underwent an inadvertent modification in https://github.com/apache/incubator-celeborn/pull/1077

### Does this PR introduce _any_ user-facing change?

Bug fix

### How was this patch tested?

Pass GA

Closes #1729 from cfmcgrady/fix-conf-key.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-18 17:54:49 +08:00
Fu Chen
16d9c657c2
[CELEBORN-805][FOLLOWUP] Remove unnecessary TODO
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

cleanup the unnecessary TODO which introduced in https://github.com/apache/incubator-celeborn/pull/1727

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Review

Closes #1728 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:37:21 +08:00
Fu Chen
7c6644b1a7
[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.

This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.

Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.

```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:12345         0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:41563           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:42905           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44419           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:45025           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39053           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39029           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39475           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:40153           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33051           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33449           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:34073           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35347           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35971           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:36799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:40775     0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:44457     0.0.0.0:*               LISTEN      1130743/java
```

After this PR:

```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1727 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:12:51 +08:00
zky.zhoukeyong
1109e2c8f4 [CELEBORN-803][FOLLOWUP] Make ``rpcAskTimeout`` default to 60s
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```,
so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1725 from waitinfuture/803-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 23:53:52 +08:00
zky.zhoukeyong
9ec223edd7 [CELEBORN-803] Increase default timeout for commit files
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s.
It's more reasonable because commit files costs relatively long time. In my testing with tough disks,
30s timeout with 2 retires is not enough.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1724 from waitinfuture/803.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 22:31:36 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
95119b1e4b [CELEBORN-799][FOLLOWUP] Fix doc of celeborn.client.push.maxReqsInFlight.total
…Flight.total```

### What changes were proposed in this pull request?
Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1723 from waitinfuture/799-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:01:03 +08:00
Cheng Pan
1ec4f4a9f5 [CELEBORN-801] Warn when local shuffle reader is enabled
### What changes were proposed in this pull request?

Warn when local shuffle reader is enabled.

```
Detected spark.sql.adaptive.localShuffleReader.enabled (default is true) is enabled,
it's highly recommended to disable it when use Celeborn as Remote Shuffle Service to
avoid performance degradation.
```

### Why are the changes needed?

When local shuffle reader is enabled, the reduce task may read shuffle data in by map id, which is not match the Celeborn shuffle data clustering model, then cause extremely bad shuffle read performance.

### Does this PR introduce _any_ user-facing change?

Yes, user would see warning message from Driver log when `spark.sql.adaptive.localShuffleReader.enabled` is true.

### How was this patch tested?

Review.

Closes #1721 from pan3793/CELEBORN-801.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:43:50 +08:00
zky.zhoukeyong
10a1def512 [CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects
### What changes were proposed in this pull request?
Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]```
objects in ```PushTask```.

### Why are the changes needed?
I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G:
![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092)

After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number
of  PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M:
![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392)

Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and Manual test.

Closes #1722 from waitinfuture/802.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:35:14 +08:00
zky.zhoukeyong
4b3a47c9db [CELEBORN-799] Limit total inflight push requests
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1720 from waitinfuture/799.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:17:24 +08:00
zky.zhoukeyong
a7bbbd05c4 [CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf
### What changes were proposed in this pull request?
1. Decrease writeTime metric sampling frequency to improve perf
2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4

### Why are the changes needed?
Following are test cases
case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data
case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data
Following are e2e time of shuffle write stage
||Sort pusher before|Sort pusher after|Hash pusher before|Hash pusher after|
|----|----|----|----|-----|
|case1|4.4min|4.1min|4.4min|3.9min|
|case2|9.1min|8.4min|9.7min|8.5min|

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1718 from waitinfuture/797.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:51:50 +08:00
mingji
a4687716d2 [CELEBORN-791] Remove slots allocation simulation in master and use active slots sent from worker's heartbeat
### What changes were proposed in this pull request?
Master won't simulate slots allocations and use active slots sent from worker.

### Why are the changes needed?
I have observed that a new worker might allocate more slots than other workers when using the round-robin slot allocation algorithm.
There is a logic error in processing heartbeat from worker. It will update disk info's active slots to max(current disk info active slots, disk info sent from worker active slots). If I registered a huge shuffle, master will allocate more slots than a disk's max slots and mark them as unknown disk slots but worker will count the unknown disk slots as active slots and report it to the master. Then the slots release logic can not distinguish unknown slots from a number so the release will not decrease active slots properly.
Due to the gap between work and master, so I think it's OK to remove slots allocation simulation from worker and use active slots from worker.

Before this patch:
<img width="928" alt="截屏2023-07-12 16 51 15" src="https://github.com/apache/incubator-celeborn/assets/4150993/9c8a46d9-26a8-42f5-a956-938273277c9b">

After this patch:
<img width="509" alt="截屏2023-07-12 16 25 52" src="https://github.com/apache/incubator-celeborn/assets/4150993/c49b3d91-14ea-4eb8-9b71-9aab73541faf">

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1710 from FMX/CELEBORN-791.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:40:55 +08:00
e
f78a7d349f [CELEBORN-794] Fix link of CONFIGURATIONS in README
### What changes were proposed in this pull request?

Modify CONFIGURATIONS to point to the correct address

### Why are the changes needed?

CONFIGURATIONS in README.md points to an invalid address

![image](https://github.com/apache/incubator-celeborn/assets/14961757/538294ee-3432-4e1e-a45e-4dc1983d50e8)
![image](https://github.com/apache/incubator-celeborn/assets/14961757/d4681603-5317-46ae-a2f5-e58fa72c706c)

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NO

Closes #1714 from jiaoqingbo/CELEBORN-794.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 18:08:09 +08:00
无迹
e1337972e8 [CELEBORN-792] SparkShuffleManager.getWriter use wrong appUniqueId fo…
…r Spark2

### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA and manual test.

Closes #1717 from shujiewu/CELEBORN-792.

Authored-by: 无迹 <peter.wsj@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 17:17:48 +08:00
e
867469a201 [CELEBORN-795] Change the parameter of getLogger to ReviveManager.class
### What changes were proposed in this pull request?

Change the parameter of getLogger to ReviveManager.class

### Why are the changes needed?

The parameter of getLogger in the ReviveManager class should be ReviveManager.class

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NO

Closes #1715 from jiaoqingbo/795.

Authored-by: e <1178404354@qq.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-14 15:52:25 +08:00
zky.zhoukeyong
03717375fc [CELEBORN-790][FOLLOWUP] Use allocator.compositeDirectBuffer to track memory leak
### What changes were proposed in this pull request?
According to https://github.com/apache/incubator-celeborn/pull/1709#discussion_r1260133078

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1711 from waitinfuture/790-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-13 10:01:11 +08:00
zky.zhoukeyong
dcf6be29d8 [CELEBORN-789] Increase default value of flushBuffer's max components
### What changes were proposed in this pull request?
Set default value of ```celeborn.worker.push.compositeBuffer.maxComponents``` to 256, to be aligned with 0.2.1-incubating version.

### Why are the changes needed?

Default 16 is too small, and causes ~~severe GC~~ and CPU high load.

<img width="1719" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ab9675e-c19e-44f1-af46-90c29dc4df75">

### Does this PR introduce _any_ user-facing change?
No, it's internal config.

### How was this patch tested?
Passes GA.

Closes #1707 from waitinfuture/789.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-12 20:18:48 +08:00
Angerszhuuuu
1642090f9f [CELEBORN-781] Refactor RPC message type name
### What changes were proposed in this pull request?
After https://github.com/apache/incubator-celeborn/pull/1658 merged, we can format the message type now.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1696 from AngersZhuuuu/CELEBORN-731.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-12 14:16:19 +08:00
Fu Chen
90ba9f3e87 [CELEBORN-783][FOLLOWUP] Private member updates and cleanup in SortBasedPusher
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1699#discussion_r1259137323

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1704 from cfmcgrady/insert-record-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 23:08:42 +08:00
zky.zhoukeyong
9e74cf3547
[CELEBORN-790] Use pooled direct allocator for flusher's CompositeByteBuf
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
Before this PR, ```Flusher#takeBuffer``` returns a ```CompositeByteBuf``` which is unpooled and on heap:
```
      buffer = Unpooled.compositeBuffer(maxComponents)
```
```
    public static CompositeByteBuf compositeBuffer(int maxNumComponents) {
        return new CompositeByteBuf(ALLOC, /*direct*/ false, maxNumComponents);
    }
```
When consolidation happens, the data will be copied from direct memory to heap memory, causing OOM and
perf degration.
With this PR, in my test cases of shuffling 14G for three 1G/1G workers, I don't see disk buffer larger than direct memory,
nor do I encounter high GC.

### Does this PR introduce _any_ user-facing change?

This patch fixes some OOM issues.

### How was this patch tested?
Passes GA and manual test.

Closes #1709 from waitinfuture/790.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-11 22:44:57 +08:00
zky.zhoukeyong
3bc611bdd6 [CELEBORN-787] Add chunk related UTs for FileWriter
### What changes were proposed in this pull request?
Add chunk related UTs for FileWriter.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1705 from waitinfuture/787.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 16:05:18 +08:00
Angerszhuuuu
b4dfb0352b [CELEBORN-733] Clean unused GetBlacklist & GetBlacklistResponse
### What changes were proposed in this pull request?
Clean unused GetBlacklist & GetBlacklistResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1656 from AngersZhuuuu/CELEBORN-733.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-11 15:58:20 +08:00
caojiaqing
d64e0091f1 [CELEBORN-785] Add worker side partition hard split threshold
### What changes were proposed in this pull request?
Add a configuration `celeborn.worker.shuffle.partitionSplit.max`  to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value

### Why are the changes needed?

In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios.

### Does this PR introduce _any_ user-facing change?
`celeborn.worker.shuffle.partitionSplit.max` defalut value 2g

### How was this patch tested?
none

Closes #1701 from JQ-Cao/785.

Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 14:14:41 +08:00
zky.zhoukeyong
7a47fae230 [CELEBORN-786] Change default flush threads
### What changes were proposed in this pull request?
This PR changes default values of the following configs:

|config|previous default value|new default value|
|----|----|----|
|celeborn.worker.flusher.threads|2|16|
|celeborn.worker.flusher.ssd.threads|8|16|

### Why are the changes needed?
If disk type is not specified, ```celeborn.worker.flusher.threads``` will be used. Recently many users
use SSD for Celeborn workers without specifying disk type, and 2 flush threads is far from leveraging the power of SSD.

### Does this PR introduce _any_ user-facing change?
Yes, default configs are changed.

### How was this patch tested?
Passes GA.

Closes #1703 from waitinfuture/786.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 13:09:29 +08:00
Fu Chen
e47ec10cef [CELEBORN-783] Revise the conditions for the SortBasedPusher#insertRecord method
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

[comment](7adf1fca41 (r121138008))

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT

Closes #1699 from cfmcgrady/insert-record.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 11:36:29 +08:00
Shuang
02492fe9f8 [CELEBORN-626][FOLLOWUP] Fix wrong update chunkOffsets when final flush with empty buffer
### What changes were proposed in this pull request?
Final flush with none empty buffer.

### Why are the changes needed?
This bug was introduced by [CELEBORN-626](https://github.com/apache/incubator-celeborn/pull/1534/files)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass UT

Closes #1702 from RexXiong/CELEBORN-626-dev.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 00:28:54 +08:00
Angerszhuuuu
9f09ac6ce9 [CELEBORN-780] Change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default to Int.MaxValue since slot's is not a bottleneck
### What changes were proposed in this pull request?
Now slots is not a bottleneck, change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default value to Int.MaxValue.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1695 from AngersZhuuuu/CELEBORN-780.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-10 18:50:10 +08:00
zky.zhoukeyong
334f0e5ba4 [CELEBORN-782] Make max components configurable for FileWriter#flushBuffer
### What changes were proposed in this pull request?
Make max components configurable for FileWriter#flushBuffer.

### Why are the changes needed?
When max components of ```CompositeByteBuf``` is too big (hard coded 256 before this PR), netty's offheap memory
usage will be several times bigger than true usage:
```
 Direct memory usage: 1044.0 MiB/4.0 GiB, disk buffer size: 255.9 MiB
```
When set to 1, netty's memory usage will be very close to disk buffer:
```
Direct memory usage: 376.0 MiB/4.0 GiB, disk buffer size: 353.0 MiB
```
but when it set too low, for example 1, performance might decrease, especially for sort pusher:
```
max components:   1 vs. 16
shuffle write time:   34s vs. 30s
```
For hash pusher, difference is not so big:
```
max components:   1 vs. 8
shuffle write time:   25s vs. 23s
```
This PR makes the parameter configurable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA, and manual test.

Closes #1697 from waitinfuture/782.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 18:00:07 +08:00
mingji
a7ff28c26d [CELEBORN-779] Fix sorted file size summary overflow
### What changes were proposed in this pull request?
To fix sorted file size summary counter overflow. Sorted file size summary can be larger than the integer's max value. It should be applied to branch-0.3, branch-0.2, and main branch.

### Why are the changes needed?
This graph is incorrect.
<img width="1500" alt="lQLPJx4a6A76xo7NBOHNC7iwLBx50tXnv5IEoQTQxMAMAA_3000_1249" src="https://github.com/apache/incubator-celeborn/assets/4150993/8d288087-64d7-4569-ba04-05bb1915118a">

### Does this PR introduce _any_ user-facing change?
It will cause user-facing changes.  Celeborn worker's metric for sorted file summary is incorrect and this patch will correct it.

### How was this patch tested?
UT.

Closes #1694 from FMX/CELEBORN-779.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 11:29:47 +08:00
Cheng Pan
494c116818
[CELEBORN-778] Rename MemoryManagerStat to ServingState
### What changes were proposed in this pull request?

Rename

```
enum MemoryManagerStat { resumeAll, pausePushDataAndReplicate, pausePushDataAndResumeReplicate }
```

to

```
enum ServingState { NONE_PAUSED, PUSH_AND_REPLICATE_PAUSED, PUSH_PAUSED }
```

### Why are the changes needed?

`MemoryManagerStat` indicates the worker serving functionalities, and it's weird to represent the state using verbs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1691 from pan3793/CELEBORN-778.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-10 11:17:52 +08:00
Angerszhuuuu
fd82855e8b [CELEBORN-777][FOLLOWUP] CongestionControl getPotentialConsumeSpeed throw … …/zero error
### What changes were proposed in this pull request?
Fix compute bug

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1693 from AngersZhuuuu/CELEBORN-777.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 10:37:19 +08:00
Angerszhuuuu
52dcd3b5df [CELEBORN-777][BUG] CongestionControl getPotentialConsumeSpeed throw /zero error
### What changes were proposed in this pull request?
In `TimeSlidingHub.add()` `_deque` will clear then add the pair.

```
      if (nodesToAdd >= maxQueueSize) {
        // The new node exceed existing sliding list, need to clear all old nodes
        // and create a new sliding list
        _deque.clear();
        _deque.add(Pair.of(currentTimestamp, (N) newNode.clone()));
        sumNode = (N) newNode.clone();
        return;
      }

```

Then when call `BufferStatusHub.avgBytesPerSec()`,  `currentNumBytes` can be `> 0` but `getCurrentTimeWindowsInMills` may return 0. Cause the error.

```
  public long avgBytesPerSec() {
    long currentNumBytes = sum().numBytes();
    if (currentNumBytes > 0) {
      return currentNumBytes * 1000 / (long) getCurrentTimeWindowsInMills();
    }
    return 0L;
  }
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1690 from AngersZhuuuu/CELEBORN-777.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-08 21:46:37 +08:00
Cheng Pan
e314f85087
[CELEBORN-614][FOLLOWUP] Fix flushOnMemoryPressure condition
### What changes were proposed in this pull request?

Fix the refactor bug of CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517).

### Why are the changes needed?

This is a bug fix, the condition `writer.getException != null` was inverted accidentally during CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517), which causes the trim became no-op.

### Does this PR introduce _any_ user-facing change?

No. The bug was caused by an unreleased commit.

### How was this patch tested?

Set Worker off-heap memory to 2G, and run 1T tera sort.

Before: the trim does not trigger disk buffer flush, causing the worker can not to recover from the pause pushdata state, then Job failed.

After: the trim correctly triggers disk buffer flush, releases the worker memory, and the Job succeeded.

<img width="1653" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ef62c78-e6a9-497f-9dac-d3f712e830cc">

Closes #1689 from pan3793/CELEBORN-614-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-07-07 15:33:09 +08:00
Cheng Pan
a556b02bc1
[CELEBORN-747][FOLLOWUP] Fix nanoDurationToString and add UT
### What changes were proposed in this pull request?

Sorry, the last added `nanoDurationToString` is not correct, this PR fixes it and adds UT to verify.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

UT.

Closes #1688 from pan3793/CELEBORN-747-followup2.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 19:34:35 +08:00
Cheng Pan
ed035d7ab0
[CELEBORN-747][FOLLOWUP] avgFlushTime and avgFetchTime are in nano seconds
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?

`avgFlushTime` and `avgFetchTime` are in nano seconds, it was accidentally formatted by `msDurationToString` and caused unreasonable logs.

```
usableSpace: 1602.1 GiB, avgFlushTime: 1.35 h, avgFetchTime: 1187.66 h
```

### Why are the changes needed?

Fix logs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT is updated

Closes #1687 from pan3793/CELEBORN-747-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 17:59:49 +08:00
Fu Chen
2bd1d86d41
[CELEBORN-775] Fix executorCores calculation in SparkShuffleManager for Spark local mode
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

```shell
$ bin/spark-shell --master local[2]
23/07/06 16:11:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/06 16:11:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[2], app id = local-1688631101733).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sparkContext.getConf.get("spark.executor.cores")
java.util.NoSuchElementException: spark.executor.cores
  at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.SparkConf.get(SparkConf.scala:245)
  ... 47 elided

scala>
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CelebornPipelineSortSuite should cover this change

Closes #1685 from cfmcgrady/local-core-number.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:29:59 +08:00
Cheng Pan
4f8e72f217
[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf
### What changes were proposed in this pull request?

Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style

### Why are the changes needed?

Pullout inline configuration to `CelebornConf`, and expose it in configuration docs

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1684 from pan3793/CELEBORN-774.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:23:32 +08:00
zky.zhoukeyong
09881f5cff [CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16
…Flight to 16

### What changes were proposed in this pull request?
Change default value of celeborn.client.push.maxReqsInFlight to 16.

### Why are the changes needed?
Previous value 4 is too small, 16 is more reasonable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #1683 from waitinfuture/769.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-06 10:22:06 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zky.zhoukeyong
4300835363 [CELEBORN-768] Change default config values for batch rpcs and netty …
…memory allocator

### What changes were proposed in this pull request?
Changes the following configs' default values
| config  | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share  | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled  | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |

### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.

### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.

### How was this patch tested?
Passes GA.

Closes #1682 from waitinfuture/768.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:16:41 +08:00
zhongqiang.czq
95f08300e5 [CELEBORN-766][LICENSE] Upgrade notice file responding to dependency …
…upgrading

### What changes were proposed in this pull request?

upgrade notice-binary for commons-lang and commons-io

### Why are the changes needed?
In pre prs, the following dependencies are upgraded and its' notice file are upgraded too.
Bump commons-io to 2.13.0 [[CELEBORN-743]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-743?filter=allissues)
Bump commons-lang3 to 3.12.0 [[CELEBORN-736]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-736?filter=allissues)
1. NOTICE.txt  in commons-lang3-3.12.0.jar
```
Apache Commons Lang
Copyright 2001-2021 The Apache Software Foundation
```
2. NOTICE.txt  in commons-io-2.13.0.jar
```
Apache Commons IO
Copyright 2002-2023 The Apache Software Foundation
```

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1681 from zhongqiangczq/notice-0705.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:13:21 +08:00
Fu Chen
3af5c231c7 [CELEBORN-767][DOC] Update the docs of celeborn.client.spark.push.sort.memory.threshold
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1680 from cfmcgrady/docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:07:09 +08:00
zhongqiang.czq
a0f4be67a9 [CELEBORN-765][DOC] Disable partitionSplit in Flink engine related co…
…nfigurations

### What changes were proposed in this pull request?
In Doc Readme, setting partitionSplit to false should be added in Flink engine related configurations.

### Why are the changes needed?
Currently, Mappartition split is not supported, but shuffle partition split is enabled by default, so error will be thrown when flink task's shuffle data size exceeds 1G(by Default).

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1679 from zhongqiangczq/readme.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:04:10 +08:00