Commit Graph

112 Commits

Author SHA1 Message Date
SteNicholas
82022a9427
[CELEBORN-1362] Remove unnecessary configuration celeborn.client.flink.inputGate.minMemory and celeborn.client.flink.resultPartition.minMemory
### What changes were proposed in this pull request?

Remove unnecessary configuration `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory`.

### Why are the changes needed?

`celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are configured as min memory reserved at present. Meanwhile, `celeborn.client.flink.inputGate.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_GATE` bytes, and `celeborn.client.flink.resultPartition.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_PARTITION` bytes. Therefore, `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are unnecessary configuration for `celeborn.client.flink.inputGate.memory` and `celeborn.client.flink.resultPartition.memory`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`PluginSideConfSuiteJ#testCoalesce`

Closes #2433 from SteNicholas/CELEBORN-1362.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-01 11:15:14 +08:00
SteNicholas
e29f013e3a [CELEBORN-1357] AbstractRemoteShuffleResultPartitionFactory should remove the check of shuffle compression codec
### What changes were proposed in this pull request?

`AbstractRemoteShuffleResultPartitionFactory` removes the check of shuffle compression codec.

### Why are the changes needed?

`AbstractRemoteShuffleResultPartitionFactory` checks whether shuffle compression codec is LZ4 for Flink 1.14 and 1.15 version at present. Meanwhile, since Flink 1.17 version, ZSTD has already been supported. Therefore `AbstractRemoteShuffleResultPartitionFactory` should remove the check of shuffle compression codec for Flink 1.17 version and above, which is checked via the constructor of `BufferCompressor`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `RemoteShuffleResultPartitionFactorySuiteJ`

Closes #2414 from SteNicholas/CELEBORN-1357.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 15:44:45 +08:00
CodingCat
c6c319d865 [CELEBORN-1309][FOLLOWUP] Cap the max memory can be used for sort buffer
### What changes were proposed in this pull request?

add a new parameter to cap the max memory can be used for sort writer buffer

### Why are the changes needed?

with a huge number of partitions, the threshold based on buffer size * number of partitions without this cap can be too large, e.g. 64K * 100000 = 6G

### Does this PR introduce _any_ user-facing change?

a new parameter

### How was this patch tested?

ut

Closes #2388 from CodingCat/adaptive_followup.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-25 12:08:04 +08:00
sychen
91f6378682 [CELEBORN-1336] Remove client partition split pool
### What changes were proposed in this pull request?

### Why are the changes needed?
`CELEBORN-1320` uses `ReviveManager` to batch processing SOFT_SPLIT event RPC, so `partitionSplitPool` is no longer used, and the configuration item `celeborn.client.push.splitPartition.threads` is meaningless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2396 from cxzl25/CELEBORN-1336.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-18 21:48:59 +08:00
CodingCat
2e251457f3 [CELEBORN-1309] Support adaptive management of memory threshold for SortBasedWriter
### What changes were proposed in this pull request?

while SortBasedWriter has less memory footprint than HashBasedWriter, it suffers from performance issue when we have  many partitions and the write buffer is filled with small chunks of data quickly

 for example, if sort buffer size is 32K, you have 4 partitions and 128K data in total, the data distribution is like partition A, B, C, D, each time it comes with 8K per partition.... in this case, you need to compress and send small 8K chunk 4 times per partition , the cost would become very high. If you use hashbasedwriter, it doesn't have this problem since the push only happens when the per-partition buffer is full. Of course , larger sort buffer size can mitigate the issue, but tuning sort buffer size per job is a tedious work

this PR introduces a new feature that we measure total size of pushed bytes and pushed count as well as the "should-pushed" bytes and counts (should-push means that , the data we pushed is larger than CLIENT_PUSH_BUFFER_MAX_SIZE (in another word, we will trigger a push even with hashbasedwriter in this case))

when actualPushedBytes/actualPushedCounts > (1 + Threshold) * (ShouldPushBytes/ShouldPushCounts), we will enlarge the sort buffer size by 1X to try to buffer more data before pushing  (the max size of sortBuffer would be capped at # of partitions * CLIENT_PUSH_BUFFER_MAX_SIZE)

### Why are the changes needed?

to reduce perf cost in sortbased writer

### Does this PR introduce _any_ user-facing change?

no, but have 2 extra configurations

### How was this patch tested?

in prod of our company and also unit test

Closes #2358 from CodingCat/adaptive_memory_threshold.

Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-13 13:54:12 +08:00
Angerszhuuuu
7b0211e345 [CELEBORN-1277] Add celeborn.quota.enabled at Master and Client side to enable checking quota
### What changes were proposed in this pull request?

Add `celeborn.quota.enabled` at Master and Client side to enable checking quota

### Why are the changes needed?

`celeborn.quota.enabled` should be added in Master and Client side to enable quota check for Celeborn Master and Client.

### Does this PR introduce _any_ user-facing change?

Add categories of `celeborn.quota,enabled` with `master` and `client`.

### How was this patch tested?

No.

Closes #2318 from AngersZhuuuu/CELEBORN-1277.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-26 11:33:14 +08:00
Chandni Singh
9185cae35a [CELEBORN-1257][FOLLOWUP] Removed the additional secured port from Celeborn Master
### What changes were proposed in this pull request?
https://github.com/apache/incubator-celeborn/pull/2292#discussion_r1497160753
Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled.

### Why are the changes needed?
These changes are for enabling authentication

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
This removed additional secured port.

Closes #2327 from otterc/CELEBORN-1257.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-25 00:09:05 +08:00
Angerszhuuuu
92704c7d06 [CELEBORN-1051] Add isDynamic property for CelebornConf
### What changes were proposed in this pull request?
Since we support ConfigService, many configuration can be dynamic, add `isDynamic` property for CelebornConf in this pr.

### Why are the changes needed?
Make configuration doc more cleear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #2308 from AngersZhuuuu/CELEBORN-1051.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-20 14:20:44 +08:00
xiyu.zk
c1837536a1 [CELEBORN-1267] Add config to control worker check in CelebornShuffleFallbackPolicyRunner
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
For some scenarios, if Celeborn cannot be used, users want to report an error directly instead of fallback.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI

Closes #2291 from kerwin-zk/add-config.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-02-07 19:01:41 +08:00
Chandni Singh
ab4c0bc85b [CELEBORN-1257] Adds a secured port in Celeborn Master for secure communication with LifecycleManager
### What changes were proposed in this pull request?
This adds a secured port to Celeborn Master which is used for secure communication with LifecycleManager.
This is part of adding authentication support in Celeborn (see CELEBORN-1011).

This change targets just adding the secured port to Master. The following items from the proposal are still pending:
1. Persisting the app secrets in Ratis.
2. Forwarding secrets to Workers and having ability for the workers to pull registration info from the Master.
3. Secured and internal port in Workers.
4. Secured communication between workers and clients.

In addition, since we are supporting both secured and unsecured communication for backward compatibility and seamless rolling upgrades, there is an additional change needed. An app which registers with the Master can try to talk to the workers on unsecured ports which is a security breach. So, the workers need to know whether an app registered with Master or not and for that Master has to propagate list of un-secured apps to Celeborn workers as well. We can discuss this more with https://issues.apache.org/jira/browse/CELEBORN-1261

### Why are the changes needed?
It is needed for adding authentication support to Celeborn (CELEBORN-1011)

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added a simple UT.

Closes #2281 from otterc/CELEBORN-1257.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-02-06 14:53:28 +08:00
mingji
0249698c5b
[CELEBORN-1247] Output config's alternatives to doc
### What changes were proposed in this pull request?
Add configs' alternatives to doc.

### Why are the changes needed?
To help users use correct configs.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2253 from FMX/b1241.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-24 11:21:23 +08:00
mingji
a3c28d0b34 [CELEBORN-1150] Revert "[] support io encryption for spark"
### What changes were proposed in this pull request?
Revert "[CELEBORN-1150] support io encryption for spark".

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2208 from FMX/b1150-3.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-04 13:00:58 +08:00
Cheng Pan
77e468161d [CELEBORN-891] Remove pipeline feature for sort based writer
### What changes were proposed in this pull request?

Remove pipeline feature for sort based writer

### Why are the changes needed?

The pipeline feature is added as part of CELEBORN-295, for performance. Eventually, an unresolvable issue that would crash the JVM was identified in https://github.com/apache/incubator-celeborn/pull/1807, and after discussion, we decided to delete this feature.

### Does this PR introduce _any_ user-facing change?

No, the pipeline feature is disabled by default, there are no changes to users who use the default settings.

### How was this patch tested?

Pass GA.

Closes #2196 from pan3793/CELEBORN-891.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-01 10:42:17 +08:00
liangyongyuan
4304be1a60 [CELEBORN-1172][SPARK] Support dynamic switch shuffle push write mode based on partition number
### What changes were proposed in this pull request?
Dynamically determine the writing mode in Spark based on the number of partitions.

### Why are the changes needed?
Enhance the flexibility of shuffle writes to improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add uts

Closes #2160 from lyy-pineapple/dynamic-write-mode.

Lead-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-21 16:58:51 +08:00
mingji
4dacf72a6d
[CELEBORN-1150] support io encryption for spark
### What changes were proposed in this pull request?
1. To support io encryption for spark.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manually test on a cluster.

Closes #2135 from FMX/B1150.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-19 11:44:05 +08:00
zky.zhoukeyong
01feb93abb [CELEBORN-1167] Avoid calling parmap when destroy slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots

This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2156 from waitinfuture/1167.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 18:30:31 +08:00
Fu Chen
0f2a9a3a63 [CELEBORN-1160][FOLLOWUP] Update the version for celeborn.client.rpc.shared.threads to 0.3.2
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Since we are backporting #2145 to branch-0.3, and the configuration entry `celeborn.client.rpc.shared.threads` in #2145
 has a start version of 0.4.0, this update aligns the version accordingly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2153 from cfmcgrady/celeborn-1160-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 15:12:50 +08:00
zky.zhoukeyong
92bebd305d [CELEBORN-1160] Avoid calling parmap when commit files
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the first one. To be more detail, this PR eliminates `parmap` when doing committing files, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2145 from waitinfuture/1160.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 14:36:48 +08:00
wangshengjie
8516df4beb [CELEBORN-1151] Request slots when register shuffle should filter the workers excluded by application
### What changes were proposed in this pull request?
When request slots, filter workers excluded by application

### Why are the changes needed?
If worker alive but can not service, register shuffle will remove the worker from application client exclude list and next shuffle may reserve slots on this worker,this will cause application revive unexpectly

### Does this PR introduce _any_ user-facing change?
Yes, request slots will filter workers excluded by application

### How was this patch tested?
UT,

Closes #2131 from wangshengjie123/fix-request-slots-blacklist.

Authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-12 10:02:18 +08:00
exmy
8a15396cb6 [CELEBORN-1145] Separate clientPushBufferMaxSize from CelebornInputStreamImpl
### What changes were proposed in this pull request?
The `clientPushBufferMaxSize` config is also used by `CelebornInputStreamImpl`, it's a config about push side and should not be used by fetch side. This pr introduces a fetch config to replace it.

### Why are the changes needed?

As above

### Does this PR introduce _any_ user-facing change?

Yes, a new config `celeborn.client.fetch.buffer.size` is introduced.

### How was this patch tested?

Pass CI

Closes #2118 from exmy/celeborn-1145.

Authored-by: exmy <xumovens@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-30 18:56:03 +08:00
Erik.fang
aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
Shuang
931880a82d [CELEBORN-1112] Inform celeborn application is shutdown, then celeborn cluster can release resource immediately
### What changes were proposed in this pull request?
Unregister application to Celeborn master After Application stopped, then master will expire the related shuffle resource immediately, resulting in resource savings.

### Why are the changes needed?
Currently Celeborn master expires the related application shuffle resource only when application is being checked timeout,
this would greatly delay the release of resources, which is not conducive to saving resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #2075 from RexXiong/CELEBORN-1112.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 20:46:51 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
SteNicholas
3092644168 [CELEBORN-1095] Support configuration of fastest available XXHashFactory instance for checksum of Lz4Decompressor
### What changes were proposed in this pull request?

`CelebornConf` adds `celeborn.client.shuffle.decompression.lz4.xxhash.instance` to configure fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor`. Fix #2043.

### Why are the changes needed?

`Lz4Decompressor` creates the checksum with `XXHashFactory#fastestInstance`, which returns the fastest available `XXHashFactory` instance that uses nativeInstance at default. The fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor` could be supported to configure instead of dependency on the class loader is the system class loader, which method is as follows:
```
/**
 * Returns the fastest available {link XXHashFactory} instance. If the class
 * loader is the system class loader and if the
 * {link #nativeInstance() native instance} loads successfully, then the
 * {link #nativeInstance() native instance} is returned, otherwise the
 * {link #fastestJavaInstance() fastest Java instance} is returned.
 * <p>
 * Please read {link #nativeInstance() javadocs of nativeInstance()} before
 * using this method.
 *
 * return the fastest available {link XXHashFactory} instance.
 */
public static XXHashFactory fastestInstance() {
  if (Native.isLoaded()
      || Native.class.getClassLoader() == ClassLoader.getSystemClassLoader()) {
    try {
      return nativeInstance();
    } catch (Throwable t) {
      return fastestJavaInstance();
    }
  } else {
    return fastestJavaInstance();
  }
}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `CelebornConfSuite`
- `ConfigurationSuite`

Closes #2050 from SteNicholas/CELEBORN-1095.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 14:57:31 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
zky.zhoukeyong
a42ec85a6e [CELEBORN-943][PERF] Pre-create CelebornInputStreams in CelebornShuffleReader
### What changes were proposed in this pull request?
This PR fixes performance degradation when Spark's coalescePartitions takes effect caused
by RPC latency.

### Why are the changes needed?
I encountered a performance degradation when testing  tpcds 10T q10:
||Time|
|---|---|
|ESS|14s|
|Celeborn| 24s|

After digging into it I found out that q10 triggers partition coalescence:
![image](https://github.com/apache/incubator-celeborn/assets/948245/0b4745da-8d57-4661-a35d-683d97f56e1d)

As I configured `spark.sql.adaptive.coalescePartitions.initialPartitionNum` to 1000, `CelebornShuffleReader`
will call `shuffleClient.readPartition` sequentially 1000 times, causing the delay.

This PR optimizes by calling `shuffleClient.readPartition` in parallel. After this PR q10 time becomes 14s.

### Does this PR introduce _any_ user-facing change?
No, but introduced a new client side configuration `celeborn.client.streamCreatorPool.threads`
which defaults to 32.

### How was this patch tested?
TPCDS 1T and passes GA.

Closes #1876 from waitinfuture/943.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-04 21:46:11 +08:00
zhongqiang.czq
b66eaff880 [CELEBORN-627][FLINK] Support split partitions
### What changes were proposed in this pull request?
In MapPartiitoin, datas are split into regions.

1. Unlike ReducePartition whose partition split can occur on pushing data
to keep MapPartition data ordering,  PartitionSplit only be done on the time of sending PushDataHandShake or RegionStart messages (As shown in the following image). That's to say that the partition split only appear at the beginnig of a region but not inner a region.
> Notice: if the client side think that it's failed to push HandShake or RegionStart messages. but the worker side can still receive normal HandShake/RegionStart message. After client revive succss, it don't push any messages to old partition, so the worker having the old partition will create a empty file. After committing files, the worker will return empty commitids. That's to say that empty file will be filterd after committing files and ReduceTask will not read any empty files.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/468fd660-afbc-42c1-b111-6643f5c1e944)

2. PushData/RegioinFinish don't care the following cases:
 - Diskfull
 - ExceedPartitionSplitThreshold
 - Worker ShuttingDown
so if one of the above three conditions appears, PushData and RegionFinish cant still do as normal. Workers should consider the ShuttingDown case and  try best to wait all the regions finished before shutting down.

if PushData or RegionFinish failed like network timeout and so on, then MapTask will failed and start another attempte maptask.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/db9f9166-2085-4be1-b09e-cf73b469c55b)

3. how shuffle read supports partition split?
ReduceTask should get split paritions by order and open the stream by partition epoc orderly

### Why are the changes needed?
PartiitonSplit is not supported by MapPartition from now.
There still a risk that  a partition file'size is too large to store the file on worker disk.
To avoid this risk, this pr introduces partition split in shuffle read and shuffle write.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and manual TPCDS test

Closes #1550 from FMX/CELEBORN-627.

Lead-authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-01 19:25:51 +08:00
mingji
505ba804c7 [CELEBORN-752] Support read local shuffle file for spark
### What changes were proposed in this pull request?
For spark clusters, support read local shuffle file if Celeborn is co-deployed with yarn node managers. This PR help to reduce the number of active connections.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.  The performance is identical whether you enable local reader, but the active connection number may vary according to your connections per peer.
<img width="951" alt="截屏2023-08-16 20 20 14" src="https://github.com/apache/incubator-celeborn/assets/4150993/9106e731-28fc-4e78-9c05-ae6a269d249a">
The active connection number changed from 3745 to 2894. This PR will help to improve cluster stability.

Closes #1812 from FMX/CELEBORN-752.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 18:52:18 +08:00
lishiyucn
57a35ca349 [CELEBORN-498] Add new config for DfsPartitionReader's chunk size
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Make `celeborn.shuffle.chunk.size` worker side only config.
Add a new client side config `celeborn.client.fetch.dfsReadChunkSize` for DfsPartitionReader

### Does this PR introduce _any_ user-facing change?
Yes, the chunks size of DfsPartitionReader is changed from client side config `celeborn.shuffle.chunk.size`
to `celeborn.client.fetch.dfsReadChunkSize`

### How was this patch tested?
Passes GA

Closes #1834 from lishiyucn/main.

Lead-authored-by: lishiyucn <675590586@qq.com>
Co-authored-by: shiyu li <675590586@qq.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-24 21:31:34 +08:00
e
4a4a37ed17 [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

Fix typo in CelebornConf

### Why are the changes needed?

Fix typo in CelebornConf

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Passing GA

Closes #1813 from jiaoqingbo/typo-conf.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 10:32:08 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
zky.zhoukeyong
6ea1ee2ec4 [CELEBORN-152] Add config to limit max workers when offering slots
### What changes were proposed in this pull request?
Add config to limit max workers when offering slots, the config can be set both
in server side and client side. Celeborn will choose the smaller positive configs from client and master.

### Why are the changes needed?
For large Celeborn clusters, users may want to limit the number of workers that
a shuffle can spread, reasons are:

1. One worker failure will not affect all applications
2. One huge shuffle will not affect all applications
3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than
    spreading across a large number of workers, say 1000, because the network connections
   in pushing data is `number of ShuffleClient` * `number of allocated Workers`

The recommended number of Workers should depend on workload and Worker hardware,
and this can be configured per application, so it's relatively flexible.

### Does this PR introduce _any_ user-facing change?
No, added a new configuration.

### How was this patch tested?
Added ITs and passes GA.

Closes #1790 from waitinfuture/152.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-07 10:13:53 +08:00
zky.zhoukeyong
6a5e3ed794 [CELEBORN-812] Cleanup SendBufferPool if idle for long
### What changes were proposed in this pull request?
Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than
`celeborn.client.push.sendbufferpool.expireTimeout`.

### Why are the changes needed?
Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large
and will not be reused in the future, it wastes memory and causes GC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual tests.

Closes #1735 from waitinfuture/812-1.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 00:34:55 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
zky.zhoukeyong
1109e2c8f4 [CELEBORN-803][FOLLOWUP] Make ``rpcAskTimeout`` default to 60s
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```,
so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1725 from waitinfuture/803-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 23:53:52 +08:00
zky.zhoukeyong
9ec223edd7 [CELEBORN-803] Increase default timeout for commit files
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s.
It's more reasonable because commit files costs relatively long time. In my testing with tough disks,
30s timeout with 2 retires is not enough.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1724 from waitinfuture/803.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 22:31:36 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
95119b1e4b [CELEBORN-799][FOLLOWUP] Fix doc of celeborn.client.push.maxReqsInFlight.total
…Flight.total```

### What changes were proposed in this pull request?
Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1723 from waitinfuture/799-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:01:03 +08:00
zky.zhoukeyong
4b3a47c9db [CELEBORN-799] Limit total inflight push requests
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1720 from waitinfuture/799.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:17:24 +08:00
Angerszhuuuu
9f09ac6ce9 [CELEBORN-780] Change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default to Int.MaxValue since slot's is not a bottleneck
### What changes were proposed in this pull request?
Now slots is not a bottleneck, change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default value to Int.MaxValue.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1695 from AngersZhuuuu/CELEBORN-780.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-10 18:50:10 +08:00
zky.zhoukeyong
09881f5cff [CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16
…Flight to 16

### What changes were proposed in this pull request?
Change default value of celeborn.client.push.maxReqsInFlight to 16.

### Why are the changes needed?
Previous value 4 is too small, 16 is more reasonable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #1683 from waitinfuture/769.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-06 10:22:06 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zky.zhoukeyong
4300835363 [CELEBORN-768] Change default config values for batch rpcs and netty …
…memory allocator

### What changes were proposed in this pull request?
Changes the following configs' default values
| config  | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share  | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled  | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |

### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.

### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.

### How was this patch tested?
Passes GA.

Closes #1682 from waitinfuture/768.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:16:41 +08:00
Fu Chen
3af5c231c7 [CELEBORN-767][DOC] Update the docs of celeborn.client.spark.push.sort.memory.threshold
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1680 from cfmcgrady/docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:07:09 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
onebox-li
1b74d85fb1 [CELEBORN-725][MINOR] Refine congestion code
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1637 from onebox-li/improve-congestion.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:31:40 +08:00
Angerszhuuuu
33cf343d20 [CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration
### What changes were proposed in this pull request?
Unify exclude and blacklist related configuration

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 10:59:58 +08:00