Commit Graph

257 Commits

Author SHA1 Message Date
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
95119b1e4b [CELEBORN-799][FOLLOWUP] Fix doc of celeborn.client.push.maxReqsInFlight.total
…Flight.total```

### What changes were proposed in this pull request?
Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1723 from waitinfuture/799-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:01:03 +08:00
zky.zhoukeyong
4b3a47c9db [CELEBORN-799] Limit total inflight push requests
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
In case where worker instances is very large, say 1000, then before this PR total memory consumed
by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR
limits total inflight push requests, as 0.2.1-incubating does.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1720 from waitinfuture/799.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:17:24 +08:00
zky.zhoukeyong
a7bbbd05c4 [CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf
### What changes were proposed in this pull request?
1. Decrease writeTime metric sampling frequency to improve perf
2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4

### Why are the changes needed?
Following are test cases
case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data
case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data
Following are e2e time of shuffle write stage
||Sort pusher before|Sort pusher after|Hash pusher before|Hash pusher after|
|----|----|----|----|-----|
|case1|4.4min|4.1min|4.4min|3.9min|
|case2|9.1min|8.4min|9.7min|8.5min|

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1718 from waitinfuture/797.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:51:50 +08:00
caojiaqing
d64e0091f1 [CELEBORN-785] Add worker side partition hard split threshold
### What changes were proposed in this pull request?
Add a configuration `celeborn.worker.shuffle.partitionSplit.max`  to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value

### Why are the changes needed?

In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios.

### Does this PR introduce _any_ user-facing change?
`celeborn.worker.shuffle.partitionSplit.max` defalut value 2g

### How was this patch tested?
none

Closes #1701 from JQ-Cao/785.

Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 14:14:41 +08:00
zky.zhoukeyong
7a47fae230 [CELEBORN-786] Change default flush threads
### What changes were proposed in this pull request?
This PR changes default values of the following configs:

|config|previous default value|new default value|
|----|----|----|
|celeborn.worker.flusher.threads|2|16|
|celeborn.worker.flusher.ssd.threads|8|16|

### Why are the changes needed?
If disk type is not specified, ```celeborn.worker.flusher.threads``` will be used. Recently many users
use SSD for Celeborn workers without specifying disk type, and 2 flush threads is far from leveraging the power of SSD.

### Does this PR introduce _any_ user-facing change?
Yes, default configs are changed.

### How was this patch tested?
Passes GA.

Closes #1703 from waitinfuture/786.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 13:09:29 +08:00
Angerszhuuuu
9f09ac6ce9 [CELEBORN-780] Change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default to Int.MaxValue since slot's is not a bottleneck
### What changes were proposed in this pull request?
Now slots is not a bottleneck, change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default value to Int.MaxValue.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1695 from AngersZhuuuu/CELEBORN-780.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-10 18:50:10 +08:00
Cheng Pan
4f8e72f217
[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf
### What changes were proposed in this pull request?

Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style

### Why are the changes needed?

Pullout inline configuration to `CelebornConf`, and expose it in configuration docs

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1684 from pan3793/CELEBORN-774.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:23:32 +08:00
zky.zhoukeyong
09881f5cff [CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16
…Flight to 16

### What changes were proposed in this pull request?
Change default value of celeborn.client.push.maxReqsInFlight to 16.

### Why are the changes needed?
Previous value 4 is too small, 16 is more reasonable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #1683 from waitinfuture/769.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-06 10:22:06 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zky.zhoukeyong
4300835363 [CELEBORN-768] Change default config values for batch rpcs and netty …
…memory allocator

### What changes were proposed in this pull request?
Changes the following configs' default values
| config  | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share  | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled  | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |

### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.

### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.

### How was this patch tested?
Passes GA.

Closes #1682 from waitinfuture/768.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:16:41 +08:00
Fu Chen
3af5c231c7 [CELEBORN-767][DOC] Update the docs of celeborn.client.spark.push.sort.memory.threshold
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1680 from cfmcgrady/docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:07:09 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Cheng Pan
26aaba14d4 [CELEBORN-637][FOLLOWUP] Mention configurations change in migration guide
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

mention configuration behavior change in migration guide

### Does this PR introduce _any_ user-facing change?

Yes, the migration guide is updated

### How was this patch tested?

review

Closes #1673 from pan3793/CELEBORN-637-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 14:26:43 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
onebox-li
1b74d85fb1 [CELEBORN-725][MINOR] Refine congestion code
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1637 from onebox-li/improve-congestion.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:31:40 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
zhongqiang.czq
374d735ae5
[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo…
…nse with lower versions

### What changes were proposed in this pull request?
The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse.

The reply field is default false before the version 0.2.1, so master can be compatible with older client version

### Why are the changes needed?
Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`.
if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur.
The following compatiblity error will be printted.

``` java
java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType
	at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362]
	at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```
``` java
Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$
	at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362]
	at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
	at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
The pr is tested manually and the testing process is as follows:
1. server side is deploy using the code of latest branch-0.3.
2. spark client is deploy the version of 0.2.1, then run spark-sql to execute  3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted
3. spark client is deploy the version of 0.3.0,  then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted

This patch had conflicts when merged, resolved by
Committer: Cheng Pan <chengpan@apache.org>

Closes #1635 from zhongqiangczq/heartbeat2.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 16:04:18 +08:00
Angerszhuuuu
33cf343d20 [CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration
### What changes were proposed in this pull request?
Unify exclude and blacklist related configuration

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 10:59:58 +08:00
zky.zhoukeyong
57b0e815cf [CELEBORN-656] Batch revive RPCs in client to avoid too many requests
### What changes were proposed in this pull request?
This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests.

To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl.

### Why are the changes needed?
In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 |grep -i revive |wc -l
64364
```
After this PR, number of ReviveBatch requests reduces to 708:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out |grep -i revive |wc -l
2573
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test. I have tested:

1. Disable graceful shutdown, kill one worker, job succeeds
2. Disable graceful shutdown, kill two workers successively, job fails as expected
3. Enable graceful shutdown, restart two workers successively, job succeeds
4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds

Closes #1588 from waitinfuture/656-2.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-27 22:11:04 +08:00
mingji
40760ede3a [CELEBORN-568] Support storage type selection
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.

### Why are the changes needed?
Support HDFS only scenario.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1619 from FMX/CELEBORN-568.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 18:07:08 +08:00
Angerszhuuuu
a2b215bd47 [CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with celeborn.hadoop. prefix
### What changes were proposed in this pull request?
 Celeborn generate hadoop configuration should respect Celeborn conf

### Why are the changes needed?

In spark client side we should write like `spark.celeborn.hadoop.xxx.xx`
In server side we should write like `celeborn.hadoop.xxx.xxx`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1629 from AngersZhuuuu/CELEBORN-719.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-27 17:00:47 +08:00
Cheng Pan
1753556565
[CELEBORN-713] Local network binding support IP or FQDN
### What changes were proposed in this pull request?

This PR aims to make network local address binding support both IP and FQDN strategy.

Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`

### Why are the changes needed?

Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.

For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause

```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.

### How was this patch tested?

Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42

starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```

Closes #1622 from pan3793/CELEBORN-713.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 09:42:11 +08:00
Cheng Pan
2b82194ce0 [CELEBORN-715] Change master URL schema from rss to celeborn
### What changes were proposed in this pull request?

Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>`

### Why are the changes needed?

Respect the project name.

### Does this PR introduce _any_ user-facing change?

Yes, migration guide is updated accordingly.

### How was this patch tested?

Pass GA.

Closes #1624 from pan3793/CELEBORN-715.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-26 22:30:20 +08:00
Cheng Pan
ac84d64d51 [CELEBORN-707][MASTER] Remove env CELEBORN_MASTER_HOST and CELEBORN_MASTER_PORT
### What changes were proposed in this pull request?

Remove environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT`, and makes `CELEBORN_LOCAL_HOSTNAME` takes effect on both master and worker.

### Why are the changes needed?

There are many different ways to configure the master/worker host and port, which makes the thing complex and inconsistent.

After this change,

#### master

1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`
3. things are different when HA enabled and disabled
  3.1. when HA is disabled, lookup configurations `celeborn.master.host` and `celeborn.master.port`
  3.2. when HA is enabled, each node needs to know the whole cluster info,
     ```
     celeborn.master.ha.node.1.host clb-1
     celeborn.master.ha.node.1.port 9097
     celeborn.master.ha.node.2.host clb-2
     celeborn.master.ha.node.2.port 9097
     celeborn.master.ha.node.3.host clb-3
     celeborn.master.ha.node.3.port 9097
     ```
     in addition, `celeborn.master.ha.node.id=1` can be used to indicate the node id, otherwise, the master will try to bind each host to match the node id.

#### worker

1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`

things are simple than the master case because each worker is not required to know others.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

UT.

Closes #1616 from pan3793/CELEBORN-707.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-25 16:00:59 +08:00
zky.zhoukeyong
e2eeafd4bf [CELEBORN-709] Increase default fetch timeout
### What changes were proposed in this pull request?
30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s.

### Why are the changes needed?
When I was testing 3T TPCDS with three workers, I encountered fetch timeout:
```
23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647
...
23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647
```
And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1618 from waitinfuture/709.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-23 21:06:43 +08:00
zky.zhoukeyong
5f4f6d953f [CELEBORN-702][DOC] Extend doc about migration from 0.2.1 to 0.3.0
### What changes were proposed in this pull request?
Extend doc about migration from 0.2.1 to 0.3.0. Added the following contents:

<img width="1084" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/7a9d172c-09ba-48b6-9f5c-73a8b13d035f">

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1612 from waitinfuture/702.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-20 20:45:58 +08:00
Cheng Pan
e22379c3ab [CELEBORN-638] Migrate configurations celeborn.ha.master.* to celeborn.master.ha.*
### What changes were proposed in this pull request?

It was discussed during the last meeting, but abandoned due to the complication.

### Why are the changes needed?

Make the configuration unified.

### Does this PR introduce _any_ user-facing change?

Yes, but the legacy configurations still take effect.

### How was this patch tested?

New UTs.

Closes #1549 from pan3793/CELEBORN-638.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-16 18:18:26 +08:00
Angerszhuuuu
1ba6dee324 [CELEBORN-680][DOC] Refresh celeborn configurations in doc
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1592 from AngersZhuuuu/CELEBORN-680.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:59:38 +08:00
Angerszhuuuu
0aa13832b5 [CELEBORN-676] Celeborn fetch chunk also should support check timeout
### What changes were proposed in this pull request?
Celeborn fetch chunk also should support check timeout

#### Test case
```
executor instance 20

SQL:
SELECT count(1) from (select /*+ REPARTITION(100) */ * from spark_auxiliary.t50g) tmp;

--conf spark.celeborn.client.spark.shuffle.writer=sort \
--conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \
--conf spark.celeborn.client.push.timeout=10s \
--conf spark.celeborn.client.push.replicate.enabled=true \
--conf spark.celeborn.client.push.revive.maxRetries=10 \
--conf spark.celeborn.client.reserveSlots.maxRetries=10 \
--conf spark.celeborn.client.registerShuffle.maxRetries=3 \
--conf spark.celeborn.client.push.blacklist.enabled=true \
--conf spark.celeborn.client.blacklistSlave.enabled=true \
--conf spark.celeborn.client.fetch.timeout=30s \
--conf spark.celeborn.client.push.data.timeout=30s \
--conf spark.celeborn.client.push.limit.inFlight.timeout=600s \
--conf spark.celeborn.client.push.maxReqsInFlight=32 \
--conf spark.celeborn.client.shuffle.compression.codec=ZSTD \
--conf spark.celeborn.rpc.askTimeout=30s \
--conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \
--conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \
--conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \
--conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true
```

Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest`

Before patch
<img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206">

After patch
<img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e">

The log shows the fetch timeout checker workers
```
23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed.
org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[
  id-epoch:35-0
  host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095
  mode:MASTER
  peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095)
  storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=}
  mapIdBitMap:null], change to peer
org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed.
	at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146)
	at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT
	at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147)
	... 8 more
23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1587 from AngersZhuuuu/CELEBORN-676.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:54:09 +08:00
Angerszhuuuu
8a0b7d80d6 [CELEBORN-681][DOC] Add celeborn.metrics.conf to conf entity
### What changes were proposed in this pull request?
Add celeborn.metrics.conf to conf entity

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1593 from AngersZhuuuu/CELEBORN-681.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-14 18:06:03 +08:00
Fu Chen
aa3bb0ac3b
[CELEBORN-679] Optimize Utils#bytesToString
### What changes were proposed in this pull request?

refer to https://github.com/apache/spark/pull/40301

1. Optimize `Utils.bytesToString`. Arithmetic ops on BigInt and BigDecimal are order(s) of magnitude slower than the ops on primitive types. Division is an especially slow operation and it is used en masse here.

2. According to the information sourced from [Wikipedia](https://en.wikipedia.org/wiki/Kilobyte), it is established that 1000 is the appropriate factor for representing kilobytes (KB), while 1024 is the correct factor for kibibytes (KiB). In alignment with this understanding, changing the size unit from "KB" to "KiB".

### Why are the changes needed?

the Utils#bytesToString method is frequently employed in memory-related log messages.

### Does this PR introduce _any_ user-facing change?

No, only perf improvement.

### How was this patch tested?

existing UT and manually tested.

Closes #1590 from cfmcgrady/bytesToString.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-14 17:42:16 +08:00
Shuang
da85347330 [CELEBORN-675] Fix decode heartbeat message
### What changes were proposed in this pull request?
Give Heartbeat one byte message and skip this byte when decode.

### Why are the changes needed?
Heartbeat message may split in to two netty buffer,  then the `empty buffer` (which don't need actually, but need keep) be wrong removed, then decodeNext would throw NPE. see
```  java
while (headerBuf.readableBytes() < HEADER_SIZE) {
      ByteBuf next = buffers.getFirst();
      int toRead = Math.min(next.readableBytes(), HEADER_SIZE - headerBuf.readableBytes());
      headerBuf.writeBytes(next, toRead);
      if (!next.isReadable()) {
        buffers.removeFirst().release();
      }
    }
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT & MANUAL

Closes #1589 from RexXiong/CELEBORN-675.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-06-14 14:37:13 +08:00
zky.zhoukeyong
47cded835f [CELEBORN-669] Avoid commit files on excluded worker list
### What changes were proposed in this pull request?
CommitHandler will check whether the target worker is in WorkerStatusTracker's excluded list. If so, skip calling commit files on it.

### Why are the changes needed?
Avoid unnecessary commit files to excluded worker.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1581 from waitinfuture/669.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-13 22:31:02 +08:00
Angerszhuuuu
357add5b00 [CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch
### What changes were proposed in this pull request?
####Test case
```
executor instance 20

SQL:
SELECT count(1) from (select /*+ REPARTITION(100) */ * from spark_auxiliary.t50g) tmp;

create connection timeout 10s

Fetch chunk timeout 30s
```
In the graph, the shuffle read time of `before` and `after` is always the same delay time.

##### Worker can't connect
Before
![image](https://user-images.githubusercontent.com/46485123/229465520-9d751b40-2b8f-49d2-b350-a2278e3dd89e.png)

After
![image](https://user-images.githubusercontent.com/46485123/229465552-88ac1ca4-24ad-4c30-9a46-0cdcae6bbfd5.png)

##### OpenStream stuck
Before
![image](https://user-images.githubusercontent.com/46485123/229465629-68765a6a-2503-4018-8917-d49e47d5dccc.png)

After
![image](https://user-images.githubusercontent.com/46485123/229465683-2f57b374-1c66-4819-93dd-cabee7ccb788.png)

##### Fetch chunk stuck
Before
![image](https://user-images.githubusercontent.com/46485123/229465735-8d2f694b-1b4a-4984-b069-c4a308f41008.png)

After
![image](https://user-images.githubusercontent.com/46485123/229465754-c2237d5a-6fb6-4d5b-819e-b7d86a1e88d7.png)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1406 from AngersZhuuuu/CELEBORN-494.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-13 20:06:31 +08:00
Angerszhuuuu
6b725202a2 [CELEBORN-640][WORKER] DataPushQueue should not keep waiting take tasks
### What changes were proposed in this pull request?
In our prod meet many times of push queue stuck caused by PushState's status was not being removed.
Caused DataPushQueue to keep waiting for taking task.

Although have resolved some bugs, here we'd better add a max wait time for taking tasks since we already have the `PUSH_DATA_TIMEOUT` check method. If the target worker is really stuck, we can retry our task finally.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1552 from AngersZhuuuu/CELEBORN-640.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-09 14:06:47 +08:00
onebox-li
0c869ac9a0
[CELEBORN-642] Improve metrics and update grafana
### What changes were proposed in this pull request?
Change in grafana

(ALL)
add:
JVMCPUTime
LastMinuteSystemLoad
AvailableProcessors
(For Master)
add:
LostWorkers
IsActiveMaster
PartitionSize
(For Worker)
add:
PushDataFailCount -> WriteDataFailCount
ReplicateDataFailCount
ReplicateDataWriteFailCount
ReplicateDataCreateConnectionFailCount
ReplicateDataConnectionExceptionCount
ReplicateDataTimeoutCount
SortedFileSize
PushDataHandshakeFailCount
RegionStartFailCount
RegionFinishFailCount
MasterPushDataHandshakeTime
SlavePushDataHandshakeTime
MasterRegionStartTime
SlaveRegionStartTime
MasterRegionFinishTime
SlaveRegionFinishTime
PotentialConsumeSpeed
UserProduceSpeed
WorkerConsumeSpeed
DeviceOSFreeBytes
DeviceCelebornFreeBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory
remove:
dup ReserveSlotsTime

Change dashboard layout.

Fix support for multiple labels.

Modify some metrics docs.

### Why are the changes needed?
For better use of metrics.

### Does this PR introduce _any_ user-facing change?
Below metrics change name, extract some value to the label.
DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes
DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes
DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes
DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory

### How was this patch tested?
Cluster test.

Closes #1557 from onebox-li/improve-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-08 18:10:06 +08:00
Ethan Feng
76a42beab0
[CELEBORN-610][FLINK] Eliminate pluginconf and merge its content to CelebornConf
### What changes were proposed in this pull request?
Pluginconf might be hard to understand why Celeborn needs to config class.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT.

Closes #1524 from FMX/CELEBORN-610.

Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-05 14:08:53 +08:00
Angerszhuuuu
218bfc78a5
[CELEBORN-629][DOC] Add doc about enable rac-awareness
### What changes were proposed in this pull request?

Add doc about enabling rac-awareness

### Why are the changes needed?

Document new features.

### Does this PR introduce _any_ user-facing change?

Yes, the docs are updated.

### How was this patch tested?

<img width="1085" alt="截屏2023-06-02 下午3 19 10" src="https://github.com/apache/incubator-celeborn/assets/46485123/c8c51a4c-40be-40ea-befd-3c369b9f7600">

Closes #1536 from AngersZhuuuu/CELEBORN-629.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-05 10:28:26 +08:00
Angerszhuuuu
3883fe2c80
[CELEBORN-623][FOLLUPUP] Refine doc about use ratis shell with RSS cluster
### What changes were proposed in this pull request?
Refine this doc since:

1. It didn't mention our cluster default RPC type is  `NETTY`
2. If the user use the ratis shell with `GRPC` but didn't know the ratis cluster is `NETTY`, the error is not clear and hard to debug.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1542 from AngersZhuuuu/CELEBORN-623-FOLLOWUP.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-02 22:09:05 +08:00
Angerszhuuuu
4df4775524
[CELEBORN-632][DOC] Add spark name space to spark specify properties (#1538) 2023-06-02 21:48:56 +08:00
liyihe
188b069710
[CELEBORN-623][DOCS] Document how to change RPC type in celeborn-ratis
### What changes were proposed in this pull request?
Ratis-shell use GRPC by default. Celeborn support Netty for ratis, if `raft.rpc.type` is not specified, commands may fail.
e.g.
```
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 14.947369960s. [closed=[], open=[[buffered_nanos=14962358255, waiting_for_connection]]]
```
So I think we should update the document to mention how to change the RPC type to in `celeborn-ratis`.

### Why are the changes needed?

Improve user experience

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test

Closes #1530 from onebox-li/ratis-shell-default-rpc.

Lead-authored-by: liyihe <liyihe@bigo.sg>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-02 20:23:09 +08:00
Angerszhuuuu
e18a5ea769
[CELEBORN-624] StorageManager should only remove expired app dirs (#1531) 2023-06-02 11:33:33 +08:00
Ethan Feng
d33916e571
[CELEBORN-625] Add a config to enable/disable UnsafeRow fast write. (#1532) 2023-06-01 20:55:45 +08:00
Angerszhuuuu
cf308aa057
[CLEBORN-595] Refine code frame of CelebornConf (#1525) 2023-06-01 10:37:58 +08:00
Angerszhuuuu
6d5dd50915
[CELEBORN-595][FOLLOWUP] Fix change version to 0.3.0. (#1522) 2023-05-30 20:12:56 +08:00
Angerszhuuuu
62681ba85d
[CELEBORN-595] Rename and refactor the configuration doc. (#1501) 2023-05-30 15:14:12 +08:00
zhongqiangchen
f117cff776
[CELEBORN-618] [FLINK] worker side adds partition split configuration options (#1520) 2023-05-30 14:13:31 +08:00
Binjie Yang
d30f45ad63
[CELEBORN-450][HELM] Configurable volumes in the values.yaml (#1508)
* [CELEBORN-450] Configure the mount & volume in the Values.yaml

* fix comments

* fix wrong name

* fix comments

* fix typo

* fix into array

* Wiht User Note Comments

* fix comments

* Update charts/celeborn/templates/worker-statefulset.yaml

---------

Co-authored-by: Cheng Pan <pan3793@gmail.com>
2023-05-29 13:48:23 +08:00
Angerszhuuuu
d244f44518
[CELEBORN-593] Refine some RPC related default configurations (#1498) 2023-05-19 18:23:12 +08:00
Angerszhuuuu
615d9a111f
[CELEBORN-487] Remove wrong space of config SHUFFLE_CLIENT_PUSH_BLACK (#1500) 2023-05-19 14:27:57 +08:00
Angerszhuuuu
811e192bbd
[CELEBORN-446] Support rack aware during assign slots for ROUNDROBIN (#1370) 2023-05-18 13:58:51 +08:00
Ethan Feng
7015d2463a
[CELEBORN-583] Merge pooled memory allocators. (#1490) 2023-05-18 10:37:30 +08:00
Angerszhuuuu
791d72d45f
[CELEBORN-590] Remove hadoop prefix of WORKER_WORKING_DIR (#1494) 2023-05-17 17:57:27 +08:00
Angerszhuuuu
7c6cb2f3bb
[CELEBORN-588] Remove test conf's category (#1491) 2023-05-17 17:37:28 +08:00
Angerszhuuuu
64a3534f71
[CELEBORN-584] Worker side should expose push/replicate/fetch Netty allocator metrics (#1489) 2023-05-16 17:51:33 +08:00
Angerszhuuuu
d657f8268a
[CELEBORN-586] Add SystemMiscSource to indicate system running status (#1488) 2023-05-16 14:03:07 +08:00
zhongqiangchen
5769c3fdc7
[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457) 2023-05-10 19:35:51 +08:00
Angerszhuuuu
778b5440bc
[CELEBORN-556][BUG] ReserveSlot should not use default RPC time out since register shuffle max timeout is network timeout (#1461) 2023-05-10 12:29:06 +08:00
Ethan Feng
3e0d779962
[CELEBORN-576] Add static identity provider and manually settable identity provider for non-hadoop environment. (#1480) 2023-05-08 17:29:01 +08:00
Ethan Feng
91b757555e
[CELEBORN-570] Update docs about monitor and deployment. (#1478) 2023-05-08 17:07:42 +08:00
Angerszhuuuu
ef4c12e0fe
[CELEBORN-565] FETCH_MAX_RETRIES should double when enable replicates (#1471) 2023-04-28 14:27:35 +08:00
Angerszhuuuu
13ce04f8a1
[CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT (#1462)
* [CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT
2023-04-26 15:19:34 +08:00
Shuang
0b2e4877bd
[CELEBORN-553] Improve IO (#1458) 2023-04-25 21:14:06 +08:00
Angerszhuuuu
0c2d3e647d
[CELEBORN-532][METRICS] Refine push-related failure metrics (#1442)
* [CELEBORN-532][METRICS] Refine push-related failure metrics
2023-04-21 17:05:43 +08:00
Angerszhuuuu
181c1bfcd6
[CELEBORN-524][PERF] CongestionControl call too much ChannelsLimiter onTrim cause CPU stuck or occupy too much CPU cause no cpu for handlePushData (#1428) 2023-04-21 15:44:56 +08:00
Angerszhuuuu
6830cb61ef
[CELEBORN-540][Refactor] Add config entity of celeborn.rpc.io.threads (#1443)
* [CELEBORN-540][CONF] Add config entity of celeborn.rpc.io.threads
2023-04-21 11:21:41 +08:00
Angerszhuuuu
e319b99a1c
[CELEBORN-527][DOC] Fix incorrect monitor the arrangement of documents (#1432) 2023-04-17 11:12:19 +08:00
Angerszhuuuu
ecafbf41fc
[CELEBORN-516][FOLLOWUP] Remove removed RPC metrics in metric doc (#1431) 2023-04-17 10:46:04 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Cheng Pan
fb7b311c89
[CELEBORN-499] Move version specific resource to main repo (#1429)
* [CELEBORN-499] Move version specific resource to main repo

* license
2023-04-14 16:20:51 +08:00
Ethan Feng
9cccfc9872
[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407) 2023-04-13 10:37:47 +08:00
Angerszhuuuu
e5722126e9
[CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication (#1408)
* [CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication
2023-04-12 14:59:32 +08:00
Angerszhuuuu
cad2836e85
[CELEBORN-505] Fix typo of SHUFFLE_CHUCK_SIZE (#1411) 2023-04-04 19:15:30 +08:00
Keyong Zhou
2e1598c011
[CELEBORN-485] Make celeborn.push.replicate.enabled default to false (#1394) 2023-04-03 16:36:29 +08:00
Angerszhuuuu
bf46336d54
[CELEBORN-487][PERF] ShuffleClientSide support blacklist to avoid client side timeout in same worker multiple times (#1399) 2023-04-03 11:50:04 +08:00
zhongqiangchen
cd92c423cd
[CELEBORN-475] Support extra tags for prometheus metrics (#1385)
[CELEBORN-475] Support extra tags for prometheus metrics
2023-03-28 21:22:28 +08:00
Keyong Zhou
cb19ed1c66
[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386) 2023-03-27 16:18:55 +08:00
Shuang
89b3f3887d
[CELEBORN-356] [FLINK] Support release single partition resource (#1314) 2023-03-24 17:15:28 +08:00
Ethan Feng
0ebad677d7
[CELEBORN-434] Add constrain about memory manager's parameters. (#1356) 2023-03-17 15:14:03 +08:00
Angerszhuuuu
4b334df7a6
[CELEBORN-399] Make fileSorterExecutors thread num can be customized (#1325) 2023-03-10 21:10:43 +08:00
Keyong Zhou
dcedf7b0a9
[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287) 2023-03-02 18:31:50 +08:00
zhongqiangchen
cb76c4de4c
[CELEBORN-350][FLINK] Add PluginConf to be compatible with old configurations 2023-02-28 20:36:11 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
Ethan Feng
0c8bb83114
[CELEBORN-234] Implement buffer stream. (#1221) 2023-02-17 17:38:36 +08:00
Ethan Feng
3aacede5f8
[CELEBORN-283] Derive network layer for flink plugin. (#1222) 2023-02-17 14:12:54 +08:00
jiaoqingbo
3a92b0d911
[CELEBORN-284] fix typo in CelebornConf (#1218)
Co-authored-by: jiaoqb <jiaoqb@asiainfo.com>
2023-02-10 14:59:36 +08:00
Rex(Hui) An
bff6e91e0b
[CELEBORN-227] Support different push strategies to control the push speed (#1167) 2023-02-07 14:24:30 +08:00
Rex(Hui) An
bb113ec9be
[CELEBORN-207] Support network congestion control (#1066) 2023-02-07 12:06:18 +08:00
Angerszhuuuu
4b6f7e4593
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185) 2023-02-03 11:53:15 +08:00
Angerszhuuuu
04427f2b16
[CELEBORN-247] Add metrics for each user's quota usage (#1182) 2023-02-02 18:31:08 +08:00
Angerszhuuuu
122da47815
[CELEBORN-241][IMPROVEMENT] limit inflight push timeout should > push data timeout (#1179) 2023-01-30 11:57:07 +08:00
zy.jordan
c5be79ee3d
[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102) 2023-01-20 10:18:45 +08:00
Ethan Feng
a239f9f284
[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168) 2023-01-16 20:06:47 +08:00
zy.jordan
bb96700415
[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163) 2023-01-16 12:03:46 +08:00
Keyong Zhou
fa7ba43136
[CELEBORN-225] Add global default configuration for number of flusher… (#1165) 2023-01-14 13:20:44 +08:00
zhongqiangczq
411ab09ffb
[CELEBORN-158][Flink] Add ShuffleServiceFactory to Support MapPartition in … (#1105) 2023-01-13 16:38:46 +08:00
Shuang
1332362bff
[CELEBORN-213] Add configuration for whether to close idle connections in client side (#1157) 2023-01-10 19:13:33 +08:00
zy.jordan
19197b9190
[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158) 2023-01-10 17:46:56 +08:00
Angerszhuuuu
e155ec122a
[CELEBORN-190] doPushMergedData should also support revive multiple times, not only twice (#1136) 2023-01-10 11:39:40 +08:00
Angerszhuuuu
415452d9c4
[CELEBORN-189][IMPROVEMENT] PushDataFailedSlave should add slave worker to blacklist (#1135) 2023-01-05 20:12:07 +08:00
RexAn
6432a129be
[CELEBORN-61][CELEBORN-62][FOLLOW_UP] Fix some issues for slow start (#1119) 2022-12-29 12:07:20 +08:00
Ethan Feng
5aa959a335
[CELEBORN-157] Change prefix of configurations to celeborn. (#1104) 2022-12-21 15:17:28 +08:00
Keyong Zhou
2f0682265e
[CELEBORN-119] Add timeout for pushdata (#1097) 2022-12-20 20:40:42 +08:00
nafiy
c931663e5f
[CELEBORN-110][REFACTOR] Notify critical error after collecting a certain number of non-critical error (#1055) 2022-12-16 15:47:36 +08:00
nafiy
2e37830a0f
[CELEBORN-139][BUG] Fix read wrong yaml file format when loading config (#1083) 2022-12-14 20:56:04 +08:00
Angerszhuuuu
de3ef0d694
[CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout (#1047)
* [CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout
2022-12-06 14:41:23 +08:00
Ethan Feng
acfaf59ab3
[CELEBORN-91] Refactor memory tracker to support read buffer. (#1038)
* [CELEBORN-91] Refactor memory tracker to support read buffer.
2022-12-05 15:38:43 +08:00
nafiy
8e384cda5a
[CELEBORN-88][REFACTOR] Revive/PartitionSplit should set separated timeout configuration (#1046) 2022-12-05 10:36:43 +08:00
nafiy
44d45c2a27
[CELEBORN-90][REFACTOR] GetReducerFileGroup should support separated timeout configuration (#1045) 2022-12-02 22:53:51 +08:00
nafiy
13e1e24035
[CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration (#1031)
* [CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration
2022-12-01 18:39:56 +08:00
nafiy
d584211a75
[CELEBORN-95][REFACTOR]Rename CLIENT_RPC_ASK_TIMEOUT to HA_CLIENT_RPC_ASK_TIMEOUT (#1037) 2022-12-01 11:57:02 +08:00
zhongqiangczq
898d1126a6
[CELEBORN-11] ShuffleClient supports MapPartition shuffle write: send handshake/regionstart/regionfinish (#1035) 2022-12-01 11:20:55 +08:00
Angerszhuuuu
d26e73209b
[CELEBORN-76] Support batch commit hard split partition before stage end 2022-11-29 13:09:01 +08:00
Cheng Pan
9bf4c65357
[CELEBORN-72][DOCS] Remove unused website resources from main repo (#1014) 2022-11-28 09:47:30 +08:00
Keyong Zhou
f8bb2cd47d
[CELEBORN-12]Retry on CommitFile request (#1011) 2022-11-26 20:56:24 +08:00
Keyong Zhou
9214b82181
[CELEBORN-68] Client might fetch incorrect data chunk (#1010) 2022-11-26 18:06:06 +08:00
Ethan Feng
ee243f286d
[CELEBORN-4] Add metrics about top disk used apps. (#985) 2022-11-22 20:06:36 +08:00
Gabriel
5ecb09d62a
[ISSUE-911] Decrease numConnectionsPerPeer to achieve better performance (#983) 2022-11-20 11:46:17 +08:00
leesf
3699683a3b
Fix and migrate some configs (#927) 2022-11-07 09:41:38 +08:00
Kerwin Zhang
db08d49032
[FEATURE] Support columnar shuffle codegen (#915) 2022-11-04 20:54:13 +08:00
Angerszhuuuu
38e15d89e6
[ISSUE-902][IMPROVEMENT][FOLLOWUP] LifecycleManager should reserve blacklist with irrecoverable status (#914) 2022-11-04 15:54:45 +08:00
Angerszhuuuu
87fcfa767f
[ISSUE-887][REFACTOR] Configuration type convert to Enum (#888)
* [ISSUE-332][FOLLOWUP] Add deps in worker's pom

* [Refactor] Modify package name of utils to keep consistence

* [Refactor] Modify package name of utils to keep consistence

* [REFACTOR] Remove unused isRegistered in controller

* [ISSUE-887][REFACTOR] Configuration type convert to Enum

* update

* update

* Update RssShuffleManager.java
2022-10-29 13:41:06 +08:00
Cheng Pan
d7be6006e7
Migrate network related conf to structured conf system (#875)
* Migrate network related conf to structured conf system

* migrate

* fix

* fix

* worker

* fix

* nit

* review

* nit
2022-10-28 10:45:52 +08:00
Angerszhuuuu
d283cca4e1
[ISSUE-869][REFACTOR] Migrate partition size/sorter related conf to Celeborn ConfigEntity (#870) 2022-10-27 16:49:55 +08:00
Angerszhuuuu
26dcc118c6
[ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System (#873)
* [ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System
2022-10-27 15:35:29 +08:00
Angerszhuuuu
399236c880
[ISSUE-849][REFACTOR] Migrate master and common Celeborn Configuration System (#850) 2022-10-26 17:09:27 +08:00
Angerszhuuuu
89c3013122
[ISSUE-851][REFACTOR] Migrate quota configruation to Celeborn Configuration System (#852)
* [ISSUE-851][REFACTOR] Migrate quota configruation to Celeborn Configuration System
2022-10-26 14:09:44 +08:00
nafiy
e44e8c9610
[ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry (#831)
* [ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry

* Fix based on review

* update doc

* resolve review feedback

* fix

* Fix based on review

* fix based on review
2022-10-25 21:16:53 +08:00
Ethan Feng
8800fc4a8e
[Refactor] Refine rpc cache configs (#853)
* refine rpc cache configs.

* update.

* update.

* update.
2022-10-25 20:28:18 +08:00
Ethan Feng
45ef716737
[Feature] Cache GetReducerFileGroupResponse to avoid lifecycle manager oom. (#792) 2022-10-25 16:16:44 +08:00
Cheng Pan
e71c0228aa
Migrate columnar shuffle configurations to ConfigEntry (#844) 2022-10-25 14:26:11 +08:00
AngersZhuuuu
2ebf873b3c
[ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System (#846)
[ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System
2022-10-25 10:54:45 +08:00
AngersZhuuuu
0bd0a3e9f4
[ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System (#848)
* [ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System

* Update CelebornConf.scala

* follow comments

* update

* update

* update

* Update client.md
2022-10-25 09:16:46 +08:00
Cheng Pan
e3d649fff3
Change slot to slots for consistency (#843) 2022-10-24 20:49:28 +08:00
AngersZhuuuu
0fdb19065a
[ISSUE-841][REFACTOR] Migrate shuffle client side conf to Celeborn Configuration System (#842) 2022-10-24 20:48:48 +08:00
Cheng Pan
8d7d397e71
Fix Configuration page and polish naming (#838)
* Fix Configuration page and polish naming

* nit

* nit

* comment
2022-10-24 12:46:25 +08:00
Ethan Feng
392a252baa
[FOLLOWUP][ISSUE-813]Update doc and fix typo. (#825) 2022-10-22 23:02:22 +08:00
nafiy
1a8a36e8fe
[ISSUE-812][Refactor] Migrate metrics system related configs to ConfigEntry (#821) 2022-10-21 13:57:58 +08:00
Ethan Feng
5c761a8df3
[ISSUE-813][Refactor] Refactor flusher configurations. (#813)
* Refactor flusher configurations.

* Refactor flusher configurations.

* Update.

* remove brackets.

* update docs.

* rename.

* update.

* update docs.

* update.

* update.

* update.

* update.

* update.

* update.

* update.

* format.

* update.

* update.
2022-10-20 15:23:17 +08:00
AngersZhuuuu
23c65a27a9
[ISSUE-798][REFACTOR] Migrate worker-recover related conf to ConfigEntry (#799) 2022-10-19 16:42:00 +08:00
Cheng Pan
cb07cf62c0
Auto generate configuration docs (#794) 2022-10-19 10:50:22 +08:00
Cheng Pan
ea67f4e060
Introduce categories to ConfigEntry and migrate configurations (#775) 2022-10-17 16:56:54 +08:00
Cheng Pan
f01a696313
Migrate and refactor configuration for master endpoints (#752) 2022-10-11 21:33:21 +08:00
AngersZhuuuu
bbb4f8e225
[ISSUE-306][IMPROVEMENT] Handle change partition request in batch (#622) 2022-10-10 18:31:37 +08:00
AngersZhuuuu
db9ce36608
[ISSUE-690][DOC] Storage resource quota doc (#703) 2022-10-09 20:01:50 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00