celeborn

Author	SHA1	Message	Date
zky.zhoukeyong	e56a8a8bed	[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean… …up client ### What changes were proposed in this pull request? Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response, client calls ```unregisterShuffle``` for cleanup. ### Why are the changes needed? Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo): ![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005) After this PR, the number of PartitionLocation objects decreases to 275 thousands ![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc) This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1719 from waitinfuture/798. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:14:10 +08:00
zky.zhoukeyong	95119b1e4b	[CELEBORN-799][FOLLOWUP] Fix doc of `celeborn.client.push.maxReqsInFlight.total` …Flight.total``` ### What changes were proposed in this pull request? Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA. Closes #1723 from waitinfuture/799-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:01:03 +08:00
zky.zhoukeyong	4b3a47c9db	[CELEBORN-799] Limit total inflight push requests ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In case where worker instances is very large, say 1000, then before this PR total memory consumed by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR limits total inflight push requests, as 0.2.1-incubating does. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1720 from waitinfuture/799. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:17:24 +08:00
zky.zhoukeyong	a7bbbd05c4	[CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf ### What changes were proposed in this pull request? 1. Decrease writeTime metric sampling frequency to improve perf 2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4 ### Why are the changes needed? Following are test cases case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data Following are e2e time of shuffle write stage \|\|Sort pusher before\|Sort pusher after\|Hash pusher before\|Hash pusher after\| \|----\|----\|----\|----\|-----\| \|case1\|4.4min\|4.1min\|4.4min\|3.9min\| \|case2\|9.1min\|8.4min\|9.7min\|8.5min\| ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1718 from waitinfuture/797. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-14 20:51:50 +08:00
caojiaqing	d64e0091f1	[CELEBORN-785] Add worker side partition hard split threshold ### What changes were proposed in this pull request? Add a configuration `celeborn.worker.shuffle.partitionSplit.max` to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value ### Why are the changes needed? In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios. ### Does this PR introduce _any_ user-facing change? `celeborn.worker.shuffle.partitionSplit.max` defalut value 2g ### How was this patch tested? none Closes #1701 from JQ-Cao/785. Authored-by: caojiaqing <caojiaqing@bilibili.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 14:14:41 +08:00
zky.zhoukeyong	7a47fae230	[CELEBORN-786] Change default flush threads ### What changes were proposed in this pull request? This PR changes default values of the following configs: \|config\|previous default value\|new default value\| \|----\|----\|----\| \|celeborn.worker.flusher.threads\|2\|16\| \|celeborn.worker.flusher.ssd.threads\|8\|16\| ### Why are the changes needed? If disk type is not specified, ```celeborn.worker.flusher.threads``` will be used. Recently many users use SSD for Celeborn workers without specifying disk type, and 2 flush threads is far from leveraging the power of SSD. ### Does this PR introduce _any_ user-facing change? Yes, default configs are changed. ### How was this patch tested? Passes GA. Closes #1703 from waitinfuture/786. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 13:09:29 +08:00
Angerszhuuuu	9f09ac6ce9	[CELEBORN-780] Change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default to Int.MaxValue since slot's is not a bottleneck ### What changes were proposed in this pull request? Now slots is not a bottleneck, change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default value to Int.MaxValue. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1695 from AngersZhuuuu/CELEBORN-780. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-10 18:50:10 +08:00
Cheng Pan	4f8e72f217	[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf ### What changes were proposed in this pull request? Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style ### Why are the changes needed? Pullout inline configuration to `CelebornConf`, and expose it in configuration docs ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1684 from pan3793/CELEBORN-774. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-06 16:23:32 +08:00
zky.zhoukeyong	09881f5cff	[CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16 …Flight to 16 ### What changes were proposed in this pull request? Change default value of celeborn.client.push.maxReqsInFlight to 16. ### Why are the changes needed? Previous value 4 is too small, 16 is more reasonable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1683 from waitinfuture/769. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-06 10:22:06 +08:00
mingji	d0ecf83fec	[CELEBORN-764] Fix celeborn on HDFS might clean using app directories ### What changes were proposed in this pull request? Make Celeborn leader clean expired app dirs on HDFS when an application is Lost. ### Why are the changes needed? If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories. This will cause using app directories to be deleted unexpectedly. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1678 from FMX/CELEBORN-764. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 23:11:50 +08:00
zky.zhoukeyong	4300835363	[CELEBORN-768] Change default config values for batch rpcs and netty … …memory allocator ### What changes were proposed in this pull request? Changes the following configs' default values \| config \| previous value \| current value \| \| ------------- \| ------------- \| ------------- \| \| celeborn.network.memory.allocator.share \| false \| true \| \| celeborn.client.shuffle.batchHandleChangePartition.enabled \| false \| true \| \| celeborn.client.shuffle.batchHandleCommitPartition.enabled \| false \| true \| ### Why are the changes needed? In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled. In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout. ### Does this PR introduce _any_ user-facing change? No, these conifgs are introduces from 0.3.0. ### How was this patch tested? Passes GA. Closes #1682 from waitinfuture/768. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 18:16:41 +08:00
Fu Chen	3af5c231c7	[CELEBORN-767][DOC] Update the docs of `celeborn.client.spark.push.sort.memory.threshold` ### What changes were proposed in this pull request? As title ### Why are the changes needed? To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1680 from cfmcgrady/docs. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 18:07:09 +08:00
Angerszhuuuu	693172d0bd	[CELEBORN-751] Rename remain rss related class name and filenames etc ### What changes were proposed in this pull request? Rename remain rss related class name and filenames etc... ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1664 from AngersZhuuuu/CELEBORN-751. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-04 10:20:08 +08:00
Cheng Pan	26aaba14d4	[CELEBORN-637][FOLLOWUP] Mention configurations change in migration guide ### What changes were proposed in this pull request? as title ### Why are the changes needed? mention configuration behavior change in migration guide ### Does this PR introduce _any_ user-facing change? Yes, the migration guide is updated ### How was this patch tested? review Closes #1673 from pan3793/CELEBORN-637-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-03 14:26:43 +08:00
xiyu.zk	381165d4e7	[CELEBORN-755] Support disable shuffle compression ### What changes were proposed in this pull request? Support to decide whether to compress shuffle data through configuration. ### Why are the changes needed? Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation. ### Does this PR introduce _any_ user-facing change? no. Closes #1669 from kerwin-zk/celeborn-755. Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-01 00:03:50 +08:00
Angerszhuuuu	5c7ecb8302	[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### What changes were proposed in this pull request? Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1667 from AngersZhuuuu/CELEBORN-754. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-30 17:27:33 +08:00
Fu Chen	adbd38a926	[CELEBORN-726][FOLLOWUP] Update data replication terminology from `master/slave` to `primary/replica` in the codebase ### What changes were proposed in this pull request? As title ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #1639 from cfmcgrady/primary-replica. Lead-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 17:07:26 +08:00
Fu Chen	17c1e01874	[CELEBORN-726] Update data replication terminology from `master/slave` to `primary/replica` for configurations and metrics ### What changes were proposed in this pull request? This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC. ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. Closes #1650 from cfmcgrady/primary-replica-metrics. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 09:47:02 +08:00
onebox-li	1b74d85fb1	[CELEBORN-725][MINOR] Refine congestion code ### What changes were proposed in this pull request? Refine the congestion relevant code/log/comments ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually test Closes #1637 from onebox-li/improve-congestion. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 18:31:40 +08:00
Angerszhuuuu	3985a5cbd7	[CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment ### What changes were proposed in this pull request? Unify all blacklist related code and comment ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 16:28:03 +08:00
zhongqiang.czq	374d735ae5	[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo… …nse with lower versions ### What changes were proposed in this pull request? The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse. The reply field is default false before the version 0.2.1, so master can be compatible with older client version ### Why are the changes needed? Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`. if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur. The following compatiblity error will be printted. ``` java java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362] at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ``` java Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$ at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362] at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?] at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?] at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?] at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The pr is tested manually and the testing process is as follows: 1. server side is deploy using the code of latest branch-0.3. 2. spark client is deploy the version of 0.2.1, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted 3. spark client is deploy the version of 0.3.0, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted This patch had conflicts when merged, resolved by Committer: Cheng Pan <chengpan@apache.org> Closes #1635 from zhongqiangczq/heartbeat2. Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 16:04:18 +08:00
Angerszhuuuu	33cf343d20	[CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration ### What changes were proposed in this pull request? Unify exclude and blacklist related configuration ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 10:59:58 +08:00
zky.zhoukeyong	57b0e815cf	[CELEBORN-656] Batch revive RPCs in client to avoid too many requests ### What changes were proposed in this pull request? This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests. To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl. ### Why are the changes needed? In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 \|grep -i revive \|wc -l 64364 ``` After this PR, number of ReviveBatch requests reduces to 708: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out \|grep -i revive \|wc -l 2573 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. I have tested: 1. Disable graceful shutdown, kill one worker, job succeeds 2. Disable graceful shutdown, kill two workers successively, job fails as expected 3. Enable graceful shutdown, restart two workers successively, job succeeds 4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds Closes #1588 from waitinfuture/656-2. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-27 22:11:04 +08:00
mingji	40760ede3a	[CELEBORN-568] Support storage type selection ### What changes were proposed in this pull request? 1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now. 2. Add new buffer size for HDFS file writers. 3. Worker support empty working dirs. ### Why are the changes needed? Support HDFS only scenario. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1619 from FMX/CELEBORN-568. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 18:07:08 +08:00
Angerszhuuuu	a2b215bd47	[CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with `celeborn.hadoop.` prefix ### What changes were proposed in this pull request? Celeborn generate hadoop configuration should respect Celeborn conf ### Why are the changes needed? In spark client side we should write like `spark.celeborn.hadoop.xxx.xx` In server side we should write like `celeborn.hadoop.xxx.xxx` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1629 from AngersZhuuuu/CELEBORN-719. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-27 17:00:47 +08:00
Cheng Pan	1753556565	[CELEBORN-713] Local network binding support IP or FQDN ### What changes were proposed in this pull request? This PR aims to make network local address binding support both IP and FQDN strategy. Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-` ### Why are the changes needed? Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting. For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause ``` java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874) at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735) at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827) at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140) at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192) at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### Does this PR introduce _any_ user-facing change? Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior. ### How was this patch tested? Manually testing with `celeborn.network.bind.preferIpAddress=false` ``` Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.143.252 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.173.94 Server: 10.178.96.64 Address: 10.178.96.64#53 Name: celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local Address: 10.153.149.42 starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303. ``` Closes #1622 from pan3793/CELEBORN-713. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-27 09:42:11 +08:00
Cheng Pan	2b82194ce0	[CELEBORN-715] Change master URL schema from rss to celeborn ### What changes were proposed in this pull request? Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>` ### Why are the changes needed? Respect the project name. ### Does this PR introduce _any_ user-facing change? Yes, migration guide is updated accordingly. ### How was this patch tested? Pass GA. Closes #1624 from pan3793/CELEBORN-715. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-26 22:30:20 +08:00
Cheng Pan	ac84d64d51	[CELEBORN-707][MASTER] Remove env CELEBORN_MASTER_HOST and CELEBORN_MASTER_PORT ### What changes were proposed in this pull request? Remove environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT`, and makes `CELEBORN_LOCAL_HOSTNAME` takes effect on both master and worker. ### Why are the changes needed? There are many different ways to configure the master/worker host and port, which makes the thing complex and inconsistent. After this change, #### master 1. cli args `--host` `--port` takes the highest priority 2. then lookup env `CELEBORN_LOCAL_HOSTNAME` 3. things are different when HA enabled and disabled 3.1. when HA is disabled, lookup configurations `celeborn.master.host` and `celeborn.master.port` 3.2. when HA is enabled, each node needs to know the whole cluster info, ``` celeborn.master.ha.node.1.host clb-1 celeborn.master.ha.node.1.port 9097 celeborn.master.ha.node.2.host clb-2 celeborn.master.ha.node.2.port 9097 celeborn.master.ha.node.3.host clb-3 celeborn.master.ha.node.3.port 9097 ``` in addition, `celeborn.master.ha.node.id=1` can be used to indicate the node id, otherwise, the master will try to bind each host to match the node id. #### worker 1. cli args `--host` `--port` takes the highest priority 2. then lookup env `CELEBORN_LOCAL_HOSTNAME` things are simple than the master case because each worker is not required to know others. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? UT. Closes #1616 from pan3793/CELEBORN-707. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-25 16:00:59 +08:00
zky.zhoukeyong	e2eeafd4bf	[CELEBORN-709] Increase default fetch timeout ### What changes were proposed in this pull request? 30s for fetch timeout is too short and easy to exceed. This PR increases the default value to 600s. ### Why are the changes needed? When I was testing 3T TPCDS with three workers, I encountered fetch timeout: ``` 23/06/21 16:46:41,771 INFO [fetch-server-11-7] FetchHandler: Sending chunk 28856864163, 1, 0, 2147483647 ... 23/06/21 16:47:16,870 INFO [fetch-server-11-7] FetchHandler: Sent chunk 28856864163, 1, 0, 2147483647 ``` And I remember from some users' monitoring, the max fetch time can reach several minutes on heavy load without error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1618 from waitinfuture/709. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-23 21:06:43 +08:00
zky.zhoukeyong	5f4f6d953f	[CELEBORN-702][DOC] Extend doc about migration from 0.2.1 to 0.3.0 ### What changes were proposed in this pull request? Extend doc about migration from 0.2.1 to 0.3.0. Added the following contents: <img width="1084" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/7a9d172c-09ba-48b6-9f5c-73a8b13d035f"> ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1612 from waitinfuture/702. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-20 20:45:58 +08:00
Cheng Pan	e22379c3ab	[CELEBORN-638] Migrate configurations celeborn.ha.master.* to celeborn.master.ha.* ### What changes were proposed in this pull request? It was discussed during the last meeting, but abandoned due to the complication. ### Why are the changes needed? Make the configuration unified. ### Does this PR introduce _any_ user-facing change? Yes, but the legacy configurations still take effect. ### How was this patch tested? New UTs. Closes #1549 from pan3793/CELEBORN-638. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-16 18:18:26 +08:00
Angerszhuuuu	1ba6dee324	[CELEBORN-680][DOC] Refresh celeborn configurations in doc ### What changes were proposed in this pull request? Refresh celeborn configurations in doc ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1592 from AngersZhuuuu/CELEBORN-680. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-15 13:59:38 +08:00
Angerszhuuuu	0aa13832b5	[CELEBORN-676] Celeborn fetch chunk also should support check timeout ### What changes were proposed in this pull request? Celeborn fetch chunk also should support check timeout #### Test case ``` executor instance 20 SQL: SELECT count(1) from (select /+ REPARTITION(100) / * from spark_auxiliary.t50g) tmp; --conf spark.celeborn.client.spark.shuffle.writer=sort \ --conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \ --conf spark.celeborn.client.push.timeout=10s \ --conf spark.celeborn.client.push.replicate.enabled=true \ --conf spark.celeborn.client.push.revive.maxRetries=10 \ --conf spark.celeborn.client.reserveSlots.maxRetries=10 \ --conf spark.celeborn.client.registerShuffle.maxRetries=3 \ --conf spark.celeborn.client.push.blacklist.enabled=true \ --conf spark.celeborn.client.blacklistSlave.enabled=true \ --conf spark.celeborn.client.fetch.timeout=30s \ --conf spark.celeborn.client.push.data.timeout=30s \ --conf spark.celeborn.client.push.limit.inFlight.timeout=600s \ --conf spark.celeborn.client.push.maxReqsInFlight=32 \ --conf spark.celeborn.client.shuffle.compression.codec=ZSTD \ --conf spark.celeborn.rpc.askTimeout=30s \ --conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \ --conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true ``` Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest` Before patch <img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206"> After patch <img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e"> The log shows the fetch timeout checker workers ``` 23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed. org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[ id-epoch:35-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095) storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=} mapIdBitMap:null], change to peer org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed. at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98) at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) ... 8 more 23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1587 from AngersZhuuuu/CELEBORN-676. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-15 13:54:09 +08:00
Angerszhuuuu	8a0b7d80d6	[CELEBORN-681][DOC] Add celeborn.metrics.conf to conf entity ### What changes were proposed in this pull request? Add celeborn.metrics.conf to conf entity ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1593 from AngersZhuuuu/CELEBORN-681. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-14 18:06:03 +08:00
Fu Chen	aa3bb0ac3b	[CELEBORN-679] Optimize `Utils#bytesToString` ### What changes were proposed in this pull request? refer to https://github.com/apache/spark/pull/40301 1. Optimize `Utils.bytesToString`. Arithmetic ops on BigInt and BigDecimal are order(s) of magnitude slower than the ops on primitive types. Division is an especially slow operation and it is used en masse here. 2. According to the information sourced from [Wikipedia](https://en.wikipedia.org/wiki/Kilobyte), it is established that 1000 is the appropriate factor for representing kilobytes (KB), while 1024 is the correct factor for kibibytes (KiB). In alignment with this understanding, changing the size unit from "KB" to "KiB". ### Why are the changes needed? the Utils#bytesToString method is frequently employed in memory-related log messages. ### Does this PR introduce _any_ user-facing change? No, only perf improvement. ### How was this patch tested? existing UT and manually tested. Closes #1590 from cfmcgrady/bytesToString. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-14 17:42:16 +08:00
Shuang	da85347330	[CELEBORN-675] Fix decode heartbeat message ### What changes were proposed in this pull request? Give Heartbeat one byte message and skip this byte when decode. ### Why are the changes needed? Heartbeat message may split in to two netty buffer, then the `empty buffer` (which don't need actually, but need keep) be wrong removed, then decodeNext would throw NPE. see ``` java while (headerBuf.readableBytes() < HEADER_SIZE) { ByteBuf next = buffers.getFirst(); int toRead = Math.min(next.readableBytes(), HEADER_SIZE - headerBuf.readableBytes()); headerBuf.writeBytes(next, toRead); if (!next.isReadable()) { buffers.removeFirst().release(); } } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT & MANUAL Closes #1589 from RexXiong/CELEBORN-675. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>	2023-06-14 14:37:13 +08:00
zky.zhoukeyong	47cded835f	[CELEBORN-669] Avoid commit files on excluded worker list ### What changes were proposed in this pull request? CommitHandler will check whether the target worker is in WorkerStatusTracker's excluded list. If so, skip calling commit files on it. ### Why are the changes needed? Avoid unnecessary commit files to excluded worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1581 from waitinfuture/669. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-13 22:31:02 +08:00
Angerszhuuuu	357add5b00	[CELEBORN-494][PERF] RssInputStream fetch side support blacklist to avoid client side timeout in same worker multiple times during fetch ### What changes were proposed in this pull request? ####Test case ``` executor instance 20 SQL: SELECT count(1) from (select /+ REPARTITION(100) / * from spark_auxiliary.t50g) tmp; create connection timeout 10s Fetch chunk timeout 30s ``` In the graph, the shuffle read time of `before` and `after` is always the same delay time. ##### Worker can't connect Before ![image](https://user-images.githubusercontent.com/46485123/229465520-9d751b40-2b8f-49d2-b350-a2278e3dd89e.png) After ![image](https://user-images.githubusercontent.com/46485123/229465552-88ac1ca4-24ad-4c30-9a46-0cdcae6bbfd5.png) ##### OpenStream stuck Before ![image](https://user-images.githubusercontent.com/46485123/229465629-68765a6a-2503-4018-8917-d49e47d5dccc.png) After ![image](https://user-images.githubusercontent.com/46485123/229465683-2f57b374-1c66-4819-93dd-cabee7ccb788.png) ##### Fetch chunk stuck Before ![image](https://user-images.githubusercontent.com/46485123/229465735-8d2f694b-1b4a-4984-b069-c4a308f41008.png) After ![image](https://user-images.githubusercontent.com/46485123/229465754-c2237d5a-6fb6-4d5b-819e-b7d86a1e88d7.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1406 from AngersZhuuuu/CELEBORN-494. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-13 20:06:31 +08:00
Angerszhuuuu	6b725202a2	[CELEBORN-640][WORKER] DataPushQueue should not keep waiting take tasks ### What changes were proposed in this pull request? In our prod meet many times of push queue stuck caused by PushState's status was not being removed. Caused DataPushQueue to keep waiting for taking task. Although have resolved some bugs, here we'd better add a max wait time for taking tasks since we already have the `PUSH_DATA_TIMEOUT` check method. If the target worker is really stuck, we can retry our task finally. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1552 from AngersZhuuuu/CELEBORN-640. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-09 14:06:47 +08:00
onebox-li	0c869ac9a0	[CELEBORN-642] Improve metrics and update grafana ### What changes were proposed in this pull request? Change in grafana （ALL） add: JVMCPUTime LastMinuteSystemLoad AvailableProcessors （For Master） add: LostWorkers IsActiveMaster PartitionSize （For Worker） add: PushDataFailCount -> WriteDataFailCount ReplicateDataFailCount ReplicateDataWriteFailCount ReplicateDataCreateConnectionFailCount ReplicateDataConnectionExceptionCount ReplicateDataTimeoutCount SortedFileSize PushDataHandshakeFailCount RegionStartFailCount RegionFinishFailCount MasterPushDataHandshakeTime SlavePushDataHandshakeTime MasterRegionStartTime SlaveRegionStartTime MasterRegionFinishTime SlaveRegionFinishTime PotentialConsumeSpeed UserProduceSpeed WorkerConsumeSpeed DeviceOSFreeBytes DeviceCelebornFreeBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory remove: dup ReserveSlotsTime Change dashboard layout. Fix support for multiple labels. Modify some metrics docs. ### Why are the changes needed? For better use of metrics. ### Does this PR introduce _any_ user-facing change? Below metrics change name, extract some value to the label. DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory ### How was this patch tested? Cluster test. Closes #1557 from onebox-li/improve-metrics. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-08 18:10:06 +08:00
Ethan Feng	76a42beab0	[CELEBORN-610][FLINK] Eliminate pluginconf and merge its content to CelebornConf ### What changes were proposed in this pull request? Pluginconf might be hard to understand why Celeborn needs to config class. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT. Closes #1524 from FMX/CELEBORN-610. Authored-by: Ethan Feng <ethanfeng@apache.org> Signed-off-by: Ethan Feng <ethanfeng@apache.org>	2023-06-05 14:08:53 +08:00
Angerszhuuuu	218bfc78a5	[CELEBORN-629][DOC] Add doc about enable rac-awareness ### What changes were proposed in this pull request? Add doc about enabling rac-awareness ### Why are the changes needed? Document new features. ### Does this PR introduce _any_ user-facing change? Yes, the docs are updated. ### How was this patch tested? <img width="1085" alt="截屏2023-06-02 下午3 19 10" src="https://github.com/apache/incubator-celeborn/assets/46485123/c8c51a4c-40be-40ea-befd-3c369b9f7600"> Closes #1536 from AngersZhuuuu/CELEBORN-629. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-05 10:28:26 +08:00
Angerszhuuuu	3883fe2c80	[CELEBORN-623][FOLLUPUP] Refine doc about use ratis shell with RSS cluster ### What changes were proposed in this pull request? Refine this doc since: 1. It didn't mention our cluster default RPC type is `NETTY` 2. If the user use the ratis shell with `GRPC` but didn't know the ratis cluster is `NETTY`, the error is not clear and hard to debug. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1542 from AngersZhuuuu/CELEBORN-623-FOLLOWUP. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-02 22:09:05 +08:00
Angerszhuuuu	4df4775524	[CELEBORN-632][DOC] Add spark name space to spark specify properties (#1538 )	2023-06-02 21:48:56 +08:00
liyihe	188b069710	[CELEBORN-623][DOCS] Document how to change RPC type in `celeborn-ratis` ### What changes were proposed in this pull request? Ratis-shell use GRPC by default. Celeborn support Netty for ratis, if `raft.rpc.type` is not specified, commands may fail. e.g. ``` org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 14.947369960s. [closed=[], open=[[buffered_nanos=14962358255, waiting_for_connection]]] ``` So I think we should update the document to mention how to change the RPC type to in `celeborn-ratis`. ### Why are the changes needed? Improve user experience ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test Closes #1530 from onebox-li/ratis-shell-default-rpc. Lead-authored-by: liyihe <liyihe@bigo.sg> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-02 20:23:09 +08:00
Angerszhuuuu	e18a5ea769	[CELEBORN-624] StorageManager should only remove expired app dirs (#1531 )	2023-06-02 11:33:33 +08:00
Ethan Feng	d33916e571	[CELEBORN-625] Add a config to enable/disable UnsafeRow fast write. (#1532 )	2023-06-01 20:55:45 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	6d5dd50915	[CELEBORN-595][FOLLOWUP] Fix change version to 0.3.0. (#1522 )	2023-05-30 20:12:56 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
zhongqiangchen	f117cff776	[CELEBORN-618] [FLINK] worker side adds partition split configuration options (#1520 )	2023-05-30 14:13:31 +08:00
Binjie Yang	d30f45ad63	[CELEBORN-450][HELM] Configurable volumes in the values.yaml (#1508 ) * [CELEBORN-450] Configure the mount & volume in the Values.yaml * fix comments * fix wrong name * fix comments * fix typo * fix into array * Wiht User Note Comments * fix comments * Update charts/celeborn/templates/worker-statefulset.yaml --------- Co-authored-by: Cheng Pan <pan3793@gmail.com>	2023-05-29 13:48:23 +08:00
Angerszhuuuu	d244f44518	[CELEBORN-593] Refine some RPC related default configurations (#1498 )	2023-05-19 18:23:12 +08:00
Angerszhuuuu	615d9a111f	[CELEBORN-487] Remove wrong space of config SHUFFLE_CLIENT_PUSH_BLACK (#1500 )	2023-05-19 14:27:57 +08:00
Angerszhuuuu	811e192bbd	[CELEBORN-446] Support rack aware during assign slots for ROUNDROBIN (#1370 )	2023-05-18 13:58:51 +08:00
Ethan Feng	7015d2463a	[CELEBORN-583] Merge pooled memory allocators. (#1490 )	2023-05-18 10:37:30 +08:00
Angerszhuuuu	791d72d45f	[CELEBORN-590] Remove hadoop prefix of WORKER_WORKING_DIR (#1494 )	2023-05-17 17:57:27 +08:00
Angerszhuuuu	7c6cb2f3bb	[CELEBORN-588] Remove test conf's category (#1491 )	2023-05-17 17:37:28 +08:00
Angerszhuuuu	64a3534f71	[CELEBORN-584] Worker side should expose push/replicate/fetch Netty allocator metrics (#1489 )	2023-05-16 17:51:33 +08:00
Angerszhuuuu	d657f8268a	[CELEBORN-586] Add SystemMiscSource to indicate system running status (#1488 )	2023-05-16 14:03:07 +08:00
zhongqiangchen	5769c3fdc7	[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457 )	2023-05-10 19:35:51 +08:00
Angerszhuuuu	778b5440bc	[CELEBORN-556][BUG] ReserveSlot should not use default RPC time out since register shuffle max timeout is network timeout (#1461 )	2023-05-10 12:29:06 +08:00
Ethan Feng	3e0d779962	[CELEBORN-576] Add static identity provider and manually settable identity provider for non-hadoop environment. (#1480 )	2023-05-08 17:29:01 +08:00
Ethan Feng	91b757555e	[CELEBORN-570] Update docs about monitor and deployment. (#1478 )	2023-05-08 17:07:42 +08:00
Angerszhuuuu	ef4c12e0fe	[CELEBORN-565] FETCH_MAX_RETRIES should double when enable replicates (#1471 )	2023-04-28 14:27:35 +08:00
Angerszhuuuu	13ce04f8a1	[CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT (#1462 ) * [CELEBORN-557] HA_CLIENT_RPC_ASK_TIMEOUT should fallback to RPC_ASK_TIMEOUT	2023-04-26 15:19:34 +08:00
Shuang	0b2e4877bd	[CELEBORN-553] Improve IO (#1458 )	2023-04-25 21:14:06 +08:00
Angerszhuuuu	0c2d3e647d	[CELEBORN-532][METRICS] Refine push-related failure metrics (#1442 ) * [CELEBORN-532][METRICS] Refine push-related failure metrics	2023-04-21 17:05:43 +08:00
Angerszhuuuu	181c1bfcd6	[CELEBORN-524][PERF] CongestionControl call too much ChannelsLimiter onTrim cause CPU stuck or occupy too much CPU cause no cpu for handlePushData (#1428 )	2023-04-21 15:44:56 +08:00
Angerszhuuuu	6830cb61ef	[CELEBORN-540][Refactor] Add config entity of celeborn.rpc.io.threads (#1443 ) * [CELEBORN-540][CONF] Add config entity of celeborn.rpc.io.threads	2023-04-21 11:21:41 +08:00
Angerszhuuuu	e319b99a1c	[CELEBORN-527][DOC] Fix incorrect monitor the arrangement of documents (#1432 )	2023-04-17 11:12:19 +08:00
Angerszhuuuu	ecafbf41fc	[CELEBORN-516][FOLLOWUP] Remove removed RPC metrics in metric doc (#1431 )	2023-04-17 10:46:04 +08:00
cxzl25	13f772e0c0	[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size	2023-04-14 20:45:25 +08:00
Cheng Pan	fb7b311c89	[CELEBORN-499] Move version specific resource to main repo (#1429 ) * [CELEBORN-499] Move version specific resource to main repo * license	2023-04-14 16:20:51 +08:00
Ethan Feng	9cccfc9872	[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407 )	2023-04-13 10:37:47 +08:00
Angerszhuuuu	e5722126e9	[CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication (#1408 ) * [CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication	2023-04-12 14:59:32 +08:00
Angerszhuuuu	cad2836e85	[CELEBORN-505] Fix typo of SHUFFLE_CHUCK_SIZE (#1411 )	2023-04-04 19:15:30 +08:00
Keyong Zhou	2e1598c011	[CELEBORN-485] Make celeborn.push.replicate.enabled default to false (#1394 )	2023-04-03 16:36:29 +08:00
Angerszhuuuu	bf46336d54	[CELEBORN-487][PERF] ShuffleClientSide support blacklist to avoid client side timeout in same worker multiple times (#1399 )	2023-04-03 11:50:04 +08:00
zhongqiangchen	cd92c423cd	[CELEBORN-475] Support extra tags for prometheus metrics (#1385 ) [CELEBORN-475] Support extra tags for prometheus metrics	2023-03-28 21:22:28 +08:00
Keyong Zhou	cb19ed1c66	[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386 )	2023-03-27 16:18:55 +08:00
Shuang	89b3f3887d	[CELEBORN-356] [FLINK] Support release single partition resource (#1314 )	2023-03-24 17:15:28 +08:00
Ethan Feng	0ebad677d7	[CELEBORN-434] Add constrain about memory manager's parameters. (#1356 )	2023-03-17 15:14:03 +08:00
Angerszhuuuu	4b334df7a6	[CELEBORN-399] Make fileSorterExecutors thread num can be customized (#1325 )	2023-03-10 21:10:43 +08:00
Keyong Zhou	dcedf7b0a9	[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287 )	2023-03-02 18:31:50 +08:00
zhongqiangchen	cb76c4de4c	[CELEBORN-350][FLINK] Add PluginConf to be compatible with old configurations	2023-02-28 20:36:11 +08:00
Keyong Zhou	7adf1fca41	[CELEBORN-295] Optimize data push (#1232 ) * [CELEBORN-295] Add double buffer for sort pusher	2023-02-28 10:35:55 +08:00
Ethan Feng	0c8bb83114	[CELEBORN-234] Implement buffer stream. (#1221 )	2023-02-17 17:38:36 +08:00
Ethan Feng	3aacede5f8	[CELEBORN-283] Derive network layer for flink plugin. (#1222 )	2023-02-17 14:12:54 +08:00
jiaoqingbo	3a92b0d911	[CELEBORN-284] fix typo in CelebornConf (#1218 ) Co-authored-by: jiaoqb <jiaoqb@asiainfo.com>	2023-02-10 14:59:36 +08:00
Rex(Hui) An	bff6e91e0b	[CELEBORN-227] Support different push strategies to control the push speed (#1167 )	2023-02-07 14:24:30 +08:00
Rex(Hui) An	bb113ec9be	[CELEBORN-207] Support network congestion control (#1066 )	2023-02-07 12:06:18 +08:00
Angerszhuuuu	4b6f7e4593	[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185 )	2023-02-03 11:53:15 +08:00
Angerszhuuuu	04427f2b16	[CELEBORN-247] Add metrics for each user's quota usage (#1182 )	2023-02-02 18:31:08 +08:00
Angerszhuuuu	122da47815	[CELEBORN-241][IMPROVEMENT] limit inflight push timeout should > push data timeout (#1179 )	2023-01-30 11:57:07 +08:00
zy.jordan	c5be79ee3d	[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102 )	2023-01-20 10:18:45 +08:00
Ethan Feng	a239f9f284	[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168 )	2023-01-16 20:06:47 +08:00
zy.jordan	bb96700415	[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163 )	2023-01-16 12:03:46 +08:00
Keyong Zhou	fa7ba43136	[CELEBORN-225] Add global default configuration for number of flusher… (#1165 )	2023-01-14 13:20:44 +08:00
zhongqiangczq	411ab09ffb	[CELEBORN-158][Flink] Add ShuffleServiceFactory to Support MapPartition in … (#1105 )	2023-01-13 16:38:46 +08:00
Shuang	1332362bff	[CELEBORN-213] Add configuration for whether to close idle connections in client side (#1157 )	2023-01-10 19:13:33 +08:00
zy.jordan	19197b9190	[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158 )	2023-01-10 17:46:56 +08:00
Angerszhuuuu	e155ec122a	[CELEBORN-190] doPushMergedData should also support revive multiple times, not only twice (#1136 )	2023-01-10 11:39:40 +08:00
Angerszhuuuu	415452d9c4	[CELEBORN-189][IMPROVEMENT] PushDataFailedSlave should add slave worker to blacklist (#1135 )	2023-01-05 20:12:07 +08:00
RexAn	6432a129be	[CELEBORN-61][CELEBORN-62][FOLLOW_UP] Fix some issues for slow start (#1119 )	2022-12-29 12:07:20 +08:00
Ethan Feng	5aa959a335	[CELEBORN-157] Change prefix of configurations to celeborn. (#1104 )	2022-12-21 15:17:28 +08:00
Keyong Zhou	2f0682265e	[CELEBORN-119] Add timeout for pushdata (#1097 )	2022-12-20 20:40:42 +08:00
nafiy	c931663e5f	[CELEBORN-110][REFACTOR] Notify critical error after collecting a certain number of non-critical error (#1055 )	2022-12-16 15:47:36 +08:00
nafiy	2e37830a0f	[CELEBORN-139][BUG] Fix read wrong yaml file format when loading config (#1083 )	2022-12-14 20:56:04 +08:00
Angerszhuuuu	de3ef0d694	[CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout (#1047 ) * [CELEBORN-102][REFACTOR] TIMEOUT default value should be changed with network timeout	2022-12-06 14:41:23 +08:00
Ethan Feng	acfaf59ab3	[CELEBORN-91] Refactor memory tracker to support read buffer. (#1038 ) * [CELEBORN-91] Refactor memory tracker to support read buffer.	2022-12-05 15:38:43 +08:00
nafiy	8e384cda5a	[CELEBORN-88][REFACTOR] Revive/PartitionSplit should set separated timeout configuration (#1046 )	2022-12-05 10:36:43 +08:00
nafiy	44d45c2a27	[CELEBORN-90][REFACTOR] GetReducerFileGroup should support separated timeout configuration (#1045 )	2022-12-02 22:53:51 +08:00
nafiy	13e1e24035	[CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration (#1031 ) * [CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration	2022-12-01 18:39:56 +08:00
nafiy	d584211a75	[CELEBORN-95][REFACTOR]Rename CLIENT_RPC_ASK_TIMEOUT to HA_CLIENT_RPC_ASK_TIMEOUT (#1037 )	2022-12-01 11:57:02 +08:00
zhongqiangczq	898d1126a6	[CELEBORN-11] ShuffleClient supports MapPartition shuffle write: send handshake/regionstart/regionfinish (#1035 )	2022-12-01 11:20:55 +08:00
Angerszhuuuu	d26e73209b	[CELEBORN-76] Support batch commit hard split partition before stage end	2022-11-29 13:09:01 +08:00
Cheng Pan	9bf4c65357	[CELEBORN-72][DOCS] Remove unused website resources from main repo (#1014 )	2022-11-28 09:47:30 +08:00
Keyong Zhou	f8bb2cd47d	[CELEBORN-12]Retry on CommitFile request (#1011 )	2022-11-26 20:56:24 +08:00
Keyong Zhou	9214b82181	[CELEBORN-68] Client might fetch incorrect data chunk (#1010 )	2022-11-26 18:06:06 +08:00
Ethan Feng	ee243f286d	[CELEBORN-4] Add metrics about top disk used apps. (#985 )	2022-11-22 20:06:36 +08:00
Gabriel	5ecb09d62a	[ISSUE-911] Decrease numConnectionsPerPeer to achieve better performance (#983 )	2022-11-20 11:46:17 +08:00
leesf	3699683a3b	Fix and migrate some configs (#927 )	2022-11-07 09:41:38 +08:00
Kerwin Zhang	db08d49032	[FEATURE] Support columnar shuffle codegen (#915 )	2022-11-04 20:54:13 +08:00
Angerszhuuuu	38e15d89e6	[ISSUE-902][IMPROVEMENT][FOLLOWUP] LifecycleManager should reserve blacklist with irrecoverable status (#914 )	2022-11-04 15:54:45 +08:00
Angerszhuuuu	87fcfa767f	[ISSUE-887][REFACTOR] Configuration type convert to Enum (#888 ) * [ISSUE-332][FOLLOWUP] Add deps in worker's pom * [Refactor] Modify package name of utils to keep consistence * [Refactor] Modify package name of utils to keep consistence * [REFACTOR] Remove unused isRegistered in controller * [ISSUE-887][REFACTOR] Configuration type convert to Enum * update * update * Update RssShuffleManager.java	2022-10-29 13:41:06 +08:00
Cheng Pan	d7be6006e7	Migrate network related conf to structured conf system (#875 ) * Migrate network related conf to structured conf system * migrate * fix * fix * worker * fix * nit * review * nit	2022-10-28 10:45:52 +08:00
Angerszhuuuu	d283cca4e1	[ISSUE-869][REFACTOR] Migrate partition size/sorter related conf to Celeborn ConfigEntity (#870 )	2022-10-27 16:49:55 +08:00
Angerszhuuuu	26dcc118c6	[ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System (#873 ) * [ISSUE-871][REFACTOR] Migrate Worker conf to Celeborn Configuration System	2022-10-27 15:35:29 +08:00
Angerszhuuuu	399236c880	[ISSUE-849][REFACTOR] Migrate master and common Celeborn Configuration System (#850 )	2022-10-26 17:09:27 +08:00
Angerszhuuuu	89c3013122	[ISSUE-851][REFACTOR] Migrate quota configruation to Celeborn Configuration System (#852 ) * [ISSUE-851][REFACTOR] Migrate quota configruation to Celeborn Configuration System	2022-10-26 14:09:44 +08:00
nafiy	e44e8c9610	[ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry (#831 ) * [ISSUE-828][REFACTOR] Migrate memory tracker related configs to ConfigEntry * Fix based on review * update doc * resolve review feedback * fix * Fix based on review * fix based on review	2022-10-25 21:16:53 +08:00
Ethan Feng	8800fc4a8e	[Refactor] Refine rpc cache configs (#853 ) * refine rpc cache configs. * update. * update. * update.	2022-10-25 20:28:18 +08:00
Ethan Feng	45ef716737	[Feature] Cache GetReducerFileGroupResponse to avoid lifecycle manager oom. (#792 )	2022-10-25 16:16:44 +08:00
Cheng Pan	e71c0228aa	Migrate columnar shuffle configurations to ConfigEntry (#844 )	2022-10-25 14:26:11 +08:00
AngersZhuuuu	2ebf873b3c	[ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System (#846 ) [ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System	2022-10-25 10:54:45 +08:00
AngersZhuuuu	0bd0a3e9f4	[ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System (#848 ) * [ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System * Update CelebornConf.scala * follow comments * update * update * update * Update client.md	2022-10-25 09:16:46 +08:00
Cheng Pan	e3d649fff3	Change slot to slots for consistency (#843 )	2022-10-24 20:49:28 +08:00
AngersZhuuuu	0fdb19065a	[ISSUE-841][REFACTOR] Migrate shuffle client side conf to Celeborn Configuration System (#842 )	2022-10-24 20:48:48 +08:00
Cheng Pan	8d7d397e71	Fix Configuration page and polish naming (#838 ) * Fix Configuration page and polish naming * nit * nit * comment	2022-10-24 12:46:25 +08:00
Ethan Feng	392a252baa	[FOLLOWUP][ISSUE-813]Update doc and fix typo. (#825 )	2022-10-22 23:02:22 +08:00
nafiy	1a8a36e8fe	[ISSUE-812][Refactor] Migrate metrics system related configs to ConfigEntry (#821 )	2022-10-21 13:57:58 +08:00
Ethan Feng	5c761a8df3	[ISSUE-813][Refactor] Refactor flusher configurations. (#813 ) * Refactor flusher configurations. * Refactor flusher configurations. * Update. * remove brackets. * update docs. * rename. * update. * update docs. * update. * update. * update. * update. * update. * update. * update. * format. * update. * update.	2022-10-20 15:23:17 +08:00
AngersZhuuuu	23c65a27a9	[ISSUE-798][REFACTOR] Migrate worker-recover related conf to ConfigEntry (#799 )	2022-10-19 16:42:00 +08:00
Cheng Pan	cb07cf62c0	Auto generate configuration docs (#794 )	2022-10-19 10:50:22 +08:00
Cheng Pan	ea67f4e060	Introduce categories to ConfigEntry and migrate configurations (#775 )	2022-10-17 16:56:54 +08:00
Cheng Pan	f01a696313	Migrate and refactor configuration for master endpoints (#752 )	2022-10-11 21:33:21 +08:00
AngersZhuuuu	bbb4f8e225	[ISSUE-306][IMPROVEMENT] Handle change partition request in batch (#622 )	2022-10-10 18:31:37 +08:00
AngersZhuuuu	db9ce36608	[ISSUE-690][DOC] Storage resource quota doc (#703 )	2022-10-09 20:01:50 +08:00
Keyong Zhou	a2d2379153	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00

1 2 3 4 5 ...

257 Commits