### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".
### Why are the changes needed?
Enable client application to select storage types to use.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
GA and cluster.
Closes#2045 from FMX/B1081.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
1. To support arbitrary Ratis configs
2. To support Ratis client rpc timeout
### Why are the changes needed?
After some digs that I found out Celeborn never changed the default config of ratis client's timeout.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#1969 from FMX/CELEBORN-1021.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
<img width="1460" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/ac3b29be-7c39-4c18-b71d-0e243797273e">
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.prefix' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.number' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
```
Closes#1993 from cxzl25/CELEBORN-299.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```
### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.
https://celeborn.apache.org/docs/latest/monitoring/#rest-api
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1919 from cxzl25/CELEBORN-983.
Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Mention metrics name change in Migration Guide
### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1939
### Does this PR introduce _any_ user-facing change?
Yes, docs updated.
### How was this patch tested?
Review.
Closes#1950 from pan3793/CELEBORN-1007-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Restore titles in migration guide
### Why are the changes needed?
Make title in migration guide consistent.
### Does this PR introduce _any_ user-facing change?
Yes, docs changed.
### How was this patch tested?
Pass GA.
Closes#1949 from pan3793/CELEBORN-909-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Change the `.version("0.3.2")` to `.version("0.3.1")`
### Why are the changes needed?
0.3.1 is not release yet.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1948 from pan3793/minor-version.
Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1943 from cxzl25/CELEBORN-985.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Reduce default disk Check Interval
### Why are the changes needed?
since https://github.com/apache/incubator-celeborn/pull/1909 ,In PushDataHandler#checkDiskFull method,Added check logic for DiskInfo status, the default disk Check Interval should be reduced
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA
Closes#1915 from jiaoqingbo/979.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
After #1829 we set `celeborn.worker.directMemoryRatioToResume` default value from `0.5` to `0.7`.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
No
Closes#1836 from zwangsheng/CELEBORN-909.
Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#1795 from cfmcgrady/sbt-docs.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1770 from AngersZhuuuu/CELEBORN-851.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1731 from AngersZhuuuu/CELEBORN-808.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1664 from AngersZhuuuu/CELEBORN-751.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
mention configuration behavior change in migration guide
### Does this PR introduce _any_ user-facing change?
Yes, the migration guide is updated
### How was this patch tested?
review
Closes#1673 from pan3793/CELEBORN-637-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1667 from AngersZhuuuu/CELEBORN-754.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.
### Why are the changes needed?
In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests.
Closes#1650 from cfmcgrady/primary-replica-metrics.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Unify all blacklist related code and comment
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…nse with lower versions
### What changes were proposed in this pull request?
The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse.
The reply field is default false before the version 0.2.1, so master can be compatible with older client version
### Why are the changes needed?
Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`.
if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur.
The following compatiblity error will be printted.
``` java
java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362]
at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```
``` java
Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$
at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362]
at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?]
at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?]
at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The pr is tested manually and the testing process is as follows:
1. server side is deploy using the code of latest branch-0.3.
2. spark client is deploy the version of 0.2.1, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted
3. spark client is deploy the version of 0.3.0, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted
This patch had conflicts when merged, resolved by
Committer: Cheng Pan <chengpan@apache.org>
Closes#1635 from zhongqiangczq/heartbeat2.
Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Celeborn generate hadoop configuration should respect Celeborn conf
### Why are the changes needed?
In spark client side we should write like `spark.celeborn.hadoop.xxx.xx`
In server side we should write like `celeborn.hadoop.xxx.xxx`
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1629 from AngersZhuuuu/CELEBORN-719.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>`
### Why are the changes needed?
Respect the project name.
### Does this PR introduce _any_ user-facing change?
Yes, migration guide is updated accordingly.
### How was this patch tested?
Pass GA.
Closes#1624 from pan3793/CELEBORN-715.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT`, and makes `CELEBORN_LOCAL_HOSTNAME` takes effect on both master and worker.
### Why are the changes needed?
There are many different ways to configure the master/worker host and port, which makes the thing complex and inconsistent.
After this change,
#### master
1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`
3. things are different when HA enabled and disabled
3.1. when HA is disabled, lookup configurations `celeborn.master.host` and `celeborn.master.port`
3.2. when HA is enabled, each node needs to know the whole cluster info,
```
celeborn.master.ha.node.1.host clb-1
celeborn.master.ha.node.1.port 9097
celeborn.master.ha.node.2.host clb-2
celeborn.master.ha.node.2.port 9097
celeborn.master.ha.node.3.host clb-3
celeborn.master.ha.node.3.port 9097
```
in addition, `celeborn.master.ha.node.id=1` can be used to indicate the node id, otherwise, the master will try to bind each host to match the node id.
#### worker
1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`
things are simple than the master case because each worker is not required to know others.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
UT.
Closes#1616 from pan3793/CELEBORN-707.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Extend doc about migration from 0.2.1 to 0.3.0. Added the following contents:
<img width="1084" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/7a9d172c-09ba-48b6-9f5c-73a8b13d035f">
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#1612 from waitinfuture/702.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
It was discussed during the last meeting, but abandoned due to the complication.
### Why are the changes needed?
Make the configuration unified.
### Does this PR introduce _any_ user-facing change?
Yes, but the legacy configurations still take effect.
### How was this patch tested?
New UTs.
Closes#1549 from pan3793/CELEBORN-638.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1592 from AngersZhuuuu/CELEBORN-680.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>