Commit Graph

274 Commits

Author SHA1 Message Date
sychen
b94fea8e17
[CELEBORN-1207] SBT http repository documentation
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2201 from cxzl25/CELEBORN-1207.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: cxzl25 <3898450+cxzl25@users.noreply.github.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-01-02 22:12:28 +08:00
Cheng Pan
77e468161d [CELEBORN-891] Remove pipeline feature for sort based writer
### What changes were proposed in this pull request?

Remove pipeline feature for sort based writer

### Why are the changes needed?

The pipeline feature is added as part of CELEBORN-295, for performance. Eventually, an unresolvable issue that would crash the JVM was identified in https://github.com/apache/incubator-celeborn/pull/1807, and after discussion, we decided to delete this feature.

### Does this PR introduce _any_ user-facing change?

No, the pipeline feature is disabled by default, there are no changes to users who use the default settings.

### How was this patch tested?

Pass GA.

Closes #2196 from pan3793/CELEBORN-891.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-01 10:42:17 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
liangyongyuan
4304be1a60 [CELEBORN-1172][SPARK] Support dynamic switch shuffle push write mode based on partition number
### What changes were proposed in this pull request?
Dynamically determine the writing mode in Spark based on the number of partitions.

### Why are the changes needed?
Enhance the flexibility of shuffle writes to improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add uts

Closes #2160 from lyy-pineapple/dynamic-write-mode.

Lead-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-21 16:58:51 +08:00
mingji
4dacf72a6d
[CELEBORN-1150] support io encryption for spark
### What changes were proposed in this pull request?
1. To support io encryption for spark.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manually test on a cluster.

Closes #2135 from FMX/B1150.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-19 11:44:05 +08:00
Chandni Singh
600bd53616 [CELEBORN-1180] Changed the version of Sasl Auth related config to 0.5
### What changes were proposed in this pull request?
Changes the version of the config to 0.5 given that 0.4 will be released soon.

### Why are the changes needed?
See above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
NA

Closes #2165 from otterc/CELEBORN-1180.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 13:45:46 +08:00
zky.zhoukeyong
01feb93abb [CELEBORN-1167] Avoid calling parmap when destroy slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots

This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2156 from waitinfuture/1167.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 18:30:31 +08:00
Fu Chen
41df4ebbea [CELEBORN-1156][BUILD] SBT publish support
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Yes, the user can publish shade clients via SBT

### How was this patch tested?

```shell
docker run -d -p 8081:8081 sonatype/nexus3
```

```shell
export SONATYPE_SNAPSHOTS_URL=http://192.168.3.46:8081/repository/maven-snapshots/
export SONATYPE_RELEASES_URL=http://192.168.3.46:8081/repository/maven-releases/
export ASF_USERNAME=admin
export ASF_PASSWORD=123456
```

- Publish the shade client for Spark 3.5:
```shell
./build/sbt -Pspark-3.4 celeborn-client-spark-3-shaded/publish
```

<img width="1673" alt="截屏2023-12-08 下午10 22 07" src="https://github.com/apache/incubator-celeborn/assets/8537877/1e87e7e2-cf3b-4bc0-8272-0f5b03ee65bf">

- Publish the shade client for Flink 1.18:

```shell
$ ./build/sbt -Pflink-1.18 celeborn-client-flink-1_18-shaded/publish
```
<img width="1676" alt="截屏2023-12-08 下午10 25 28" src="https://github.com/apache/incubator-celeborn/assets/8537877/62d0c3c4-e105-4e8a-8d8d-e78650a2eb09">

- Publish the shade client for MapReduce:
```shell
$ ./build/sbt -Pmr celeborn-client-mr-shaded/publish
```
<img width="1672" alt="截屏2023-12-08 下午10 25 47" src="https://github.com/apache/incubator-celeborn/assets/8537877/563d5ad5-fa6d-46fc-9465-8279ef96385a">

Closes #2129 from cfmcgrady/sbt-publish.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-15 11:22:35 +08:00
Chandni Singh
a03ce6c165 [CELEBORN-1157] Add client-side support for Sasl Authentication in the transport layer
### What changes were proposed in this pull request?
This adds the client side Sasl authentication support in the transport layer. Most of this code is taken from Apache Spark.

### Why are the changes needed?
The changes are needed for adding authentication to Celeborn. See [CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011).

### Does this PR introduce _any_ user-facing change?
Added a configuration for Sasl request timeout

### How was this patch tested?
Will be adding `CelebornSaslSuiteJ.java` (https://github.com/apache/incubator-celeborn/pull/2105) that tests the end-to-end Sasl flow.

Closes #2139 from otterc/CELEBORN-1157.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:52:49 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
Fu Chen
0f2a9a3a63 [CELEBORN-1160][FOLLOWUP] Update the version for celeborn.client.rpc.shared.threads to 0.3.2
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Since we are backporting #2145 to branch-0.3, and the configuration entry `celeborn.client.rpc.shared.threads` in #2145
 has a start version of 0.4.0, this update aligns the version accordingly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2153 from cfmcgrady/celeborn-1160-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 15:12:50 +08:00
zky.zhoukeyong
92bebd305d [CELEBORN-1160] Avoid calling parmap when commit files
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the first one. To be more detail, this PR eliminates `parmap` when doing committing files, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2145 from waitinfuture/1160.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 14:36:48 +08:00
wangshengjie
8516df4beb [CELEBORN-1151] Request slots when register shuffle should filter the workers excluded by application
### What changes were proposed in this pull request?
When request slots, filter workers excluded by application

### Why are the changes needed?
If worker alive but can not service, register shuffle will remove the worker from application client exclude list and next shuffle may reserve slots on this worker,this will cause application revive unexpectly

### Does this PR introduce _any_ user-facing change?
Yes, request slots will filter workers excluded by application

### How was this patch tested?
UT,

Closes #2131 from wangshengjie123/fix-request-slots-blacklist.

Authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-12 10:02:18 +08:00
qinrui
04a1e90207 [CELEBORN-1122] Metrics supports json format
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format

### How was this patch tested?
Cluster test.

Closes #2089 from suizhe007/CELEBORN-1122.

Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-06 09:24:28 +08:00
SteNicholas
406cef8392 [CELEBORN-1052][FOLLOWUP] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?

Follow up #2100. Mainly changes the package from scala to java of the codes in #2100. Meanwhile, `FsConfigServiceImpl#refresh` should directly return instead of refreshing configs.

### Why are the changes needed?

This PR follow up dynamic `ConfigService` at `SystemLevel` and `TenantLevel`, Dynamic configuration is a type of configuration that can be changed at runtime as needed in #2100. The implementation of `ConfigService` is based on Java codes, which are put into Scala package and cause that the spotless plugin does not format well. After the changes of the pull request, there are much code style changes generated from the package moving behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`ConfigServiceSuiteJ`.

Closes #2125 from SteNicholas/CELEBORN-1052.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-04 19:03:59 +08:00
exmy
8a15396cb6 [CELEBORN-1145] Separate clientPushBufferMaxSize from CelebornInputStreamImpl
### What changes were proposed in this pull request?
The `clientPushBufferMaxSize` config is also used by `CelebornInputStreamImpl`, it's a config about push side and should not be used by fetch side. This pr introduces a fetch config to replace it.

### Why are the changes needed?

As above

### Does this PR introduce _any_ user-facing change?

Yes, a new config `celeborn.client.fetch.buffer.size` is introduced.

### How was this patch tested?

Pass CI

Closes #2118 from exmy/celeborn-1145.

Authored-by: exmy <xumovens@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-30 18:56:03 +08:00
SteNicholas
4dfcd9b56b [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake
### What changes were proposed in this pull request?

Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.

### Why are the changes needed?

When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of [jvmquake](https://github.com/Netflix-Skunkworks/jvmquake) which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`JVMQuakeSuite`

Closes #2061 from SteNicholas/CELEBORN-1092.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:45:08 +08:00
mingji
113311df3e [CELEBORN-1081][FOLLOWUP] Remove UNKNOWN_DISK and allocate all slots to disk
### What changes were proposed in this pull request?
1. Remove UNKNOWN_DISK from StorageInfo.
2. Enable load-aware slots allocation when there is HDFS.

### Why are the changes needed?
To support the application's config about available storage types.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
GA and Cluster.

Closes #2098 from FMX/B1081-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-28 11:26:00 +08:00
Shuang
ad57c8b91e
[CELEBORN-1052] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.

There are several other tasks related to this feature that will be implemented in the future.

- [ ]  [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ]  [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ]  [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)

### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2100 from RexXiong/CELEBORN-1052.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-27 12:17:05 +08:00
Erik.fang
aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
SteNicholas
465b0938f7
[CELEBORN-1134] Celeborn Flink client should validate whether execution.batch-shuffle-mode is ALL_EXCHANGES_BLOCKING
### What changes were proposed in this pull request?

Celeborn Flink client validates whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING`.

### Why are the changes needed?

The config option `execution.batch-shuffle-mode` of Flink is `ALL_EXCHANGES_BLOCKING` by default. Celeborn Flink client should validate whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING`. If `execution.batch-shuffle-mode` is set as `ALL_EXCHANGES_PIPELINED`, there is `NullPointerException` for `ReducePartitionCommitHandler#handleGetReducerFileGroup`, which exception is as follows:

```
023-11-16 14:40:55,984 ERROR org.apache.celeborn.common.rpc.netty.Inbox                    - Ignoring error
java.lang.NullPointerException: Cannot invoke "java.util.Set.add(Object)" because the return value of "java.util.concurrent.ConcurrentHashMap.get(Object)" is null
	at org.apache.celeborn.client.commit.ReducePartitionCommitHandler.handleGetReducerFileGroup(ReducePartitionCommitHandler.scala:307)
	at org.apache.celeborn.client.CommitManager.handleGetReducerFileGroup(CommitManager.scala:266)
	at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$handleGetReducerFileGroup(LifecycleManager.scala:559)
	at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:297)
	at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
	at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
	at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:227)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RemoteShuffleServiceFactorySuitJ#testInvalidShuffleServiceConfig`.

Closes #2106 from SteNicholas/CELEBORN-1134.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-17 20:32:04 +08:00
mingji
02cea042a0 [CELEBORN-1116] Read authentication configs from HADOOP_CONF_DIR
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.

### Why are the changes needed?
To support HDFS with Kerberos.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2082 from FMX/B1116.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-09 11:07:13 +08:00
jiaoqingbo
39153c8c2d [MINOR] Updated sbt.md documentation to be consistent with description
### What changes were proposed in this pull request?

add --release parameter to create a Celeborn distribution like those distributed by the Celeborn Downloads page

### Why are the changes needed?

Without --release parameter, the created Celeborn distribution is different from the Celeborn Downloads page and lacks client-related packages.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #2080 from jiaoqingbo/minor-sbt.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 21:07:43 +08:00
Shuang
931880a82d [CELEBORN-1112] Inform celeborn application is shutdown, then celeborn cluster can release resource immediately
### What changes were proposed in this pull request?
Unregister application to Celeborn master After Application stopped, then master will expire the related shuffle resource immediately, resulting in resource savings.

### Why are the changes needed?
Currently Celeborn master expires the related application shuffle resource only when application is being checked timeout,
this would greatly delay the release of resources, which is not conducive to saving resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #2075 from RexXiong/CELEBORN-1112.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 20:46:51 +08:00
SteNicholas
d2582919ad
[CELEBORN-1110] Support celeborn.worker.storage.disk.reserve.ratio to configure worker reserved ratio for each disk
### What changes were proposed in this pull request?

Support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk.

### Why are the changes needed?

`CelebornConf` supports to configure celeborn worker reserved space for each disk, which space is absolute. `CelebornConf` could support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk. The minimum usable size for each disk should be the max space between the reserved space and the space calculate via reserved ratio.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SlotsAllocatorSuiteJ`

Closes #2071 from SteNicholas/CELEBORN-1110.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-08 12:39:25 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
joey.ljy
455cd40137 [CELEBORN-1111] Supporting connection to HDFS with Kerberos authentication enabled
### What changes were proposed in this pull request?
Adding Kerberos support for HDFS storage type.

The following five parameters need to be configured:
| key | value |
| :--: | :--: |
| celeborn.storage.hdfs.kerberos.enabled | true |
| celeborn.storage.hdfs.kerberos.principal | userREALM |
| celeborn.storage.hdfs.kerberos.keytab | /path/test.keytab |
| celeborn.hadoop.hadoop.security.authorization | kerberos |
| celeborn.hadoop.dfs.namenode.kerberos.principal | hdfs/_HOSTREALM |

### Why are the changes needed?
Connecting to HDFS with Kerberos enabled requires support for keytab login.

### Does this PR introduce _any_ user-facing change?
Add 3 configurations.
celeborn.storage.hdfs.kerberos.enabled
celeborn.storage.hdfs.kerberos.principal
celeborn.storage.hdfs.kerberos.keytab

### How was this patch tested?
Test in Kerberos enabled HDFS cluster.

Closes #2072 from liujiayi771/hdfs-kerberos.

Authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-04 17:21:41 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
Chandni Singh
c8b5384baf [CELEBORN-1107] Make the max default number of netty threads configurable
### What changes were proposed in this pull request?
This change makes the maximum default number of Netty threads configurable. Previously, this value was hardcoded to 64, which could be small for certain environments. While it's possible to configure the number of Netty server and client threads individually for each module, providing an option to increase the default value offers greater convenience.

### Why are the changes needed?
The change offers convenience.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a UT

Closes #2065 from otterc/CELEBORN-1107.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 13:18:44 +08:00
onebox-li
7b185a2562 [CELEBORN-1058] Support specifying the number of dispatcher threads for each role
### What changes were proposed in this pull request?
Support specifying the number of dispatcher threads for each role, especially shuffle client side. For shuffle client, there is only RpcEndpointVerifier endpoint which handles not many requests, one thread is enough. The rpc env of other roles has only two endpoints at most, using a shared event loop is reasonable. I am not sure if there is a need to add rpc requests to shuffle client. So add specific parameters to specify the dispatcher threads here.

And change the dispatcher thread pool name in order to distinguish it from spark's.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, add params celeborn.\<role>.rpc.dispatcher.threads

### How was this patch tested?
Manual test and UT

Closes #2003 from onebox-li/my_dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 10:35:54 +08:00
SteNicholas
4e8e8c2310
[CELEBORN-1094] Optimize mechanism of ChunkManager expired shuffle key cleanup to avoid memory leak
### What changes were proposed in this pull request?

The `cleaner` of `Worker` executes the `StorageManager#cleanupExpiredShuffleKey` to clean expired shuffle keys with daemon cached thread pool. The optimization speeds up cleaning including expired shuffle keys of ChunkManager to avoid memory leak.

### Why are the changes needed?

`ChunkManager#streams` could lead memory leak when the speed of cleanup is slower than expiration for expired shuffle of worker. The behavior that `ChunkStreamManager` cleanup expired shuffle key should be optimized to avoid memory leak, which causes that the VM thread of worker is 100%.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WorkerSuite#clean up`.

Closes #2053 from SteNicholas/CELEBORN-1094.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-02 15:46:07 +08:00
sychen
e437228dc8 [CELEBORN-1104][DOC] Fix SBT documentation incorrect command
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2062 from cxzl25/CELEBORN-1104.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-01 17:00:09 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
onebox-li
320714bf24 [CELEBORN-1089] Seperate overHighWatermark check to a dedicated thread
### What changes were proposed in this pull request?
Seperate `overHighWatermark` check to a dedicated thread, let this value can shared better and lighten `CongestionController#isUserCongested` logic.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and UT.

Closes #2041 from onebox-li/congest-check.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-01 09:51:24 +08:00
SteNicholas
3092644168 [CELEBORN-1095] Support configuration of fastest available XXHashFactory instance for checksum of Lz4Decompressor
### What changes were proposed in this pull request?

`CelebornConf` adds `celeborn.client.shuffle.decompression.lz4.xxhash.instance` to configure fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor`. Fix #2043.

### Why are the changes needed?

`Lz4Decompressor` creates the checksum with `XXHashFactory#fastestInstance`, which returns the fastest available `XXHashFactory` instance that uses nativeInstance at default. The fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor` could be supported to configure instead of dependency on the class loader is the system class loader, which method is as follows:
```
/**
 * Returns the fastest available {link XXHashFactory} instance. If the class
 * loader is the system class loader and if the
 * {link #nativeInstance() native instance} loads successfully, then the
 * {link #nativeInstance() native instance} is returned, otherwise the
 * {link #fastestJavaInstance() fastest Java instance} is returned.
 * <p>
 * Please read {link #nativeInstance() javadocs of nativeInstance()} before
 * using this method.
 *
 * return the fastest available {link XXHashFactory} instance.
 */
public static XXHashFactory fastestInstance() {
  if (Native.isLoaded()
      || Native.class.getClassLoader() == ClassLoader.getSystemClassLoader()) {
    try {
      return nativeInstance();
    } catch (Throwable t) {
      return fastestJavaInstance();
    }
  } else {
    return fastestJavaInstance();
  }
}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `CelebornConfSuite`
- `ConfigurationSuite`

Closes #2050 from SteNicholas/CELEBORN-1095.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 14:57:31 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
SteNicholas
11c90d8e72
[CELEBORN-916] Add new metric about active shuffle file count in worker
### What changes were proposed in this pull request?

Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.

### Why are the changes needed?

`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2009 from SteNicholas/CELEBORN-916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
mingji
69defcad7f [CELEBORN-1021] Celeborn support arbitary Ratis configs and client rpc timeout
### What changes were proposed in this pull request?
1. To support arbitrary Ratis configs
2. To support Ratis client rpc timeout

### Why are the changes needed?
After some digs that I found out Celeborn never changed the default config of ratis client's timeout.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1969 from FMX/CELEBORN-1021.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-18 10:26:11 +08:00
sunjunjie
03498ce46b [CELEBORN-1046] Add an expiration time configuration for app directory to clean up
### What changes were proposed in this pull request?
Add a configuration "celeborn.worker.storage.expireDirs.timeout" with a default value of 6h in rsswork. This configuration is used to set the expiration time for app local directories.

https://issues.apache.org/jira/browse/CELEBORN-1046
### Why are the changes needed?
When Celeborn periodically deletes the directories of apps, it determines whether the app needs to be deleted based on the shuffleKeySet in memory. However, this method may not accurately indicate the completion of the app and could potentially lead to the unintentional deletion of shuffle data.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1998 from wilsonjie/CELEBORN-1046.

Authored-by: sunjunjie <sunjunjie@zto.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 19:23:49 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00
sychen
a8ac18f2e8 [CELEBORN-299] Deprecate celeborn.worker.storage.baseDir.prefix and celeborn.worker.storage.baseDir.number
### What changes were proposed in this pull request?

<img width="1460" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/ac3b29be-7c39-4c18-b71d-0e243797273e">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.prefix' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.number' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
```

Closes #1993 from cxzl25/CELEBORN-299.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 19:10:13 +08:00
SteNicholas
f2d6cc7525 [CELEBORN-829] Improve response message of invalid HTTP request
### What changes were proposed in this pull request?

Improve response message of invalid HTTP request, which lists available API providers like as below:

- master

```
Invalid uri of the master. Available API providers include:
/applications        List all running application's ids of the cluster.
/conf                List the conf setting of the master.
/excludedWorkers     List all excluded workers of the master.
/help                List the available API providers of the master.
/hostnames           List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers         List all lost workers of the master.
/masterGroupInfo     List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles            List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers     List all shutdown workers of the master.
/threadDump          List the current thread dump of the master.
/workerInfo          List worker information of the service. It will list all registered workers 's information.
```

- worker

```
Invalid uri of the worker. Available API providers include:
/conf                      List the conf setting of the worker.
/exit                      Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help                      List the available API providers of the worker.
/isRegistered              Show if the worker is registered to the master success.
/isShutdown                Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps       List the top disk usage application ids. It only return application ids running in that worker.
/shuffles                  List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump                List the current thread dump of the worker.
/unavailablePeers          List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo                List the worker information of the worker.
```

### Why are the changes needed?

Response message of invalid HTTP request could not help users with correct HTTP path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`

Closes #1986 from SteNicholas/CELEBORN-829.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 16:37:51 +08:00
SteNicholas
f61fe17551 [CELEBORN-987][FOLLOWUP][DOC] README#Build and sbt#System Requirements should extend to Scala 2.13 and Spark 3.5
### What changes were proposed in this pull request?

`README#Build` and `sbt#System Requirements` extends to Scala 2.13.

### Why are the changes needed?

`README#Build` and `sbt#System Requirements`should extend to Scala 2.13 to align the SBT CI test results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SBT CI tests.

Closes #1987 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-14 09:54:22 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
onebox-li
a47f6169d8 [MINOR] Fix some typos
### What changes were proposed in this pull request?
Fix some typos

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
-

Closes #1983 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:34:07 +08:00
sychen
9c07ceddb0 [CELEBORN-1028][FOLLOWUP][DOCS] Make prometheus path configurable
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1965#issuecomment-1755345813

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="1410" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6454133a-040b-4dde-84b7-dbf08fb15b13">

<img width="1401" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/3cdfa9f2-9a7a-43cb-9006-77810a350669">

Closes #1974 from cxzl25/CELEBORN-1028-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 22:59:22 +08:00
sychen
bcf89da7dd [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1971 from cxzl25/typo.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 20:04:16 +08:00