Commit Graph

30 Commits

Author SHA1 Message Date
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
SteNicholas
11c90d8e72
[CELEBORN-916] Add new metric about active shuffle file count in worker
### What changes were proposed in this pull request?

Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.

### Why are the changes needed?

`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2009 from SteNicholas/CELEBORN-916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00
SteNicholas
f2d6cc7525 [CELEBORN-829] Improve response message of invalid HTTP request
### What changes were proposed in this pull request?

Improve response message of invalid HTTP request, which lists available API providers like as below:

- master

```
Invalid uri of the master. Available API providers include:
/applications        List all running application's ids of the cluster.
/conf                List the conf setting of the master.
/excludedWorkers     List all excluded workers of the master.
/help                List the available API providers of the master.
/hostnames           List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers         List all lost workers of the master.
/masterGroupInfo     List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles            List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers     List all shutdown workers of the master.
/threadDump          List the current thread dump of the master.
/workerInfo          List worker information of the service. It will list all registered workers 's information.
```

- worker

```
Invalid uri of the worker. Available API providers include:
/conf                      List the conf setting of the worker.
/exit                      Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help                      List the available API providers of the worker.
/isRegistered              Show if the worker is registered to the master success.
/isShutdown                Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps       List the top disk usage application ids. It only return application ids running in that worker.
/shuffles                  List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump                List the current thread dump of the worker.
/unavailablePeers          List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo                List the worker information of the worker.
```

### Why are the changes needed?

Response message of invalid HTTP request could not help users with correct HTTP path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`

Closes #1986 from SteNicholas/CELEBORN-829.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 16:37:51 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
sychen
9c07ceddb0 [CELEBORN-1028][FOLLOWUP][DOCS] Make prometheus path configurable
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1965#issuecomment-1755345813

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="1410" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6454133a-040b-4dde-84b7-dbf08fb15b13">

<img width="1401" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/3cdfa9f2-9a7a-43cb-9006-77810a350669">

Closes #1974 from cxzl25/CELEBORN-1028-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 22:59:22 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
sychen
b2b7c4d359 [CELEBORN-991][DOC] Remove incorrect spark.metrics.conf
### What changes were proposed in this pull request?
1. Replace `spark.metrics.conf` with `celeborn.metrics.conf`.
2. Fix broken links.
https://celeborn.apache.org/docs/latest/monitoring/#metrics

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1925 from cxzl25/CELEBORN-991.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-20 09:03:27 +08:00
sychen
4d35e501a3 [CELEBORN-984][DOC] shutdownWorkers API documentation
### What changes were proposed in this pull request?
https://celeborn.apache.org/docs/latest/monitoring/#master_1

07c1dc2568/service/src/main/scala/org/apache/celeborn/server/common/http/HttpRequestHandler.scala (L74-L75)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1920 from cxzl25/CELEBORN-984.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 19:58:11 +08:00
SteNicholas
92777c3ff2 [CELEBORN-927][DOC] Correct celeborn.metrics.conf.*.sink.csv.class configuration example for a CSV sink
### What changes were proposed in this pull request?

Correct `celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink.

### Why are the changes needed?

`celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink is wrong, which value should be `org.apache.celeborn.common.metrics.sink.CsvSink`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

None.

Closes #1865 from SteNicholas/CELEBORN-927.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 16:11:03 +08:00
Angerszhuuuu
17de30009b [CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1768 from AngersZhuuuu/CELEBORN-847.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:04:26 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
Angerszhuuuu
bacfb54447 [CELEBORN-832] Support use RESTful API to trigger worker decommission
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1759 from AngersZhuuuu/CELEBORN-832.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 15:40:14 +08:00
Angerszhuuuu
00c36fda99 [CELEBORN-828] Merge Monitoring to Development doc
### What changes were proposed in this pull request?
As title

<img width="1610" alt="截屏2023-07-24 上午11 34 43" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba1b040b-9ea4-4c93-b055-75a469365ff2">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1751 from AngersZhuuuu/CELEBORN-828.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 15:37:32 +08:00
Angerszhuuuu
14c6e5719f
[CELEBORN-811] Refine monitoring doc
### What changes were proposed in this pull request?
Refine monitoring doc

1. Remove unnecessary left side navigator
2. Add TOC in right side
3. fix list indentation

Before
![celeborn apache org_docs_latest_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/885da0e5-f2f9-41ba-a9fe-257e46e76a78)

After
![127 0 0 1_8000_monitoring_](https://github.com/apache/incubator-celeborn/assets/46485123/8cb3fc60-0a2e-4134-8edb-dd0fe434be60)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1734 from AngersZhuuuu/CELEBORN-811.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:53:21 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
Angerszhuuuu
1ba6dee324 [CELEBORN-680][DOC] Refresh celeborn configurations in doc
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1592 from AngersZhuuuu/CELEBORN-680.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:59:38 +08:00
onebox-li
0c869ac9a0
[CELEBORN-642] Improve metrics and update grafana
### What changes were proposed in this pull request?
Change in grafana

(ALL)
add:
JVMCPUTime
LastMinuteSystemLoad
AvailableProcessors
(For Master)
add:
LostWorkers
IsActiveMaster
PartitionSize
(For Worker)
add:
PushDataFailCount -> WriteDataFailCount
ReplicateDataFailCount
ReplicateDataWriteFailCount
ReplicateDataCreateConnectionFailCount
ReplicateDataConnectionExceptionCount
ReplicateDataTimeoutCount
SortedFileSize
PushDataHandshakeFailCount
RegionStartFailCount
RegionFinishFailCount
MasterPushDataHandshakeTime
SlavePushDataHandshakeTime
MasterRegionStartTime
SlaveRegionStartTime
MasterRegionFinishTime
SlaveRegionFinishTime
PotentialConsumeSpeed
UserProduceSpeed
WorkerConsumeSpeed
DeviceOSFreeBytes
DeviceCelebornFreeBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory
remove:
dup ReserveSlotsTime

Change dashboard layout.

Fix support for multiple labels.

Modify some metrics docs.

### Why are the changes needed?
For better use of metrics.

### Does this PR introduce _any_ user-facing change?
Below metrics change name, extract some value to the label.
DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes
DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes
DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes
DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory

### How was this patch tested?
Cluster test.

Closes #1557 from onebox-li/improve-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-08 18:10:06 +08:00
Angerszhuuuu
64a3534f71
[CELEBORN-584] Worker side should expose push/replicate/fetch Netty allocator metrics (#1489) 2023-05-16 17:51:33 +08:00
Angerszhuuuu
d657f8268a
[CELEBORN-586] Add SystemMiscSource to indicate system running status (#1488) 2023-05-16 14:03:07 +08:00
Ethan Feng
91b757555e
[CELEBORN-570] Update docs about monitor and deployment. (#1478) 2023-05-08 17:07:42 +08:00
Angerszhuuuu
0c2d3e647d
[CELEBORN-532][METRICS] Refine push-related failure metrics (#1442)
* [CELEBORN-532][METRICS] Refine push-related failure metrics
2023-04-21 17:05:43 +08:00
Angerszhuuuu
e319b99a1c
[CELEBORN-527][DOC] Fix incorrect monitor the arrangement of documents (#1432) 2023-04-17 11:12:19 +08:00
Angerszhuuuu
ecafbf41fc
[CELEBORN-516][FOLLOWUP] Remove removed RPC metrics in metric doc (#1431) 2023-04-17 10:46:04 +08:00
Cheng Pan
fb7b311c89
[CELEBORN-499] Move version specific resource to main repo (#1429)
* [CELEBORN-499] Move version specific resource to main repo

* license
2023-04-14 16:20:51 +08:00