Commit Graph

63 Commits

Author SHA1 Message Date
Wang, Fei
f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
SteNicholas
497bfdf5d7 [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
### What changes were proposed in this pull request?

`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.

### Why are the changes needed?

`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:

```
public int numHeapArenas() {
  return this.allocator.numHeapArenas();
}

public int numDirectArenas() {
  return this.allocator.numDirectArenas();
}

public List<PoolArenaMetric> heapArenas() {
  return this.allocator.heapArenas();
}

public List<PoolArenaMetric> directArenas() {
  return this.allocator.directArenas();
}

public int numThreadLocalCaches() {
  return this.allocator.numThreadLocalCaches();
}

public int tinyCacheSize() {
  return this.allocator.tinyCacheSize();
}

public int smallCacheSize() {
  return this.allocator.smallCacheSize();
}

public int normalCacheSize() {
  return this.allocator.normalCacheSize();
}

public int chunkSize() {
  return this.allocator.chunkSize();
}

public long usedHeapMemory() {
  return this.allocator.usedHeapMemory();
}

public long usedDirectMemory() {
  return this.allocator.usedDirectMemory();
}
```

`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)

Closes #2802 from SteNicholas/CELEBORN-1640.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-17 18:12:08 +08:00
SteNicholas
8bd5ac0b99 [MINOR] Add navigation for REST API document
### What changes were proposed in this pull request?

Add navigation for `REST API` document.

### Why are the changes needed?

`REST API` document does not have any navigation, which is better to add navigation to guide REST API.

<img width="1438" alt="image" src="https://github.com/user-attachments/assets/b5b3a14a-38d4-4769-bffb-3acd571d5dbb">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2775 from SteNicholas/navigate-rest-api.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 20:20:37 +08:00
Sanskar Modi
961144fdbd [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">

### Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NA

Closes #2711 from s0nskar/unrelease_shuffle_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 17:02:25 +08:00
Weijie Guo
d8809793f3 [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
### What changes were proposed in this pull request?

Impl worker write process for Flink Hybrid Shuffle.

### Why are the changes needed?

We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
no need.

Closes #2741 from reswqa/cip6-6-pr.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-25 10:27:55 +08:00
szt
59a39952dd [CELEBORN-1586] Add available workers Metrics
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test
![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702)

Closes #2723 from zaynt4606/availableWorker.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-05 13:34:52 +08:00
Sanskar Modi
b7027b6011 [CELEBORN-914][FOLLOWUP] Adding metrics for memory file storage in monitoring.md
### What changes were proposed in this pull request?

Adding documentation for missing memory file storage metrics.

### Why are the changes needed?

Few new metrics were added in https://github.com/apache/celeborn/pull/2300 but they were missing their documentation in monitoring.md

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NA

Closes #2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-26 16:05:35 +08:00
Wang, Fei
0e05bc6cf9 [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
### What changes were proposed in this pull request?

As title, merge these two similar user guides.

### Why are the changes needed?
To close CELEBORN-1437

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard

Closes #2623 from turboFei/metrics_merge.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 13:41:46 +08:00
Wang, Fei
6b03dcd5c2 [CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md
### What changes were proposed in this pull request?

Move Rest API out from monitoring.md to webapi.md

### Why are the changes needed?
To close CELEBORN-1436

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Review https://github.com/turboFei/incubator-celeborn/blob/webapi_md/docs/webapi.md

Closes #2624 from turboFei/webapi_md.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-15 10:54:03 +08:00
SteNicholas
c7b1b8d61e
[CELEBORN-1459] Introduce CleanTaskQueueSize and CleanExpiredShuffleKeysTime to record situation of cleaning up expired shuffle keys
### What changes were proposed in this pull request?

Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.

### Why are the changes needed?

There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)

Closes #2557 from SteNicholas/CELEBORN-1459.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 16:31:57 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
Shuang
9a9abfe3bc [CELEBORN-1245][FOLLOWUP] Fix SendWorkerEvent in HA mode
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Handle worker event use wrong request.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent`

Closes #2493 from RexXiong/CELEBORN-1245-FOLLOW-UP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-07 15:16:47 +08:00
SteNicholas
3ac769e4fa
[CELEBORN-1236][FOLLOWUP] Gauge is_terminating, is_terminated and is_shutdown should represent a single numerical value
### What changes were proposed in this pull request?

Gauge `is_terminating`, `is_terminated` and `is_shutdown` should represent a single numerical value instead of boolean value.

### Why are the changes needed?

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The value type of `is_terminating`, `is_terminated` and `is_shutdown` should be numerical, otherwise `AbstractSource#addGauge` would warn the failed log as follows:

```
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminating failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminated failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_shutdown failed, the value type class java.lang.Boolean is not a number
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2457 from SteNicholas/CELEBORN-1236.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-15 11:34:34 +08:00
CodingCat
c788c38025
[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots
### What changes were proposed in this pull request?

Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Why are the changes needed?

It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5)

Closes #2386 from CodingCat/slots_decrease.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 11:08:05 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
liangyongyuan
4ddc91afda [CELEBRON-1282] Optimize push data replica error message
### What changes were proposed in this pull request?
Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios.

### Why are the changes needed?
When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations.
![image](https://github.com/apache/incubator-celeborn/assets/46274164/816d21ad-1f79-45f0-bbe7-e93e15389edd)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
through exist uts

Closes #2323 from lyy-pineapple/CELEBORN-1282.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-26 12:55:26 +08:00
SteNicholas
a1c9d01739 [CELEBORN-1056] Introduce Rest API of listing dynamic configuration
### What changes were proposed in this pull request?

Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:

```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          100000
celeborn.worker.flusher.buffer.size                                           64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          200000
celeborn.worker.flusher.buffer.size                                           128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          300000
celeborn.worker.flusher.buffer.size                                           256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          400000
celeborn.worker.flusher.buffer.size                                           512k
```

### Why are the changes needed?

Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.

### Does this PR introduce _any_ user-facing change?

- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.

### How was this patch tested?

- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`

Closes #2311 from SteNicholas/CELEBORN-1056.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-23 10:30:11 +08:00
SteNicholas
05fa11b3a0 [CELEBORN-1174] Introduce application dimension resource consumption metrics
### What changes were proposed in this pull request?

Introduce application dimension resource consumption metrics for `ResourceConsumptionSource`.

### Why are the changes needed?

`ResourceConsumption` namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Master and Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `WorkerInfoSuite#WorkerInfo toString output`
- `PbSerDeUtilsTest#fromAndToPbResourceConsumption`
- `MasterStateMachineSuitej#testObjSerde`

Closes #2161 from SteNicholas/CELEBORN-1174.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-01 15:24:29 +08:00
Shuang
e71d912d50 [CELEBORN-1245] Support Celeborn Master(Leader) to manage workers
### What changes were proposed in this pull request?
1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat
2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...)
3. Add Http interface for master to handleWorkerEvent/getWorkerEvent

### Why are the changes needed?
Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #2255 from RexXiong/CELEBORN-1245.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-01 09:44:59 +08:00
Angerszhuuuu
3ffed66c40 [CELEBORN-1236][METRICS] Celeborn add metrics about thread pool
### What changes were proposed in this pull request?
Add metrics about worker's thread pool, help admin to observe the thread pool's work status.

ThreadPool list as below:

1. celeborn-dispatcher
2. celeborn-netty-rpc-connection-executor
3. worker-disk-{mount_point}-cleaner
4. worker-device-checker
5. flusher-{mount_point}
6. worker-file-sorter-executor
7. worker-data-replicator
8. worker-files-committer
9. worker-expired-shuffle-cleaner

```
metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 1 1706237338484
metrics_active_thread_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485
metrics_thread_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 2 1706237338485
metrics_thread_is_terminated_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485
metrics_thread_is_shutdown_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485
```
### Why are the changes needed?
Help observe server status

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

Closes #2239 from AngersZhuuuu/CLEBORN-1236.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-01-26 18:14:05 +08:00
Angerszhuuuu
4709251bb4
[CELEBORN-1246] Introduce OpenStreamSuccessCount, FetchChunkSuccessCount and WriteDataSuccessCount metric to expose the count of opening stream, fetching chunk and writing data successfully
### What changes were proposed in this pull request?

Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric to expose the count of opening stream, fetching chunk and writing data successfully in current worker.

### Why are the changes needed?

The ratio of opening stream, fetching chunk and writing data failed is important for Celeborn performance to balance the healty of cluster, which is lack of the count of opening stream, fetching chunk and writing data successfully.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2252 from AngersZhuuuu/CELEBORN-1246.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-24 10:44:28 +08:00
xianminglei
b90fb1fdb2 [CELEBORN-1237][METRICS] Refactor metrics name
### What changes were proposed in this pull request?
Refactor metrics name.

### Why are the changes needed?
Easier to understand the meaning of metrics

### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md

### How was this patch tested?
Existing UTs.

Closes #2240 from leixm/metrics_name.

Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-18 18:15:43 +08:00
SteNicholas
402d23d0ea
[CELEBORN-1223] Align master and worker metrics of document with MasterSource and WorkerSource
### What changes were proposed in this pull request?

Align master and worker metrics of document with `MasterSource` and `WorkerSource` in `METRICS.md` and `monitoring.md`.

### Why are the changes needed?

Metrics of master and worker is inconsistent with `MasterSource` and `WorkerSource` at present. It is recommended to align master and worker metrics of document with `MasterSource` and `WorkerSource`:

- PushDataHandshakeFailCount
- RegionStartFailCount
- RegionFinishFailCount
- PrimaryPushDataHandshakeTime
- ReplicaPushDataHandshakeTime
- PrimaryRegionStartTime
- ReplicaRegionStartTime
- PrimaryRegionFinishTime
- ReplicaRegionFinishTime
- ActiveConnectionCount
- BufferStreamReadBuffer
- ReadBufferDispatcherRequestsLength
- ReadBufferAllocatedCount
- CreditStreamCount
- ActiveMapPartitionCount
- DeviceOSFreeBytes
- DeviceOSTotalBytes
- DeviceCelebornFreeBytes
- DeviceCelebornTotalBytes
- PotentialConsumeSpeed
- UserProduceSpeed
- WorkerConsumeSpeed

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2226 from SteNicholas/CELEBORN-1223.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-16 16:19:39 +08:00
SteNicholas
4b5e23db37
[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers
### What changes were proposed in this pull request?

Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers.

### Why are the changes needed?

`PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
- `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic`

Closes #2221 from SteNicholas/CELEBORN-1215.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-10 19:55:04 +08:00
SteNicholas
0cd1291f6c
[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData
### What changes were proposed in this pull request?

Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData.

### Why are the changes needed?

As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2217 from SteNicholas/CELEBORN-1214.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-09 21:54:53 +08:00
SteNicholas
29e930488b
[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler
### What changes were proposed in this pull request?

Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`:

- `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics.
- Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example:
  1. `"expr": "metrics_RunningApplicationCount_Value"`
  2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`.
  3. Updates the `gridPos` of some panels.

### Why are the changes needed?

There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2212 from SteNicholas/CELEBORN-1100.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-05 17:05:35 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
SteNicholas
11c90d8e72
[CELEBORN-916] Add new metric about active shuffle file count in worker
### What changes were proposed in this pull request?

Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.

### Why are the changes needed?

`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2009 from SteNicholas/CELEBORN-916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00
SteNicholas
f2d6cc7525 [CELEBORN-829] Improve response message of invalid HTTP request
### What changes were proposed in this pull request?

Improve response message of invalid HTTP request, which lists available API providers like as below:

- master

```
Invalid uri of the master. Available API providers include:
/applications        List all running application's ids of the cluster.
/conf                List the conf setting of the master.
/excludedWorkers     List all excluded workers of the master.
/help                List the available API providers of the master.
/hostnames           List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers         List all lost workers of the master.
/masterGroupInfo     List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles            List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers     List all shutdown workers of the master.
/threadDump          List the current thread dump of the master.
/workerInfo          List worker information of the service. It will list all registered workers 's information.
```

- worker

```
Invalid uri of the worker. Available API providers include:
/conf                      List the conf setting of the worker.
/exit                      Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help                      List the available API providers of the worker.
/isRegistered              Show if the worker is registered to the master success.
/isShutdown                Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps       List the top disk usage application ids. It only return application ids running in that worker.
/shuffles                  List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump                List the current thread dump of the worker.
/unavailablePeers          List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo                List the worker information of the worker.
```

### Why are the changes needed?

Response message of invalid HTTP request could not help users with correct HTTP path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`

Closes #1986 from SteNicholas/CELEBORN-829.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 16:37:51 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
sychen
9c07ceddb0 [CELEBORN-1028][FOLLOWUP][DOCS] Make prometheus path configurable
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1965#issuecomment-1755345813

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="1410" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6454133a-040b-4dde-84b7-dbf08fb15b13">

<img width="1401" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/3cdfa9f2-9a7a-43cb-9006-77810a350669">

Closes #1974 from cxzl25/CELEBORN-1028-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 22:59:22 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
sychen
b2b7c4d359 [CELEBORN-991][DOC] Remove incorrect spark.metrics.conf
### What changes were proposed in this pull request?
1. Replace `spark.metrics.conf` with `celeborn.metrics.conf`.
2. Fix broken links.
https://celeborn.apache.org/docs/latest/monitoring/#metrics

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1925 from cxzl25/CELEBORN-991.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-20 09:03:27 +08:00
sychen
4d35e501a3 [CELEBORN-984][DOC] shutdownWorkers API documentation
### What changes were proposed in this pull request?
https://celeborn.apache.org/docs/latest/monitoring/#master_1

07c1dc2568/service/src/main/scala/org/apache/celeborn/server/common/http/HttpRequestHandler.scala (L74-L75)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1920 from cxzl25/CELEBORN-984.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 19:58:11 +08:00
SteNicholas
92777c3ff2 [CELEBORN-927][DOC] Correct celeborn.metrics.conf.*.sink.csv.class configuration example for a CSV sink
### What changes were proposed in this pull request?

Correct `celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink.

### Why are the changes needed?

`celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink is wrong, which value should be `org.apache.celeborn.common.metrics.sink.CsvSink`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

None.

Closes #1865 from SteNicholas/CELEBORN-927.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 16:11:03 +08:00
Angerszhuuuu
17de30009b [CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1768 from AngersZhuuuu/CELEBORN-847.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:04:26 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
Angerszhuuuu
bacfb54447 [CELEBORN-832] Support use RESTful API to trigger worker decommission
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1759 from AngersZhuuuu/CELEBORN-832.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 15:40:14 +08:00
Angerszhuuuu
00c36fda99 [CELEBORN-828] Merge Monitoring to Development doc
### What changes were proposed in this pull request?
As title

<img width="1610" alt="截屏2023-07-24 上午11 34 43" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba1b040b-9ea4-4c93-b055-75a469365ff2">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1751 from AngersZhuuuu/CELEBORN-828.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-24 15:37:32 +08:00