Commit Graph

77 Commits

Author SHA1 Message Date
xxx
a9490d6e24 [CELEBORN-2118] Introduce IsHighWorkload metric to monitor worker overload status
### What changes were proposed in this pull request?

Introduce `IsHighWorkload` metric to monitor worker overload status.

### Why are the changes needed?

There is no any metric to monitor worker overload status at present.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf).

Closes #3435 from xy2953396112/CELEBORN-2118.

Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-25 20:46:17 +08:00
xxx
661a096b77 [CELEBORN-2112] Introduce PausePushDataStatus and PausePushDataAndReplicateStatus metric to record status of pause push data
### What changes were proposed in this pull request?

Add `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric.

### Why are the changes needed?

Introduce `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric to record status of pause push data.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test. [Grafana](https://xy2953396112.grafana.net/public-dashboards/21af8e2844234c438e74c741211f0032)

Closes #3426 from xy2953396112/CELEBORN-2112.

Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-21 11:17:44 +08:00
dz
11b41f97ad [CELEBORN-2102] Introduce SorterCacheHitRate metric to monitor the hit reate of index cache for sorter
### What changes were proposed in this pull request?

Introduce `SorterCacheHitRate` metric to monitor the hit reate of index cache for sorter.

### Why are the changes needed?

Monitor the hit rate of `PartitionFilesSorter#indexCache`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The verified grafana dashboard: https://xy2953396112.grafana.net/public-dashboards/5d1177ee0f784b53ad817fde919141b7

Closes #3416 from xy2953396112/CELEBORN_2102.

Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-20 10:47:38 +08:00
SteNicholas
4540b5772b [MINOR] Document introduced metrics into monitoring.md
### What changes were proposed in this pull request?

Document introduced metrics into `monitoring.md` including `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize`.

### Why are the changes needed?

Introduced metrics `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize` don't document in `monitoring.md`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #3398 from SteNicholas/document-monitoring.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-07-29 14:33:46 +08:00
mingji
7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-26 18:42:20 -07:00
Sanskar Modi
2a2c6e4687 [CELEBORN-2024] Publish commit files fail count metrics
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
Added a commit files request fail count metric.

### Why are the changes needed?
To monitor and tune the configurations around the commit files workflow.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local setup

<img width="739" alt="Screenshot 2025-06-04 at 10 51 06 AM" src="https://github.com/user-attachments/assets/d6256028-d8b7-4a81-90b1-3dcbf61adeba" />

Closes #3307 from s0nskar/commit_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-17 11:52:45 -07:00
Sanskar Modi
80bdb46801 [CELEBORN-1892] Adding register with master fail count metric for worker
### What changes were proposed in this pull request?

Adding register with master fail count metric for worker

### Why are the changes needed?

This will help put monitoring around if workers are not able to register with master like wrong endpoints are passed or master becomes unavailable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Local setup

<img width="724" alt="Screenshot 2025-06-04 at 10 44 56 AM" src="https://github.com/user-attachments/assets/1f84557b-5df8-422f-b602-bb5316a72a0e" />

Closes #3308 from s0nskar/worker_register_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-11 11:04:59 -07:00
SteNicholas
d9984c9e0e [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
### What changes were proposed in this pull request?

Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.

### Why are the changes needed?

There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.

Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`

Closes #3026 from SteNicholas/CELEBORN-1800.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-19 07:20:00 -07:00
Sanskar Modi
9ba54b39e2 [CELEBORN-1968] Publish metric for unreleased partition location count when worker was gracefully shutdown
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased partition location count when worker was gracefully shutdown.

<img width="742" alt="Screenshot 2025-04-16 at 1 19 18 AM" src="https://github.com/user-attachments/assets/159f744a-cd76-45a2-9387-930f27dd72be" />

### Why are the changes needed?

Similar to https://github.com/apache/celeborn/pull/2711, Currently celeborn don't publish the count of unreleased partition location when worker is gracefully exit. This can be useful for monitoring and configuring the gracefulShutdownTimeout.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
NA

Closes #3213 from s0nskar/unrelease_partition_location.

Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-12 04:34:44 -07:00
zhengtao
ac0d335f40 [CELEBORN-1831] Add ratis commitIndex metrics
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).

### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.
![image](https://github.com/user-attachments/assets/f354a3cd-e3b3-4af0-98c2-fc13330b2d81)

Closes #3063 from zaynt4606/clb1831.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-17 10:58:06 +08:00
Nan
ca60613f2f [CELEBORN-1817] add committed file size metrics
### What changes were proposed in this pull request?

this PR adds the file size metrics for workers

### Why are the changes needed?

the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)

we want to have this metrics to monitor the severity of the issue

### Does this PR introduce _any_ user-facing change?

yes, one more metrics

### How was this patch tested?

(ignore the dashboard title, it's a dummy one)

![image](https://github.com/user-attachments/assets/d88c15e6-d740-4def-94d5-03666bbb38ca)

Closes #3047 from CodingCat/committed_file_size.

Authored-by: Nan <nzhu@pinterest.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-07 10:17:45 +08:00
wuziyi
f886751e80 [CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted
### What changes were proposed in this pull request?

Current implementation uses `
shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`.

### Why are the changes needed?

Add metrics to monitor current sorting files which is making disk-io operations.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

![image](https://github.com/user-attachments/assets/6ffed37e-ad12-4d8d-a4aa-2b2695a92168)

Closes #3040 from Z1Wu/fix/sorting_file_metrics.

Authored-by: wuziyi <wuziyi02@corp.netease.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-04 10:27:53 +08:00
Wang, Fei
81a0d5113c [CELEBORN-1660] Cache available workers and only count the available workers device free capacity
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part

### Why are the changes needed?
Cache  the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
UT.

<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">

Closes #2827 from turboFei/device_free.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-14 11:10:45 +08:00
SteNicholas
169b6f6973 [CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
### What changes were proposed in this pull request?

1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.

### Why are the changes needed?

The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #2891 from SteNicholas/CELEBORN-1685.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-11 10:37:25 +08:00
Wang, Fei
f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
SteNicholas
497bfdf5d7 [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
### What changes were proposed in this pull request?

`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.

### Why are the changes needed?

`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:

```
public int numHeapArenas() {
  return this.allocator.numHeapArenas();
}

public int numDirectArenas() {
  return this.allocator.numDirectArenas();
}

public List<PoolArenaMetric> heapArenas() {
  return this.allocator.heapArenas();
}

public List<PoolArenaMetric> directArenas() {
  return this.allocator.directArenas();
}

public int numThreadLocalCaches() {
  return this.allocator.numThreadLocalCaches();
}

public int tinyCacheSize() {
  return this.allocator.tinyCacheSize();
}

public int smallCacheSize() {
  return this.allocator.smallCacheSize();
}

public int normalCacheSize() {
  return this.allocator.normalCacheSize();
}

public int chunkSize() {
  return this.allocator.chunkSize();
}

public long usedHeapMemory() {
  return this.allocator.usedHeapMemory();
}

public long usedDirectMemory() {
  return this.allocator.usedDirectMemory();
}
```

`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)

Closes #2802 from SteNicholas/CELEBORN-1640.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-17 18:12:08 +08:00
SteNicholas
8bd5ac0b99 [MINOR] Add navigation for REST API document
### What changes were proposed in this pull request?

Add navigation for `REST API` document.

### Why are the changes needed?

`REST API` document does not have any navigation, which is better to add navigation to guide REST API.

<img width="1438" alt="image" src="https://github.com/user-attachments/assets/b5b3a14a-38d4-4769-bffb-3acd571d5dbb">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2775 from SteNicholas/navigate-rest-api.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 20:20:37 +08:00
Sanskar Modi
961144fdbd [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">

### Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NA

Closes #2711 from s0nskar/unrelease_shuffle_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 17:02:25 +08:00
Weijie Guo
d8809793f3 [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
### What changes were proposed in this pull request?

Impl worker write process for Flink Hybrid Shuffle.

### Why are the changes needed?

We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
no need.

Closes #2741 from reswqa/cip6-6-pr.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-25 10:27:55 +08:00
szt
59a39952dd [CELEBORN-1586] Add available workers Metrics
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test
![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702)

Closes #2723 from zaynt4606/availableWorker.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-05 13:34:52 +08:00
Sanskar Modi
b7027b6011 [CELEBORN-914][FOLLOWUP] Adding metrics for memory file storage in monitoring.md
### What changes were proposed in this pull request?

Adding documentation for missing memory file storage metrics.

### Why are the changes needed?

Few new metrics were added in https://github.com/apache/celeborn/pull/2300 but they were missing their documentation in monitoring.md

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NA

Closes #2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-26 16:05:35 +08:00
Wang, Fei
0e05bc6cf9 [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
### What changes were proposed in this pull request?

As title, merge these two similar user guides.

### Why are the changes needed?
To close CELEBORN-1437

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard

Closes #2623 from turboFei/metrics_merge.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 13:41:46 +08:00
Wang, Fei
6b03dcd5c2 [CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md
### What changes were proposed in this pull request?

Move Rest API out from monitoring.md to webapi.md

### Why are the changes needed?
To close CELEBORN-1436

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Review https://github.com/turboFei/incubator-celeborn/blob/webapi_md/docs/webapi.md

Closes #2624 from turboFei/webapi_md.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-15 10:54:03 +08:00
SteNicholas
c7b1b8d61e
[CELEBORN-1459] Introduce CleanTaskQueueSize and CleanExpiredShuffleKeysTime to record situation of cleaning up expired shuffle keys
### What changes were proposed in this pull request?

Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.

### Why are the changes needed?

There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)

Closes #2557 from SteNicholas/CELEBORN-1459.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 16:31:57 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
Shuang
9a9abfe3bc [CELEBORN-1245][FOLLOWUP] Fix SendWorkerEvent in HA mode
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Handle worker event use wrong request.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent`

Closes #2493 from RexXiong/CELEBORN-1245-FOLLOW-UP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-07 15:16:47 +08:00
SteNicholas
3ac769e4fa
[CELEBORN-1236][FOLLOWUP] Gauge is_terminating, is_terminated and is_shutdown should represent a single numerical value
### What changes were proposed in this pull request?

Gauge `is_terminating`, `is_terminated` and `is_shutdown` should represent a single numerical value instead of boolean value.

### Why are the changes needed?

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The value type of `is_terminating`, `is_terminated` and `is_shutdown` should be numerical, otherwise `AbstractSource#addGauge` would warn the failed log as follows:

```
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminating failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminated failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_shutdown failed, the value type class java.lang.Boolean is not a number
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2457 from SteNicholas/CELEBORN-1236.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-15 11:34:34 +08:00
CodingCat
c788c38025
[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots
### What changes were proposed in this pull request?

Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Why are the changes needed?

It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5)

Closes #2386 from CodingCat/slots_decrease.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 11:08:05 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
liangyongyuan
4ddc91afda [CELEBRON-1282] Optimize push data replica error message
### What changes were proposed in this pull request?
Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios.

### Why are the changes needed?
When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations.
![image](https://github.com/apache/incubator-celeborn/assets/46274164/816d21ad-1f79-45f0-bbe7-e93e15389edd)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
through exist uts

Closes #2323 from lyy-pineapple/CELEBORN-1282.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-26 12:55:26 +08:00
SteNicholas
a1c9d01739 [CELEBORN-1056] Introduce Rest API of listing dynamic configuration
### What changes were proposed in this pull request?

Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:

```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          100000
celeborn.worker.flusher.buffer.size                                           64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          200000
celeborn.worker.flusher.buffer.size                                           128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          300000
celeborn.worker.flusher.buffer.size                                           256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          400000
celeborn.worker.flusher.buffer.size                                           512k
```

### Why are the changes needed?

Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.

### Does this PR introduce _any_ user-facing change?

- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.

### How was this patch tested?

- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`

Closes #2311 from SteNicholas/CELEBORN-1056.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-23 10:30:11 +08:00
SteNicholas
05fa11b3a0 [CELEBORN-1174] Introduce application dimension resource consumption metrics
### What changes were proposed in this pull request?

Introduce application dimension resource consumption metrics for `ResourceConsumptionSource`.

### Why are the changes needed?

`ResourceConsumption` namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Master and Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `WorkerInfoSuite#WorkerInfo toString output`
- `PbSerDeUtilsTest#fromAndToPbResourceConsumption`
- `MasterStateMachineSuitej#testObjSerde`

Closes #2161 from SteNicholas/CELEBORN-1174.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-01 15:24:29 +08:00
Shuang
e71d912d50 [CELEBORN-1245] Support Celeborn Master(Leader) to manage workers
### What changes were proposed in this pull request?
1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat
2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...)
3. Add Http interface for master to handleWorkerEvent/getWorkerEvent

### Why are the changes needed?
Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #2255 from RexXiong/CELEBORN-1245.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-01 09:44:59 +08:00
Angerszhuuuu
3ffed66c40 [CELEBORN-1236][METRICS] Celeborn add metrics about thread pool
### What changes were proposed in this pull request?
Add metrics about worker's thread pool, help admin to observe the thread pool's work status.

ThreadPool list as below:

1. celeborn-dispatcher
2. celeborn-netty-rpc-connection-executor
3. worker-disk-{mount_point}-cleaner
4. worker-device-checker
5. flusher-{mount_point}
6. worker-file-sorter-executor
7. worker-data-replicator
8. worker-files-committer
9. worker-expired-shuffle-cleaner

```
metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484
metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 1 1706237338484
metrics_active_thread_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_pending_task_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484
metrics_core_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485
metrics_thread_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 2 1706237338485
metrics_thread_is_terminated_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485
metrics_thread_is_shutdown_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485
metrics_active_thread_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485
metrics_pending_task_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485
metrics_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485
metrics_core_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485
metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485
metrics_largest_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485
```
### Why are the changes needed?
Help observe server status

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
MT

Closes #2239 from AngersZhuuuu/CLEBORN-1236.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-01-26 18:14:05 +08:00
Angerszhuuuu
4709251bb4
[CELEBORN-1246] Introduce OpenStreamSuccessCount, FetchChunkSuccessCount and WriteDataSuccessCount metric to expose the count of opening stream, fetching chunk and writing data successfully
### What changes were proposed in this pull request?

Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric to expose the count of opening stream, fetching chunk and writing data successfully in current worker.

### Why are the changes needed?

The ratio of opening stream, fetching chunk and writing data failed is important for Celeborn performance to balance the healty of cluster, which is lack of the count of opening stream, fetching chunk and writing data successfully.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2252 from AngersZhuuuu/CELEBORN-1246.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-24 10:44:28 +08:00
xianminglei
b90fb1fdb2 [CELEBORN-1237][METRICS] Refactor metrics name
### What changes were proposed in this pull request?
Refactor metrics name.

### Why are the changes needed?
Easier to understand the meaning of metrics

### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md

### How was this patch tested?
Existing UTs.

Closes #2240 from leixm/metrics_name.

Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-18 18:15:43 +08:00
SteNicholas
402d23d0ea
[CELEBORN-1223] Align master and worker metrics of document with MasterSource and WorkerSource
### What changes were proposed in this pull request?

Align master and worker metrics of document with `MasterSource` and `WorkerSource` in `METRICS.md` and `monitoring.md`.

### Why are the changes needed?

Metrics of master and worker is inconsistent with `MasterSource` and `WorkerSource` at present. It is recommended to align master and worker metrics of document with `MasterSource` and `WorkerSource`:

- PushDataHandshakeFailCount
- RegionStartFailCount
- RegionFinishFailCount
- PrimaryPushDataHandshakeTime
- ReplicaPushDataHandshakeTime
- PrimaryRegionStartTime
- ReplicaRegionStartTime
- PrimaryRegionFinishTime
- ReplicaRegionFinishTime
- ActiveConnectionCount
- BufferStreamReadBuffer
- ReadBufferDispatcherRequestsLength
- ReadBufferAllocatedCount
- CreditStreamCount
- ActiveMapPartitionCount
- DeviceOSFreeBytes
- DeviceOSTotalBytes
- DeviceCelebornFreeBytes
- DeviceCelebornTotalBytes
- PotentialConsumeSpeed
- UserProduceSpeed
- WorkerConsumeSpeed

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2226 from SteNicholas/CELEBORN-1223.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-16 16:19:39 +08:00
SteNicholas
4b5e23db37
[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers
### What changes were proposed in this pull request?

Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers.

### Why are the changes needed?

`PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
- `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic`

Closes #2221 from SteNicholas/CELEBORN-1215.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-10 19:55:04 +08:00
SteNicholas
0cd1291f6c
[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData
### What changes were proposed in this pull request?

Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData.

### Why are the changes needed?

As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2217 from SteNicholas/CELEBORN-1214.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-09 21:54:53 +08:00
SteNicholas
29e930488b
[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler
### What changes were proposed in this pull request?

Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`:

- `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics.
- Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example:
  1. `"expr": "metrics_RunningApplicationCount_Value"`
  2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`.
  3. Updates the `gridPos` of some panels.

### Why are the changes needed?

There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2212 from SteNicholas/CELEBORN-1100.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-05 17:05:35 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00