Commit Graph

98 Commits

Author SHA1 Message Date
xxx
a9490d6e24 [CELEBORN-2118] Introduce IsHighWorkload metric to monitor worker overload status
### What changes were proposed in this pull request?

Introduce `IsHighWorkload` metric to monitor worker overload status.

### Why are the changes needed?

There is no any metric to monitor worker overload status at present.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf).

Closes #3435 from xy2953396112/CELEBORN-2118.

Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-25 20:46:17 +08:00
xxx
661a096b77 [CELEBORN-2112] Introduce PausePushDataStatus and PausePushDataAndReplicateStatus metric to record status of pause push data
### What changes were proposed in this pull request?

Add `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric.

### Why are the changes needed?

Introduce `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric to record status of pause push data.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test. [Grafana](https://xy2953396112.grafana.net/public-dashboards/21af8e2844234c438e74c741211f0032)

Closes #3426 from xy2953396112/CELEBORN-2112.

Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-21 11:17:44 +08:00
dz
11b41f97ad [CELEBORN-2102] Introduce SorterCacheHitRate metric to monitor the hit reate of index cache for sorter
### What changes were proposed in this pull request?

Introduce `SorterCacheHitRate` metric to monitor the hit reate of index cache for sorter.

### Why are the changes needed?

Monitor the hit rate of `PartitionFilesSorter#indexCache`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The verified grafana dashboard: https://xy2953396112.grafana.net/public-dashboards/5d1177ee0f784b53ad817fde919141b7

Closes #3416 from xy2953396112/CELEBORN_2102.

Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-20 10:47:38 +08:00
Wang, Fei
c587f33aaf [CELEBORN-1793] Add netty pinned memory metrics
### What changes were proposed in this pull request?
Add netty pinned memory metrics

### Why are the changes needed?
We can know more accurately the memory actually allocated from PoolArena.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing uts.

Closes #3019 from leixm/CELEBORN-1793.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-25 17:09:42 +08:00
Wang, Fei
b8253b0864 [CELEBORN-2078] Fix wrong grafana metrics units
### What changes were proposed in this pull request?

Fix the metrics units.

1. bytes -> decbytes, see https://github.com/apache/celeborn/pull/2896
```
metrics_FetchChunkTransferSize_Max
metrics_FetchChunkTransferSize_Mean
```

2. bytes -> none, followup https://github.com/apache/celeborn/pull/3362
```
metrics_LocalFlushSize_Count
metrics_HdfsFlushSize_Count
metrics_OssFlushSize_Count
metrics_S3FlushSize_Count
```

3. ms -> ns, followup https://github.com/apache/celeborn/pull/2990
```
metrics_RpcQueueTime_Max
metrics_RpcQueueTime_Mean
metrics_RpcProcessTime_Max
metrics_RpcProcessTime_Mean
```

4. add unit `decbytes` for `metrics_SortedFileSize_Value`, which was not set before
### Why are the changes needed?

Fix the metrics units.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Code review.

Closes #3381 from turboFei/fix_rpc_unit.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-23 15:59:32 +08:00
TheodoreLx
d09b424756 [CELEBORN-2061] Introduce metrics to count the amount of data flushed into different storage types
Added metrics for the amount of data written to different storage types, including Local, HDFS, OSS, and S3

Currently, there is a lack of data volume written to each storage, and it is impossible to monitor the size and speed of writing.

no

Cluster Test

Closes #3362 from TheodoreLx/add-flush-count-metric.

Authored-by: TheodoreLx <1548069580@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-21 19:38:35 +08:00
Wang, Fei
b92820c635 [CELEBORN-2072] Add missing instance filter to grafana dashboard
### What changes were proposed in this pull request?

As title.
### Why are the changes needed?

To prevent the dashboard crash for large celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?
Manually testing.

Closes #3373 from turboFei/metrics_instance.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-21 14:27:22 +08:00
Wang, Fei
979f2e2148 [CELEBORN-2073] Fix PartitionFileSizeBytes metrics
### What changes were proposed in this pull request?

This PR fix two issues:

1. followup https://github.com/apache/celeborn/pull/3047, the metrics positions for `PartitionFileSizeBytes` on grafana dashboard are wrong.
2. follow up https://github.com/apache/celeborn/pull/3085, PartitionFileSizeBytes does not work.

### Why are the changes needed?

1. The metrics positions are not correct, they should be placed under `Worker` row. But now, they are at the end.
<img width="1727" height="247" alt="image" src="https://github.com/user-attachments/assets/87a7eb1d-e296-4730-8986-efbf48aa35e6" />

2. the metrics does not work after 951b626a98 (diff-93aed69b393af59cefdfa6f5293f4dfb9cba96a9be23f3eec0bbe7d61f6d65be)
<img width="2072" height="282" alt="image" src="https://github.com/user-attachments/assets/28d0b404-914a-49e5-ac71-f399b3c3d44a" />

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

1. the metrics position looks good now.
<img width="1703" height="534" alt="image" src="https://github.com/user-attachments/assets/f5b78d37-9d84-4241-9285-e9a2ba0b12b2" />

2. UT

Closes #3374 from turboFei/fix_metrics_pos.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-21 14:25:06 +08:00
Sanskar Modi
2a2c6e4687 [CELEBORN-2024] Publish commit files fail count metrics
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
Added a commit files request fail count metric.

### Why are the changes needed?
To monitor and tune the configurations around the commit files workflow.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local setup

<img width="739" alt="Screenshot 2025-06-04 at 10 51 06 AM" src="https://github.com/user-attachments/assets/d6256028-d8b7-4a81-90b1-3dcbf61adeba" />

Closes #3307 from s0nskar/commit_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-17 11:52:45 -07:00
Shuang
a0a4260013 [CELEBORN-1817][FOLLOWUP] Correct the problematic metrics
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
PASS GA & grafana

Closes #3333 from RexXiong/CELEBORN-1817-FOLLOWUP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-16 21:11:40 -07:00
SteNicholas
cfc3f1b13a [CELEBORN-1319][FOLLOWUP] Support celeborn optimize skew partitions patch for Spark v3.5.6 and v4.0.0
### What changes were proposed in this pull request?

Support celeborn optimize skew partitions patch for Spark v3.5.6 and v4.0.0.

### Why are the changes needed?

There is no patch of celeborn optimize skew partitions for Spark v4.0.0. Meanwhile, Spark v3.5.6 could not apply `Celeborn-Optimize-Skew-Partitions-spark3_5.patch` because of https://github.com/apache/spark/pull/50946.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
$ git checkout v3.5.6
Previous HEAD position was fa33ea000a0 Preparing Spark release v4.0.0-rc7
HEAD is now at 303c18c7466 Preparing Spark release v3.5.6-rc1
$ git apply --check /celeborn/assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_5_6.patch
$ git checkout v4.0.0
Previous HEAD position was 303c18c7466 Preparing Spark release v3.5.6-rc1
HEAD is now at fa33ea000a0 Preparing Spark release v4.0.0-rc7
$ git apply --check /celeborn/assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark4_0.patch
```

Closes #3329 from SteNicholas/CELEBORN-1319.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-12 11:04:17 -07:00
Sanskar Modi
80bdb46801 [CELEBORN-1892] Adding register with master fail count metric for worker
### What changes were proposed in this pull request?

Adding register with master fail count metric for worker

### Why are the changes needed?

This will help put monitoring around if workers are not able to register with master like wrong endpoints are passed or master becomes unavailable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Local setup

<img width="724" alt="Screenshot 2025-06-04 at 10 44 56 AM" src="https://github.com/user-attachments/assets/1f84557b-5df8-422f-b602-bb5316a72a0e" />

Closes #3308 from s0nskar/worker_register_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-11 11:04:59 -07:00
Xianming Lei
edeeb4b30a [CELEBORN-1719][FOLLOWUP] Rename throwsFetchFailure to stageRerunEnabled
### What changes were proposed in this pull request?
Rename throwsFetchFailure to stageRerunEnabled

### Why are the changes needed?
Make the code cleaner.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
existing UTs.

Closes #3324 from leixm/CELEBORN-2035.

Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-11 19:33:19 +08:00
Wang, Fei
9a689b7482 [CELEBORN-2028] Setup GA for grafana dashboard
### What changes were proposed in this pull request?

Setup the GA for grafana dashboard.

1. Lint the dashboard with https://github.com/grafana/dashboard-linter
2. Check the duplicate id in dashboard json file

### Why are the changes needed?

It is helpful for grafana related PR review, for example: https://github.com/apache/celeborn/pull/3307#discussion_r2134799722

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA.

<img width="1407" alt="image" src="https://github.com/user-attachments/assets/35452633-ddff-4140-b929-3c44a943a2ab" />

Closes #3316 from turboFei/dashboard.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-06-10 16:14:49 +08:00
SteNicholas
d9984c9e0e [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
### What changes were proposed in this pull request?

Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.

### Why are the changes needed?

There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.

Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`

Closes #3026 from SteNicholas/CELEBORN-1800.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-19 07:20:00 -07:00
Sanskar Modi
9ba54b39e2 [CELEBORN-1968] Publish metric for unreleased partition location count when worker was gracefully shutdown
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased partition location count when worker was gracefully shutdown.

<img width="742" alt="Screenshot 2025-04-16 at 1 19 18 AM" src="https://github.com/user-attachments/assets/159f744a-cd76-45a2-9387-930f27dd72be" />

### Why are the changes needed?

Similar to https://github.com/apache/celeborn/pull/2711, Currently celeborn don't publish the count of unreleased partition location when worker is gracefully exit. This can be useful for monitoring and configuring the gracefulShutdownTimeout.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
NA

Closes #3213 from s0nskar/unrelease_partition_location.

Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-12 04:34:44 -07:00
Wang, Fei
f92f9b84a0 [CELEBORN-1856][FOLLOWUP] Check isCelebornSkewedShuffle before registerCelebornSkewedShuffle for stage rollback
### What changes were proposed in this pull request?
Followup for https://github.com/apache/celeborn/pull/3118

Add a condition check(isCelebornShuffleIndeterminate) before `registerCelebornSkewedShuffle` for stage rollback.
### Why are the changes needed?

Fix the logical.
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Minor change.

Closes #3209 from turboFei/spark_celeborn_patch.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-04-11 11:45:44 -07:00
wangshengjie3
4bacd1f211 [CELEBORN-1856] Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read
### What changes were proposed in this pull request?
Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read

### Why are the changes needed?
In [CELEBORN-1319](https://issues.apache.org/jira/browse/CELEBORN-1319), we have already implemented the skew partition read optimization based on chunk offsets, but we don't support skew partition shuffle retry, so we need support the stage rerun.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #3118 from wangshengjie123/support-stage-rerun.

Lead-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-03-24 22:03:15 +08:00
wangshengjie
d659e06d45 [CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
### What changes were proposed in this pull request?
Add logic to support avoid sorting shuffle files for Reduce mode when optimize skew partitions

### Why are the changes needed?
Current logic need sorting shuffle files when read Reduce mode skew partition shuffle files, we found some shuffle sorting timeout and performance issue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test and uts

Closes #2373 from wangshengjie123/optimize-skew-partition.

Lead-authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Co-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Co-authored-by: wangshengjie3 <soldier.sj.wang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-02-19 16:57:44 +08:00
zhengtao
ac0d335f40 [CELEBORN-1831] Add ratis commitIndex metrics
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).

### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.
![image](https://github.com/user-attachments/assets/f354a3cd-e3b3-4af0-98c2-fc13330b2d81)

Closes #3063 from zaynt4606/clb1831.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-17 10:58:06 +08:00
Nan
ca60613f2f [CELEBORN-1817] add committed file size metrics
### What changes were proposed in this pull request?

this PR adds the file size metrics for workers

### Why are the changes needed?

the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)

we want to have this metrics to monitor the severity of the issue

### Does this PR introduce _any_ user-facing change?

yes, one more metrics

### How was this patch tested?

(ignore the dashboard title, it's a dummy one)

![image](https://github.com/user-attachments/assets/d88c15e6-d740-4def-94d5-03666bbb38ca)

Closes #3047 from CodingCat/committed_file_size.

Authored-by: Nan <nzhu@pinterest.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-07 10:17:45 +08:00
wuziyi
f886751e80 [CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted
### What changes were proposed in this pull request?

Current implementation uses `
shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`.

### Why are the changes needed?

Add metrics to monitor current sorting files which is making disk-io operations.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

![image](https://github.com/user-attachments/assets/6ffed37e-ad12-4d8d-a4aa-2b2695a92168)

Closes #3040 from Z1Wu/fix/sorting_file_metrics.

Authored-by: wuziyi <wuziyi02@corp.netease.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-04 10:27:53 +08:00
Wang, Fei
03656b5b1c [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
### What changes were proposed in this pull request?

1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.

### Why are the changes needed?

For monitoring

### Does this PR introduce _any_ user-facing change?
No, it has not been released

### How was this patch tested?
UT for  metrics source `instance`.

<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">

<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">

Closes #2990 from turboFei/rpc_metrics.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 11:13:49 +08:00
mingji
c40f69b941 [CELEBORN-1766] Add detail metrics about fetch chunk
### What changes were proposed in this pull request?
1. Add histogram
2. Collect critical metrics about fetch chunk

### Why are the changes needed?
1. To find out IO pattern of fetch chunk
2. To have detail metrics about fetch chunk time

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

<img width="940" alt="截屏2024-12-09 15 42 50" src="https://github.com/user-attachments/assets/9f526103-c162-4607-a031-ba90f42ae83e">
<img width="962" alt="截屏2024-12-09 15 42 56" src="https://github.com/user-attachments/assets/c17822da-0433-4701-b0cc-0887ac970353">

Closes #2983 from FMX/b1766.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 16:17:14 +08:00
Wang, Fei
81a0d5113c [CELEBORN-1660] Cache available workers and only count the available workers device free capacity
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part

### Why are the changes needed?
Cache  the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
UT.

<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">

Closes #2827 from turboFei/device_free.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-14 11:10:45 +08:00
Wang, Fei
def5254ec2
[CELEBORN-1706] Use bytes(IEC) unit instead of bytes(SI) for size related metrics in prometheus dashboard
### What changes were proposed in this pull request?
Use unit `bytes(IEC)`(`decbytes`, 1,024 bytes in a kibibyte ) for below 18 metrics(disk and memory related) instead of `bytes(SI)`(`bytes`, 1,000 bytes in a kilobyte).
- metrics_DeviceCelebornTotalBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_PartitionSize_Value
- metrics_ActiveShuffleSize_Value
- metrics_NettyMemory_Value
- metrics_DiskBuffer_Value
- metrics_push_usedHeapMemory_Value
- metrics_push_usedDirectMemory_Value
- metrics_fetch_usedHeapMemory_Value
- metrics_fetch_usedDirectMemory_Value
- metrics_replicate_usedHeapMemory_Value
- metrics_replicate_usedDirectMemory_Value
- metrics_BufferStreamReadBuffer_Value
- metrics_SortMemory_Value
- metrics_DeviceOSFreeBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_diskBytesWritten_Value
- metrics_hdfsBytesWritten_Value

Also apply for 6 jvm metrics
- metrics_jvm_memory_heap_init_Value
- metrics_jvm_memory_non_heap_init_Value
- metrics_jvm_memory_total_init_Value
- metrics_jvm_memory_pools_init_Value
- metrics_jvm_direct_capacity_Value
- metrics_jvm_mapped_capacity_Value

### Why are the changes needed?

Some size related metrics use `bytes(IEC)` and some use `bytes(SI)`.
<img width="1715" alt="image" src="https://github.com/user-attachments/assets/8dd1727b-4e16-487c-b2f9-f70116bc27d3">

<img width="1722" alt="image" src="https://github.com/user-attachments/assets/17ed933a-3f01-4a91-a170-aa7a042f4947">

The main difference between bytes in the International System of Units (SI) and the International Electrotechnical Commission (IEC) is the number of bytes in a kilobyte:
SI: 1,000 bytes in a kilobyte
IEC: 1,024 bytes in a kibibyte

FYI: https://www.drupal.org/project/drupal/issues/1114538#:~:text=According%20to%20the%20SI%20standard,e.g.%20a%20stick%20of%20RAM.

4545cdc401/assets/grafana/celeborn-dashboard.json (L5636-L5699)

### Does this PR introduce _any_ user-facing change?

Yes, metrics unit changed.

### How was this patch tested?
Not needed, we already use `decbytes` in the dashboard json.

Closes #2896 from turboFei/unit_decbytes.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-11-14 10:40:57 +08:00
SteNicholas
169b6f6973 [CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
### What changes were proposed in this pull request?

1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.

### Why are the changes needed?

The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #2891 from SteNicholas/CELEBORN-1685.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-11 10:37:25 +08:00
Wang, Fei
f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
Sanskar Modi
2c996133b9 [CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
### What changes were proposed in this pull request?

Adding IsDecommissioningWorker metric to celeborn dashboard

### Why are the changes needed?

Metric was missing from dashboard

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?

Tested in local grafana setup

<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">

Closes #2836 from s0nskar/decommision_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 09:55:43 +08:00
Wang, Fei
ffc4980847 [CELEBORN-1627][FOLLOWUP] Fix typo for metrics_SlotsAllocated_increas_1h
### What changes were proposed in this pull request?
Fix typo in prometheus expr.

### Why are the changes needed?

Fix typo.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<img width="1220" alt="image" src="https://github.com/user-attachments/assets/0b8649b6-163a-4868-9eb4-31a25a225d0e">

Closes #2825 from turboFei/fix_typo.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:33:54 +08:00
SteNicholas
497bfdf5d7 [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
### What changes were proposed in this pull request?

`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.

### Why are the changes needed?

`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:

```
public int numHeapArenas() {
  return this.allocator.numHeapArenas();
}

public int numDirectArenas() {
  return this.allocator.numDirectArenas();
}

public List<PoolArenaMetric> heapArenas() {
  return this.allocator.heapArenas();
}

public List<PoolArenaMetric> directArenas() {
  return this.allocator.directArenas();
}

public int numThreadLocalCaches() {
  return this.allocator.numThreadLocalCaches();
}

public int tinyCacheSize() {
  return this.allocator.tinyCacheSize();
}

public int smallCacheSize() {
  return this.allocator.smallCacheSize();
}

public int normalCacheSize() {
  return this.allocator.normalCacheSize();
}

public int chunkSize() {
  return this.allocator.chunkSize();
}

public long usedHeapMemory() {
  return this.allocator.usedHeapMemory();
}

public long usedDirectMemory() {
  return this.allocator.usedDirectMemory();
}
```

`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)

Closes #2802 from SteNicholas/CELEBORN-1640.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-17 18:12:08 +08:00
Wang, Fei
c3d33daabc [CELEBORN-1627] Introduce instance variable for celeborn dashboard to filter metrics
### What changes were proposed in this pull request?

1. add `instanceLabel` in metrics source, prefer `FQDN:port` than `ip:port` even with `celeborn.network.bind.preferIpAddress=false` before
2. add variable  `instance` with  `label_values(metrics_JVMCPUTime_Value, instance)` same as `celeborn-jvm-dashboard.json`
3. add filter `instance=~"${instance}"` for every metrics
4. add missing `legendFormat` for memory file storage metrics expressions

### Why are the changes needed?

There should be too many celeborn instances in production use case, it is better to add filter with instance.

### Does this PR introduce _any_ user-facing change?
Yes. introduce new variable.

But the instance default value is `ALL`, same behavior as before.

### How was this patch tested?

Config: `celeborn.network.bind.preferIpAddress=false`
<img width="1141" alt="image" src="https://github.com/user-attachments/assets/c3161069-790a-4cb2-8654-6d52cf8e5fb0">
<img width="944" alt="image" src="https://github.com/user-attachments/assets/293b8bd4-252a-459c-aa86-5f4aa75eb594">

<img width="939" alt="image" src="https://github.com/user-attachments/assets/1e1b28af-dd71-4c5b-8285-57473a6c9650">

For JVM metrics, before it was ip:port, and now it is FQDN:port.
<img width="947" alt="image" src="https://github.com/user-attachments/assets/fe00762f-605d-4b5e-b0a4-c586bdc0ec1a">

Closes #2777 from turboFei/legend_base.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-09 14:47:03 +08:00
Sanskar Modi
961144fdbd [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">

### Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NA

Closes #2711 from s0nskar/unrelease_shuffle_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 17:02:25 +08:00
Weijie Guo
d8809793f3 [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
### What changes were proposed in this pull request?

Impl worker write process for Flink Hybrid Shuffle.

### Why are the changes needed?

We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
no need.

Closes #2741 from reswqa/cip6-6-pr.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-25 10:27:55 +08:00
szt
59a39952dd [CELEBORN-1586] Add available workers Metrics
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test
![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702)

Closes #2723 from zaynt4606/availableWorker.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-05 13:34:52 +08:00
Wang, Fei
3b0abdee5b [CELEBORN-1491][FOLLOWUP] Using baseLegend for metrics_FlushWorkingQueueSize_Value
### What changes were proposed in this pull request?

Followup for https://issues.apache.org/jira/browse/CELEBORN-1491, use baseLegend for the new introduced metrics.

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Before:
<img width="852" alt="image" src="https://github.com/user-attachments/assets/cf1cb852-9480-49ff-873c-62b535167fa3">

After:
<img width="346" alt="image" src="https://github.com/user-attachments/assets/cbd6ec82-4531-4056-b8ee-96bde813f899">

<img width="849" alt="image" src="https://github.com/user-attachments/assets/a787be53-4646-48d2-a24e-da9b714b7fca">

Closes #2712 from turboFei/grafana_dashboard.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-28 11:39:15 +08:00
Wang, Fei
0e05bc6cf9 [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
### What changes were proposed in this pull request?

As title, merge these two similar user guides.

### Why are the changes needed?
To close CELEBORN-1437

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard

Closes #2623 from turboFei/metrics_merge.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 13:41:46 +08:00
mingji
cb6e2202ae [CELEBORN-1491] introduce flusher working queue size metric
### What changes were proposed in this pull request?
Add metrics about flusher working queue size.

### Why are the changes needed?
To show if there is an accumulation of flush tasks.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2598 from FMX/b1491.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-05 09:55:02 +08:00
SteNicholas
c7b1b8d61e
[CELEBORN-1459] Introduce CleanTaskQueueSize and CleanExpiredShuffleKeysTime to record situation of cleaning up expired shuffle keys
### What changes were proposed in this pull request?

Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.

### Why are the changes needed?

There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)

Closes #2557 from SteNicholas/CELEBORN-1459.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 16:31:57 +08:00
SteNicholas
f63ff34ba7
[CELEBORN-1462] Fix layout of DeviceCelebornTotalBytes, DeviceCelebornFreeBytes, RunningApplicationCount and DecommissionWorkerCount in celeborn-dashboard.json
### What changes were proposed in this pull request?

Fix layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json`.

### Why are the changes needed?

The layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json` have wrong position as follows:

![celeborn-dashboard](https://github.com/apache/celeborn/assets/10048174/adf82c15-ce31-4755-8c81-ffde9ceef822)

We should fix the correct position to provide layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test: [Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/822b08768a324dfe9fc526254bae5ae5).

Closes #2569 from SteNicholas/CELEBORN-1462.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-14 15:11:18 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
mingji
89d56c9bbc
[CELEBORN-914] Support memory file storage
### What changes were proposed in this pull request?
To support memory file storage.

### Why are the changes needed?
To improve shuffle performance for small shuffle files.

Design doc: https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA and manually test on a cluster.

Closes #2300 from FMX/B914.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-23 21:05:52 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
CodingCat
c788c38025
[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots
### What changes were proposed in this pull request?

Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Why are the changes needed?

It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5)

Closes #2386 from CodingCat/slots_decrease.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 11:08:05 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
SteNicholas
4e64ae3214
[CELEBORN-1282][FOLLOWUP] Introduce ReplicateDataFailNonCriticalCauseCount metric in Grafana dashboard
### What changes were proposed in this pull request?

Introduce `ReplicateDataFailNonCriticalCauseCount` metric in Grafana dashboard. Follow up #2323.

### Why are the changes needed?

`ReplicateDataFailNonCriticalCauseCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/6e50cc2c7af34692babcc2809066e147)

Closes #2332 from SteNicholas/CELEBORN-1282.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-27 15:32:28 +08:00
SteNicholas
4723c738b3
[CELEBORN-1246][FOLLOWUP] Introduce OpenStreamSuccessCount, FetchChunkSuccessCount and WriteDataSuccessCount metric in Grafana dashboard
### What changes were proposed in this pull request?

Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric in Grafana dashboard.

### Why are the changes needed?

`OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2269 from SteNicholas/CELEBORN-1246.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-29 19:36:44 +08:00
xianminglei
b90fb1fdb2 [CELEBORN-1237][METRICS] Refactor metrics name
### What changes were proposed in this pull request?
Refactor metrics name.

### Why are the changes needed?
Easier to understand the meaning of metrics

### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md

### How was this patch tested?
Existing UTs.

Closes #2240 from leixm/metrics_name.

Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-18 18:15:43 +08:00