### What changes were proposed in this pull request?
Introduce `IsHighWorkload` metric to monitor worker overload status.
### Why are the changes needed?
There is no any metric to monitor worker overload status at present.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf).
Closes#3435 from xy2953396112/CELEBORN-2118.
Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric.
### Why are the changes needed?
Introduce `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric to record status of pause push data.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test. [Grafana](https://xy2953396112.grafana.net/public-dashboards/21af8e2844234c438e74c741211f0032)
Closes#3426 from xy2953396112/CELEBORN-2112.
Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce `SorterCacheHitRate` metric to monitor the hit reate of index cache for sorter.
### Why are the changes needed?
Monitor the hit rate of `PartitionFilesSorter#indexCache`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The verified grafana dashboard: https://xy2953396112.grafana.net/public-dashboards/5d1177ee0f784b53ad817fde919141b7Closes#3416 from xy2953396112/CELEBORN_2102.
Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add netty pinned memory metrics
### Why are the changes needed?
We can know more accurately the memory actually allocated from PoolArena.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing uts.
Closes#3019 from leixm/CELEBORN-1793.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Fix the metrics units.
1. bytes -> decbytes, see https://github.com/apache/celeborn/pull/2896
```
metrics_FetchChunkTransferSize_Max
metrics_FetchChunkTransferSize_Mean
```
2. bytes -> none, followup https://github.com/apache/celeborn/pull/3362
```
metrics_LocalFlushSize_Count
metrics_HdfsFlushSize_Count
metrics_OssFlushSize_Count
metrics_S3FlushSize_Count
```
3. ms -> ns, followup https://github.com/apache/celeborn/pull/2990
```
metrics_RpcQueueTime_Max
metrics_RpcQueueTime_Mean
metrics_RpcProcessTime_Max
metrics_RpcProcessTime_Mean
```
4. add unit `decbytes` for `metrics_SortedFileSize_Value`, which was not set before
### Why are the changes needed?
Fix the metrics units.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Code review.
Closes#3381 from turboFei/fix_rpc_unit.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
Added metrics for the amount of data written to different storage types, including Local, HDFS, OSS, and S3
Currently, there is a lack of data volume written to each storage, and it is impossible to monitor the size and speed of writing.
no
Cluster Test
Closes#3362 from TheodoreLx/add-flush-count-metric.
Authored-by: TheodoreLx <1548069580@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
To prevent the dashboard crash for large celeborn cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually testing.
Closes#3373 from turboFei/metrics_instance.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
Added a commit files request fail count metric.
### Why are the changes needed?
To monitor and tune the configurations around the commit files workflow.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local setup
<img width="739" alt="Screenshot 2025-06-04 at 10 51 06 AM" src="https://github.com/user-attachments/assets/d6256028-d8b7-4a81-90b1-3dcbf61adeba" />
Closes#3307 from s0nskar/commit_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA & grafana
Closes#3333 from RexXiong/CELEBORN-1817-FOLLOWUP.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Support celeborn optimize skew partitions patch for Spark v3.5.6 and v4.0.0.
### Why are the changes needed?
There is no patch of celeborn optimize skew partitions for Spark v4.0.0. Meanwhile, Spark v3.5.6 could not apply `Celeborn-Optimize-Skew-Partitions-spark3_5.patch` because of https://github.com/apache/spark/pull/50946.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
$ git checkout v3.5.6
Previous HEAD position was fa33ea000a0 Preparing Spark release v4.0.0-rc7
HEAD is now at 303c18c7466 Preparing Spark release v3.5.6-rc1
$ git apply --check /celeborn/assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_5_6.patch
$ git checkout v4.0.0
Previous HEAD position was 303c18c7466 Preparing Spark release v3.5.6-rc1
HEAD is now at fa33ea000a0 Preparing Spark release v4.0.0-rc7
$ git apply --check /celeborn/assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark4_0.patch
```
Closes#3329 from SteNicholas/CELEBORN-1319.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Adding register with master fail count metric for worker
### Why are the changes needed?
This will help put monitoring around if workers are not able to register with master like wrong endpoints are passed or master becomes unavailable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local setup
<img width="724" alt="Screenshot 2025-06-04 at 10 44 56 AM" src="https://github.com/user-attachments/assets/1f84557b-5df8-422f-b602-bb5316a72a0e" />
Closes#3308 from s0nskar/worker_register_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Rename throwsFetchFailure to stageRerunEnabled
### Why are the changes needed?
Make the code cleaner.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
existing UTs.
Closes#3324 from leixm/CELEBORN-2035.
Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.
### Why are the changes needed?
There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.
Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
Closes#3026 from SteNicholas/CELEBORN-1800.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased partition location count when worker was gracefully shutdown.
<img width="742" alt="Screenshot 2025-04-16 at 1 19 18 AM" src="https://github.com/user-attachments/assets/159f744a-cd76-45a2-9387-930f27dd72be" />
### Why are the changes needed?
Similar to https://github.com/apache/celeborn/pull/2711, Currently celeborn don't publish the count of unreleased partition location when worker is gracefully exit. This can be useful for monitoring and configuring the gracefulShutdownTimeout.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
NA
Closes#3213 from s0nskar/unrelease_partition_location.
Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Followup for https://github.com/apache/celeborn/pull/3118
Add a condition check(isCelebornShuffleIndeterminate) before `registerCelebornSkewedShuffle` for stage rollback.
### Why are the changes needed?
Fix the logical.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Minor change.
Closes#3209 from turboFei/spark_celeborn_patch.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read
### Why are the changes needed?
In [CELEBORN-1319](https://issues.apache.org/jira/browse/CELEBORN-1319), we have already implemented the skew partition read optimization based on chunk offsets, but we don't support skew partition shuffle retry, so we need support the stage rerun.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Cluster test
Closes#3118 from wangshengjie123/support-stage-rerun.
Lead-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add logic to support avoid sorting shuffle files for Reduce mode when optimize skew partitions
### Why are the changes needed?
Current logic need sorting shuffle files when read Reduce mode skew partition shuffle files, we found some shuffle sorting timeout and performance issue
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Cluster test and uts
Closes#2373 from wangshengjie123/optimize-skew-partition.
Lead-authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Co-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Co-authored-by: wangshengjie3 <soldier.sj.wang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).
### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.

Closes#3063 from zaynt4606/clb1831.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
this PR adds the file size metrics for workers
### Why are the changes needed?
the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)
we want to have this metrics to monitor the severity of the issue
### Does this PR introduce _any_ user-facing change?
yes, one more metrics
### How was this patch tested?
(ignore the dashboard title, it's a dummy one)

Closes#3047 from CodingCat/committed_file_size.
Authored-by: Nan <nzhu@pinterest.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Current implementation uses `
shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`.
### Why are the changes needed?
Add metrics to monitor current sorting files which is making disk-io operations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Closes#3040 from Z1Wu/fix/sorting_file_metrics.
Authored-by: wuziyi <wuziyi02@corp.netease.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.
### Why are the changes needed?
For monitoring
### Does this PR introduce _any_ user-facing change?
No, it has not been released
### How was this patch tested?
UT for metrics source `instance`.
<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">
Closes#2990 from turboFei/rpc_metrics.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Add histogram
2. Collect critical metrics about fetch chunk
### Why are the changes needed?
1. To find out IO pattern of fetch chunk
2. To have detail metrics about fetch chunk time
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
<img width="940" alt="截屏2024-12-09 15 42 50" src="https://github.com/user-attachments/assets/9f526103-c162-4607-a031-ba90f42ae83e">
<img width="962" alt="截屏2024-12-09 15 42 56" src="https://github.com/user-attachments/assets/c17822da-0433-4701-b0cc-0887ac970353">
Closes#2983 from FMX/b1766.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part
### Why are the changes needed?
Cache the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">
Closes#2827 from turboFei/device_free.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Use unit `bytes(IEC)`(`decbytes`, 1,024 bytes in a kibibyte ) for below 18 metrics(disk and memory related) instead of `bytes(SI)`(`bytes`, 1,000 bytes in a kilobyte).
- metrics_DeviceCelebornTotalBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_PartitionSize_Value
- metrics_ActiveShuffleSize_Value
- metrics_NettyMemory_Value
- metrics_DiskBuffer_Value
- metrics_push_usedHeapMemory_Value
- metrics_push_usedDirectMemory_Value
- metrics_fetch_usedHeapMemory_Value
- metrics_fetch_usedDirectMemory_Value
- metrics_replicate_usedHeapMemory_Value
- metrics_replicate_usedDirectMemory_Value
- metrics_BufferStreamReadBuffer_Value
- metrics_SortMemory_Value
- metrics_DeviceOSFreeBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_diskBytesWritten_Value
- metrics_hdfsBytesWritten_Value
Also apply for 6 jvm metrics
- metrics_jvm_memory_heap_init_Value
- metrics_jvm_memory_non_heap_init_Value
- metrics_jvm_memory_total_init_Value
- metrics_jvm_memory_pools_init_Value
- metrics_jvm_direct_capacity_Value
- metrics_jvm_mapped_capacity_Value
### Why are the changes needed?
Some size related metrics use `bytes(IEC)` and some use `bytes(SI)`.
<img width="1715" alt="image" src="https://github.com/user-attachments/assets/8dd1727b-4e16-487c-b2f9-f70116bc27d3">
<img width="1722" alt="image" src="https://github.com/user-attachments/assets/17ed933a-3f01-4a91-a170-aa7a042f4947">
The main difference between bytes in the International System of Units (SI) and the International Electrotechnical Commission (IEC) is the number of bytes in a kilobyte:
SI: 1,000 bytes in a kilobyte
IEC: 1,024 bytes in a kibibyte
FYI: https://www.drupal.org/project/drupal/issues/1114538#:~:text=According%20to%20the%20SI%20standard,e.g.%20a%20stick%20of%20RAM.
4545cdc401/assets/grafana/celeborn-dashboard.json (L5636-L5699)
### Does this PR introduce _any_ user-facing change?
Yes, metrics unit changed.
### How was this patch tested?
Not needed, we already use `decbytes` in the dashboard json.
Closes#2896 from turboFei/unit_decbytes.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.
### Why are the changes needed?
The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2891 from SteNicholas/CELEBORN-1685.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, introduce metrics_ShuffleFallbackCount_Value.
### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.
Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.
In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.
### Does this PR introduce _any_ user-facing change?
Yes, new metrics.
### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">
Closes#2866 from turboFei/record_fallback.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding IsDecommissioningWorker metric to celeborn dashboard
### Why are the changes needed?
Metric was missing from dashboard
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Tested in local grafana setup
<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">
Closes#2836 from s0nskar/decommision_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix typo in prometheus expr.
### Why are the changes needed?
Fix typo.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
<img width="1220" alt="image" src="https://github.com/user-attachments/assets/0b8649b6-163a-4868-9eb4-31a25a225d0e">
Closes#2825 from turboFei/fix_typo.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.
### Why are the changes needed?
`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:
```
public int numHeapArenas() {
return this.allocator.numHeapArenas();
}
public int numDirectArenas() {
return this.allocator.numDirectArenas();
}
public List<PoolArenaMetric> heapArenas() {
return this.allocator.heapArenas();
}
public List<PoolArenaMetric> directArenas() {
return this.allocator.directArenas();
}
public int numThreadLocalCaches() {
return this.allocator.numThreadLocalCaches();
}
public int tinyCacheSize() {
return this.allocator.tinyCacheSize();
}
public int smallCacheSize() {
return this.allocator.smallCacheSize();
}
public int normalCacheSize() {
return this.allocator.normalCacheSize();
}
public int chunkSize() {
return this.allocator.chunkSize();
}
public long usedHeapMemory() {
return this.allocator.usedHeapMemory();
}
public long usedDirectMemory() {
return this.allocator.usedDirectMemory();
}
```
`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)
Closes#2802 from SteNicholas/CELEBORN-1640.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. add `instanceLabel` in metrics source, prefer `FQDN:port` than `ip:port` even with `celeborn.network.bind.preferIpAddress=false` before
2. add variable `instance` with `label_values(metrics_JVMCPUTime_Value, instance)` same as `celeborn-jvm-dashboard.json`
3. add filter `instance=~"${instance}"` for every metrics
4. add missing `legendFormat` for memory file storage metrics expressions
### Why are the changes needed?
There should be too many celeborn instances in production use case, it is better to add filter with instance.
### Does this PR introduce _any_ user-facing change?
Yes. introduce new variable.
But the instance default value is `ALL`, same behavior as before.
### How was this patch tested?
Config: `celeborn.network.bind.preferIpAddress=false`
<img width="1141" alt="image" src="https://github.com/user-attachments/assets/c3161069-790a-4cb2-8654-6d52cf8e5fb0">
<img width="944" alt="image" src="https://github.com/user-attachments/assets/293b8bd4-252a-459c-aa86-5f4aa75eb594">
<img width="939" alt="image" src="https://github.com/user-attachments/assets/1e1b28af-dd71-4c5b-8285-57473a6c9650">
For JVM metrics, before it was ip:port, and now it is FQDN:port.
<img width="947" alt="image" src="https://github.com/user-attachments/assets/fe00762f-605d-4b5e-b0a4-c586bdc0ec1a">
Closes#2777 from turboFei/legend_base.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.
<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">
### Why are the changes needed?
Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
NA
Closes#2711 from s0nskar/unrelease_shuffle_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Impl worker write process for Flink Hybrid Shuffle.
### Why are the changes needed?
We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no need.
Closes#2741 from reswqa/cip6-6-pr.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local test

Closes#2723 from zaynt4606/availableWorker.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, merge these two similar user guides.
### Why are the changes needed?
To close CELEBORN-1437
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboardCloses#2623 from turboFei/metrics_merge.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add metrics about flusher working queue size.
### Why are the changes needed?
To show if there is an accumulation of flush tasks.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#2598 from FMX/b1491.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.
### Why are the changes needed?
There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)
Closes#2557 from SteNicholas/CELEBORN-1459.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json`.
### Why are the changes needed?
The layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json` have wrong position as follows:

We should fix the correct position to provide layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test: [Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/822b08768a324dfe9fc526254bae5ae5).
Closes#2569 from SteNicholas/CELEBORN-1462.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce worker decommission metrics and corresponding REST API.
### Why are the changes needed?
In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.
If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`
Closes#2535 from leixm/issue_1444.
Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.
### Why are the changes needed?
Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:
* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601
Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.
Backport:
- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2540 from SteNicholas/CELEBORN-1389.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
To support memory file storage.
### Why are the changes needed?
To improve shuffle performance for small shuffle files.
Design doc: https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA and manually test on a cluster.
Closes#2300 from FMX/B914.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA
Closes#2521 from RexXiong/CELEBORN-1427.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.
### Why are the changes needed?
It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

Closes#2386 from CodingCat/slots_decrease.
Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.
<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">
### Why are the changes needed?
`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)
Closes#2379 from SteNicholas/CELEBORN-1323.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.
### Why are the changes needed?
The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2378 from SteNicholas/CELEBORN-1322.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ReplicateDataFailNonCriticalCauseCount` metric in Grafana dashboard. Follow up #2323.
### Why are the changes needed?
`ReplicateDataFailNonCriticalCauseCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/6e50cc2c7af34692babcc2809066e147)
Closes#2332 from SteNicholas/CELEBORN-1282.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric in Grafana dashboard.
### Why are the changes needed?
`OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
Closes#2269 from SteNicholas/CELEBORN-1246.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Refactor metrics name.
### Why are the changes needed?
Easier to understand the meaning of metrics
### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md
### How was this patch tested?
Existing UTs.
Closes#2240 from leixm/metrics_name.
Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>