### What changes were proposed in this pull request?
Introduce `IsHighWorkload` metric to monitor worker overload status.
### Why are the changes needed?
There is no any metric to monitor worker overload status at present.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf).
Closes#3435 from xy2953396112/CELEBORN-2118.
Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric.
### Why are the changes needed?
Introduce `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric to record status of pause push data.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test. [Grafana](https://xy2953396112.grafana.net/public-dashboards/21af8e2844234c438e74c741211f0032)
Closes#3426 from xy2953396112/CELEBORN-2112.
Authored-by: xxx <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce `SorterCacheHitRate` metric to monitor the hit reate of index cache for sorter.
### Why are the changes needed?
Monitor the hit rate of `PartitionFilesSorter#indexCache`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The verified grafana dashboard: https://xy2953396112.grafana.net/public-dashboards/5d1177ee0f784b53ad817fde919141b7Closes#3416 from xy2953396112/CELEBORN_2102.
Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Document introduced metrics into `monitoring.md` including `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize`.
### Why are the changes needed?
Introduced metrics `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize` don't document in `monitoring.md`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#3398 from SteNicholas/document-monitoring.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.
### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Cluster.
Closes#3346 from FMX/b2045.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
Added a commit files request fail count metric.
### Why are the changes needed?
To monitor and tune the configurations around the commit files workflow.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local setup
<img width="739" alt="Screenshot 2025-06-04 at 10 51 06 AM" src="https://github.com/user-attachments/assets/d6256028-d8b7-4a81-90b1-3dcbf61adeba" />
Closes#3307 from s0nskar/commit_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Adding register with master fail count metric for worker
### Why are the changes needed?
This will help put monitoring around if workers are not able to register with master like wrong endpoints are passed or master becomes unavailable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local setup
<img width="724" alt="Screenshot 2025-06-04 at 10 44 56 AM" src="https://github.com/user-attachments/assets/1f84557b-5df8-422f-b602-bb5316a72a0e" />
Closes#3308 from s0nskar/worker_register_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.
### Why are the changes needed?
There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.
Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
Closes#3026 from SteNicholas/CELEBORN-1800.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased partition location count when worker was gracefully shutdown.
<img width="742" alt="Screenshot 2025-04-16 at 1 19 18 AM" src="https://github.com/user-attachments/assets/159f744a-cd76-45a2-9387-930f27dd72be" />
### Why are the changes needed?
Similar to https://github.com/apache/celeborn/pull/2711, Currently celeborn don't publish the count of unreleased partition location when worker is gracefully exit. This can be useful for monitoring and configuring the gracefulShutdownTimeout.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
NA
Closes#3213 from s0nskar/unrelease_partition_location.
Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).
### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.

Closes#3063 from zaynt4606/clb1831.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
this PR adds the file size metrics for workers
### Why are the changes needed?
the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)
we want to have this metrics to monitor the severity of the issue
### Does this PR introduce _any_ user-facing change?
yes, one more metrics
### How was this patch tested?
(ignore the dashboard title, it's a dummy one)

Closes#3047 from CodingCat/committed_file_size.
Authored-by: Nan <nzhu@pinterest.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Current implementation uses `
shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`.
### Why are the changes needed?
Add metrics to monitor current sorting files which is making disk-io operations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Closes#3040 from Z1Wu/fix/sorting_file_metrics.
Authored-by: wuziyi <wuziyi02@corp.netease.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part
### Why are the changes needed?
Cache the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">
Closes#2827 from turboFei/device_free.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.
### Why are the changes needed?
The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2891 from SteNicholas/CELEBORN-1685.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, introduce metrics_ShuffleFallbackCount_Value.
### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively.
Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.
In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.
### Does this PR introduce _any_ user-facing change?
Yes, new metrics.
### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">
Closes#2866 from turboFei/record_fallback.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.
### Why are the changes needed?
`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:
```
public int numHeapArenas() {
return this.allocator.numHeapArenas();
}
public int numDirectArenas() {
return this.allocator.numDirectArenas();
}
public List<PoolArenaMetric> heapArenas() {
return this.allocator.heapArenas();
}
public List<PoolArenaMetric> directArenas() {
return this.allocator.directArenas();
}
public int numThreadLocalCaches() {
return this.allocator.numThreadLocalCaches();
}
public int tinyCacheSize() {
return this.allocator.tinyCacheSize();
}
public int smallCacheSize() {
return this.allocator.smallCacheSize();
}
public int normalCacheSize() {
return this.allocator.normalCacheSize();
}
public int chunkSize() {
return this.allocator.chunkSize();
}
public long usedHeapMemory() {
return this.allocator.usedHeapMemory();
}
public long usedDirectMemory() {
return this.allocator.usedDirectMemory();
}
```
`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)
Closes#2802 from SteNicholas/CELEBORN-1640.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add navigation for `REST API` document.
### Why are the changes needed?
`REST API` document does not have any navigation, which is better to add navigation to guide REST API.
<img width="1438" alt="image" src="https://github.com/user-attachments/assets/b5b3a14a-38d4-4769-bffb-3acd571d5dbb">
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2775 from SteNicholas/navigate-rest-api.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.
<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">
### Why are the changes needed?
Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
NA
Closes#2711 from s0nskar/unrelease_shuffle_metric.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Impl worker write process for Flink Hybrid Shuffle.
### Why are the changes needed?
We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no need.
Closes#2741 from reswqa/cip6-6-pr.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local test

Closes#2723 from zaynt4606/availableWorker.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding documentation for missing memory file storage metrics.
### Why are the changes needed?
Few new metrics were added in https://github.com/apache/celeborn/pull/2300 but they were missing their documentation in monitoring.md
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
NA
Closes#2705 from s0nskar/memory_metrics.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title, merge these two similar user guides.
### Why are the changes needed?
To close CELEBORN-1437
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboardCloses#2623 from turboFei/metrics_merge.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Move Rest API out from monitoring.md to webapi.md
### Why are the changes needed?
To close CELEBORN-1436
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Review https://github.com/turboFei/incubator-celeborn/blob/webapi_md/docs/webapi.mdCloses#2624 from turboFei/webapi_md.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.
### Why are the changes needed?
There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)
Closes#2557 from SteNicholas/CELEBORN-1459.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce worker decommission metrics and corresponding REST API.
### Why are the changes needed?
In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.
If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`
Closes#2535 from leixm/issue_1444.
Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.
### Why are the changes needed?
Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:
* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601
Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.
Backport:
- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2540 from SteNicholas/CELEBORN-1389.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA
Closes#2521 from RexXiong/CELEBORN-1427.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Improve parameters, description and document of Celeborn REST API, including:
1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.
### Why are the changes needed?
The parameters, description and document of REST API need to improve after http server refine.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2495 from SteNicholas/CELEBORN-1317.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Handle worker event use wrong request.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent`
Closes#2493 from RexXiong/CELEBORN-1245-FOLLOW-UP.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Gauge `is_terminating`, `is_terminated` and `is_shutdown` should represent a single numerical value instead of boolean value.
### Why are the changes needed?
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The value type of `is_terminating`, `is_terminated` and `is_shutdown` should be numerical, otherwise `AbstractSource#addGauge` would warn the failed log as follows:
```
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminating failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminated failed, the value type class java.lang.Boolean is not a number
2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_shutdown failed, the value type class java.lang.Boolean is not a number
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2457 from SteNicholas/CELEBORN-1236.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.
### Why are the changes needed?
It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

Closes#2386 from CodingCat/slots_decrease.
Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.
<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">
### Why are the changes needed?
`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)
Closes#2379 from SteNicholas/CELEBORN-1323.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.
### Why are the changes needed?
The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2378 from SteNicholas/CELEBORN-1322.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios.
### Why are the changes needed?
When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
through exist uts
Closes#2323 from lyy-pineapple/CELEBORN-1282.
Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:
```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 100000
celeborn.worker.flusher.buffer.size 64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 200000
celeborn.worker.flusher.buffer.size 128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 300000
celeborn.worker.flusher.buffer.size 256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 400000
celeborn.worker.flusher.buffer.size 512k
```
### Why are the changes needed?
Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.
### Does this PR introduce _any_ user-facing change?
- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.
### How was this patch tested?
- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`
Closes#2311 from SteNicholas/CELEBORN-1056.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Introduce application dimension resource consumption metrics for `ResourceConsumptionSource`.
### Why are the changes needed?
`ResourceConsumption` namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Master and Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `WorkerInfoSuite#WorkerInfo toString output`
- `PbSerDeUtilsTest#fromAndToPbResourceConsumption`
- `MasterStateMachineSuitej#testObjSerde`
Closes#2161 from SteNicholas/CELEBORN-1174.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat
2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...)
3. Add Http interface for master to handleWorkerEvent/getWorkerEvent
### Why are the changes needed?
Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2255 from RexXiong/CELEBORN-1245.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric to expose the count of opening stream, fetching chunk and writing data successfully in current worker.
### Why are the changes needed?
The ratio of opening stream, fetching chunk and writing data failed is important for Celeborn performance to balance the healty of cluster, which is lack of the count of opening stream, fetching chunk and writing data successfully.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2252 from AngersZhuuuu/CELEBORN-1246.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Refactor metrics name.
### Why are the changes needed?
Easier to understand the meaning of metrics
### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md
### How was this patch tested?
Existing UTs.
Closes#2240 from leixm/metrics_name.
Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Align master and worker metrics of document with `MasterSource` and `WorkerSource` in `METRICS.md` and `monitoring.md`.
### Why are the changes needed?
Metrics of master and worker is inconsistent with `MasterSource` and `WorkerSource` at present. It is recommended to align master and worker metrics of document with `MasterSource` and `WorkerSource`:
- PushDataHandshakeFailCount
- RegionStartFailCount
- RegionFinishFailCount
- PrimaryPushDataHandshakeTime
- ReplicaPushDataHandshakeTime
- PrimaryRegionStartTime
- ReplicaRegionStartTime
- PrimaryRegionFinishTime
- ReplicaRegionFinishTime
- ActiveConnectionCount
- BufferStreamReadBuffer
- ReadBufferDispatcherRequestsLength
- ReadBufferAllocatedCount
- CreditStreamCount
- ActiveMapPartitionCount
- DeviceOSFreeBytes
- DeviceOSTotalBytes
- DeviceCelebornFreeBytes
- DeviceCelebornTotalBytes
- PotentialConsumeSpeed
- UserProduceSpeed
- WorkerConsumeSpeed
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2226 from SteNicholas/CELEBORN-1223.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers.
### Why are the changes needed?
`PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
- `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic`
Closes#2221 from SteNicholas/CELEBORN-1215.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData.
### Why are the changes needed?
As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
Closes#2217 from SteNicholas/CELEBORN-1214.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`:
- `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics.
- Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example:
1. `"expr": "metrics_RunningApplicationCount_Value"`
2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`.
3. Updates the `gridPos` of some panels.
### Why are the changes needed?
There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
Closes#2212 from SteNicholas/CELEBORN-1100.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.
### Why are the changes needed?
`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2172 from SteNicholas/CELEBORN-1189.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.
### Why are the changes needed?
`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2171 from SteNicholas/CELEBORN-1187.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.
### Why are the changes needed?
The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal test.
Closes#2151 from SteNicholas/CELEBORN-1164.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Support exclude worker manually given worker id. This worker is added into excluded workers manually.
### Why are the changes needed?
Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`
Closes#1997 from SteNicholas/CELEBORN-448.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.
### Why are the changes needed?
The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2059 from SteNicholas/CELEBORN-247.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
…Rpcs and outstandingPushes to metrics"
This reverts commit bfa341c32f.
### What changes were proposed in this pull request?
### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2032 from cfmcgrady/revert-pr-1992.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>