celeborn

Author	SHA1	Message	Date
xxx	a9490d6e24	[CELEBORN-2118] Introduce IsHighWorkload metric to monitor worker overload status ### What changes were proposed in this pull request? Introduce `IsHighWorkload` metric to monitor worker overload status. ### Why are the changes needed? There is no any metric to monitor worker overload status at present. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf). Closes #3435 from xy2953396112/CELEBORN-2118. Authored-by: xxx <953396112@qq.com> Signed-off-by: SteNicholas <programgeek@163.com>	2025-08-25 20:46:17 +08:00
xxx	661a096b77	[CELEBORN-2112] Introduce PausePushDataStatus and PausePushDataAndReplicateStatus metric to record status of pause push data ### What changes were proposed in this pull request? Add `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric. ### Why are the changes needed? Introduce `PausePushDataStatus` and `PausePushDataAndReplicateStatus` metric to record status of pause push data. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. [Grafana](https://xy2953396112.grafana.net/public-dashboards/21af8e2844234c438e74c741211f0032) Closes #3426 from xy2953396112/CELEBORN-2112. Authored-by: xxx <953396112@qq.com> Signed-off-by: SteNicholas <programgeek@163.com>	2025-08-21 11:17:44 +08:00
dz	11b41f97ad	[CELEBORN-2102] Introduce SorterCacheHitRate metric to monitor the hit reate of index cache for sorter ### What changes were proposed in this pull request? Introduce `SorterCacheHitRate` metric to monitor the hit reate of index cache for sorter. ### Why are the changes needed? Monitor the hit rate of `PartitionFilesSorter#indexCache`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The verified grafana dashboard: https://xy2953396112.grafana.net/public-dashboards/5d1177ee0f784b53ad817fde919141b7 Closes #3416 from xy2953396112/CELEBORN_2102. Authored-by: dz <953396112@qq.com> Signed-off-by: SteNicholas <programgeek@163.com>	2025-08-20 10:47:38 +08:00
SteNicholas	4540b5772b	[MINOR] Document introduced metrics into monitoring.md ### What changes were proposed in this pull request? Document introduced metrics into `monitoring.md` including `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize`. ### Why are the changes needed? Introduced metrics `FetchChunkTransferTime`, `FetchChunkTransferSize`, `FlushWorkingQueueSize`, `LocalFlushCount`, `LocalFlushSize`, `HdfsFlushCount`, `HdfsFlushSize`, `OssFlushCount`, `OssFlushSize`, `S3FlushCount`, `S3FlushSize` don't document in `monitoring.md`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #3398 from SteNicholas/document-monitoring. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2025-07-29 14:33:46 +08:00
mingji	7a0eee332a	[CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM ### What changes were proposed in this pull request? 1. Add a new sink and allow the user to store metrics to files. 2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM. ### Why are the changes needed? A long-running worker ran out of memory and found out that the metrics are huge in the heap dump. As you can see below, the biggest object is the time metric queue, and I got 1.6 million records. <img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" /> <img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" /> ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Cluster. Closes #3346 from FMX/b2045. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <ethanfeng@apache.org> Co-authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-06-26 18:42:20 -07:00
Sanskar Modi	2a2c6e4687	[CELEBORN-2024] Publish commit files fail count metrics <!-- Thanks for sending a pull request! Here are some tips for you: - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'. - Be sure to keep the PR description updated to reflect all changes. - Please write your PR title to summarize what this PR proposes. - If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? Added a commit files request fail count metric. ### Why are the changes needed? To monitor and tune the configurations around the commit files workflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local setup <img width="739" alt="Screenshot 2025-06-04 at 10 51 06 AM" src="https://github.com/user-attachments/assets/d6256028-d8b7-4a81-90b1-3dcbf61adeba" /> Closes #3307 from s0nskar/commit_metric. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-06-17 11:52:45 -07:00
Sanskar Modi	80bdb46801	[CELEBORN-1892] Adding register with master fail count metric for worker ### What changes were proposed in this pull request? Adding register with master fail count metric for worker ### Why are the changes needed? This will help put monitoring around if workers are not able to register with master like wrong endpoints are passed or master becomes unavailable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local setup <img width="724" alt="Screenshot 2025-06-04 at 10 44 56 AM" src="https://github.com/user-attachments/assets/1f84557b-5df8-422f-b602-bb5316a72a0e" /> Closes #3308 from s0nskar/worker_register_metric. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-06-11 11:04:59 -07:00
SteNicholas	d9984c9e0e	[CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application ### What changes were proposed in this pull request? Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application. ### Why are the changes needed? There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity. Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback` - `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback` Closes #3026 from SteNicholas/CELEBORN-1800. Lead-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-05-19 07:20:00 -07:00
Sanskar Modi	9ba54b39e2	[CELEBORN-1968] Publish metric for unreleased partition location count when worker was gracefully shutdown ### What changes were proposed in this pull request? Adding a worker metrics for publish unreleased partition location count when worker was gracefully shutdown. <img width="742" alt="Screenshot 2025-04-16 at 1 19 18 AM" src="https://github.com/user-attachments/assets/159f744a-cd76-45a2-9387-930f27dd72be" /> ### Why are the changes needed? Similar to https://github.com/apache/celeborn/pull/2711, Currently celeborn don't publish the count of unreleased partition location when worker is gracefully exit. This can be useful for monitoring and configuring the gracefulShutdownTimeout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA Closes #3213 from s0nskar/unrelease_partition_location. Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-05-12 04:34:44 -07:00
zhengtao	ac0d335f40	[CELEBORN-1831] Add ratis commitIndex metrics ### What changes were proposed in this pull request? Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value). ### Why are the changes needed? To observe the metadata synchronization of the raft cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Cluster test. ![image](https://github.com/user-attachments/assets/f354a3cd-e3b3-4af0-98c2-fc13330b2d81) Closes #3063 from zaynt4606/clb1831. Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2025-01-17 10:58:06 +08:00
Nan	ca60613f2f	[CELEBORN-1817] add committed file size metrics ### What changes were proposed in this pull request? this PR adds the file size metrics for workers ### Why are the changes needed? the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split) we want to have this metrics to monitor the severity of the issue ### Does this PR introduce _any_ user-facing change? yes, one more metrics ### How was this patch tested? (ignore the dashboard title, it's a dummy one) ![image](https://github.com/user-attachments/assets/d88c15e6-d740-4def-94d5-03666bbb38ca) Closes #3047 from CodingCat/committed_file_size. Authored-by: Nan <nzhu@pinterest.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2025-01-07 10:17:45 +08:00
wuziyi	f886751e80	[CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted ### What changes were proposed in this pull request? Current implementation uses ` shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`. ### Why are the changes needed? Add metrics to monitor current sorting files which is making disk-io operations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![image](https://github.com/user-attachments/assets/6ffed37e-ad12-4d8d-a4aa-2b2695a92168) Closes #3040 from Z1Wu/fix/sorting_file_metrics. Authored-by: wuziyi <wuziyi02@corp.netease.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2025-01-04 10:27:53 +08:00
Wang, Fei	81a0d5113c	[CELEBORN-1660] Cache available workers and only count the available workers device free capacity ### What changes were proposed in this pull request? 1. cache the available workers 2. Only count the available workers device free capacity. 3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part ### Why are the changes needed? Cache the available workers to reduce the computation that need to loop the workers frequently. To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. <img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0"> Closes #2827 from turboFei/device_free. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-14 11:10:45 +08:00
SteNicholas	169b6f6973	[CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric ### What changes were proposed in this pull request? 1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy. 2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle. 3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`. ### Why are the changes needed? The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Cluster test. Closes #2891 from SteNicholas/CELEBORN-1685. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-11-11 10:37:25 +08:00
Wang, Fei	f1bda46de4	[CELEBORN-1680] Introduce ShuffleFallbackCount metrics ### What changes were proposed in this pull request? As title, introduce metrics_ShuffleFallbackCount_Value. ### Why are the changes needed? To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us to deprecate the ESS progressively. Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k. In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health. ### Does this PR introduce _any_ user-facing change? Yes, new metrics. ### How was this patch tested? UT. <img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4"> Closes #2866 from turboFei/record_fallback. Lead-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-11-07 11:42:17 +08:00
SteNicholas	497bfdf5d7	[CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize ### What changes were proposed in this pull request? `NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`. ### Why are the changes needed? `PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics: ``` public int numHeapArenas() { return this.allocator.numHeapArenas(); } public int numDirectArenas() { return this.allocator.numDirectArenas(); } public List<PoolArenaMetric> heapArenas() { return this.allocator.heapArenas(); } public List<PoolArenaMetric> directArenas() { return this.allocator.directArenas(); } public int numThreadLocalCaches() { return this.allocator.numThreadLocalCaches(); } public int tinyCacheSize() { return this.allocator.tinyCacheSize(); } public int smallCacheSize() { return this.allocator.smallCacheSize(); } public int normalCacheSize() { return this.allocator.normalCacheSize(); } public int chunkSize() { return this.allocator.chunkSize(); } public long usedHeapMemory() { return this.allocator.usedHeapMemory(); } public long usedDirectMemory() { return this.allocator.usedDirectMemory(); } ``` `NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97) Closes #2802 from SteNicholas/CELEBORN-1640. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-17 18:12:08 +08:00
SteNicholas	8bd5ac0b99	[MINOR] Add navigation for REST API document ### What changes were proposed in this pull request? Add navigation for `REST API` document. ### Why are the changes needed? `REST API` document does not have any navigation, which is better to add navigation to guide REST API. <img width="1438" alt="image" src="https://github.com/user-attachments/assets/b5b3a14a-38d4-4769-bffb-3acd571d5dbb"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2775 from SteNicholas/navigate-rest-api. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-08 20:20:37 +08:00
Sanskar Modi	961144fdbd	[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned ### What changes were proposed in this pull request? Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned. <img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d"> ### Why are the changes needed? Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NA Closes #2711 from s0nskar/unrelease_shuffle_metric. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-10-08 17:02:25 +08:00
Weijie Guo	d8809793f3	[CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle ### What changes were proposed in this pull request? Impl worker write process for Flink Hybrid Shuffle. ### Why are the changes needed? We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need. Closes #2741 from reswqa/cip6-6-pr. Authored-by: Weijie Guo <reswqa@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-09-25 10:27:55 +08:00
szt	59a39952dd	[CELEBORN-1586] Add available workers Metrics ### What changes were proposed in this pull request? Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local test ![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702) Closes #2723 from zaynt4606/availableWorker. Authored-by: szt <zaynt4606@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-09-05 13:34:52 +08:00
Sanskar Modi	b7027b6011	[CELEBORN-914][FOLLOWUP] Adding metrics for memory file storage in monitoring.md ### What changes were proposed in this pull request? Adding documentation for missing memory file storage metrics. ### Why are the changes needed? Few new metrics were added in https://github.com/apache/celeborn/pull/2300 but they were missing their documentation in monitoring.md ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NA Closes #2705 from s0nskar/memory_metrics. Authored-by: Sanskar Modi <sanskarmodi97@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-08-26 16:05:35 +08:00
Wang, Fei	0e05bc6cf9	[CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md ### What changes were proposed in this pull request? As title, merge these two similar user guides. ### Why are the changes needed? To close CELEBORN-1437 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard Closes #2623 from turboFei/metrics_merge. Lead-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-07-16 13:41:46 +08:00
Wang, Fei	6b03dcd5c2	[CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md ### What changes were proposed in this pull request? Move Rest API out from monitoring.md to webapi.md ### Why are the changes needed? To close CELEBORN-1436 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Review https://github.com/turboFei/incubator-celeborn/blob/webapi_md/docs/webapi.md Closes #2624 from turboFei/webapi_md. Lead-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2024-07-15 10:54:03 +08:00
SteNicholas	c7b1b8d61e	[CELEBORN-1459] Introduce CleanTaskQueueSize and CleanExpiredShuffleKeysTime to record situation of cleaning up expired shuffle keys ### What changes were proposed in this pull request? Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys. ### Why are the changes needed? There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7) Closes #2557 from SteNicholas/CELEBORN-1459. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-06-18 16:31:57 +08:00
Xianming Lei	999510b265	[CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API ### What changes were proposed in this pull request? Introduce worker decommission metrics and corresponding REST API. ### Why are the changes needed? In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes. If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? - `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission` - `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission` - `ApiMasterResourceSuite#decommissionWorkers` - `ApiWorkerResourceSuite#isDecommissioning` Closes #2535 from leixm/issue_1444. Lead-authored-by: Xianming Lei <jerrylei@apache.org> Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2024-06-08 11:10:31 +08:00
SteNicholas	4fc42d7fef	[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25 ### What changes were proposed in this pull request? Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`. ### Why are the changes needed? Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including: * [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125 * [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601 Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25. Backport: - https://github.com/apache/spark/pull/26332 - https://github.com/apache/spark/pull/29426 - https://github.com/apache/spark/pull/37372 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #2540 from SteNicholas/CELEBORN-1389. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-06-04 19:26:20 +08:00
Shuang	308eed28c9	[CELEBORN-1427] Add Capacity metrics for Celeborn ### What changes were proposed in this pull request? As title ### Why are the changes needed? The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #2521 from RexXiong/CELEBORN-1427. Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-05-23 16:06:11 +08:00
SteNicholas	db163bd793	[CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API ### What changes were proposed in this pull request? Improve parameters, description and document of Celeborn REST API, including: 1. The POST request uses `FormParam` instead of `QueryParam`. 2. The parameter name uses lowercase instead of uppercase. 3. The description of `/exclude` aligns with document in `monitoring.md`. 4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface. ### Why are the changes needed? The parameters, description and document of REST API need to improve after http server refine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #2495 from SteNicholas/CELEBORN-1317. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-05-09 17:41:13 +08:00
Shuang	9a9abfe3bc	[CELEBORN-1245][FOLLOWUP] Fix SendWorkerEvent in HA mode ### What changes were proposed in this pull request? As title ### Why are the changes needed? Handle worker event use wrong request. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent` Closes #2493 from RexXiong/CELEBORN-1245-FOLLOW-UP. Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-05-07 15:16:47 +08:00
SteNicholas	3ac769e4fa	[CELEBORN-1236][FOLLOWUP] Gauge is_terminating, is_terminated and is_shutdown should represent a single numerical value ### What changes were proposed in this pull request? Gauge `is_terminating`, `is_terminated` and `is_shutdown` should represent a single numerical value instead of boolean value. ### Why are the changes needed? A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The value type of `is_terminating`, `is_terminated` and `is_shutdown` should be numerical, otherwise `AbstractSource#addGauge` would warn the failed log as follows: ``` 2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminating failed, the value type class java.lang.Boolean is not a number 2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_terminated failed, the value type class java.lang.Boolean is not a number 2024-04-12 20:04:12,438 [WARN] [main] - org.apache.celeborn.common.metrics.source.ThreadPoolSource -Logging.scala(55) -Add gauge is_shutdown failed, the value type class java.lang.Boolean is not a number ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #2457 from SteNicholas/CELEBORN-1236. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-04-15 11:34:34 +08:00
CodingCat	c788c38025	[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots ### What changes were proposed in this pull request? Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster. ### Why are the changes needed? It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way). ![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5) Closes #2386 from CodingCat/slots_decrease. Lead-authored-by: CodingCat <zhunansjtu@gmail.com> Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com> Co-authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-04-08 11:08:05 +08:00
SteNicholas	0054930ce7	[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list ### What changes were proposed in this pull request? Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list. <img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb"> ### Why are the changes needed? `/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08) Closes #2379 from SteNicholas/CELEBORN-1323. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-03-12 16:01:22 +08:00
SteNicholas	dee4afc580	[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style ### What changes were proposed in this pull request? Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics. ### Why are the changes needed? The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2378 from SteNicholas/CELEBORN-1322. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-03-11 20:41:22 +08:00
liangyongyuan	4ddc91afda	[CELEBRON-1282] Optimize push data replica error message ### What changes were proposed in this pull request? Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios. ### Why are the changes needed? When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations. ![image](https://github.com/apache/incubator-celeborn/assets/46274164/816d21ad-1f79-45f0-bbe7-e93e15389edd) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? through exist uts Closes #2323 from lyy-pineapple/CELEBORN-1282. Authored-by: liangyongyuan <liangyongyuan@xiaomi.com> Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>	2024-02-26 12:55:26 +08:00
SteNicholas	a1c9d01739	[CELEBORN-1056] Introduce Rest API of listing dynamic configuration ### What changes were proposed in this pull request? Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows: ``` =========================== Dynamic Configuration ============================ celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 100000 celeborn.worker.flusher.buffer.size 64k =========================== SYSTEM ============================ celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 200000 celeborn.worker.flusher.buffer.size 128k =========================== TENANT ============================ =========================== Tenant: tenantId1 ============================ celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 300000 celeborn.worker.flusher.buffer.size 256k =========================== TENANT_USER ============================ =========================== Tenant: tenantId1, Name: user1 ============================ celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 400000 celeborn.worker.flusher.buffer.size 512k ``` ### Why are the changes needed? Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management. ### Does this PR introduce _any_ user-facing change? - Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system\|tenant\|tenant_user]&tenant=tenantId1&name=user1`. ### How was this patch tested? - `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration` Closes #2311 from SteNicholas/CELEBORN-1056. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-02-23 10:30:11 +08:00
SteNicholas	05fa11b3a0	[CELEBORN-1174] Introduce application dimension resource consumption metrics ### What changes were proposed in this pull request? Introduce application dimension resource consumption metrics for `ResourceConsumptionSource`. ### Why are the changes needed? `ResourceConsumption` namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Master and Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - `WorkerInfoSuite#WorkerInfo toString output` - `PbSerDeUtilsTest#fromAndToPbResourceConsumption` - `MasterStateMachineSuitej#testObjSerde` Closes #2161 from SteNicholas/CELEBORN-1174. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: SteNicholas <programgeek@163.com>	2024-02-01 15:24:29 +08:00
Shuang	e71d912d50	[CELEBORN-1245] Support Celeborn Master(Leader) to manage workers ### What changes were proposed in this pull request? 1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat 2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...) 3. Add Http interface for master to handleWorkerEvent/getWorkerEvent ### Why are the changes needed? Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #2255 from RexXiong/CELEBORN-1245. Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>	2024-02-01 09:44:59 +08:00
Angerszhuuuu	3ffed66c40	[CELEBORN-1236][METRICS] Celeborn add metrics about thread pool ### What changes were proposed in this pull request? Add metrics about worker's thread pool, help admin to observe the thread pool's work status. ThreadPool list as below: 1. celeborn-dispatcher 2. celeborn-netty-rpc-connection-executor 3. worker-disk-{mount_point}-cleaner 4. worker-device-checker 5. flusher-{mount_point} 6. worker-file-sorter-executor 7. worker-data-replicator 8. worker-files-committer 9. worker-expired-shuffle-cleaner ``` metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484 metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-dispatcher"} 0 1706237338484 metrics_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484 metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484 metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484 metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-dispatcher"} 64 1706237338484 metrics_active_thread_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484 metrics_pending_task_count_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484 metrics_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 0 1706237338484 metrics_core_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484 metrics_maximum_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 64 1706237338484 metrics_largest_pool_size_Value{role="Worker",threadPool="celeborn-netty-rpc-connection-executor"} 1 1706237338484 metrics_active_thread_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484 metrics_pending_task_count_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484 metrics_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338484 metrics_core_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 4 1706237338484 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-disk-/-cleaner"} 0 1706237338485 metrics_active_thread_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485 metrics_pending_task_count_Value{role="Worker",threadPool="worker-device-checker"} 0 1706237338485 metrics_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485 metrics_core_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 5 1706237338485 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-device-checker"} 2 1706237338485 metrics_thread_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 2 1706237338485 metrics_thread_is_terminated_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485 metrics_thread_is_shutdown_count_Value{role="Worker",threadPool="LocalFlusher1441328175-/"} 0 1706237338485 metrics_active_thread_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485 metrics_pending_task_count_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485 metrics_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485 metrics_core_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 24 1706237338485 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-file-sorter-executor"} 0 1706237338485 metrics_active_thread_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485 metrics_pending_task_count_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485 metrics_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485 metrics_core_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 64 1706237338485 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-data-replicator"} 0 1706237338485 metrics_active_thread_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485 metrics_pending_task_count_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485 metrics_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485 metrics_core_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 32 1706237338485 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-files-committer"} 0 1706237338485 metrics_active_thread_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485 metrics_pending_task_count_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 0 1706237338485 metrics_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485 metrics_core_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485 metrics_maximum_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 64 1706237338485 metrics_largest_pool_size_Value{role="Worker",threadPool="worker-expired-shuffle-cleaner"} 2 1706237338485 ``` ### Why are the changes needed? Help observe server status ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT Closes #2239 from AngersZhuuuu/CLEBORN-1236. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2024-01-26 18:14:05 +08:00
Angerszhuuuu	4709251bb4	[CELEBORN-1246] Introduce OpenStreamSuccessCount, FetchChunkSuccessCount and WriteDataSuccessCount metric to expose the count of opening stream, fetching chunk and writing data successfully ### What changes were proposed in this pull request? Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric to expose the count of opening stream, fetching chunk and writing data successfully in current worker. ### Why are the changes needed? The ratio of opening stream, fetching chunk and writing data failed is important for Celeborn performance to balance the healty of cluster, which is lack of the count of opening stream, fetching chunk and writing data successfully. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2252 from AngersZhuuuu/CELEBORN-1246. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-24 10:44:28 +08:00
xianminglei	b90fb1fdb2	[CELEBORN-1237][METRICS] Refactor metrics name ### What changes were proposed in this pull request? Refactor metrics name. ### Why are the changes needed? Easier to understand the meaning of metrics ### Does this PR introduce _any_ user-facing change? METRICS.md migration.md monitoring.md ### How was this patch tested? Existing UTs. Closes #2240 from leixm/metrics_name. Authored-by: xianminglei <xianming.lei@shopee.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2024-01-18 18:15:43 +08:00
SteNicholas	402d23d0ea	[CELEBORN-1223] Align master and worker metrics of document with MasterSource and WorkerSource ### What changes were proposed in this pull request? Align master and worker metrics of document with `MasterSource` and `WorkerSource` in `METRICS.md` and `monitoring.md`. ### Why are the changes needed? Metrics of master and worker is inconsistent with `MasterSource` and `WorkerSource` at present. It is recommended to align master and worker metrics of document with `MasterSource` and `WorkerSource`: - PushDataHandshakeFailCount - RegionStartFailCount - RegionFinishFailCount - PrimaryPushDataHandshakeTime - ReplicaPushDataHandshakeTime - PrimaryRegionStartTime - ReplicaRegionStartTime - PrimaryRegionFinishTime - ReplicaRegionFinishTime - ActiveConnectionCount - BufferStreamReadBuffer - ReadBufferDispatcherRequestsLength - ReadBufferAllocatedCount - CreditStreamCount - ActiveMapPartitionCount - DeviceOSFreeBytes - DeviceOSTotalBytes - DeviceCelebornFreeBytes - DeviceCelebornTotalBytes - PotentialConsumeSpeed - UserProduceSpeed - WorkerConsumeSpeed ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2226 from SteNicholas/CELEBORN-1223. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-16 16:19:39 +08:00
SteNicholas	4b5e23db37	[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers ### What changes were proposed in this pull request? Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers. ### Why are the changes needed? `PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) - `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic` Closes #2221 from SteNicholas/CELEBORN-1215. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-10 19:55:04 +08:00
SteNicholas	0cd1291f6c	[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData ### What changes were proposed in this pull request? Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData. ### Why are the changes needed? As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) Closes #2217 from SteNicholas/CELEBORN-1214. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-09 21:54:53 +08:00
SteNicholas	29e930488b	[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler ### What changes were proposed in this pull request? Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`: - `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics. - Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example: 1. `"expr": "metrics_RunningApplicationCount_Value"` 2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`. 3. Updates the `gridPos` of some panels. ### Why are the changes needed? There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) Closes #2212 from SteNicholas/CELEBORN-1100. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-05 17:05:35 +08:00
SteNicholas	e7e39a51be	[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker ### What changes were proposed in this pull request? Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker. ### Why are the changes needed? `RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2172 from SteNicholas/CELEBORN-1189. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-12-27 09:51:16 +08:00
SteNicholas	277f7ced57	[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker ### What changes were proposed in this pull request? Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`. ### Why are the changes needed? `MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2171 from SteNicholas/CELEBORN-1187. Lead-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-12-22 17:07:39 +08:00
SteNicholas	850d3199ef	[CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker ### What changes were proposed in this pull request? Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker. ### Why are the changes needed? The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal test. Closes #2151 from SteNicholas/CELEBORN-1164. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-12-13 23:01:16 +08:00
SteNicholas	52eddc59f3	[CELEBORN-448] Support exclude worker manually ### What changes were proposed in this pull request? Support exclude worker manually given worker id. This worker is added into excluded workers manually. ### Why are the changes needed? Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - `HttpUtilsSuite` - `DefaultMetaSystemSuiteJ#testHandleWorkerExclude` - `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude` - `MasterStateMachineSuiteJ#testObjSerde` Closes #1997 from SteNicholas/CELEBORN-448. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-11-07 16:25:24 +08:00
SteNicholas	b45b63f9a5	[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker ### What changes were proposed in this pull request? Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker. ### Why are the changes needed? The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2059 from SteNicholas/CELEBORN-247. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-11-01 15:48:31 +08:00
Fu Chen	349ee8b1cb	Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding… …Rpcs and outstandingPushes to metrics" This reverts commit `bfa341c32f`. ### What changes were proposed in this pull request? ### Why are the changes needed? https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2032 from cfmcgrady/revert-pr-1992. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Fu Chen <cfmcgrady@gmail.com>	2023-10-24 17:18:54 +08:00

1 2

77 Commits