Commit Graph

81 Commits

Author SHA1 Message Date
wangshengjie3
4bacd1f211 [CELEBORN-1856] Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read
### What changes were proposed in this pull request?
Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read

### Why are the changes needed?
In [CELEBORN-1319](https://issues.apache.org/jira/browse/CELEBORN-1319), we have already implemented the skew partition read optimization based on chunk offsets, but we don't support skew partition shuffle retry, so we need support the stage rerun.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #3118 from wangshengjie123/support-stage-rerun.

Lead-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-03-24 22:03:15 +08:00
wangshengjie
d659e06d45 [CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
### What changes were proposed in this pull request?
Add logic to support avoid sorting shuffle files for Reduce mode when optimize skew partitions

### Why are the changes needed?
Current logic need sorting shuffle files when read Reduce mode skew partition shuffle files, we found some shuffle sorting timeout and performance issue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test and uts

Closes #2373 from wangshengjie123/optimize-skew-partition.

Lead-authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Co-authored-by: wangshengjie3 <wangshengjie3@xiaomi.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Co-authored-by: wangshengjie3 <soldier.sj.wang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-02-19 16:57:44 +08:00
zhengtao
ac0d335f40 [CELEBORN-1831] Add ratis commitIndex metrics
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).

### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.
![image](https://github.com/user-attachments/assets/f354a3cd-e3b3-4af0-98c2-fc13330b2d81)

Closes #3063 from zaynt4606/clb1831.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-17 10:58:06 +08:00
Nan
ca60613f2f [CELEBORN-1817] add committed file size metrics
### What changes were proposed in this pull request?

this PR adds the file size metrics for workers

### Why are the changes needed?

the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)

we want to have this metrics to monitor the severity of the issue

### Does this PR introduce _any_ user-facing change?

yes, one more metrics

### How was this patch tested?

(ignore the dashboard title, it's a dummy one)

![image](https://github.com/user-attachments/assets/d88c15e6-d740-4def-94d5-03666bbb38ca)

Closes #3047 from CodingCat/committed_file_size.

Authored-by: Nan <nzhu@pinterest.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-01-07 10:17:45 +08:00
wuziyi
f886751e80 [CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted
### What changes were proposed in this pull request?

Current implementation uses `
shuffleSortTaskDeque.size()` as current sorting file count.This value might be more appropriately described as the sort tasks waiting to be submitted to `fileSorterExecutors`. And the actual current sorting file number ( doing some disk-io operation etc) should be get from `sortingShuffleFiles`.

### Why are the changes needed?

Add metrics to monitor current sorting files which is making disk-io operations.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

![image](https://github.com/user-attachments/assets/6ffed37e-ad12-4d8d-a4aa-2b2695a92168)

Closes #3040 from Z1Wu/fix/sorting_file_metrics.

Authored-by: wuziyi <wuziyi02@corp.netease.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-04 10:27:53 +08:00
Wang, Fei
03656b5b1c [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
### What changes were proposed in this pull request?

1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.

### Why are the changes needed?

For monitoring

### Does this PR introduce _any_ user-facing change?
No, it has not been released

### How was this patch tested?
UT for  metrics source `instance`.

<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">

<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">

Closes #2990 from turboFei/rpc_metrics.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 11:13:49 +08:00
mingji
c40f69b941 [CELEBORN-1766] Add detail metrics about fetch chunk
### What changes were proposed in this pull request?
1. Add histogram
2. Collect critical metrics about fetch chunk

### Why are the changes needed?
1. To find out IO pattern of fetch chunk
2. To have detail metrics about fetch chunk time

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

<img width="940" alt="截屏2024-12-09 15 42 50" src="https://github.com/user-attachments/assets/9f526103-c162-4607-a031-ba90f42ae83e">
<img width="962" alt="截屏2024-12-09 15 42 56" src="https://github.com/user-attachments/assets/c17822da-0433-4701-b0cc-0887ac970353">

Closes #2983 from FMX/b1766.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-16 16:17:14 +08:00
Wang, Fei
81a0d5113c [CELEBORN-1660] Cache available workers and only count the available workers device free capacity
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part

### Why are the changes needed?
Cache  the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
UT.

<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">

Closes #2827 from turboFei/device_free.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-14 11:10:45 +08:00
Wang, Fei
def5254ec2
[CELEBORN-1706] Use bytes(IEC) unit instead of bytes(SI) for size related metrics in prometheus dashboard
### What changes were proposed in this pull request?
Use unit `bytes(IEC)`(`decbytes`, 1,024 bytes in a kibibyte ) for below 18 metrics(disk and memory related) instead of `bytes(SI)`(`bytes`, 1,000 bytes in a kilobyte).
- metrics_DeviceCelebornTotalBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_PartitionSize_Value
- metrics_ActiveShuffleSize_Value
- metrics_NettyMemory_Value
- metrics_DiskBuffer_Value
- metrics_push_usedHeapMemory_Value
- metrics_push_usedDirectMemory_Value
- metrics_fetch_usedHeapMemory_Value
- metrics_fetch_usedDirectMemory_Value
- metrics_replicate_usedHeapMemory_Value
- metrics_replicate_usedDirectMemory_Value
- metrics_BufferStreamReadBuffer_Value
- metrics_SortMemory_Value
- metrics_DeviceOSFreeBytes_Value
- metrics_DeviceCelebornFreeBytes_Value
- metrics_diskBytesWritten_Value
- metrics_hdfsBytesWritten_Value

Also apply for 6 jvm metrics
- metrics_jvm_memory_heap_init_Value
- metrics_jvm_memory_non_heap_init_Value
- metrics_jvm_memory_total_init_Value
- metrics_jvm_memory_pools_init_Value
- metrics_jvm_direct_capacity_Value
- metrics_jvm_mapped_capacity_Value

### Why are the changes needed?

Some size related metrics use `bytes(IEC)` and some use `bytes(SI)`.
<img width="1715" alt="image" src="https://github.com/user-attachments/assets/8dd1727b-4e16-487c-b2f9-f70116bc27d3">

<img width="1722" alt="image" src="https://github.com/user-attachments/assets/17ed933a-3f01-4a91-a170-aa7a042f4947">

The main difference between bytes in the International System of Units (SI) and the International Electrotechnical Commission (IEC) is the number of bytes in a kilobyte:
SI: 1,000 bytes in a kilobyte
IEC: 1,024 bytes in a kibibyte

FYI: https://www.drupal.org/project/drupal/issues/1114538#:~:text=According%20to%20the%20SI%20standard,e.g.%20a%20stick%20of%20RAM.

4545cdc401/assets/grafana/celeborn-dashboard.json (L5636-L5699)

### Does this PR introduce _any_ user-facing change?

Yes, metrics unit changed.

### How was this patch tested?
Not needed, we already use `decbytes` in the dashboard json.

Closes #2896 from turboFei/unit_decbytes.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-11-14 10:40:57 +08:00
SteNicholas
169b6f6973 [CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
### What changes were proposed in this pull request?

1. `ShuffleFallbackPolicy` supports `ShuffleFallbackCount` metric to provide the shuffle fallback count of each fallback policy.
2. Introduce `ShuffleTotalCount` metric to record the total count of shuffle.
3. Fix Spark 2 does not increment shuffle count via `LifecycleManager`.

### Why are the changes needed?

The implementations of `ShuffleFallbackPolicy` does not support `ShuffleFallbackCount` metric at present. Meanwhile, Bilibili production practice needs `ShuffleFallbackCount` of different `ShuffleFallbackPolicy`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #2891 from SteNicholas/CELEBORN-1685.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-11 10:37:25 +08:00
Wang, Fei
f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
Sanskar Modi
2c996133b9 [CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
### What changes were proposed in this pull request?

Adding IsDecommissioningWorker metric to celeborn dashboard

### Why are the changes needed?

Metric was missing from dashboard

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?

Tested in local grafana setup

<img width="755" alt="Screenshot 2024-10-21 at 5 19 55 PM" src="https://github.com/user-attachments/assets/7c0a2517-32a8-4565-81d8-a056d3708ac8">

Closes #2836 from s0nskar/decommision_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-30 09:55:43 +08:00
Wang, Fei
ffc4980847 [CELEBORN-1627][FOLLOWUP] Fix typo for metrics_SlotsAllocated_increas_1h
### What changes were proposed in this pull request?
Fix typo in prometheus expr.

### Why are the changes needed?

Fix typo.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<img width="1220" alt="image" src="https://github.com/user-attachments/assets/0b8649b6-163a-4868-9eb4-31a25a225d0e">

Closes #2825 from turboFei/fix_typo.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:33:54 +08:00
SteNicholas
497bfdf5d7 [CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
### What changes were proposed in this pull request?

`NettyMemoryMetrics` supports `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`. Meanwhile, remove `server_` prefix from metric name of netty memory metric in `monitoring.md`.

### Why are the changes needed?

`PooledByteBufAllocatorMetric` provides the following API to support netty memory metrics:

```
public int numHeapArenas() {
  return this.allocator.numHeapArenas();
}

public int numDirectArenas() {
  return this.allocator.numDirectArenas();
}

public List<PoolArenaMetric> heapArenas() {
  return this.allocator.heapArenas();
}

public List<PoolArenaMetric> directArenas() {
  return this.allocator.directArenas();
}

public int numThreadLocalCaches() {
  return this.allocator.numThreadLocalCaches();
}

public int tinyCacheSize() {
  return this.allocator.tinyCacheSize();
}

public int smallCacheSize() {
  return this.allocator.smallCacheSize();
}

public int normalCacheSize() {
  return this.allocator.normalCacheSize();
}

public int chunkSize() {
  return this.allocator.chunkSize();
}

public long usedHeapMemory() {
  return this.allocator.usedHeapMemory();
}

public long usedDirectMemory() {
  return this.allocator.usedDirectMemory();
}
```

`NettyMemoryMetrics` only supports `usedHeapMemory` and `usedDirectMemory`, which could support `numHeapArenas`, `numDirectArenas`, `tinyCacheSize`, `smallCacheSize`, `normalCacheSize`, `numThreadLocalCaches` and `chunkSize` from `PooledByteBufAllocatorMetric`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/a520ca36a33843a38bbde28387023f97)

Closes #2802 from SteNicholas/CELEBORN-1640.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-17 18:12:08 +08:00
Wang, Fei
c3d33daabc [CELEBORN-1627] Introduce instance variable for celeborn dashboard to filter metrics
### What changes were proposed in this pull request?

1. add `instanceLabel` in metrics source, prefer `FQDN:port` than `ip:port` even with `celeborn.network.bind.preferIpAddress=false` before
2. add variable  `instance` with  `label_values(metrics_JVMCPUTime_Value, instance)` same as `celeborn-jvm-dashboard.json`
3. add filter `instance=~"${instance}"` for every metrics
4. add missing `legendFormat` for memory file storage metrics expressions

### Why are the changes needed?

There should be too many celeborn instances in production use case, it is better to add filter with instance.

### Does this PR introduce _any_ user-facing change?
Yes. introduce new variable.

But the instance default value is `ALL`, same behavior as before.

### How was this patch tested?

Config: `celeborn.network.bind.preferIpAddress=false`
<img width="1141" alt="image" src="https://github.com/user-attachments/assets/c3161069-790a-4cb2-8654-6d52cf8e5fb0">
<img width="944" alt="image" src="https://github.com/user-attachments/assets/293b8bd4-252a-459c-aa86-5f4aa75eb594">

<img width="939" alt="image" src="https://github.com/user-attachments/assets/1e1b28af-dd71-4c5b-8285-57473a6c9650">

For JVM metrics, before it was ip:port, and now it is FQDN:port.
<img width="947" alt="image" src="https://github.com/user-attachments/assets/fe00762f-605d-4b5e-b0a4-c586bdc0ec1a">

Closes #2777 from turboFei/legend_base.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-09 14:47:03 +08:00
Sanskar Modi
961144fdbd [CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
### What changes were proposed in this pull request?

Adding a worker metrics for publish unreleased shuffle count when worker was decommissioned.

<img width="885" alt="Screenshot 2024-09-16 at 11 12 33 AM" src="https://github.com/user-attachments/assets/c81f36c1-cbed-44fe-814b-88f3ff29875d">

### Why are the changes needed?

Currently celeborn don't publish the count of unreleased shuffle key which gets lost when a worker is decommissioned. This can be useful for monitoring and configuring the `forceExitTimeout`.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NA

Closes #2711 from s0nskar/unrelease_shuffle_metric.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 17:02:25 +08:00
Weijie Guo
d8809793f3 [CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
### What changes were proposed in this pull request?

Impl worker write process for Flink Hybrid Shuffle.

### Why are the changes needed?

We supports tiered producer write data from flink to worker. In this PR, we enable the worker to write this kind of data to storage.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
no need.

Closes #2741 from reswqa/cip6-6-pr.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-25 10:27:55 +08:00
szt
59a39952dd [CELEBORN-1586] Add available workers Metrics
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test
![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702)

Closes #2723 from zaynt4606/availableWorker.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-05 13:34:52 +08:00
Wang, Fei
3b0abdee5b [CELEBORN-1491][FOLLOWUP] Using baseLegend for metrics_FlushWorkingQueueSize_Value
### What changes were proposed in this pull request?

Followup for https://issues.apache.org/jira/browse/CELEBORN-1491, use baseLegend for the new introduced metrics.

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Before:
<img width="852" alt="image" src="https://github.com/user-attachments/assets/cf1cb852-9480-49ff-873c-62b535167fa3">

After:
<img width="346" alt="image" src="https://github.com/user-attachments/assets/cbd6ec82-4531-4056-b8ee-96bde813f899">

<img width="849" alt="image" src="https://github.com/user-attachments/assets/a787be53-4646-48d2-a24e-da9b714b7fca">

Closes #2712 from turboFei/grafana_dashboard.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-28 11:39:15 +08:00
Wang, Fei
0e05bc6cf9 [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
### What changes were proposed in this pull request?

As title, merge these two similar user guides.

### Why are the changes needed?
To close CELEBORN-1437

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard

Closes #2623 from turboFei/metrics_merge.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 13:41:46 +08:00
mingji
cb6e2202ae [CELEBORN-1491] introduce flusher working queue size metric
### What changes were proposed in this pull request?
Add metrics about flusher working queue size.

### Why are the changes needed?
To show if there is an accumulation of flush tasks.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2598 from FMX/b1491.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-05 09:55:02 +08:00
SteNicholas
c7b1b8d61e
[CELEBORN-1459] Introduce CleanTaskQueueSize and CleanExpiredShuffleKeysTime to record situation of cleaning up expired shuffle keys
### What changes were proposed in this pull request?

Introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record situation of cleaning up expired shuffle keys.

### Why are the changes needed?

There is a backlog of task queue for cleaning up shuffle data of expired shuffle keys in the production environment. It's recommended to introduce `CleanTaskQueueSize` and `CleanExpiredShuffleKeysTime` to record the progress of cleaning up expired shuffle keys.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/4b5a0b79a35e4ddbb18ddccfe2ec06d7)

Closes #2557 from SteNicholas/CELEBORN-1459.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 16:31:57 +08:00
SteNicholas
f63ff34ba7
[CELEBORN-1462] Fix layout of DeviceCelebornTotalBytes, DeviceCelebornFreeBytes, RunningApplicationCount and DecommissionWorkerCount in celeborn-dashboard.json
### What changes were proposed in this pull request?

Fix layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json`.

### Why are the changes needed?

The layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount` in `celeborn-dashboard.json` have wrong position as follows:

![celeborn-dashboard](https://github.com/apache/celeborn/assets/10048174/adf82c15-ce31-4755-8c81-ffde9ceef822)

We should fix the correct position to provide layout of `DeviceCelebornTotalBytes`, `DeviceCelebornFreeBytes`, `RunningApplicationCount` and `DecommissionWorkerCount`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test: [Celeborn Grafana Dashboard](https://stenicholas.grafana.net/public-dashboards/822b08768a324dfe9fc526254bae5ae5).

Closes #2569 from SteNicholas/CELEBORN-1462.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-14 15:11:18 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
SteNicholas
4fc42d7fef
[CELEBORN-1389] Bump Dropwizard version from 3.2.6 to 4.2.25
### What changes were proposed in this pull request?

Bump Dropwizard version from 3.2.6 to 4.2.25. Meanwhile, introduce `metrics_jvm_thread_peak_count_Value` and `metrics_jvm_thread_total_started_count_Value` in `celeborn-jvm-dashboard.json`.

### Why are the changes needed?

Dropwizard metrics has released v4.2.25 including some bugfixes and improvements including:

* [JVM] Fix maximum/total memory calculation: https://github.com/dropwizard/metrics/pull/3125
* [Thread] Add peak and total started thread count to `ThreadStatesGaugeSet`: https://github.com/dropwizard/metrics/pull/1601

Meanwhile, Ratis version has upgraded to 3.0.1 which has no compatibility problem with Dropwizard 4.2.25.

Backport:

- https://github.com/apache/spark/pull/26332
- https://github.com/apache/spark/pull/29426
- https://github.com/apache/spark/pull/37372

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2540 from SteNicholas/CELEBORN-1389.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-04 19:26:20 +08:00
mingji
89d56c9bbc
[CELEBORN-914] Support memory file storage
### What changes were proposed in this pull request?
To support memory file storage.

### Why are the changes needed?
To improve shuffle performance for small shuffle files.

Design doc: https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA and manually test on a cluster.

Closes #2300 from FMX/B914.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-23 21:05:52 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
CodingCat
c788c38025
[CELEBORN-1328] Introduce ActiveSlotsCount metric to monitor the number of active slots
### What changes were proposed in this pull request?

Introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Why are the changes needed?

It's recommended to introduce `ActiveSlots` metric to represent the disk resource demand currently in the cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

In our test cluster (we can see the value of activeSlots increases and then back to 0 after the application finished, and slotsAllocated is increasing all the way).

![image](https://github.com/apache/incubator-celeborn/assets/678008/c05aa763-11ad-4bbd-9ae0-dd6a9cb01ac5)

Closes #2386 from CodingCat/slots_decrease.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 11:08:05 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
SteNicholas
4e64ae3214
[CELEBORN-1282][FOLLOWUP] Introduce ReplicateDataFailNonCriticalCauseCount metric in Grafana dashboard
### What changes were proposed in this pull request?

Introduce `ReplicateDataFailNonCriticalCauseCount` metric in Grafana dashboard. Follow up #2323.

### Why are the changes needed?

`ReplicateDataFailNonCriticalCauseCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/6e50cc2c7af34692babcc2809066e147)

Closes #2332 from SteNicholas/CELEBORN-1282.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-27 15:32:28 +08:00
SteNicholas
4723c738b3
[CELEBORN-1246][FOLLOWUP] Introduce OpenStreamSuccessCount, FetchChunkSuccessCount and WriteDataSuccessCount metric in Grafana dashboard
### What changes were proposed in this pull request?

Introduce `OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric in Grafana dashboard.

### Why are the changes needed?

`OpenStreamSuccessCount`, `FetchChunkSuccessCount` and `WriteDataSuccessCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2269 from SteNicholas/CELEBORN-1246.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-29 19:36:44 +08:00
xianminglei
b90fb1fdb2 [CELEBORN-1237][METRICS] Refactor metrics name
### What changes were proposed in this pull request?
Refactor metrics name.

### Why are the changes needed?
Easier to understand the meaning of metrics

### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md

### How was this patch tested?
Existing UTs.

Closes #2240 from leixm/metrics_name.

Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-18 18:15:43 +08:00
SteNicholas
4b5e23db37
[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers
### What changes were proposed in this pull request?

Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers.

### Why are the changes needed?

`PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
- `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic`

Closes #2221 from SteNicholas/CELEBORN-1215.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-10 19:55:04 +08:00
SteNicholas
0cd1291f6c
[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData
### What changes were proposed in this pull request?

Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData.

### Why are the changes needed?

As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2217 from SteNicholas/CELEBORN-1214.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-09 21:54:53 +08:00
SteNicholas
29e930488b
[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler
### What changes were proposed in this pull request?

Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`:

- `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics.
- Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example:
  1. `"expr": "metrics_RunningApplicationCount_Value"`
  2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`.
  3. Updates the `gridPos` of some panels.

### Why are the changes needed?

There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2212 from SteNicholas/CELEBORN-1100.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-05 17:05:35 +08:00
SteNicholas
276ab979a4
[CELEBORN-1187][FOLLOWUP] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

Follow up #2171.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2186 from SteNicholas/CELEBORN-1187.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 18:09:39 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
onebox-li
af6fd8a0e6 [CELEBORN-1127] Add JVM classloader metrics
### What changes were proposed in this pull request?
Add JVM classloader metrics for loaded and unloaded count.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/c00eceb3-54e5-4f85-8df1-fe9a6adf6ad4)

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Add two classloader-related panels.

### How was this patch tested?
Cluster test.

Closes #2099 from onebox-li/add-classloader-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-07 09:47:23 +08:00
onebox-li
ae3bbc50f4 [CELEBORN-1114][FOLLOWUP] Make SlotsAllocated metrics panel to follow previous behavior
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
To avoid users being confused after upgrading.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #2087 from onebox-li/slots_allocated_metric_panel.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 16:32:48 +08:00
Luke Yan
c7c2f6a35a [CELEBORN-858] Generate patch to each Spark 3.x minor version
### What changes were proposed in this pull request?

Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` :

1. Celeborn_Dynamic_Allocation_spark3_0.patch
2. Celeborn_Dynamic_Allocation_spark3_1.patch
3. Celeborn_Dynamic_Allocation_spark3_2.patch
4. Celeborn_Dynamic_Allocation_spark3_3.patch

Delete a patch at the same time:

1. Celeborn_Dynamic_Allocation_spark3.patch

Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md :

![image](https://github.com/apache/incubator-celeborn/assets/108530647/61e2e69b-d3f5-4d11-a20b-374622936443)

### Why are the changes needed?

Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes. All patch files can be applied to the corresponding version of spark source code through `git apply`  without any code conflicts.

Closes #2085 from lukeyan2023/spark3-patch.

Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 15:35:54 +08:00
onebox-li
b7e4dc4339 [CELEBORN-1114] Remove allocationBuckets from WorkerInfo and refactor SLOTS_ALLOCATED metrics
### What changes were proposed in this pull request?
Currently, `WorkerInfo` is used in many places, and allocationBuckets is only used when its own workers want to collect metrics `SLOTS_ALLOCATED`. If there are lots of workers in the RSS cluster, there may be a certain amount of memory waste, each `WorkerInfo` maintain a Array\[Int](61), so remove it from `WorkerInfo`.
And refactor the metrics `SLOTS_ALLOCATED` from gauge to counter. Originally, this metrics is approximately one hour's total only if there are continuous tasks. Now refactoring it into a counter can reduce the cost of maintaining time windows, including storage and timely expiration data, etc. It can also be more flexibly transformed according to user needs on the prometheus side.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Yes. metrics_SlotsAllocated_Count metrics change from gauge for 1 hour to a increasing counter.

### How was this patch tested?
Cluster test.

Closes #2078 from onebox-li/improve-SlotsAllocated.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 19:45:47 +08:00
fwang12
32a6a31f84 [CELEBORN-1088] Define baseLegend variable for JVM Metrics dashboard
### What changes were proposed in this pull request?
Define baseLegend variable for jvm grafana dashboard.

BTW, refactor the `"legendFormat": "$baseLegend"` to `"legendFormat": "${baseLegend}"` in celeborn metrics dashboard json.
### Why are the changes needed?
 so that customer can change the legend variable case by case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test.

Closes #2038 from turboFei/jvm_legend.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-25 09:10:33 +08:00
fwang12
819df5f2c4 [CELEBORN-1086] Fix JVM metrics grafana expression issue
### What changes were proposed in this pull request?
Fix jvm metrics grafana expression issue.

### Why are the changes needed?
![image](https://github.com/apache/incubator-celeborn/assets/6757692/becedc53-da90-4cce-a494-497b1c55c7a4)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.
<img width="867" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a9720fc1-9699-47e8-847e-951947f57e01">

Closes #2036 from turboFei/fix_metrics.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 21:16:42 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
fwang12
bd9cb2b1ce [CELEBORN-1077][METRICS] Support to apply base legend format for all grafana metrics
### What changes were proposed in this pull request?
Apply base legend format for all grafana metrics.

### Why are the changes needed?

Before, the metrics dashboard is not readable easily.
<img width="836" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/4647834a-fa5b-42ca-8a98-3dad37c2cb13">

### Does this PR introduce _any_ user-facing change?
Yes. A variable introduced.

### How was this patch tested?
Local Test.

Now, I can modify the variable value to `{{pod}}_{{cluster}}` and have a better insight.
<img width="853" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a5cca8d9-37c3-4a18-9819-5a9861744cb9">

Closes #2028 from turboFei/legend_format.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 13:37:08 +08:00
SteNicholas
11c90d8e72
[CELEBORN-916] Add new metric about active shuffle file count in worker
### What changes were proposed in this pull request?

Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.

### Why are the changes needed?

`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2009 from SteNicholas/CELEBORN-916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00