Commit Graph

98 Commits

Author SHA1 Message Date
SteNicholas
4b5e23db37
[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers
### What changes were proposed in this pull request?

Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers.

### Why are the changes needed?

`PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)
- `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic`

Closes #2221 from SteNicholas/CELEBORN-1215.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-10 19:55:04 +08:00
SteNicholas
0cd1291f6c
[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData
### What changes were proposed in this pull request?

Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData.

### Why are the changes needed?

As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2217 from SteNicholas/CELEBORN-1214.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-09 21:54:53 +08:00
SteNicholas
29e930488b
[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler
### What changes were proposed in this pull request?

Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`:

- `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics.
- Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example:
  1. `"expr": "metrics_RunningApplicationCount_Value"`
  2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`.
  3. Updates the `gridPos` of some panels.

### Why are the changes needed?

There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

[Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s)

Closes #2212 from SteNicholas/CELEBORN-1100.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-05 17:05:35 +08:00
SteNicholas
276ab979a4
[CELEBORN-1187][FOLLOWUP] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

Follow up #2171.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2186 from SteNicholas/CELEBORN-1187.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 18:09:39 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
onebox-li
af6fd8a0e6 [CELEBORN-1127] Add JVM classloader metrics
### What changes were proposed in this pull request?
Add JVM classloader metrics for loaded and unloaded count.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/c00eceb3-54e5-4f85-8df1-fe9a6adf6ad4)

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Add two classloader-related panels.

### How was this patch tested?
Cluster test.

Closes #2099 from onebox-li/add-classloader-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-07 09:47:23 +08:00
onebox-li
ae3bbc50f4 [CELEBORN-1114][FOLLOWUP] Make SlotsAllocated metrics panel to follow previous behavior
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
To avoid users being confused after upgrading.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #2087 from onebox-li/slots_allocated_metric_panel.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 16:32:48 +08:00
Luke Yan
c7c2f6a35a [CELEBORN-858] Generate patch to each Spark 3.x minor version
### What changes were proposed in this pull request?

Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` :

1. Celeborn_Dynamic_Allocation_spark3_0.patch
2. Celeborn_Dynamic_Allocation_spark3_1.patch
3. Celeborn_Dynamic_Allocation_spark3_2.patch
4. Celeborn_Dynamic_Allocation_spark3_3.patch

Delete a patch at the same time:

1. Celeborn_Dynamic_Allocation_spark3.patch

Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md :

![image](https://github.com/apache/incubator-celeborn/assets/108530647/61e2e69b-d3f5-4d11-a20b-374622936443)

### Why are the changes needed?

Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes. All patch files can be applied to the corresponding version of spark source code through `git apply`  without any code conflicts.

Closes #2085 from lukeyan2023/spark3-patch.

Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 15:35:54 +08:00
onebox-li
b7e4dc4339 [CELEBORN-1114] Remove allocationBuckets from WorkerInfo and refactor SLOTS_ALLOCATED metrics
### What changes were proposed in this pull request?
Currently, `WorkerInfo` is used in many places, and allocationBuckets is only used when its own workers want to collect metrics `SLOTS_ALLOCATED`. If there are lots of workers in the RSS cluster, there may be a certain amount of memory waste, each `WorkerInfo` maintain a Array\[Int](61), so remove it from `WorkerInfo`.
And refactor the metrics `SLOTS_ALLOCATED` from gauge to counter. Originally, this metrics is approximately one hour's total only if there are continuous tasks. Now refactoring it into a counter can reduce the cost of maintaining time windows, including storage and timely expiration data, etc. It can also be more flexibly transformed according to user needs on the prometheus side.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Yes. metrics_SlotsAllocated_Count metrics change from gauge for 1 hour to a increasing counter.

### How was this patch tested?
Cluster test.

Closes #2078 from onebox-li/improve-SlotsAllocated.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 19:45:47 +08:00
fwang12
32a6a31f84 [CELEBORN-1088] Define baseLegend variable for JVM Metrics dashboard
### What changes were proposed in this pull request?
Define baseLegend variable for jvm grafana dashboard.

BTW, refactor the `"legendFormat": "$baseLegend"` to `"legendFormat": "${baseLegend}"` in celeborn metrics dashboard json.
### Why are the changes needed?
 so that customer can change the legend variable case by case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test.

Closes #2038 from turboFei/jvm_legend.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-25 09:10:33 +08:00
fwang12
819df5f2c4 [CELEBORN-1086] Fix JVM metrics grafana expression issue
### What changes were proposed in this pull request?
Fix jvm metrics grafana expression issue.

### Why are the changes needed?
![image](https://github.com/apache/incubator-celeborn/assets/6757692/becedc53-da90-4cce-a494-497b1c55c7a4)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local Test.
<img width="867" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a9720fc1-9699-47e8-847e-951947f57e01">

Closes #2036 from turboFei/fix_metrics.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 21:16:42 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
fwang12
bd9cb2b1ce [CELEBORN-1077][METRICS] Support to apply base legend format for all grafana metrics
### What changes were proposed in this pull request?
Apply base legend format for all grafana metrics.

### Why are the changes needed?

Before, the metrics dashboard is not readable easily.
<img width="836" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/4647834a-fa5b-42ca-8a98-3dad37c2cb13">

### Does this PR introduce _any_ user-facing change?
Yes. A variable introduced.

### How was this patch tested?
Local Test.

Now, I can modify the variable value to `{{pod}}_{{cluster}}` and have a better insight.
<img width="853" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a5cca8d9-37c3-4a18-9819-5a9861744cb9">

Closes #2028 from turboFei/legend_format.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 13:37:08 +08:00
SteNicholas
11c90d8e72
[CELEBORN-916] Add new metric about active shuffle file count in worker
### What changes were proposed in this pull request?

Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.

### Why are the changes needed?

`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2009 from SteNicholas/CELEBORN-916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00
onebox-li
2b79692585 [CELEBORN-688] Add JVM metrics grafana template
### What changes were proposed in this pull request?
Currently there is no JVM metrics grafana template, nor in grafana labs. For better use, it is necessary to add one.
According the change in #1939
This template uses two variables(instance, pool).
The layout is divided into 5 rows.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/732cff90-463c-47b5-89b8-fa8dbbf33b1e)

The panels with g1 look like below:
![image](https://github.com/apache/incubator-celeborn/assets/19429353/919b7e9e-f86a-4341-a004-7f0394e1d8b2)

JVM Memory Pools row uses replicated panel mode which panels are automatically deplicated by `pool` variables.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/3bdf7a3c-d4e0-42ea-bbe0-012da55a61d1)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/8feaf9b7-156d-453e-8188-40a0399ea516)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/cba4b61c-7d66-4893-9f07-6157c64869bd)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/09b473ef-434c-4fd0-aa4b-084f7588a4f7)

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, this dashboard is based on changes in #1939

### How was this patch tested?
Cluster test

Closes #1964 from onebox-li/add-jvm-dashboard.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 11:54:49 +08:00
zwangsheng
03a39819b5 [CELEBORN-882][WORKER][METRICS] Add Pause Push Data Time Count Metrics & Dashboard Panel
### What changes were proposed in this pull request?
Add `PausePushDataTime ` Metrics

### Why are the changes needed?
Count each celeborn worker pause time.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster Test

Closes #1800 from zwangsheng/CELEBORN-882.

Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-12 17:45:26 +08:00
mingji
442d59ab55 [CELEBORN-933] Add metrics about active shuffle data size
### What changes were proposed in this pull request?
Add metrics about active shuffle data size in every worker and update Grafana dashboard. The metric value will decrease when shuffle is expired.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.
<img width="733" alt="截屏2023-08-30 17 00 11" src="https://github.com/apache/incubator-celeborn/assets/4150993/48e28c1c-2b49-45d7-b3ba-358674ff3f3d">

Closes #1867 from FMX/CELEBORN-933.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 18:04:57 +08:00
mingji
2b79c37381 [CELEBORN-852][FOLLOWUP] Add active connection count metrics to grafana dashboard
### What changes were proposed in this pull request?
Add active connections count metrics to grafana dashboard.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Yes, new metric chart in the grafana dashboard.

### How was this patch tested?
Cluster.

Closes #1783 from FMX/CELEBORN-852.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 21:24:57 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
6e35745736
[CELEBORN-753] Rename spark patch file name to make it more clear
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1666 from AngersZhuuuu/CELEBORN-753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-30 11:41:12 +08:00
mingji
742815f285
[CELEBORN-749] Update grafana dashboard to remove "RSS"
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
No test needed.

Closes #1663 from FMX/CELEBORN-749.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:44:09 +08:00
Angerszhuuuu
bd7c2ea35a [CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1660 from AngersZhuuuu/CELEBORN-746.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-29 16:30:02 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
onebox-li
0c869ac9a0
[CELEBORN-642] Improve metrics and update grafana
### What changes were proposed in this pull request?
Change in grafana

(ALL)
add:
JVMCPUTime
LastMinuteSystemLoad
AvailableProcessors
(For Master)
add:
LostWorkers
IsActiveMaster
PartitionSize
(For Worker)
add:
PushDataFailCount -> WriteDataFailCount
ReplicateDataFailCount
ReplicateDataWriteFailCount
ReplicateDataCreateConnectionFailCount
ReplicateDataConnectionExceptionCount
ReplicateDataTimeoutCount
SortedFileSize
PushDataHandshakeFailCount
RegionStartFailCount
RegionFinishFailCount
MasterPushDataHandshakeTime
SlavePushDataHandshakeTime
MasterRegionStartTime
SlaveRegionStartTime
MasterRegionFinishTime
SlaveRegionFinishTime
PotentialConsumeSpeed
UserProduceSpeed
WorkerConsumeSpeed
DeviceOSFreeBytes
DeviceCelebornFreeBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory
remove:
dup ReserveSlotsTime

Change dashboard layout.

Fix support for multiple labels.

Modify some metrics docs.

### Why are the changes needed?
For better use of metrics.

### Does this PR introduce _any_ user-facing change?
Below metrics change name, extract some value to the label.
DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes
DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes
DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes
DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory

### How was this patch tested?
Cluster test.

Closes #1557 from onebox-li/improve-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-08 18:10:06 +08:00
Ethan Feng
3bd232dda0
[CELEBORN-619][CORE][SHUFFLE][FOLLOWUP] Support enable DRA with Apache Celeborn
### What changes were proposed in this pull request?

Adapt Spark DRA patch for spark 3.4

### Why are the changes needed?

To support enabling DRA w/ Celeborn on Spark 3.4

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides a DRA patch for Spark 3.4

### How was this patch tested?

Compiled with Spark 3.4

Closes #1546 from FMX/CELEBORN-619.

Lead-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-06 12:57:16 +08:00
Ethan Feng
5600728149
[CELEBORN-619][CORE][SHUFFLE] Support enable DRA with Apache Celeborn
### What changes were proposed in this pull request?

Adapt Spark DRA patch for spark 3.4

### Why are the changes needed?

To support enabling DRA w/ Celeborn on Spark 3.4

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides a DRA patch for Spark 3.4

### How was this patch tested?

Compiled with Spark 3.4

Closes #1529 from FMX/CELEBORN-619.

Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-05 09:50:05 +08:00
ulysses
fa920ab0d5
Relax isRssEnabled condition (#1528)
Co-authored-by: youxiduo <youxiduo@corp.netease.com>
2023-05-31 15:26:05 +08:00
Ethan Feng
9cccfc9872
[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407) 2023-04-13 10:37:47 +08:00
Keyong Zhou
f2fd8a5c15
[CELEBORN-373] Add sorted files into grafana dashboard (#1303) 2023-03-02 23:41:16 +08:00
Keyong Zhou
54cf2e18d8
[CELEBORN-252] Delete slides (#1186) 2023-01-31 16:35:23 +08:00
Keyong Zhou
dfa81c92df
[CELEBORN-224] Correct LICENSE and NOTICE. (#1164) (#1170) 2023-01-18 19:47:42 +08:00
Ethan Feng
01b7ea97c9
[CELEBORN-193] Reduce source package size. (#1140) 2023-01-03 19:28:03 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
Binjie Yang
9f20aabb48
[IMPORVE] Fix grafana dashboard json metrics_OfferSlotsTime_Max & metrics_OfferSlotsTime_Mean target datasource (#655) 2022-09-22 17:45:38 +08:00
AngersZhuuuu
da7ac1721b
[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566) 2022-09-07 21:33:18 +08:00
Kerwin Zhang
46892c271c
[issue-517] Spark3 patch to support columnar shuffle (#528) 2022-09-05 11:34:53 +08:00
Keyong Zhou
41e8311d58
[ISSUE-436][REFACTOR] Refactor metrics (#437)
1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker
2. Delete OverloadWorkerCount
3.Change slotsUsed to SlotsAllocated in last hour
2022-08-23 18:26:47 +08:00
Ethan Feng
959c689285
[DOC] Add documentation about setting up prometheus cluster and node exporter (#393) 2022-08-19 21:47:49 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
Ethan Feng
86adc0d244
[Feature]Add metrics documentation and grafana dashboard. (#117) 2022-05-20 12:12:41 +08:00
Ethan Feng
e8e333a239
RSS support spark3 RDA. (#108)
* RSS support spark3 RDA.
2022-05-14 14:02:40 +08:00
Ethan Feng
2ea136fada
[Feature]Update spark2 patch (#89) 2022-04-08 21:46:30 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00