### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
To avoid users being confused after upgrading.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2087 from onebox-li/slots_allocated_metric_panel.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` :
1. Celeborn_Dynamic_Allocation_spark3_0.patch
2. Celeborn_Dynamic_Allocation_spark3_1.patch
3. Celeborn_Dynamic_Allocation_spark3_2.patch
4. Celeborn_Dynamic_Allocation_spark3_3.patch
Delete a patch at the same time:
1. Celeborn_Dynamic_Allocation_spark3.patch
Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md :

### Why are the changes needed?
Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation`
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
yes. All patch files can be applied to the corresponding version of spark source code through `git apply` without any code conflicts.
Closes#2085 from lukeyan2023/spark3-patch.
Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Currently, `WorkerInfo` is used in many places, and allocationBuckets is only used when its own workers want to collect metrics `SLOTS_ALLOCATED`. If there are lots of workers in the RSS cluster, there may be a certain amount of memory waste, each `WorkerInfo` maintain a Array\[Int](61), so remove it from `WorkerInfo`.
And refactor the metrics `SLOTS_ALLOCATED` from gauge to counter. Originally, this metrics is approximately one hour's total only if there are continuous tasks. Now refactoring it into a counter can reduce the cost of maintaining time windows, including storage and timely expiration data, etc. It can also be more flexibly transformed according to user needs on the prometheus side.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
Yes. metrics_SlotsAllocated_Count metrics change from gauge for 1 hour to a increasing counter.
### How was this patch tested?
Cluster test.
Closes#2078 from onebox-li/improve-SlotsAllocated.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Define baseLegend variable for jvm grafana dashboard.
BTW, refactor the `"legendFormat": "$baseLegend"` to `"legendFormat": "${baseLegend}"` in celeborn metrics dashboard json.
### Why are the changes needed?
so that customer can change the legend variable case by case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local Test.
Closes#2038 from turboFei/jvm_legend.
Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…Rpcs and outstandingPushes to metrics"
This reverts commit bfa341c32f.
### What changes were proposed in this pull request?
### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2032 from cfmcgrady/revert-pr-1992.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Apply base legend format for all grafana metrics.
### Why are the changes needed?
Before, the metrics dashboard is not readable easily.
<img width="836" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/4647834a-fa5b-42ca-8a98-3dad37c2cb13">
### Does this PR introduce _any_ user-facing change?
Yes. A variable introduced.
### How was this patch tested?
Local Test.
Now, I can modify the variable value to `{{pod}}_{{cluster}}` and have a better insight.
<img width="853" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a5cca8d9-37c3-4a18-9819-5a9861744cb9">
Closes#2028 from turboFei/legend_format.
Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker.
### Why are the changes needed?
`ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2009 from SteNicholas/CELEBORN-916.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.
### Why are the changes needed?
`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Internal tests.
Closes#1976 from SteNicholas/CELEBORN-1035.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.
### Why are the changes needed?
The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`TransportResponseHandlerSuiteJ`
Closes#1992 from SteNicholas/CELEBORN-255.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add `PausePushDataTime ` Metrics
### Why are the changes needed?
Count each celeborn worker pause time.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Cluster Test
Closes#1800 from zwangsheng/CELEBORN-882.
Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add metrics about active shuffle data size in every worker and update Grafana dashboard. The metric value will decrease when shuffle is expired.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Cluster.
<img width="733" alt="截屏2023-08-30 17 00 11" src="https://github.com/apache/incubator-celeborn/assets/4150993/48e28c1c-2b49-45d7-b3ba-358674ff3f3d">
Closes#1867 from FMX/CELEBORN-933.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add active connections count metrics to grafana dashboard.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
Yes, new metric chart in the grafana dashboard.
### How was this patch tested?
Cluster.
Closes#1783 from FMX/CELEBORN-852.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1667 from AngersZhuuuu/CELEBORN-754.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1666 from AngersZhuuuu/CELEBORN-753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
No test needed.
Closes#1663 from FMX/CELEBORN-749.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1660 from AngersZhuuuu/CELEBORN-746.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.
### Why are the changes needed?
In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests.
Closes#1650 from cfmcgrady/primary-replica-metrics.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Unify all blacklist related code and comment
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Adapt Spark DRA patch for spark 3.4
### Why are the changes needed?
To support enabling DRA w/ Celeborn on Spark 3.4
### Does this PR introduce _any_ user-facing change?
Yes, this PR provides a DRA patch for Spark 3.4
### How was this patch tested?
Compiled with Spark 3.4
Closes#1546 from FMX/CELEBORN-619.
Lead-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
### What changes were proposed in this pull request?
Adapt Spark DRA patch for spark 3.4
### Why are the changes needed?
To support enabling DRA w/ Celeborn on Spark 3.4
### Does this PR introduce _any_ user-facing change?
Yes, this PR provides a DRA patch for Spark 3.4
### How was this patch tested?
Compiled with Spark 3.4
Closes#1529 from FMX/CELEBORN-619.
Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker
2. Delete OverloadWorkerCount
3.Change slotsUsed to SlotsAllocated in last hour