celeborn

Author	SHA1	Message	Date
SteNicholas	4b5e23db37	[CELEBORN-1215] Introduce PausePushDataAndReplicateTime metric to record time for a worker to stop receiving pushData from clients and other workers ### What changes were proposed in this pull request? Introduce `PausePushDataAndReplicateTime` metric to record time for a worker to stop receiving pushData from clients and other workers. ### Why are the changes needed? `PausePushData` means the count for a worker to stop receiving pushData from clients because of back pressure. Meanwhile, `PausePushDataAndReplicate` means the count for a worker to stop receiving pushData from clients and other workers because of back pressure. Therefore,`PausePushDataTime` records the time for a worker to stop receiving pushData from clients or other workers, of which definition is confusing for users. It's recommended that `PausePushDataAndReplicateTime` metric is introduced that means the time for a worker to stop receiving pushData from clients and other workers because of back pressure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) - `MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread logic` Closes #2221 from SteNicholas/CELEBORN-1215. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-10 19:55:04 +08:00
SteNicholas	0cd1291f6c	[CELEBORN-1214] Introduce WriteDataHardSplitCount metric to record HARD_SPLIT partitions of PushData and PushMergedData ### What changes were proposed in this pull request? Introduce `WriteDataHardSplitCount` metric to record `HARD_SPLIT` partitions of PushData and PushMergedData. ### Why are the changes needed? As the log level of `PushDataHandler#handlePushData` and `PushDataHandler#handlePushMergedData` use the DEBUG level, `WriteDataHardSplitCount` metric shoud be introduced to record HARD_SPLIT partitions of PushData and PushMergedData for `PushDataHandler`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) Closes #2217 from SteNicholas/CELEBORN-1214. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-09 21:54:53 +08:00
SteNicholas	29e930488b	[CELEBORN-1100] Introduce ChunkStreamCount, OpenStreamFailCount metrics about opening stream of FetchHandler ### What changes were proposed in this pull request? Introduces `ChunkStreamCount`, `OpenStreamFailCount` metrics about opening stream of `FetchHandler`: - `WorkerSource` adds `ChunkStreamCount`, `OpenStreamFailCount` metrics. - Corrects the grafana dashboard of `celeborn-dashboard.json`. `celeborn-dashboard.json` has been verified via [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s). For example: 1. `"expr": "metrics_RunningApplicationCount_Value"` 2. Moves the panel positition of `FetchChunkFailCount` to `FetchRelatives` instead of `PushRelatives`. 3. Updates the `gridPos` of some panels. ### Why are the changes needed? There are no any metrics about opening stream of `FetchHandler` for Celeborn Worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Celeborn Dashboard](https://stenicholas.grafana.net/d/U_qgru_7z/celeborn?orgId=1&refresh=5s) Closes #2212 from SteNicholas/CELEBORN-1100. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2024-01-05 17:05:35 +08:00
SteNicholas	276ab979a4	[CELEBORN-1187][FOLLOWUP] Unify the size and file count of active shuffle metrics for master and worker ### What changes were proposed in this pull request? Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`. Follow up #2171. ### Why are the changes needed? `MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2186 from SteNicholas/CELEBORN-1187. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-12-22 18:09:39 +08:00
SteNicholas	277f7ced57	[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker ### What changes were proposed in this pull request? Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`. ### Why are the changes needed? `MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2171 from SteNicholas/CELEBORN-1187. Lead-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-12-22 17:07:39 +08:00
SteNicholas	850d3199ef	[CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker ### What changes were proposed in this pull request? Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker. ### Why are the changes needed? The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal test. Closes #2151 from SteNicholas/CELEBORN-1164. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-12-13 23:01:16 +08:00
onebox-li	af6fd8a0e6	[CELEBORN-1127] Add JVM classloader metrics ### What changes were proposed in this pull request? Add JVM classloader metrics for loaded and unloaded count. ![image](https://github.com/apache/incubator-celeborn/assets/19429353/c00eceb3-54e5-4f85-8df1-fe9a6adf6ad4) ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? Add two classloader-related panels. ### How was this patch tested? Cluster test. Closes #2099 from onebox-li/add-classloader-metrics. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-12-07 09:47:23 +08:00
onebox-li	ae3bbc50f4	[CELEBORN-1114][FOLLOWUP] Make SlotsAllocated metrics panel to follow previous behavior ### What changes were proposed in this pull request? As title. ### Why are the changes needed? To avoid users being confused after upgrading. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #2087 from onebox-li/slots_allocated_metric_panel. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-10 16:32:48 +08:00
Luke Yan	c7c2f6a35a	[CELEBORN-858] Generate patch to each Spark 3.x minor version ### What changes were proposed in this pull request? Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` : 1. Celeborn_Dynamic_Allocation_spark3_0.patch 2. Celeborn_Dynamic_Allocation_spark3_1.patch 3. Celeborn_Dynamic_Allocation_spark3_2.patch 4. Celeborn_Dynamic_Allocation_spark3_3.patch Delete a patch at the same time： 1. Celeborn_Dynamic_Allocation_spark3.patch Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md ： ![image](https://github.com/apache/incubator-celeborn/assets/108530647/61e2e69b-d3f5-4d11-a20b-374622936443) ### Why are the changes needed? Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes. All patch files can be applied to the corresponding version of spark source code through `git apply` without any code conflicts. Closes #2085 from lukeyan2023/spark3-patch. Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-10 15:35:54 +08:00
onebox-li	b7e4dc4339	[CELEBORN-1114] Remove allocationBuckets from WorkerInfo and refactor SLOTS_ALLOCATED metrics ### What changes were proposed in this pull request? Currently, `WorkerInfo` is used in many places, and allocationBuckets is only used when its own workers want to collect metrics `SLOTS_ALLOCATED`. If there are lots of workers in the RSS cluster, there may be a certain amount of memory waste, each `WorkerInfo` maintain a Array\[Int](61), so remove it from `WorkerInfo`. And refactor the metrics `SLOTS_ALLOCATED` from gauge to counter. Originally, this metrics is approximately one hour's total only if there are continuous tasks. Now refactoring it into a counter can reduce the cost of maintaining time windows, including storage and timely expiration data, etc. It can also be more flexibly transformed according to user needs on the prometheus side. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? Yes. metrics_SlotsAllocated_Count metrics change from gauge for 1 hour to a increasing counter. ### How was this patch tested? Cluster test. Closes #2078 from onebox-li/improve-SlotsAllocated. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-08 19:45:47 +08:00
fwang12	32a6a31f84	[CELEBORN-1088] Define `baseLegend` variable for JVM Metrics dashboard ### What changes were proposed in this pull request? Define baseLegend variable for jvm grafana dashboard. BTW, refactor the `"legendFormat": "$baseLegend"` to `"legendFormat": "${baseLegend}"` in celeborn metrics dashboard json. ### Why are the changes needed? so that customer can change the legend variable case by case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. Closes #2038 from turboFei/jvm_legend. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-25 09:10:33 +08:00
fwang12	819df5f2c4	[CELEBORN-1086] Fix JVM metrics grafana expression issue ### What changes were proposed in this pull request? Fix jvm metrics grafana expression issue. ### Why are the changes needed? ![image](https://github.com/apache/incubator-celeborn/assets/6757692/becedc53-da90-4cce-a494-497b1c55c7a4) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test. <img width="867" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a9720fc1-9699-47e8-847e-951947f57e01"> Closes #2036 from turboFei/fix_metrics. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-24 21:16:42 +08:00
Fu Chen	349ee8b1cb	Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding… …Rpcs and outstandingPushes to metrics" This reverts commit `bfa341c32f`. ### What changes were proposed in this pull request? ### Why are the changes needed? https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #2032 from cfmcgrady/revert-pr-1992. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Fu Chen <cfmcgrady@gmail.com>	2023-10-24 17:18:54 +08:00
fwang12	bd9cb2b1ce	[CELEBORN-1077][METRICS] Support to apply base legend format for all grafana metrics ### What changes were proposed in this pull request? Apply base legend format for all grafana metrics. ### Why are the changes needed? Before, the metrics dashboard is not readable easily. <img width="836" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/4647834a-fa5b-42ca-8a98-3dad37c2cb13"> ### Does this PR introduce _any_ user-facing change? Yes. A variable introduced. ### How was this patch tested? Local Test. Now, I can modify the variable value to `{{pod}}_{{cluster}}` and have a better insight. <img width="853" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/a5cca8d9-37c3-4a18-9819-5a9861744cb9"> Closes #2028 from turboFei/legend_format. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-24 13:37:08 +08:00
SteNicholas	11c90d8e72	[CELEBORN-916] Add new metric about active shuffle file count in worker ### What changes were proposed in this pull request? Adds new metric `ActiveShuffleFileCount` about active shuffle file count of Celeborn Worker. ### Why are the changes needed? `ActiveShuffleSize` metric report the active shuffle size of peer worker at present. Therefore, it's better to introduce `ActiveShuffleFileCount` to report the active shuffle file count of Celeborn Worker. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Internal tests. Closes #2009 from SteNicholas/CELEBORN-916. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-10-23 11:15:18 +08:00
SteNicholas	7276dd024c	[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master ### What changes were proposed in this pull request? Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition. ### Why are the changes needed? `Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Internal tests. Closes #1976 from SteNicholas/CELEBORN-1035. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-10-19 22:07:17 +08:00
SteNicholas	bfa341c32f	[CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics ### What changes were proposed in this pull request? Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker. ### Why are the changes needed? The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `TransportResponseHandlerSuiteJ` Closes #1992 from SteNicholas/CELEBORN-255. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-16 21:16:57 +08:00
onebox-li	2b79692585	[CELEBORN-688] Add JVM metrics grafana template ### What changes were proposed in this pull request? Currently there is no JVM metrics grafana template, nor in grafana labs. For better use, it is necessary to add one. According the change in #1939 This template uses two variables(instance, pool). The layout is divided into 5 rows. ![image](https://github.com/apache/incubator-celeborn/assets/19429353/732cff90-463c-47b5-89b8-fa8dbbf33b1e) The panels with g1 look like below： ![image](https://github.com/apache/incubator-celeborn/assets/19429353/919b7e9e-f86a-4341-a004-7f0394e1d8b2) JVM Memory Pools row uses replicated panel mode which panels are automatically deplicated by `pool` variables. ![image](https://github.com/apache/incubator-celeborn/assets/19429353/3bdf7a3c-d4e0-42ea-bbe0-012da55a61d1) ![image](https://github.com/apache/incubator-celeborn/assets/19429353/8feaf9b7-156d-453e-8188-40a0399ea516) ![image](https://github.com/apache/incubator-celeborn/assets/19429353/cba4b61c-7d66-4893-9f07-6157c64869bd) ![image](https://github.com/apache/incubator-celeborn/assets/19429353/09b473ef-434c-4fd0-aa4b-084f7588a4f7) ### Why are the changes needed? Ditto ### Does this PR introduce _any_ user-facing change? Yes, this dashboard is based on changes in #1939 ### How was this patch tested? Cluster test Closes #1964 from onebox-li/add-jvm-dashboard. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-13 11:54:49 +08:00
zwangsheng	03a39819b5	[CELEBORN-882][WORKER][METRICS] Add `Pause Push Data Time Count` Metrics & Dashboard Panel ### What changes were proposed in this pull request? Add `PausePushDataTime ` Metrics ### Why are the changes needed? Count each celeborn worker pause time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Cluster Test Closes #1800 from zwangsheng/CELEBORN-882. Lead-authored-by: zwangsheng <2213335496@qq.com> Co-authored-by: zwangsheng <binjieyang@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-12 17:45:26 +08:00
mingji	442d59ab55	[CELEBORN-933] Add metrics about active shuffle data size ### What changes were proposed in this pull request? Add metrics about active shuffle data size in every worker and update Grafana dashboard. The metric value will decrease when shuffle is expired. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Cluster. <img width="733" alt="截屏2023-08-30 17 00 11" src="https://github.com/apache/incubator-celeborn/assets/4150993/48e28c1c-2b49-45d7-b3ba-358674ff3f3d"> Closes #1867 from FMX/CELEBORN-933. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-30 18:04:57 +08:00
mingji	2b79c37381	[CELEBORN-852][FOLLOWUP] Add active connection count metrics to grafana dashboard ### What changes were proposed in this pull request? Add active connections count metrics to grafana dashboard. ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? Yes, new metric chart in the grafana dashboard. ### How was this patch tested? Cluster. Closes #1783 from FMX/CELEBORN-852. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-02 21:24:57 +08:00
Angerszhuuuu	5c7ecb8302	[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### What changes were proposed in this pull request? Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1667 from AngersZhuuuu/CELEBORN-754. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-30 17:27:33 +08:00
Angerszhuuuu	6e35745736	[CELEBORN-753] Rename spark patch file name to make it more clear ### What changes were proposed in this pull request? Rename spark patch file name to make it more clear ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1666 from AngersZhuuuu/CELEBORN-753. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-06-30 11:41:12 +08:00
mingji	742815f285	[CELEBORN-749] Update grafana dashboard to remove "RSS" ### What changes were proposed in this pull request? Update Grafana dashboard and its setup demo to remove the old name "RSS" ### Why are the changes needed? Ditto. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? No test needed. Closes #1663 from FMX/CELEBORN-749. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 20:44:09 +08:00
Angerszhuuuu	bd7c2ea35a	[CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx ### What changes were proposed in this pull request? Rename project files from rss-xx to celeborn-xx ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1660 from AngersZhuuuu/CELEBORN-746. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-29 16:30:02 +08:00
Fu Chen	17c1e01874	[CELEBORN-726] Update data replication terminology from `master/slave` to `primary/replica` for configurations and metrics ### What changes were proposed in this pull request? This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC. ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. Closes #1650 from cfmcgrady/primary-replica-metrics. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 09:47:02 +08:00
Angerszhuuuu	3985a5cbd7	[CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment ### What changes were proposed in this pull request? Unify all blacklist related code and comment ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 16:28:03 +08:00
onebox-li	0c869ac9a0	[CELEBORN-642] Improve metrics and update grafana ### What changes were proposed in this pull request? Change in grafana （ALL） add: JVMCPUTime LastMinuteSystemLoad AvailableProcessors （For Master） add: LostWorkers IsActiveMaster PartitionSize （For Worker） add: PushDataFailCount -> WriteDataFailCount ReplicateDataFailCount ReplicateDataWriteFailCount ReplicateDataCreateConnectionFailCount ReplicateDataConnectionExceptionCount ReplicateDataTimeoutCount SortedFileSize PushDataHandshakeFailCount RegionStartFailCount RegionFinishFailCount MasterPushDataHandshakeTime SlavePushDataHandshakeTime MasterRegionStartTime SlaveRegionStartTime MasterRegionFinishTime SlaveRegionFinishTime PotentialConsumeSpeed UserProduceSpeed WorkerConsumeSpeed DeviceOSFreeBytes DeviceCelebornFreeBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory remove: dup ReserveSlotsTime Change dashboard layout. Fix support for multiple labels. Modify some metrics docs. ### Why are the changes needed? For better use of metrics. ### Does this PR introduce _any_ user-facing change? Below metrics change name, extract some value to the label. DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory ### How was this patch tested? Cluster test. Closes #1557 from onebox-li/improve-metrics. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-08 18:10:06 +08:00
Ethan Feng	3bd232dda0	[CELEBORN-619][CORE][SHUFFLE][FOLLOWUP] Support enable DRA with Apache Celeborn ### What changes were proposed in this pull request? Adapt Spark DRA patch for spark 3.4 ### Why are the changes needed? To support enabling DRA w/ Celeborn on Spark 3.4 ### Does this PR introduce _any_ user-facing change? Yes, this PR provides a DRA patch for Spark 3.4 ### How was this patch tested? Compiled with Spark 3.4 Closes #1546 from FMX/CELEBORN-619. Lead-authored-by: Ethan Feng <ethanfeng@apache.org> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: Ethan Feng <ethanfeng@apache.org>	2023-06-06 12:57:16 +08:00
Ethan Feng	5600728149	[CELEBORN-619][CORE][SHUFFLE] Support enable DRA with Apache Celeborn ### What changes were proposed in this pull request? Adapt Spark DRA patch for spark 3.4 ### Why are the changes needed? To support enabling DRA w/ Celeborn on Spark 3.4 ### Does this PR introduce _any_ user-facing change? Yes, this PR provides a DRA patch for Spark 3.4 ### How was this patch tested? Compiled with Spark 3.4 Closes #1529 from FMX/CELEBORN-619. Authored-by: Ethan Feng <ethanfeng@apache.org> Signed-off-by: Ethan Feng <ethanfeng@apache.org>	2023-06-05 09:50:05 +08:00
ulysses	fa920ab0d5	Relax isRssEnabled condition (#1528 ) Co-authored-by: youxiduo <youxiduo@corp.netease.com>	2023-05-31 15:26:05 +08:00
Ethan Feng	9cccfc9872	[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407 )	2023-04-13 10:37:47 +08:00
Keyong Zhou	f2fd8a5c15	[CELEBORN-373] Add sorted files into grafana dashboard (#1303 )	2023-03-02 23:41:16 +08:00
Keyong Zhou	54cf2e18d8	[CELEBORN-252] Delete slides (#1186 )	2023-01-31 16:35:23 +08:00
Keyong Zhou	dfa81c92df	[CELEBORN-224] Correct LICENSE and NOTICE. (#1164 ) (#1170 )	2023-01-18 19:47:42 +08:00
Ethan Feng	01b7ea97c9	[CELEBORN-193] Reduce source package size. (#1140 )	2023-01-03 19:28:03 +08:00
Keyong Zhou	a2d2379153	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00
Keyong Zhou	fe3b5988f2	[REFACTOR] Change package name to org.apache.celeborn (#710 )	2022-10-02 18:10:29 +08:00
Binjie Yang	9f20aabb48	[IMPORVE] Fix grafana dashboard json `metrics_OfferSlotsTime_Max` & `metrics_OfferSlotsTime_Mean` target datasource (#655 )	2022-09-22 17:45:38 +08:00
AngersZhuuuu	da7ac1721b	[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566 )	2022-09-07 21:33:18 +08:00
Kerwin Zhang	46892c271c	[issue-517] Spark3 patch to support columnar shuffle (#528 )	2022-09-05 11:34:53 +08:00
Keyong Zhou	41e8311d58	[ISSUE-436][REFACTOR] Refactor metrics (#437 ) 1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker 2. Delete OverloadWorkerCount 3.Change slotsUsed to SlotsAllocated in last hour	2022-08-23 18:26:47 +08:00
Ethan Feng	959c689285	[DOC] Add documentation about setting up prometheus cluster and node exporter (#393 )	2022-08-19 21:47:49 +08:00
Ethan Feng	f3bcb7f6a8	[ISSUE-146]update slots distribution mechanism (#273 )	2022-08-12 23:38:19 +08:00
Ethan Feng	86adc0d244	[Feature]Add metrics documentation and grafana dashboard. (#117 )	2022-05-20 12:12:41 +08:00
Ethan Feng	e8e333a239	RSS support spark3 RDA. (#108 ) * RSS support spark3 RDA.	2022-05-14 14:02:40 +08:00
Ethan Feng	2ea136fada	[Feature]Update spark2 patch (#89 )	2022-04-08 21:46:30 +08:00
zky.zhoukeyong	ba5920acde	Initial Commit for RSS	2021-12-28 20:57:35 +08:00

1 2

98 Commits