Commit Graph

27 Commits

Author SHA1 Message Date
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
6e35745736
[CELEBORN-753] Rename spark patch file name to make it more clear
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1666 from AngersZhuuuu/CELEBORN-753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-30 11:41:12 +08:00
mingji
742815f285
[CELEBORN-749] Update grafana dashboard to remove "RSS"
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
No test needed.

Closes #1663 from FMX/CELEBORN-749.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:44:09 +08:00
Angerszhuuuu
bd7c2ea35a [CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1660 from AngersZhuuuu/CELEBORN-746.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-29 16:30:02 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
onebox-li
0c869ac9a0
[CELEBORN-642] Improve metrics and update grafana
### What changes were proposed in this pull request?
Change in grafana

(ALL)
add:
JVMCPUTime
LastMinuteSystemLoad
AvailableProcessors
(For Master)
add:
LostWorkers
IsActiveMaster
PartitionSize
(For Worker)
add:
PushDataFailCount -> WriteDataFailCount
ReplicateDataFailCount
ReplicateDataWriteFailCount
ReplicateDataCreateConnectionFailCount
ReplicateDataConnectionExceptionCount
ReplicateDataTimeoutCount
SortedFileSize
PushDataHandshakeFailCount
RegionStartFailCount
RegionFinishFailCount
MasterPushDataHandshakeTime
SlavePushDataHandshakeTime
MasterRegionStartTime
SlaveRegionStartTime
MasterRegionFinishTime
SlaveRegionFinishTime
PotentialConsumeSpeed
UserProduceSpeed
WorkerConsumeSpeed
DeviceOSFreeBytes
DeviceCelebornFreeBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory
remove:
dup ReserveSlotsTime

Change dashboard layout.

Fix support for multiple labels.

Modify some metrics docs.

### Why are the changes needed?
For better use of metrics.

### Does this PR introduce _any_ user-facing change?
Below metrics change name, extract some value to the label.
DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes
DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes
DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes
DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes
push usedHeapMemory/usedDirectMemory
fetch usedHeapMemory/usedDirectMemory
replicate usedHeapMemory/usedDirectMemory

### How was this patch tested?
Cluster test.

Closes #1557 from onebox-li/improve-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-08 18:10:06 +08:00
Ethan Feng
3bd232dda0
[CELEBORN-619][CORE][SHUFFLE][FOLLOWUP] Support enable DRA with Apache Celeborn
### What changes were proposed in this pull request?

Adapt Spark DRA patch for spark 3.4

### Why are the changes needed?

To support enabling DRA w/ Celeborn on Spark 3.4

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides a DRA patch for Spark 3.4

### How was this patch tested?

Compiled with Spark 3.4

Closes #1546 from FMX/CELEBORN-619.

Lead-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-06 12:57:16 +08:00
Ethan Feng
5600728149
[CELEBORN-619][CORE][SHUFFLE] Support enable DRA with Apache Celeborn
### What changes were proposed in this pull request?

Adapt Spark DRA patch for spark 3.4

### Why are the changes needed?

To support enabling DRA w/ Celeborn on Spark 3.4

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides a DRA patch for Spark 3.4

### How was this patch tested?

Compiled with Spark 3.4

Closes #1529 from FMX/CELEBORN-619.

Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-05 09:50:05 +08:00
ulysses
fa920ab0d5
Relax isRssEnabled condition (#1528)
Co-authored-by: youxiduo <youxiduo@corp.netease.com>
2023-05-31 15:26:05 +08:00
Ethan Feng
9cccfc9872
[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407) 2023-04-13 10:37:47 +08:00
Keyong Zhou
f2fd8a5c15
[CELEBORN-373] Add sorted files into grafana dashboard (#1303) 2023-03-02 23:41:16 +08:00
Keyong Zhou
54cf2e18d8
[CELEBORN-252] Delete slides (#1186) 2023-01-31 16:35:23 +08:00
Keyong Zhou
dfa81c92df
[CELEBORN-224] Correct LICENSE and NOTICE. (#1164) (#1170) 2023-01-18 19:47:42 +08:00
Ethan Feng
01b7ea97c9
[CELEBORN-193] Reduce source package size. (#1140) 2023-01-03 19:28:03 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
Binjie Yang
9f20aabb48
[IMPORVE] Fix grafana dashboard json metrics_OfferSlotsTime_Max & metrics_OfferSlotsTime_Mean target datasource (#655) 2022-09-22 17:45:38 +08:00
AngersZhuuuu
da7ac1721b
[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566) 2022-09-07 21:33:18 +08:00
Kerwin Zhang
46892c271c
[issue-517] Spark3 patch to support columnar shuffle (#528) 2022-09-05 11:34:53 +08:00
Keyong Zhou
41e8311d58
[ISSUE-436][REFACTOR] Refactor metrics (#437)
1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker
2. Delete OverloadWorkerCount
3.Change slotsUsed to SlotsAllocated in last hour
2022-08-23 18:26:47 +08:00
Ethan Feng
959c689285
[DOC] Add documentation about setting up prometheus cluster and node exporter (#393) 2022-08-19 21:47:49 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
Ethan Feng
86adc0d244
[Feature]Add metrics documentation and grafana dashboard. (#117) 2022-05-20 12:12:41 +08:00
Ethan Feng
e8e333a239
RSS support spark3 RDA. (#108)
* RSS support spark3 RDA.
2022-05-14 14:02:40 +08:00
Ethan Feng
2ea136fada
[Feature]Update spark2 patch (#89) 2022-04-08 21:46:30 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00