Commit Graph

57 Commits

Author SHA1 Message Date
Ethan Feng
b4654d788c
[ISSUE-607]Add map ids info for each PartitionLocation to enable filtering for m… (#619) 2022-09-23 15:21:41 +08:00
AngersZhuuuu
a6b8af2b00
[ISSUE-637][FEATURE] Change CheckAlive to CheckAvailable and reply checkQuota result (#658) 2022-09-22 21:54:45 +08:00
AngersZhuuuu
0ae096e56f
[ISSUE-634][FEATURE] Add ResourceConsumption class to indicate resource usage (#641) 2022-09-21 15:51:14 +08:00
AngersZhuuuu
33ce3d3d33
[ISSUE-631][FEATURE] Add UserIdentifier proto class (#638) 2022-09-21 11:04:24 +08:00
Cheng Pan
b51abeed96
Improve code smell (#624) 2022-09-20 10:03:02 +08:00
Cheng Pan
adfc344b4d
Correct propagating heartbeat values (#623) 2022-09-19 23:55:23 +08:00
AngersZhuuuu
a6acaa11e0
[ISSUE-597][REFACTOR] Unify Enum type name and correct wrong UN_KOWN (#598) 2022-09-13 19:07:48 +08:00
AngersZhuuuu
5ad89a2661
[ISSUE-548][Refactor] Unify code about master/worker's startup and HttpServer (#550) 2022-09-07 23:28:28 +08:00
AngersZhuuuu
da7ac1721b
[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566) 2022-09-07 21:33:18 +08:00
AngersZhuuuu
4f24483cc9
[ISSUE-537][FEATURE] RPCSource implement Master RPC (#538) 2022-09-06 00:06:57 +08:00
Cheng Pan
4b42219595
Remove log4j1 (#501) 2022-09-05 19:30:15 +08:00
Cheng Pan
f00b5a39bc
Extract OpenByteArrayOutputStream (#507) 2022-09-02 21:01:58 +08:00
Cheng Pan
7bc99ff7b5
HTTP server netty thread should has prefix and allow to bind host (#499) 2022-09-02 00:48:26 +08:00
Cheng Pan
c88ce306be
Use Spotless to auto check and reformat Java/Scala code (#497) 2022-09-01 21:19:56 +08:00
AngersZhuuuu
74c1596d3a
[ISSUE-489][FEATURE] Implement more info for master and worker (#490) 2022-09-01 01:01:03 +08:00
Cheng Pan
3dddb65f31
Enable Apache Rat and fix license header (#492) 2022-08-31 23:53:33 +08:00
Ethan Feng
5548dcfac2
[ISSUE-476] refactor read apis to support read from hdfs (#477) 2022-08-30 11:03:30 +08:00
Cheng Pan
555c938d70
Fix NPE (#473) 2022-08-28 18:49:14 +08:00
Ethan Feng
7214b195b2
fix update partition size failed in HA mode. (#468) 2022-08-26 16:16:19 +08:00
Keyong Zhou
6d9b81957e
[ISSUE-453] [REFACTOR] Refine code in MasterUtils (#461) 2022-08-25 10:24:08 +08:00
Keyong Zhou
ebe8793ff7
[ISSUE-450] Fix performance regression (#452)
When compare TPC-DS 3T between main branch and branch-0.1, we found that round-robin is slower than branch-0.1, which is unexpected because the high-level allocation algorithm are basically the same.
```
main roundrobin:   5332s
branch-0.1:        5027s
```
After digging deeper I found that's because branch-0.1 first allocates master locations round-robin, then slave locations round-robin, however in main branch it allocates (master, slave) pairs round-robin. As a result, say one worker has two disks disk1 and disk2, we find all master partitions are allocated on disk1 and all slave partitions are allocated on disk2, which is different from branch-0.1 which disk1 and disk2 have the same number of both master partitions and slave partitions.
Experiments show that when change main branch algorithm vivic to branch-0.1, we get the performance back.
Time of q74:
```
branch-0.1:          58.749s
main before fix:     70.114s
main after fix:      58.987s
```
2022-08-24 19:32:42 +08:00
Cheng Pan
1d4bb3616e
HAMasterMetaManager should log exception stacktrace (#447) 2022-08-24 14:53:41 +08:00
Keyong Zhou
ca3ee003d9
[ISSUE-441] Refactor cluster load check to cluster alive check (#442) 2022-08-23 23:02:23 +08:00
Keyong Zhou
743ac9cdf8
Replace all essConf with rssConf (#439) 2022-08-23 20:44:48 +08:00
Keyong Zhou
41e8311d58
[ISSUE-436][REFACTOR] Refactor metrics (#437)
1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker
2. Delete OverloadWorkerCount
3.Change slotsUsed to SlotsAllocated in last hour
2022-08-23 18:26:47 +08:00
AngersZhuuuu
809ea7fe9d
[ISSUE-418][BUG] Start master/worker should respect rpc port setting (#419) 2022-08-22 17:18:23 +08:00
Keyong Zhou
282da98e31
[ISSUE-403] Refactor StorageManager (#404) 2022-08-21 23:19:16 +08:00
AngersZhuuuu
f43dd9fc24
[ISSUE-367][FEATURE] Master support worker re-register with same port and worker miss heartbeat only delete expired data (#378) 2022-08-18 14:53:20 +08:00
nafiy
96b14e2205
[ISSUE-304][BUG]HA port being occupied makes master cannot normally launch (#317)
[ISSUE-304][BUG]HA port being occupied makes master cannot normally launch
2022-08-16 20:37:01 +08:00
Keyong Zhou
937ac54e7c
[ISSUE-351] Trigger split when reaching disk space limitation (#356) 2022-08-15 00:24:25 +08:00
Keyong Zhou
c2672c2d9d
[ISSUE-273][FOLLOW-UP] 1.Heartbeat use workerInfo's diskInfos instead… (#352) 2022-08-14 16:54:08 +08:00
Keyong Zhou
20a3ba4e56
[ISSUE-273][FOLLOW-UP] Merge MountInfo with DiskInfo (#348) 2022-08-13 22:58:13 +08:00
Keyong Zhou
9516a63eb5
[ISSUE-273][FOLLOW-UP] Remove duplicate handleWorkerHeartBeat (#347) 2022-08-13 18:32:47 +08:00
Keyong Zhou
6d1a2db663
[ISSUE-273][FOLLOW-UP] Fix IndexOutOfBoundsException when release slots (#344)
```
java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
        at java.util.ArrayList.rangeCheck(ArrayList.java:659)
        at java.util.ArrayList.get(ArrayList.java:435)
        at com.aliyun.emr.rss.service.deploy.master.clustermeta.AbstractMetaManager.updateReleaseSlotsMeta(AbstractMetaManager.java:104)
        at com.aliyun.emr.rss.service.deploy.master.clustermeta.SingleMasterMetaManager.handleReleaseSlots(SingleMasterMetaManager.java:53)
        at com.aliyun.emr.rss.service.deploy.master.Master.handleReleaseSlots(Master.scala:456)
        at com.aliyun.emr.rss.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$12(Master.scala:189)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at com.aliyun.emr.rss.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:156)
        at com.aliyun.emr.rss.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:189)
        at com.aliyun.emr.rss.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:110)
        at com.aliyun.emr.rss.common.rpc.netty.Inbox.safelyCall(Inbox.scala:214)
        at com.aliyun.emr.rss.common.rpc.netty.Inbox.process(Inbox.scala:107)
        at com.aliyun.emr.rss.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
```
2022-08-13 12:44:03 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
nafiy
eeda030599
Add metrics for marking active master (#307) 2022-08-07 18:00:49 +08:00
dxheming
8e3f48ec12
Refactor deprecated netty ConcurrentSet (#285) 2022-07-27 20:35:46 +08:00
Keyong Zhou
6442f38a33
[ISSUE-267] Extend API to support more partition types: MapPartition,… (#268) 2022-07-17 16:28:37 +08:00
AngersZhuuuu
36cc234dd4
[ISSUE-246][REFACTOR] Refactor LifecycleManager to make it's code more clear and more readable (#252) 2022-07-12 15:37:49 +08:00
AngersZhuuuu
d2a0ad480e
[ISSUE-222][BUG] RequestSlotResponse/RegisterShuffleResponse should handle null issue (#226) 2022-07-08 12:33:40 +08:00
nafiy
6f8fb8747f
Modify argument class and add config (#212) 2022-07-01 23:17:24 +08:00
Keyong Zhou
49f2a00943
[ISSUE-208] Refine log levels (#210) 2022-07-01 14:57:30 +08:00
AngersZhuuuu
909e8b2f53
[ISSUE-190][BUG] After WorkerLost, response to worker heartbeat RPC to, then worker can clean the data. (#192) 2022-06-29 22:25:29 +08:00
AngersZhuuuu
3079d0ac7a
[ISSUE-176][BUG] Handle RegisterWorker use wrong worker info when trigger lost event (#177) 2022-06-28 18:13:33 +08:00
Ethan Feng
f78451b93d
fix an ArithmeticException. (#167) 2022-06-27 17:01:55 +08:00
nafiy
491f89bbb5
[FEATURE]Add metrics source for JVM and CPU (#125)
* Add metrics source for JVM and CPU

* Fix scala style issue
2022-05-30 13:26:54 +08:00
AngersZhuuuu
730d0c4a97
[ISSUE-120] [BUG] Master‘s metrics of WorkerSlotsCount / WorkerSlotsUsed/ OverloadWorkerCount not update (#121)
[ISSUE-120] [BUG] Master‘s metrics of WorkerSlotsCount / WorkerSlotsUsed/ OverloadWorkerCount not update
2022-05-23 19:19:24 +08:00
Ethan Feng
ac645a464b
update netty and ratis version. (#115) 2022-05-19 11:25:55 +08:00
Ethan Feng
409da82964
[Bug]fix stuck under high memory pressure. (#90) 2022-04-14 18:53:39 +08:00
Ethan Feng
baa2836216
Add metrics: (#85)
1.shuffle fetch send data time.
 2.open stream time.
 3.memory critical count.
2022-04-02 15:05:27 +08:00