Commit Graph

1099 Commits

Author SHA1 Message Date
AngersZhuuuu
909ad7dc23
[ISSUE-482][REFACTOR] RetryingChunkClient should show clear error mesage (#483) 2022-08-30 21:03:21 +08:00
AngersZhuuuu
53ed1e930a
[ISSUE-387][DOC] Doc about worker status recover after restart (#459) 2022-08-30 16:37:14 +08:00
Ethan Feng
5548dcfac2
[ISSUE-476] refactor read apis to support read from hdfs (#477) 2022-08-30 11:03:30 +08:00
AngersZhuuuu
eee10032fc
[REFACTOR] Some minor changes in client module (#478) 2022-08-29 19:53:45 +08:00
nafiy
6d308eb4f2
[ISSUE-465][Bug] Common module scalatest style unit test don't actually run (#472) 2022-08-28 18:52:39 +08:00
Cheng Pan
555c938d70
Fix NPE (#473) 2022-08-28 18:49:14 +08:00
Cheng Pan
84d814d083
Improve MemoryTracker log format and change it's thread to daemon (#471) 2022-08-27 22:07:39 +08:00
Ethan Feng
eeaa28d24f
[ISSUE-440]Clean expired hdfs files and keep one replication. (#466) 2022-08-26 22:03:43 +08:00
Cheng Pan
fc96034742
[BUILD] Flatten Jars for Master and Worker (#469) 2022-08-26 21:33:38 +08:00
nafiy
01a8d48b5a
[ISSUE-312][FEATURE] Support zstd compression (#451) 2022-08-26 18:07:53 +08:00
Ethan Feng
7214b195b2
fix update partition size failed in HA mode. (#468) 2022-08-26 16:16:19 +08:00
Ethan Feng
1207ce3f49
[issue-333] sort shuffle files to hdfs (#456) 2022-08-25 17:46:37 +08:00
AngersZhuuuu
045488a95e
[ISSUE-411][BUG] Fix disk buffer metric leak.(#464) 2022-08-25 16:50:23 +08:00
AngersZhuuuu
3a42f172fa
[ISSUE-388][SHELL] Add restart script for worker's quick recover (#460) 2022-08-25 15:18:25 +08:00
AngersZhuuuu
25418117ae
[ISSUE-332][FOLLOWUP] Add deps in worker's pom (#462) 2022-08-25 14:28:24 +08:00
Keyong Zhou
6d9b81957e
[ISSUE-453] [REFACTOR] Refine code in MasterUtils (#461) 2022-08-25 10:24:08 +08:00
AngersZhuuuu
b9488db16b
[ISSUE-457][BUG] After worker recover, we should make sure there is no partial sortedFile and indexFile (#458) 2022-08-24 22:05:04 +08:00
Binjie Yang
44f01a39b6
add worker svc (#455) 2022-08-24 21:56:25 +08:00
Keyong Zhou
ebe8793ff7
[ISSUE-450] Fix performance regression (#452)
When compare TPC-DS 3T between main branch and branch-0.1, we found that round-robin is slower than branch-0.1, which is unexpected because the high-level allocation algorithm are basically the same.
```
main roundrobin:   5332s
branch-0.1:        5027s
```
After digging deeper I found that's because branch-0.1 first allocates master locations round-robin, then slave locations round-robin, however in main branch it allocates (master, slave) pairs round-robin. As a result, say one worker has two disks disk1 and disk2, we find all master partitions are allocated on disk1 and all slave partitions are allocated on disk2, which is different from branch-0.1 which disk1 and disk2 have the same number of both master partitions and slave partitions.
Experiments show that when change main branch algorithm vivic to branch-0.1, we get the performance back.
Time of q74:
```
branch-0.1:          58.749s
main before fix:     70.114s
main after fix:      58.987s
```
2022-08-24 19:32:42 +08:00
Ethan Feng
11855f1667
[Feature]hdfs writer respects working dir configuration. (#446) 2022-08-24 15:25:03 +08:00
AngersZhuuuu
8ca97d92e4
[ISSUE-415][REFACTOR] Refactor Storage related class to separated scala file (#416) 2022-08-24 15:18:53 +08:00
Cheng Pan
1d4bb3616e
HAMasterMetaManager should log exception stacktrace (#447) 2022-08-24 14:53:41 +08:00
AngersZhuuuu
b7040ae366
[ISSUE-443][FEATURE] Support customized working dir path (#444) 2022-08-24 11:28:29 +08:00
Keyong Zhou
ca3ee003d9
[ISSUE-441] Refactor cluster load check to cluster alive check (#442) 2022-08-23 23:02:23 +08:00
Ethan Feng
a4bab91453
[issue-332] support flush disk buffer to hdfs (#430) 2022-08-23 21:04:45 +08:00
Keyong Zhou
743ac9cdf8
Replace all essConf with rssConf (#439) 2022-08-23 20:44:48 +08:00
Keyong Zhou
41e8311d58
[ISSUE-436][REFACTOR] Refactor metrics (#437)
1. Fix metrics_RegisteredShuffleCount_Value inconsistent between master and worker
2. Delete OverloadWorkerCount
3.Change slotsUsed to SlotsAllocated in last hour
2022-08-23 18:26:47 +08:00
Keyong Zhou
6c7b159493
[ISSUE-434] Refine log (#435) 2022-08-23 14:16:38 +08:00
liugs0213
24e0f2cdd4
[issue-424] device monitor checklist (#433) 2022-08-22 23:12:55 +08:00
Keyong Zhou
9526cfb997
[ISSUE-428]Should not check blacklist when reserveSlots to avoid ping-pang situation (#432)
```
22/08/22 20:03:39 INFO LifecycleManager: Try reserve slots for application_1660226621060_0180-549 for 1 times.
22/08/22 20:03:39 WARN LifecycleManager: [reserve buffer] failed due to blacklist:
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)

22/08/22 20:03:41 INFO LifecycleManager: Received Blacklist from Master, blacklist: [] unkown workers: []
22/08/22 20:03:50 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)
)
```
2022-08-22 21:00:41 +08:00
Keyong Zhou
ce96e99dd8
[ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation (#431)
* device monitor checklist

* [ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation
```
22/08/22 18:21:03 WARN LifecycleManager: Do Revive for shuffle application_1660226621060_0180-298, oldPartition: PartitionLocation[226-0 192.168.15.9:37761:37903:37517:38449 Mode: Master peer: 192.168.15.6:37533:37413:37387 storage hint:StorageHint{type=MEMORY, mountPoint='/mnt/disk1', finalResult=false}], cause: StatusCode{value=PushDataFailMain}
22/08/22 18:21:03 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: null
)
```
2022-08-22 21:00:24 +08:00
Keyong Zhou
dd164e4a1f
[ISSUE-425][REFACTOR] Avoid trigger trim action if trim action is in process (#426)
[ISSUE-425][REFACTOR] Avoid trigger trim action if trim action is in process (#426)
2022-08-22 19:20:01 +08:00
AngersZhuuuu
809ea7fe9d
[ISSUE-418][BUG] Start master/worker should respect rpc port setting (#419) 2022-08-22 17:18:23 +08:00
Keyong Zhou
258f426592
[ISSUE-422] DeviceMonitor should not trigger IoHang when device stat file is unavailable (#423) 2022-08-22 17:13:34 +08:00
Keyong Zhou
11762a260b
[ISSUE-417] handleUnregisterShuffle and StageEnd trigger double handl… (#420)
1. Unregister shuffle triggers handleStageEnd
```
22/08/22 12:47:00 INFO LifecycleManager: Call StageEnd before Unregister Shuffle 60.
```
2. handleStageEnd success, maybe triggered by handleUnregisterShuffle or StageEnd
```
22/08/22 12:47:51 INFO LifecycleManager: Succeed to handle stageEnd for 60.
```
3. reports data lost
```
22/08/22 12:48:28 ERROR LifecycleManager: For 60 partition 2185-0: data lost.
22/08/22 12:48:28 ERROR LifecycleManager: Failed to handle stageEnd for 60, lost file!
```
4. report unregister success
```
22/08/22 12:48:28 INFO LifecycleManager: Unregister for 60 success.
```
2022-08-22 17:13:08 +08:00
lichaojacobs
652295f797
[ISSUE-413] Fix incorrect totalSpace in disk usage check 2022-08-22 14:53:27 +08:00
AngersZhuuuu
aab427fdde
[ISSUE-362][FEATURE] StorageManager recover FileInfo from LevelDB (#398) 2022-08-22 12:39:10 +08:00
Keyong Zhou
1c70fe446b
[ISSUE-407] Add synchronization to FileInfo and workingDirWriters to … (#408) 2022-08-22 12:04:55 +08:00
Keyong Zhou
90eabdedb5
[ISSUE-409] Device checker should not report error when check process timeout (#410) 2022-08-22 11:59:59 +08:00
Keyong Zhou
47850addb4
[ISSUE-403][FOLLOW-UP] Minor fix (#406) 2022-08-21 23:27:55 +08:00
Keyong Zhou
282da98e31
[ISSUE-403] Refactor StorageManager (#404) 2022-08-21 23:19:16 +08:00
Keyong Zhou
d5b41fbea5
[ISSUE-401] Remove DeviceObserver.reportError() (#402) 2022-08-20 16:08:53 +08:00
Keyong Zhou
3b661da013
[ISSUE-399] Duplicate code in notifyHealthy (#400) 2022-08-20 15:23:57 +08:00
Keyong Zhou
17655ab244
[ISSUE-396] Incorrectly report Slow Flusher (#397)
```
22/08/19 22:09:49,057 ERROR [local-storage-scheduler] LocalDeviceMonitor: Receive report exception, ArrayBuffer(/mnt/disk2/hadoop/rss-worker/shuffle_data, /mnt/disk2/hadoop/rss-worker/shuffle_data), java.io.IOException: Slow Flusher!
```
2022-08-19 23:11:57 +08:00
AngersZhuuuu
cb5522d7d5
[ISSUE-360][FEATURE] PartitionSorter graceful shutdown and recover sortedFiles from LevelDB (#361) 2022-08-19 23:11:27 +08:00
Ethan Feng
05ae9036d3
[Refactor] refactor apis for HDFS write. (#391) 2022-08-19 21:58:57 +08:00
Ethan Feng
959c689285
[DOC] Add documentation about setting up prometheus cluster and node exporter (#393) 2022-08-19 21:47:49 +08:00
AngersZhuuuu
50d5081922
[ISSUE-385][Feature] RetryingChunkClient openChunk don't wait for first request to each replicate (#389) 2022-08-18 17:28:15 +08:00
AngersZhuuuu
0628262634
[ISSUE-385][FEATURE] RetryingChunkClient openChunks failed should wait (#386) 2022-08-18 16:57:36 +08:00
AngersZhuuuu
2be279099f
[ISSUE-323][FEATURE] Extract FileMeta form FileWriter (#366)
* [ISSUE-323][FEATURE] Extract FileMeta form FileWriter
2022-08-18 16:33:40 +08:00