Commit Graph

312 Commits

Author SHA1 Message Date
AngersZhuuuu
a773c8e6db
[ISSUE-820][Refactor] Rename RssConf to CelebornConf (#826) 2022-10-20 20:13:13 +08:00
AngersZhuuuu
8344479df1
[ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class (#822)
* [ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class


Co-authored-by: Ethan Feng <ethan.aquarius.fmx@gmail.com>
2022-10-20 18:10:59 +08:00
Ethan Feng
5c761a8df3
[ISSUE-813][Refactor] Refactor flusher configurations. (#813)
* Refactor flusher configurations.

* Refactor flusher configurations.

* Update.

* remove brackets.

* update docs.

* rename.

* update.

* update docs.

* update.

* update.

* update.

* update.

* update.

* update.

* update.

* format.

* update.

* update.
2022-10-20 15:23:17 +08:00
nafiy
a75bce905e
[ISSUE-805][REFACTOR] Remove UserIdentifier out of ControlMessage (#808) 2022-10-19 15:32:53 +08:00
AngersZhuuuu
7fedaaeca1
[ISSUE-795][BUG] Batch handle change partition throw NPE (#796) 2022-10-19 10:54:08 +08:00
Ethan Feng
bff2a7065b
Keep one copy of roaringbitmap to reduce memory usage. (#790) 2022-10-18 13:26:49 +08:00
Cheng Pan
efad4abb5d
Migrate a bunch of configurations (#786) 2022-10-18 10:44:01 +08:00
Cheng Pan
ea67f4e060
Introduce categories to ConfigEntry and migrate configurations (#775) 2022-10-17 16:56:54 +08:00
Cheng Pan
96e969f46e
[BUILD] Extract project.version to Maven Property (#772) 2022-10-16 19:01:40 +08:00
AngersZhuuuu
c9b462dc02
[ISSUE-770][Refactor] Batch handle change partition should ignore empty batch and avoid print log of empty result (#771) 2022-10-14 21:49:37 +08:00
AngersZhuuuu
3bad403c8b
[ISSUE-768][REFACTOR] Shuffle data lost should show more clear about lost data in which worker (#769) 2022-10-14 11:41:15 +08:00
Cheng Pan
f01a696313
Migrate and refactor configuration for master endpoints (#752) 2022-10-11 21:33:21 +08:00
AngersZhuuuu
bbb4f8e225
[ISSUE-306][IMPROVEMENT] Handle change partition request in batch (#622) 2022-10-10 18:31:37 +08:00
AngersZhuuuu
f2a234f870
[ISSUE-739][REFACTOR] Use object wrap pb message method (#740) 2022-10-09 11:53:48 +08:00
AngersZhuuuu
ae4bb12d5e
[ISSUE-630][REFACTOR] Minor change of storage resource quota, include code style, comment unused code etc.. (#728) 2022-10-08 20:15:25 +08:00
Ethan Feng
96e550f81c
Fix a npe that stuck lifecycle manager when a worker is offline. (#733) 2022-10-08 20:11:42 +08:00
Ethan Feng
6deda248ac
[REFACTOR]move lifecycle manager to correct package. (#730) 2022-10-08 18:14:08 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Cheng Pan
abb4ce6405
Drop control message Scala wrapper - Revive/PartitionSplit/ChangeLocationResponse (#720) 2022-10-07 12:40:23 +08:00
Cheng Pan
a719709a17
Drop control message Scala wrapper - UnregisterShuffle/UnregisterShuffleResponse (#718) 2022-10-07 12:29:10 +08:00
Cheng Pan
cda133e11f
Drop control message Scala wrapper - RegisterShuffle/RegisterShuffleResponse (#716) 2022-10-06 23:37:36 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Cheng Pan
4880d78d6a
Extract spark tests and improve pom (#711) 2022-10-04 10:23:26 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
nafiy
5d4533fb85
[ISSUE-632][FEATURE] LifecycleManager side ReserveSlots & RequestSlots RPC with UserIdentifier (#679) 2022-09-27 00:01:44 +08:00
zky.zhoukeyong
a2522745d2 Revert "Drop control message Scala wrapper - RemoveExpiredShuffle (#676)"
This reverts commit a160cd90cb.
2022-09-25 17:18:41 +08:00
Cheng Pan
a160cd90cb
Drop control message Scala wrapper - RemoveExpiredShuffle (#676) 2022-09-24 23:23:36 +08:00
Ethan Feng
30d4323cdb
[FEATURE] Add a configuration to enable a map id filter mechanism. #662 (#663) 2022-09-23 18:38:52 +08:00
Ethan Feng
4a7a7d42b5
[FEATURE] Add metrics about fetch chunk size, commit files time and get reducer file time (#661) 2022-09-23 16:05:28 +08:00
Ethan Feng
b4654d788c
[ISSUE-607]Add map ids info for each PartitionLocation to enable filtering for m… (#619) 2022-09-23 15:21:41 +08:00
AngersZhuuuu
a6b8af2b00
[ISSUE-637][FEATURE] Change CheckAlive to CheckAvailable and reply checkQuota result (#658) 2022-09-22 21:54:45 +08:00
AngersZhuuuu
df5ba55ea5
[ISSUE-633][FEATURE] Support provider user identity by customized class and keep LifecycleManager and ShuffleClient user identity consistence (#646) 2022-09-21 17:35:59 +08:00
Ethan Feng
3c917c577b
Fix worker replied ack at the wrong time when a soft split is triggered. (#645) 2022-09-21 15:07:21 +08:00
Cheng Pan
b51abeed96
Improve code smell (#624) 2022-09-20 10:03:02 +08:00
Keyong Zhou
30a5afb816
[ISSUE-625][BUG] Incorrect result when kill worker while pushMergedData (#627) 2022-09-20 00:05:15 +08:00
AngersZhuuuu
e48efb2e1c
[ISSUE-611][BUG] FetchHandler should handle PartitionFileSorter return null and we should enable retry for sorter exception (#615) 2022-09-19 14:51:46 +08:00
nafiy
75ca396e77
[ISSUE-600][Refactor] Translate Chinese comments to English (#605) 2022-09-15 22:24:39 +08:00
Keyong Zhou
0dc7e82006
improve revive log readability. (#603) (#604)
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
2022-09-14 23:25:49 +08:00
AngersZhuuuu
a6acaa11e0
[ISSUE-597][REFACTOR] Unify Enum type name and correct wrong UN_KOWN (#598) 2022-09-13 19:07:48 +08:00
nafiy
01d138bea4
[ISSUE-578][FEATURE] Add unit test for codec (#586) 2022-09-11 17:08:45 +08:00
Keyong Zhou
e0c4779fac
[ISSUE-591][BUG] Incorrect result when revive and split happen concur… (#592) 2022-09-10 23:30:39 +08:00
Keyong Zhou
1d7fec84da
[ISSUE-588][BUG] Fix memory leak in shuffle read (#589) 2022-09-10 22:07:13 +08:00
nafiy
0a60b21b56
[ISSUE-551][BUG] CompressionMethod and checksum are not consistent when zstd level is negative (#577) 2022-09-10 21:39:51 +08:00
Keyong Zhou
a2cd01b8ef
[ISSUE-567][FOLLOW-UP] remove entry from latestPartitionLocation in removeExpiredShuffle (#575) 2022-09-08 11:21:42 +08:00
AngersZhuuuu
da7ac1721b
[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566) 2022-09-07 21:33:18 +08:00
Keyong Zhou
f0b6346c9f
[ISSUE-567] Optimize LifecycleManager.getLatestPartition (#570) 2022-09-07 21:06:49 +08:00
nafiy
644471debb
[ISSUE-516][FEATURE] Worker should clean remaining directory when start before registering to Master (#540) 2022-09-06 23:37:47 +08:00
AngersZhuuuu
35d5b587ec
[Refactor] Modify package name of utils to keep consistence (#536) 2022-09-05 20:06:54 +08:00
AngersZhuuuu
f7211204f2
[ISSUE-534][REFACTOR] Refactor log when call handleGetReducerFileGroup (#535) 2022-09-05 19:48:57 +08:00
Cheng Pan
4b42219595
Remove log4j1 (#501) 2022-09-05 19:30:15 +08:00
Cheng Pan
5c2514a5c1
[WORKER] Cleanup StreamState when channel inactive (#527) 2022-09-05 11:31:03 +08:00
Cheng Pan
f00b5a39bc
Extract OpenByteArrayOutputStream (#507) 2022-09-02 21:01:58 +08:00
Cheng Pan
99e58e8e23
Improve logging for RetryingChunkClient (#470) 2022-09-02 00:44:26 +08:00
Cheng Pan
c88ce306be
Use Spotless to auto check and reformat Java/Scala code (#497) 2022-09-01 21:19:56 +08:00
AngersZhuuuu
87f529da35
[ISSUE-484][FEATURE] Add Worker related RPC metrics (#488) 2022-09-01 16:47:31 +08:00
Ethan Feng
1a1145a86f
[ISSUE-334] read shuffle from hdfs. (#481) 2022-08-31 14:51:07 +08:00
AngersZhuuuu
909ad7dc23
[ISSUE-482][REFACTOR] RetryingChunkClient should show clear error mesage (#483) 2022-08-30 21:03:21 +08:00
Ethan Feng
5548dcfac2
[ISSUE-476] refactor read apis to support read from hdfs (#477) 2022-08-30 11:03:30 +08:00
AngersZhuuuu
eee10032fc
[REFACTOR] Some minor changes in client module (#478) 2022-08-29 19:53:45 +08:00
Ethan Feng
eeaa28d24f
[ISSUE-440]Clean expired hdfs files and keep one replication. (#466) 2022-08-26 22:03:43 +08:00
nafiy
01a8d48b5a
[ISSUE-312][FEATURE] Support zstd compression (#451) 2022-08-26 18:07:53 +08:00
Keyong Zhou
ca3ee003d9
[ISSUE-441] Refactor cluster load check to cluster alive check (#442) 2022-08-23 23:02:23 +08:00
Keyong Zhou
6c7b159493
[ISSUE-434] Refine log (#435) 2022-08-23 14:16:38 +08:00
Keyong Zhou
9526cfb997
[ISSUE-428]Should not check blacklist when reserveSlots to avoid ping-pang situation (#432)
```
22/08/22 20:03:39 INFO LifecycleManager: Try reserve slots for application_1660226621060_0180-549 for 1 times.
22/08/22 20:03:39 WARN LifecycleManager: [reserve buffer] failed due to blacklist:
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)

22/08/22 20:03:41 INFO LifecycleManager: Received Blacklist from Master, blacklist: [] unkown workers: []
22/08/22 20:03:50 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)
)
```
2022-08-22 21:00:41 +08:00
Keyong Zhou
ce96e99dd8
[ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation (#431)
* device monitor checklist

* [ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation
```
22/08/22 18:21:03 WARN LifecycleManager: Do Revive for shuffle application_1660226621060_0180-298, oldPartition: PartitionLocation[226-0 192.168.15.9:37761:37903:37517:38449 Mode: Master peer: 192.168.15.6:37533:37413:37387 storage hint:StorageHint{type=MEMORY, mountPoint='/mnt/disk1', finalResult=false}], cause: StatusCode{value=PushDataFailMain}
22/08/22 18:21:03 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: null
)
```
2022-08-22 21:00:24 +08:00
Keyong Zhou
11762a260b
[ISSUE-417] handleUnregisterShuffle and StageEnd trigger double handl… (#420)
1. Unregister shuffle triggers handleStageEnd
```
22/08/22 12:47:00 INFO LifecycleManager: Call StageEnd before Unregister Shuffle 60.
```
2. handleStageEnd success, maybe triggered by handleUnregisterShuffle or StageEnd
```
22/08/22 12:47:51 INFO LifecycleManager: Succeed to handle stageEnd for 60.
```
3. reports data lost
```
22/08/22 12:48:28 ERROR LifecycleManager: For 60 partition 2185-0: data lost.
22/08/22 12:48:28 ERROR LifecycleManager: Failed to handle stageEnd for 60, lost file!
```
4. report unregister success
```
22/08/22 12:48:28 INFO LifecycleManager: Unregister for 60 success.
```
2022-08-22 17:13:08 +08:00
AngersZhuuuu
50d5081922
[ISSUE-385][Feature] RetryingChunkClient openChunk don't wait for first request to each replicate (#389) 2022-08-18 17:28:15 +08:00
AngersZhuuuu
0628262634
[ISSUE-385][FEATURE] RetryingChunkClient openChunks failed should wait (#386) 2022-08-18 16:57:36 +08:00
Cheng Pan
f1f4b894af
Build: Enhance build system (#349) 2022-08-15 14:59:01 +08:00
AngersZhuuuu
ba41a2c2e8
[ISSUE-357][REFACTOR] Remove unused handleStageEnd (#358) 2022-08-15 12:26:15 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
Keyong Zhou
d166e042be
[ISSUE-329] Should not sleep if reserve slots successfully in reserveSlotsWithRetry (#330) 2022-08-12 12:27:27 +08:00
AngersZhuuuu
cf2b895afb
[ISSUE-293][REFACTOR] Init worker rpc endpoint and reserve slot in parallel to speed up register shuffle process (#294)
[ISSUE-293][REFACTOR] Init worker rpc endpoint and reserve slot in parallel to speed up register shuffle process (#294)
2022-08-03 20:00:30 +08:00
AngersZhuuuu
e57ad27887
[ISSUE-291][REFACTOR] When worker endpoint initializing failed, print clear warning log (#292) 2022-08-02 12:03:59 +08:00
dxheming
8e3f48ec12
Refactor deprecated netty ConcurrentSet (#285) 2022-07-27 20:35:46 +08:00
AngersZhuuuu
7a760466aa
[ISSUE-281][BUG] Use correct maxDestLength to check if buffer can satisfy compress result (#282) 2022-07-26 15:56:05 +08:00
AngersZhuuuu
9324b1e89a
[ISSUE-257][FEATURE] Reserve slots support customized retry times (#258) 2022-07-26 15:23:25 +08:00
AngersZhuuuu
fe17914942
Refactor pom import issue (#277) 2022-07-25 17:49:55 +08:00
Keyong Zhou
6442f38a33
[ISSUE-267] Extend API to support more partition types: MapPartition,… (#268) 2022-07-17 16:28:37 +08:00
Keyong Zhou
56a0b9072b
[ISSUE-261] Refine message class hierarchy (#266) 2022-07-16 17:00:09 +08:00
Keyong Zhou
7da8f64691
[ISSUE-262] Remove unused bootstrap (#263) 2022-07-16 11:01:44 +08:00
AngersZhuuuu
36cc234dd4
[ISSUE-246][REFACTOR] Refactor LifecycleManager to make it's code more clear and more readable (#252) 2022-07-12 15:37:49 +08:00
Keyong Zhou
691beb7889
[ISSUE-247] Extract PushHandler, FetchHandler, RpcHandler from Worker… (#251) 2022-07-12 11:40:42 +08:00
Keyong Zhou
d8c5758124
[ISSUE-249] Fix OutOfBounds when shuffle has no data(q24b) (#250) 2022-07-10 18:03:54 +08:00
AngersZhuuuu
f80c86a675
[ISSUE-222] Destroy and DestroyResponse should remove null check (#238) 2022-07-09 15:44:17 +08:00
AngersZhuuuu
49caced462
[ISSUE-222][BUG] GetReduceFileGroups should remove code about return null value (#236) 2022-07-09 12:14:08 +08:00
AngersZhuuuu
c28eeb078c
[ISSUE-222] CommitFiles and CommitFilesResponse should remove null check (#237) 2022-07-08 22:32:54 +08:00
AngersZhuuuu
6e5c282229
[ISSUE-222] GetBlacklist/GetBlacklistResponse should replace null value with empty list (#235) 2022-07-08 14:49:09 +08:00
AngersZhuuuu
d2a0ad480e
[ISSUE-222][BUG] RequestSlotResponse/RegisterShuffleResponse should handle null issue (#226) 2022-07-08 12:33:40 +08:00
AngersZhuuuu
736a3e8814
[ISSUE-222][BUG] handleChangePartitionLocation should handle oldPartition == null (#224) 2022-07-07 22:48:19 +08:00
Ethan Feng
04148fef2b
[ISSUE-228]Fix unexpected closed exceptions occurred while committing files. (#232) 2022-07-07 22:15:16 +08:00
Keyong Zhou
49f2a00943
[ISSUE-208] Refine log levels (#210) 2022-07-01 14:57:30 +08:00
AngersZhuuuu
506cc0af9c
[ISSUE-171][BUG] LifeCycleManager throw cala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.util.HashMap when handle destroyBuffersWithRetry (#172)
* [ISSUE-171][BUG] LifeCycleManager throw cala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.util.HashMap when handle destroyBuffersWithRetry
2022-06-28 10:45:16 +08:00
AngersZhuuuu
5c82b763eb
[ISSUE-169][FEATURE] Make app heartbeat interval can be customized (#170)
* [ISSUE-169][FEATURE] Make app heartbeat interval can be customized

* Update LifecycleManager.scala
2022-06-27 20:58:00 +08:00
mingji
d4d8eb3838 update pom version. 2022-06-24 14:28:42 +08:00
AngersZhuuuu
73b41ac8c5
[ISSUE-160] [BUG] requestReserveSlot failed loss root cause (#161) 2022-06-23 16:33:41 +08:00
AngersZhuuuu
84a281ff89
[ISSUE-158][BUG] When revive meet reserve slot filed, will throw ArrayBoundOutOfIndex exception (#159)
* [ISSUE-158][BUG] When revive meet reserve slot filed, will throw ArrayBoundOutOfIndex exception

* Update pom.xml
2022-06-23 16:15:38 +08:00
AngersZhuuuu
146f724a15
ISSUE-152. Show target host:port when push data callback onFailure (#153) 2022-06-17 22:09:17 +08:00
Ethan Feng
6811cc22fc
[issue-146] Add storage hint to indicate storage location. (#147) 2022-06-14 15:57:11 +08:00
AngersZhuuuu
b51a7626b2
[ISSUE-148][BUG] MapEnd but speculation task's inFlightBatch not cleaned (#149) 2022-06-13 15:44:06 +08:00
Ethan Feng
7d04dbab92
[BUG]Fix a null pointer exception. (#116)
* 1.Fix a null pointer exception.
2.Add partitionlocation to inflight batches to help resolve problems.
3.Reduce driver logs.
2022-05-19 11:23:34 +08:00
leoyy0316
f79e40b21d
modify CONTRIBUTING.md and move LifecycleManager to scala source (#112)
Leo Cheng <leocheng@synnex.com>
2022-05-16 19:03:40 +08:00
Ethan Feng
409da82964
[Bug]fix stuck under high memory pressure. (#90) 2022-04-14 18:53:39 +08:00
Ethan Feng
9ad8254b0a
AQE support. (#67) 2022-04-01 20:19:01 +08:00
AngersZhuuuu
86bbeea9b4
[BUG] Register shuffle with configurable retry times and retry wait time (#83) 2022-04-01 16:59:37 +08:00
AngersZhuuuu
4bd3a539a5
[ISSUE-80] When rss is in blacklist and failed for reserve, rpcRef could be null (#81) 2022-03-29 21:12:37 +08:00
Keyong Zhou
4f66849d6a
fix NPE in LifecycleManager.handleGetBlacklist (#59) 2022-02-16 12:17:41 +08:00
Ethan Feng
356a1952e4
Multi Client Support (#47)
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2022-01-29 22:28:06 +08:00
Ethan Feng
bc1adac90e
[FEATURE]Worker-Wise Current-Limiting (#44) 2022-01-26 15:27:00 +08:00
Tony Doen
302891a1b9
[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register (#30) 2022-01-26 15:16:01 +08:00
Keyong Zhou
31dc2cf7da
[BUG] Record failed worker in LifecycleManager instead of reporting to Master (#34) 2022-01-07 12:18:56 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00