Commit Graph

312 Commits

Author SHA1 Message Date
zhongqiangczq
d3d40f730c
[CELEBORN-106] flink-plugin supports shufflewrite:OutputGate (#1051) 2022-12-08 11:24:37 +08:00
Shuang
e2196e9383
[CELEBORN-56] [ISSUE-945] handle map partition mapper end (#1003) 2022-12-07 21:09:02 +08:00
Shuang
f3f104870c
[CELEBORN-75] Initialize flink plugin module (#1027) 2022-12-07 15:53:00 +08:00
Angerszhuuuu
0d38bad78a
[CELEBORN-20][REFACTOR] Extract CommitManager from LifecycleManager (#1050) 2022-12-06 22:26:18 +08:00
Angerszhuuuu
1e4dec96b9
[CELEBORN-21][REFACTOR] Extract revive related logical from LifecycleManager (#1024)
* [CELEBORN-21][REFACTOR] Extract revive related logical from LifecycleManager
2022-12-05 17:05:17 +08:00
Angerszhuuuu
5eaad136a0
[CELEBORN-84][IMPROVEMENT] Blacklist critical reason should avoid been covered by normal reason (#1043)
* [CELEBORN-84][IMPROVEMENT] Blacklist critical reason should avoid been covered by normal reason
2022-12-05 14:02:33 +08:00
nafiy
8e384cda5a
[CELEBORN-88][REFACTOR] Revive/PartitionSplit should set separated timeout configuration (#1046) 2022-12-05 10:36:43 +08:00
nafiy
44d45c2a27
[CELEBORN-90][REFACTOR] GetReducerFileGroup should support separated timeout configuration (#1045) 2022-12-02 22:53:51 +08:00
Shuang
3a4c3c03a0
[CELEBORN-76][FOLLOWUP] fix inFlightCommitRequest counting problem (#1034)
* [CELEBORN-76][FOLLOWUP] fix inFlightCommitRequest counting problem
2022-12-02 16:25:59 +08:00
nafiy
13e1e24035
[CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration (#1031)
* [CELEBORN-86][REFATCOR] Register shuffle should have separated timeout configuration
2022-12-01 18:39:56 +08:00
zhongqiangczq
898d1126a6
[CELEBORN-11] ShuffleClient supports MapPartition shuffle write: send handshake/regionstart/regionfinish (#1035) 2022-12-01 11:20:55 +08:00
RexAn
bb5a4d2180
[CELEBORN-63] Add CONGESTION related status codes (#1028)
* Increase push data return reason types such as CONGESTION ect
2022-12-01 10:55:37 +08:00
Angerszhuuuu
7f8e66afbc
CELEBORN-76][FOLLOWUP] Support batch commit hard split partition before stage end (#1030)
* CELEBORN-76][FOLLOWUP] Support batch commit hard split partition before stage end
2022-11-30 19:42:04 +08:00
Ethan Feng
dd02070e4b
[CELEBORN-83] Fix various bug when using HDFS as storage.
1. fix incompatibility between Hadoop 2 and Hadoop 3.
2. fix hdfs writer will never be called when there are no healthy disks.
3. fix an NPE when HDFS file writer close.
2022-11-30 19:33:18 +08:00
Angerszhuuuu
5ad4415c68
[CELEBORN-78][REFACTOR] Extract heartbearter from LifecycleManager (#1021)
* [CELEBORN-78][REFACTOR] Extract heartbearter from LifecycleManager
2022-11-29 19:14:55 +08:00
Angerszhuuuu
01dc9d4259
[CELEBORN-79][REFACTOR] Remove unused responseCheckerThread from LifecycleManager (#1022) 2022-11-29 15:25:37 +08:00
Angerszhuuuu
d26e73209b
[CELEBORN-76] Support batch commit hard split partition before stage end 2022-11-29 13:09:01 +08:00
Angerszhuuuu
13f4ce2be6
[CELEBORN-68][FOLLOWUP] Retry on same partition location should have a retry wait interval (#1017) 2022-11-28 20:17:08 +08:00
Keyong Zhou
d381df71f8
[CELEBORN-70] Add epoch for each commitFiles request (#1012) 2022-11-27 21:05:14 +08:00
nafiy
817eee969f
[CELEBORN-58][REFACTOR] Aggregate reserve failed logs together (#1005) 2022-11-26 20:56:39 +08:00
Keyong Zhou
f8bb2cd47d
[CELEBORN-12]Retry on CommitFile request (#1011) 2022-11-26 20:56:24 +08:00
Keyong Zhou
9214b82181
[CELEBORN-68] Client might fetch incorrect data chunk (#1010) 2022-11-26 18:06:06 +08:00
Ethan Feng
93dbf3f8b1
[CELEBORN-67] Revert "Fix fetch incorrect data chunk" related commits (#1006)
* Revert "[CELEBORN-50][FOLLOWUP] Channel inactive may cause new client use old stream id to fetch data (#999)"

This reverts commit 1e8f6dc5e8.

* Revert "[CELEBORN-50] Channel inActive may cause new client use old stream id to fetch data cause IllegalStateException. (#1000)"

This reverts commit f1c4d675d6.

* Revert "[CELEBORN-49] Deadlock when kill worker in shuffle read (#998)"

This reverts commit 0be4b3399c.

* Revert "[CELEBORN-47][IMPROVEMENT] Refine logs about tracking fetch chunk (#995)"

This reverts commit 2b05228871.

* Revert "[BUG] Fix fetch incorrect data chunk (#926)"

This reverts commit 6f043f8a

* Revert "[ISSUE-925][FOLLOWUP] Refactor class name of RetryingChunkReceiveCallback (#954)"

This reverts commit 64e8ebf1
2022-11-25 20:57:47 +08:00
nafiy
fe13e9e261
[CELEBORN-59][REFACTOR] Support send destroy slots request in parallel (#1004) 2022-11-25 18:26:05 +08:00
Angerszhuuuu
1e8f6dc5e8
[CELEBORN-50][FOLLOWUP] Channel inactive may cause new client use old stream id to fetch data (#999)
* [CELEBORN-48][BUG] Channel inactive may cause new client use old stream id to fetch data
2022-11-23 18:22:06 +08:00
Ethan Feng
f1c4d675d6
[CELEBORN-50] Channel inActive may cause new client use old stream id to fetch data cause IllegalStateException. (#1000) 2022-11-23 18:07:57 +08:00
Keyong Zhou
0be4b3399c
[CELEBORN-49] Deadlock when kill worker in shuffle read (#998) 2022-11-23 17:31:05 +08:00
Angerszhuuuu
2b05228871
[CELEBORN-47][IMPROVEMENT] Refine logs about tracking fetch chunk (#995) 2022-11-23 11:56:10 +08:00
Keyong Zhou
cfc1fa15bd
[CELEBORN-46] Refine log for RssInputStream.close() (#994) 2022-11-22 22:01:08 +08:00
Shuang
1656458788
[CELEBORN-14] [ISSUE-955] support register attempt map task (#984) 2022-11-22 15:23:20 +08:00
Angerszhuuuu
5ec278f99a
[ISSUE-987][FEATURE] During worker shutdown, return HARD_SPLIT for all existed partition (#988) 2022-11-22 14:29:55 +08:00
Shuang
fb6d1de108
[CELEBORN-8] [ISSUE-952][FEATURE] support register shuffle task in map partition mode (#973) 2022-11-16 21:46:19 +08:00
Angerszhuuuu
64e8ebf158
[ISSUE-925][FOLLOWUP] Refactor class name of RetryingChunkReceiveCallback (#954) 2022-11-11 14:00:47 +08:00
leesf
0b8376e2c7
Cleanup some code (#943) 2022-11-11 13:58:39 +08:00
Ethan Feng
6f043f8ae9
[BUG] Fix fetch incorrect data chunk (#926) 2022-11-09 22:31:39 +08:00
leesf
3699683a3b
Fix and migrate some configs (#927) 2022-11-07 09:41:38 +08:00
Angerszhuuuu
38e15d89e6
[ISSUE-902][IMPROVEMENT][FOLLOWUP] LifecycleManager should reserve blacklist with irrecoverable status (#914) 2022-11-04 15:54:45 +08:00
Angerszhuuuu
e68ca75a9e
[ISSUE-902][BUG] LifecycleManager should not reallocate slots in failed worker during retry (#906) 2022-11-02 21:07:28 +08:00
leesf
f1694f3d20
[MINOR][CLEANUP] clean up some code in LifecycleManager and ShuffleClientImpl (#896) 2022-11-01 11:40:19 +08:00
Angerszhuuuu
87fcfa767f
[ISSUE-887][REFACTOR] Configuration type convert to Enum (#888)
* [ISSUE-332][FOLLOWUP] Add deps in worker's pom

* [Refactor] Modify package name of utils to keep consistence

* [Refactor] Modify package name of utils to keep consistence

* [REFACTOR] Remove unused isRegistered in controller

* [ISSUE-887][REFACTOR] Configuration type convert to Enum

* update

* update

* Update RssShuffleManager.java
2022-10-29 13:41:06 +08:00
Cheng Pan
d7be6006e7
Migrate network related conf to structured conf system (#875)
* Migrate network related conf to structured conf system

* migrate

* fix

* fix

* worker

* fix

* nit

* review

* nit
2022-10-28 10:45:52 +08:00
Angerszhuuuu
f9ecde3b2b
[ISSUE-863][BUG]LifecycleManager should ignore change partition request when shuffle ended and not remove workersnapshot when commit success (#864) 2022-10-27 22:04:18 +08:00
Ethan Feng
8800fc4a8e
[Refactor] Refine rpc cache configs (#853)
* refine rpc cache configs.

* update.

* update.

* update.
2022-10-25 20:28:18 +08:00
Ethan Feng
45ef716737
[Feature] Cache GetReducerFileGroupResponse to avoid lifecycle manager oom. (#792) 2022-10-25 16:16:44 +08:00
AngersZhuuuu
2ebf873b3c
[ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System (#846)
[ISSUE-845][REFACTOR] Migrate partition split related conf to Celeborn Configuration System
2022-10-25 10:54:45 +08:00
AngersZhuuuu
0bd0a3e9f4
[ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System (#848)
* [ISSUE-847][REFACTOR] Migrate codec conf to Celeborn Configuration System

* Update CelebornConf.scala

* follow comments

* update

* update

* update

* Update client.md
2022-10-25 09:16:46 +08:00
AngersZhuuuu
0fdb19065a
[ISSUE-841][REFACTOR] Migrate shuffle client side conf to Celeborn Configuration System (#842) 2022-10-24 20:48:48 +08:00
Keyong Zhou
63752e7a37
[BUG] RegisterShuffle should not increase epoch (#833) 2022-10-23 23:40:32 +08:00
nafiy
d0058fb2c5
[ISSUE-780][REFACTOR] Refactor PartitionLocation's methods (#791) 2022-10-22 22:46:45 +08:00
AngersZhuuuu
f2610e3b6f
[ISSUE-829][REFACTOR] Unify name of PUSH_DATA_FAIL_MAIN (#830) 2022-10-21 19:06:33 +08:00
AngersZhuuuu
a773c8e6db
[ISSUE-820][Refactor] Rename RssConf to CelebornConf (#826) 2022-10-20 20:13:13 +08:00
AngersZhuuuu
8344479df1
[ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class (#822)
* [ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class


Co-authored-by: Ethan Feng <ethan.aquarius.fmx@gmail.com>
2022-10-20 18:10:59 +08:00
Ethan Feng
5c761a8df3
[ISSUE-813][Refactor] Refactor flusher configurations. (#813)
* Refactor flusher configurations.

* Refactor flusher configurations.

* Update.

* remove brackets.

* update docs.

* rename.

* update.

* update docs.

* update.

* update.

* update.

* update.

* update.

* update.

* update.

* format.

* update.

* update.
2022-10-20 15:23:17 +08:00
nafiy
a75bce905e
[ISSUE-805][REFACTOR] Remove UserIdentifier out of ControlMessage (#808) 2022-10-19 15:32:53 +08:00
AngersZhuuuu
7fedaaeca1
[ISSUE-795][BUG] Batch handle change partition throw NPE (#796) 2022-10-19 10:54:08 +08:00
Ethan Feng
bff2a7065b
Keep one copy of roaringbitmap to reduce memory usage. (#790) 2022-10-18 13:26:49 +08:00
Cheng Pan
efad4abb5d
Migrate a bunch of configurations (#786) 2022-10-18 10:44:01 +08:00
Cheng Pan
ea67f4e060
Introduce categories to ConfigEntry and migrate configurations (#775) 2022-10-17 16:56:54 +08:00
Cheng Pan
96e969f46e
[BUILD] Extract project.version to Maven Property (#772) 2022-10-16 19:01:40 +08:00
AngersZhuuuu
c9b462dc02
[ISSUE-770][Refactor] Batch handle change partition should ignore empty batch and avoid print log of empty result (#771) 2022-10-14 21:49:37 +08:00
AngersZhuuuu
3bad403c8b
[ISSUE-768][REFACTOR] Shuffle data lost should show more clear about lost data in which worker (#769) 2022-10-14 11:41:15 +08:00
Cheng Pan
f01a696313
Migrate and refactor configuration for master endpoints (#752) 2022-10-11 21:33:21 +08:00
AngersZhuuuu
bbb4f8e225
[ISSUE-306][IMPROVEMENT] Handle change partition request in batch (#622) 2022-10-10 18:31:37 +08:00
AngersZhuuuu
f2a234f870
[ISSUE-739][REFACTOR] Use object wrap pb message method (#740) 2022-10-09 11:53:48 +08:00
AngersZhuuuu
ae4bb12d5e
[ISSUE-630][REFACTOR] Minor change of storage resource quota, include code style, comment unused code etc.. (#728) 2022-10-08 20:15:25 +08:00
Ethan Feng
96e550f81c
Fix a npe that stuck lifecycle manager when a worker is offline. (#733) 2022-10-08 20:11:42 +08:00
Ethan Feng
6deda248ac
[REFACTOR]move lifecycle manager to correct package. (#730) 2022-10-08 18:14:08 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Cheng Pan
abb4ce6405
Drop control message Scala wrapper - Revive/PartitionSplit/ChangeLocationResponse (#720) 2022-10-07 12:40:23 +08:00
Cheng Pan
a719709a17
Drop control message Scala wrapper - UnregisterShuffle/UnregisterShuffleResponse (#718) 2022-10-07 12:29:10 +08:00
Cheng Pan
cda133e11f
Drop control message Scala wrapper - RegisterShuffle/RegisterShuffleResponse (#716) 2022-10-06 23:37:36 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Cheng Pan
4880d78d6a
Extract spark tests and improve pom (#711) 2022-10-04 10:23:26 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
nafiy
5d4533fb85
[ISSUE-632][FEATURE] LifecycleManager side ReserveSlots & RequestSlots RPC with UserIdentifier (#679) 2022-09-27 00:01:44 +08:00
zky.zhoukeyong
a2522745d2 Revert "Drop control message Scala wrapper - RemoveExpiredShuffle (#676)"
This reverts commit a160cd90cb.
2022-09-25 17:18:41 +08:00
Cheng Pan
a160cd90cb
Drop control message Scala wrapper - RemoveExpiredShuffle (#676) 2022-09-24 23:23:36 +08:00
Ethan Feng
30d4323cdb
[FEATURE] Add a configuration to enable a map id filter mechanism. #662 (#663) 2022-09-23 18:38:52 +08:00
Ethan Feng
4a7a7d42b5
[FEATURE] Add metrics about fetch chunk size, commit files time and get reducer file time (#661) 2022-09-23 16:05:28 +08:00
Ethan Feng
b4654d788c
[ISSUE-607]Add map ids info for each PartitionLocation to enable filtering for m… (#619) 2022-09-23 15:21:41 +08:00
AngersZhuuuu
a6b8af2b00
[ISSUE-637][FEATURE] Change CheckAlive to CheckAvailable and reply checkQuota result (#658) 2022-09-22 21:54:45 +08:00
AngersZhuuuu
df5ba55ea5
[ISSUE-633][FEATURE] Support provider user identity by customized class and keep LifecycleManager and ShuffleClient user identity consistence (#646) 2022-09-21 17:35:59 +08:00
Ethan Feng
3c917c577b
Fix worker replied ack at the wrong time when a soft split is triggered. (#645) 2022-09-21 15:07:21 +08:00
Cheng Pan
b51abeed96
Improve code smell (#624) 2022-09-20 10:03:02 +08:00
Keyong Zhou
30a5afb816
[ISSUE-625][BUG] Incorrect result when kill worker while pushMergedData (#627) 2022-09-20 00:05:15 +08:00
AngersZhuuuu
e48efb2e1c
[ISSUE-611][BUG] FetchHandler should handle PartitionFileSorter return null and we should enable retry for sorter exception (#615) 2022-09-19 14:51:46 +08:00
nafiy
75ca396e77
[ISSUE-600][Refactor] Translate Chinese comments to English (#605) 2022-09-15 22:24:39 +08:00
Keyong Zhou
0dc7e82006
improve revive log readability. (#603) (#604)
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
2022-09-14 23:25:49 +08:00
AngersZhuuuu
a6acaa11e0
[ISSUE-597][REFACTOR] Unify Enum type name and correct wrong UN_KOWN (#598) 2022-09-13 19:07:48 +08:00
nafiy
01d138bea4
[ISSUE-578][FEATURE] Add unit test for codec (#586) 2022-09-11 17:08:45 +08:00
Keyong Zhou
e0c4779fac
[ISSUE-591][BUG] Incorrect result when revive and split happen concur… (#592) 2022-09-10 23:30:39 +08:00
Keyong Zhou
1d7fec84da
[ISSUE-588][BUG] Fix memory leak in shuffle read (#589) 2022-09-10 22:07:13 +08:00
nafiy
0a60b21b56
[ISSUE-551][BUG] CompressionMethod and checksum are not consistent when zstd level is negative (#577) 2022-09-10 21:39:51 +08:00
Keyong Zhou
a2cd01b8ef
[ISSUE-567][FOLLOW-UP] remove entry from latestPartitionLocation in removeExpiredShuffle (#575) 2022-09-08 11:21:42 +08:00
AngersZhuuuu
da7ac1721b
[ISSUE-565][REFACTOR] Unify RPC name HeartbeatXxxxx (#566) 2022-09-07 21:33:18 +08:00
Keyong Zhou
f0b6346c9f
[ISSUE-567] Optimize LifecycleManager.getLatestPartition (#570) 2022-09-07 21:06:49 +08:00
nafiy
644471debb
[ISSUE-516][FEATURE] Worker should clean remaining directory when start before registering to Master (#540) 2022-09-06 23:37:47 +08:00
AngersZhuuuu
35d5b587ec
[Refactor] Modify package name of utils to keep consistence (#536) 2022-09-05 20:06:54 +08:00
AngersZhuuuu
f7211204f2
[ISSUE-534][REFACTOR] Refactor log when call handleGetReducerFileGroup (#535) 2022-09-05 19:48:57 +08:00
Cheng Pan
4b42219595
Remove log4j1 (#501) 2022-09-05 19:30:15 +08:00
Cheng Pan
5c2514a5c1
[WORKER] Cleanup StreamState when channel inactive (#527) 2022-09-05 11:31:03 +08:00
Cheng Pan
f00b5a39bc
Extract OpenByteArrayOutputStream (#507) 2022-09-02 21:01:58 +08:00
Cheng Pan
99e58e8e23
Improve logging for RetryingChunkClient (#470) 2022-09-02 00:44:26 +08:00
Cheng Pan
c88ce306be
Use Spotless to auto check and reformat Java/Scala code (#497) 2022-09-01 21:19:56 +08:00
AngersZhuuuu
87f529da35
[ISSUE-484][FEATURE] Add Worker related RPC metrics (#488) 2022-09-01 16:47:31 +08:00
Ethan Feng
1a1145a86f
[ISSUE-334] read shuffle from hdfs. (#481) 2022-08-31 14:51:07 +08:00
AngersZhuuuu
909ad7dc23
[ISSUE-482][REFACTOR] RetryingChunkClient should show clear error mesage (#483) 2022-08-30 21:03:21 +08:00
Ethan Feng
5548dcfac2
[ISSUE-476] refactor read apis to support read from hdfs (#477) 2022-08-30 11:03:30 +08:00
AngersZhuuuu
eee10032fc
[REFACTOR] Some minor changes in client module (#478) 2022-08-29 19:53:45 +08:00
Ethan Feng
eeaa28d24f
[ISSUE-440]Clean expired hdfs files and keep one replication. (#466) 2022-08-26 22:03:43 +08:00
nafiy
01a8d48b5a
[ISSUE-312][FEATURE] Support zstd compression (#451) 2022-08-26 18:07:53 +08:00
Keyong Zhou
ca3ee003d9
[ISSUE-441] Refactor cluster load check to cluster alive check (#442) 2022-08-23 23:02:23 +08:00
Keyong Zhou
6c7b159493
[ISSUE-434] Refine log (#435) 2022-08-23 14:16:38 +08:00
Keyong Zhou
9526cfb997
[ISSUE-428]Should not check blacklist when reserveSlots to avoid ping-pang situation (#432)
```
22/08/22 20:03:39 INFO LifecycleManager: Try reserve slots for application_1660226621060_0180-549 for 1 times.
22/08/22 20:03:39 WARN LifecycleManager: [reserve buffer] failed due to blacklist:
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)

22/08/22 20:03:41 INFO LifecycleManager: Received Blacklist from Master, blacklist: [] unkown workers: []
22/08/22 20:03:50 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: NettyRpcEndpointRef(rss://WorkerEndpoint@192.168.15.9:37761)
)
```
2022-08-22 21:00:41 +08:00
Keyong Zhou
ce96e99dd8
[ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation (#431)
* device monitor checklist

* [ISSUE-429][BUG] blacklistPartition should add worker from workersSnapshot instead of PartitionLocation
```
22/08/22 18:21:03 WARN LifecycleManager: Do Revive for shuffle application_1660226621060_0180-298, oldPartition: PartitionLocation[226-0 192.168.15.9:37761:37903:37517:38449 Mode: Master peer: 192.168.15.6:37533:37413:37387 storage hint:StorageHint{type=MEMORY, mountPoint='/mnt/disk1', finalResult=false}], cause: StatusCode{value=PushDataFailMain}
22/08/22 18:21:03 INFO LifecycleManager: Report Worker Failure: Buffer(
Host: 192.168.15.9
RpcPort: 37761
PushPort: 37903
FetchPort: 37517
ReplicatePort: 38449
SlotsUsed: 0()
LastHeartBeat: 0
Disks: {}
WorkerRef: null
)
```
2022-08-22 21:00:24 +08:00
Keyong Zhou
11762a260b
[ISSUE-417] handleUnregisterShuffle and StageEnd trigger double handl… (#420)
1. Unregister shuffle triggers handleStageEnd
```
22/08/22 12:47:00 INFO LifecycleManager: Call StageEnd before Unregister Shuffle 60.
```
2. handleStageEnd success, maybe triggered by handleUnregisterShuffle or StageEnd
```
22/08/22 12:47:51 INFO LifecycleManager: Succeed to handle stageEnd for 60.
```
3. reports data lost
```
22/08/22 12:48:28 ERROR LifecycleManager: For 60 partition 2185-0: data lost.
22/08/22 12:48:28 ERROR LifecycleManager: Failed to handle stageEnd for 60, lost file!
```
4. report unregister success
```
22/08/22 12:48:28 INFO LifecycleManager: Unregister for 60 success.
```
2022-08-22 17:13:08 +08:00
AngersZhuuuu
50d5081922
[ISSUE-385][Feature] RetryingChunkClient openChunk don't wait for first request to each replicate (#389) 2022-08-18 17:28:15 +08:00
AngersZhuuuu
0628262634
[ISSUE-385][FEATURE] RetryingChunkClient openChunks failed should wait (#386) 2022-08-18 16:57:36 +08:00
Cheng Pan
f1f4b894af
Build: Enhance build system (#349) 2022-08-15 14:59:01 +08:00
AngersZhuuuu
ba41a2c2e8
[ISSUE-357][REFACTOR] Remove unused handleStageEnd (#358) 2022-08-15 12:26:15 +08:00
Ethan Feng
f3bcb7f6a8
[ISSUE-146]update slots distribution mechanism (#273) 2022-08-12 23:38:19 +08:00
Keyong Zhou
d166e042be
[ISSUE-329] Should not sleep if reserve slots successfully in reserveSlotsWithRetry (#330) 2022-08-12 12:27:27 +08:00
AngersZhuuuu
cf2b895afb
[ISSUE-293][REFACTOR] Init worker rpc endpoint and reserve slot in parallel to speed up register shuffle process (#294)
[ISSUE-293][REFACTOR] Init worker rpc endpoint and reserve slot in parallel to speed up register shuffle process (#294)
2022-08-03 20:00:30 +08:00
AngersZhuuuu
e57ad27887
[ISSUE-291][REFACTOR] When worker endpoint initializing failed, print clear warning log (#292) 2022-08-02 12:03:59 +08:00
dxheming
8e3f48ec12
Refactor deprecated netty ConcurrentSet (#285) 2022-07-27 20:35:46 +08:00
AngersZhuuuu
7a760466aa
[ISSUE-281][BUG] Use correct maxDestLength to check if buffer can satisfy compress result (#282) 2022-07-26 15:56:05 +08:00
AngersZhuuuu
9324b1e89a
[ISSUE-257][FEATURE] Reserve slots support customized retry times (#258) 2022-07-26 15:23:25 +08:00
AngersZhuuuu
fe17914942
Refactor pom import issue (#277) 2022-07-25 17:49:55 +08:00
Keyong Zhou
6442f38a33
[ISSUE-267] Extend API to support more partition types: MapPartition,… (#268) 2022-07-17 16:28:37 +08:00
Keyong Zhou
56a0b9072b
[ISSUE-261] Refine message class hierarchy (#266) 2022-07-16 17:00:09 +08:00
Keyong Zhou
7da8f64691
[ISSUE-262] Remove unused bootstrap (#263) 2022-07-16 11:01:44 +08:00
AngersZhuuuu
36cc234dd4
[ISSUE-246][REFACTOR] Refactor LifecycleManager to make it's code more clear and more readable (#252) 2022-07-12 15:37:49 +08:00
Keyong Zhou
691beb7889
[ISSUE-247] Extract PushHandler, FetchHandler, RpcHandler from Worker… (#251) 2022-07-12 11:40:42 +08:00
Keyong Zhou
d8c5758124
[ISSUE-249] Fix OutOfBounds when shuffle has no data(q24b) (#250) 2022-07-10 18:03:54 +08:00
AngersZhuuuu
f80c86a675
[ISSUE-222] Destroy and DestroyResponse should remove null check (#238) 2022-07-09 15:44:17 +08:00
AngersZhuuuu
49caced462
[ISSUE-222][BUG] GetReduceFileGroups should remove code about return null value (#236) 2022-07-09 12:14:08 +08:00
AngersZhuuuu
c28eeb078c
[ISSUE-222] CommitFiles and CommitFilesResponse should remove null check (#237) 2022-07-08 22:32:54 +08:00
AngersZhuuuu
6e5c282229
[ISSUE-222] GetBlacklist/GetBlacklistResponse should replace null value with empty list (#235) 2022-07-08 14:49:09 +08:00
AngersZhuuuu
d2a0ad480e
[ISSUE-222][BUG] RequestSlotResponse/RegisterShuffleResponse should handle null issue (#226) 2022-07-08 12:33:40 +08:00
AngersZhuuuu
736a3e8814
[ISSUE-222][BUG] handleChangePartitionLocation should handle oldPartition == null (#224) 2022-07-07 22:48:19 +08:00
Ethan Feng
04148fef2b
[ISSUE-228]Fix unexpected closed exceptions occurred while committing files. (#232) 2022-07-07 22:15:16 +08:00
Keyong Zhou
49f2a00943
[ISSUE-208] Refine log levels (#210) 2022-07-01 14:57:30 +08:00
AngersZhuuuu
506cc0af9c
[ISSUE-171][BUG] LifeCycleManager throw cala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.util.HashMap when handle destroyBuffersWithRetry (#172)
* [ISSUE-171][BUG] LifeCycleManager throw cala.collection.immutable.HashMap$HashTrieMap cannot be cast to java.util.HashMap when handle destroyBuffersWithRetry
2022-06-28 10:45:16 +08:00
AngersZhuuuu
5c82b763eb
[ISSUE-169][FEATURE] Make app heartbeat interval can be customized (#170)
* [ISSUE-169][FEATURE] Make app heartbeat interval can be customized

* Update LifecycleManager.scala
2022-06-27 20:58:00 +08:00
mingji
d4d8eb3838 update pom version. 2022-06-24 14:28:42 +08:00
AngersZhuuuu
73b41ac8c5
[ISSUE-160] [BUG] requestReserveSlot failed loss root cause (#161) 2022-06-23 16:33:41 +08:00
AngersZhuuuu
84a281ff89
[ISSUE-158][BUG] When revive meet reserve slot filed, will throw ArrayBoundOutOfIndex exception (#159)
* [ISSUE-158][BUG] When revive meet reserve slot filed, will throw ArrayBoundOutOfIndex exception

* Update pom.xml
2022-06-23 16:15:38 +08:00
AngersZhuuuu
146f724a15
ISSUE-152. Show target host:port when push data callback onFailure (#153) 2022-06-17 22:09:17 +08:00
Ethan Feng
6811cc22fc
[issue-146] Add storage hint to indicate storage location. (#147) 2022-06-14 15:57:11 +08:00
AngersZhuuuu
b51a7626b2
[ISSUE-148][BUG] MapEnd but speculation task's inFlightBatch not cleaned (#149) 2022-06-13 15:44:06 +08:00
Ethan Feng
7d04dbab92
[BUG]Fix a null pointer exception. (#116)
* 1.Fix a null pointer exception.
2.Add partitionlocation to inflight batches to help resolve problems.
3.Reduce driver logs.
2022-05-19 11:23:34 +08:00
leoyy0316
f79e40b21d
modify CONTRIBUTING.md and move LifecycleManager to scala source (#112)
Leo Cheng <leocheng@synnex.com>
2022-05-16 19:03:40 +08:00
Ethan Feng
409da82964
[Bug]fix stuck under high memory pressure. (#90) 2022-04-14 18:53:39 +08:00
Ethan Feng
9ad8254b0a
AQE support. (#67) 2022-04-01 20:19:01 +08:00
AngersZhuuuu
86bbeea9b4
[BUG] Register shuffle with configurable retry times and retry wait time (#83) 2022-04-01 16:59:37 +08:00
AngersZhuuuu
4bd3a539a5
[ISSUE-80] When rss is in blacklist and failed for reserve, rpcRef could be null (#81) 2022-03-29 21:12:37 +08:00
Keyong Zhou
4f66849d6a
fix NPE in LifecycleManager.handleGetBlacklist (#59) 2022-02-16 12:17:41 +08:00
Ethan Feng
356a1952e4
Multi Client Support (#47)
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2022-01-29 22:28:06 +08:00
Ethan Feng
bc1adac90e
[FEATURE]Worker-Wise Current-Limiting (#44) 2022-01-26 15:27:00 +08:00
Tony Doen
302891a1b9
[BUG] ClusterLoadFallbackPolicy is not strictness when a shuffle with big partitions to register (#30) 2022-01-26 15:16:01 +08:00
Keyong Zhou
31dc2cf7da
[BUG] Record failed worker in LifecycleManager instead of reporting to Master (#34) 2022-01-07 12:18:56 +08:00
zky.zhoukeyong
ba5920acde Initial Commit for RSS 2021-12-28 20:57:35 +08:00