Angerszhuuuu
|
4f85d80687
|
[CELEBORN-606] Refine CommitHandler's noisy log (#1511)
|
2023-05-24 15:25:10 +08:00 |
|
Angerszhuuuu
|
811e192bbd
|
[CELEBORN-446] Support rack aware during assign slots for ROUNDROBIN (#1370)
|
2023-05-18 13:58:51 +08:00 |
|
Angerszhuuuu
|
a22c61e479
|
[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486)
|
2023-05-17 18:17:41 +08:00 |
|
zhongqiangchen
|
5769c3fdc7
|
[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457)
|
2023-05-10 19:35:51 +08:00 |
|
Shuang
|
fb753fd48e
|
[CELEBORN-573] Guarantee resource/app/worker change persistent to raft in Ha Mode. (#1477)
|
2023-05-10 14:28:52 +08:00 |
|
Angerszhuuuu
|
778b5440bc
|
[CELEBORN-556][BUG] ReserveSlot should not use default RPC time out since register shuffle max timeout is network timeout (#1461)
|
2023-05-10 12:29:06 +08:00 |
|
Angerszhuuuu
|
c0a9578d9f
|
[CELEBORN-563] Remove unnecessary code (#1469)
|
2023-05-06 11:25:31 +08:00 |
|
Angerszhuuuu
|
783d4e5dc5
|
[CELEBORN-551] Remove unnecessary ShuffleClient.get() (#1456)
|
2023-05-04 20:47:45 +08:00 |
|
Angerszhuuuu
|
a108d6f837
|
[CELEBORN-559][IMPROVEMENT] createReader should also wait for retry when change to same peer (#1465)
|
2023-05-04 10:51:15 +08:00 |
|
Angerszhuuuu
|
ef4c12e0fe
|
[CELEBORN-565] FETCH_MAX_RETRIES should double when enable replicates (#1471)
|
2023-04-28 14:27:35 +08:00 |
|
Angerszhuuuu
|
8d933691ae
|
[CELEBORN-479][FOLLOWUP] Add push task should check if loc is null (#1404)
|
2023-04-28 11:19:35 +08:00 |
|
Angerszhuuuu
|
bfce6052d7
|
[CELEBORN-560][FOLLOWUP] Follow the original design for handling rerun & speculative task after handleStageEnd (#1468)
|
2023-04-28 11:18:42 +08:00 |
|
Angerszhuuuu
|
7a4f2ebd8a
|
[CELEBORN-547] Refactor request related API (#1452)
|
2023-04-27 16:25:41 +08:00 |
|
Angerszhuuuu
|
ce21a738a9
|
[CELEBORN-560][BUG] Rerun task in spark later then RSS stageEnd cause NPE then job failed (#1466)
|
2023-04-27 14:16:32 +08:00 |
|
Angerszhuuuu
|
be84e8ba0d
|
[CELEBORN-562][REFACTOR] Rename Destroy and DestroyResponse to make it more clear (#1467)
|
2023-04-27 12:31:32 +08:00 |
|
Shuang
|
64a4f7274c
|
[CELEBORN-554][Tuning] Improve For LM to avoid reserve/commit empty worker resources (#1459)
|
2023-04-26 18:04:50 +08:00 |
|
Angerszhuuuu
|
4bbc8aec4f
|
[CELEBORN-555][REFACTOR] Avoid prin noisy blacklist info when record blacklist (#1460)
* [CELEBORN-555][REFACTOR] Avoid prin noisy blacklist info when record blacklist
|
2023-04-26 16:45:44 +08:00 |
|
Shuang
|
343f1e62d2
|
[CELEBORN-537][FOLLOWUP] Fix blacklist potentially lost failure workers (#1449)
|
2023-04-23 10:16:21 +08:00 |
|
Angerszhuuuu
|
17ae0cd9b1
|
[CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1448)
* [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
|
2023-04-23 10:15:41 +08:00 |
|
Shuang
|
d68deecaaa
|
[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447)
|
2023-04-22 16:33:22 +08:00 |
|
Angerszhuuuu
|
e3ae2f0e17
|
[CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1445)
* [CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
|
2023-04-21 17:26:52 +08:00 |
|
Angerszhuuuu
|
16d193071f
|
[CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1444)
* [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
|
2023-04-21 17:04:52 +08:00 |
|
Shuang
|
62d60de8c5
|
[CELEBORN-537] Improve blacklist compute & minor fix for Flink (#1441)
[CELEBORN-537] improve blacklist compute & minor fix for flink
|
2023-04-20 18:30:10 +08:00 |
|
Ethan Feng
|
6378a386d0
|
[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439)
|
2023-04-20 10:17:26 +08:00 |
|
Angerszhuuuu
|
d53cf40728
|
[CELEBRON-528][REFACTOR] RegisterShuffle 's log should show clear belongs to which shuffle (#1434)
|
2023-04-17 16:19:29 +08:00 |
|
Shuang
|
412d10b7dc
|
[CELEBORN-479][FLINK] support stopTrackingAndReleasePartitions when worker is not available (#1405)
|
2023-04-17 14:44:24 +08:00 |
|
cxzl25
|
13f772e0c0
|
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size
|
2023-04-14 20:45:25 +08:00 |
|
Angerszhuuuu
|
e5722126e9
|
[CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication (#1408)
* [CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication
|
2023-04-12 14:59:32 +08:00 |
|
Shuang
|
a892640353
|
[CELEBORN-503][FLINK] fix attempt task may use wrong partitionId. (#1409)
|
2023-04-04 15:46:35 +08:00 |
|
Angerszhuuuu
|
015788dd28
|
[CELEBORN-484][FOLLOWUP] Return shutting worker is empty also need to retain LifecycleManager's shutting workers (#1403)
|
2023-04-03 16:37:46 +08:00 |
|
Angerszhuuuu
|
bf46336d54
|
[CELEBORN-487][PERF] ShuffleClientSide support blacklist to avoid client side timeout in same worker multiple times (#1399)
|
2023-04-03 11:50:04 +08:00 |
|
Angerszhuuuu
|
b4f8ab19bd
|
[CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's partition location. (#1395)
* [CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's partition location.
|
2023-04-02 09:18:12 +08:00 |
|
Aravind Patnam
|
2c3005ad5b
|
[CELEBORN-491] Improve exception logging in RssInputStream (#1398)
|
2023-03-30 10:21:07 +08:00 |
|
Angerszhuuuu
|
9d9a2d4ea8
|
[CELEBORN-479][FOLLOWUP] Return empty tasks instead of null to avoid NPE (#1388)
|
2023-03-27 17:03:06 +08:00 |
|
Keyong Zhou
|
cb19ed1c66
|
[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386)
|
2023-03-27 16:18:55 +08:00 |
|
Fei Wang
|
b40c573069
|
[CELEBORN-474][FOLLOWUP] Using inner static ConcurrentHashMap class and only apply for JDK8 (#1384)
|
2023-03-27 16:16:23 +08:00 |
|
Fei Wang
|
0d1695abd8
|
[CELEBORN-473] Enable file system cache for viewfs in ShuffleClient as well
|
2023-03-26 09:44:04 +08:00 |
|
Fei Wang
|
7c444cb0c5
|
[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383)
|
2023-03-26 09:41:59 +08:00 |
|
Fei Wang
|
c609c0ebaa
|
[MINOR] Fix typo and remove unused code (#1381)
* fix typo
* remove unused
|
2023-03-25 23:20:33 +08:00 |
|
Shuang
|
89b3f3887d
|
[CELEBORN-356] [FLINK] Support release single partition resource (#1314)
|
2023-03-24 17:15:28 +08:00 |
|
cxzl25
|
2adbce942a
|
[CELEBORN-471] Fix String.format wrong type in ShuffleClientImpl (#1378)
|
2023-03-24 16:05:48 +08:00 |
|
Keyong Zhou
|
107868d4f1
|
[CELEBORN-441][FLINK] Move ShuffleTaskInfo to Flink Plugin (#1361)
|
2023-03-20 13:31:53 +08:00 |
|
Keyong Zhou
|
9401db2bc8
|
[CELEBORN-443] Code refine for client and common (#1362)
|
2023-03-20 10:37:43 +08:00 |
|
Keyong Zhou
|
21bdfdb21b
|
[CELEBORN-390][FLINK] Refine synchronization in FlinkShuffleClientImpl#updateFileGroup (#1320)
|
2023-03-09 16:49:18 +08:00 |
|
zhongqiangchen
|
9dc1bc2b1c
|
[CELEBORN-367] [FLINK] Move pushdata functions used by mappartition from ShuffleClientImpl to FlinkShuffleClientImpl (#1295)
|
2023-03-02 18:50:38 +08:00 |
|
Angerszhuuuu
|
786fcd6744
|
[CELEBORN-336] Revive Failed should use keep the corresponding StatusCode (#1283)
* [CELEBORN-336] Revive Failed should use keep the corresponding StatusCode
|
2023-03-01 18:57:51 +08:00 |
|
Shuang
|
bc7da3154f
|
[CELEBORN-354][Flink] fix succeedPartitionIds may contain new added partitionIds (#1289)
|
2023-03-01 15:45:24 +08:00 |
|
Angerszhuuuu
|
eda21ead24
|
[CELEBORN-344] Change PUSH_DATA_FAIL_MASTER/SALVE to PUSH_DATA_WRITE_FAIL_MASTER/SALVE (#1281)
|
2023-02-28 11:29:40 +08:00 |
|
Keyong Zhou
|
7adf1fca41
|
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
|
2023-02-28 10:35:55 +08:00 |
|
Angerszhuuuu
|
24f5478adc
|
[CELEBORN-338] Clean duplicated exception message of handling push data (#1274)
|
2023-02-28 10:35:18 +08:00 |
|
Shuang
|
935806f036
|
[CELEBORN-341][Flink] cache file group for map partition in Flink plugin (#1277)
|
2023-02-26 20:31:20 +08:00 |
|
Angerszhuuuu
|
a7587c3fe7
|
[CELEBORN-337] Remove unnecessary StatusCode.message (#1272)
* [CELEBORN-337] Remove unnecessary StatusCode.message
|
2023-02-24 15:11:07 +08:00 |
|
Angerszhuuuu
|
81f7ffd767
|
[CELEBORN-332] Unify the log of ShuffleClientImpl (#1267)
* [CELEBORN-332] Unify the log of ShuffleClientImpl
|
2023-02-24 14:07:25 +08:00 |
|
Angerszhuuuu
|
3067efcfd3
|
[CELEBORN-331] submitRetryPushData should throw PUSH_DATA_CREATE_CONNECTION_FAIL_MASTER too (#1266)
* [CELEBORN-331] submitRetryPushData should throw PUSH_DATA_CREATE_CONNECTION_FAIL_MASTER too
|
2023-02-23 14:57:11 +08:00 |
|
Angerszhuuuu
|
f7948190cf
|
[CELEBORN-316][FOLLOWUP] Should not wrap CelebornIOException with CelebornIOException (#1264)
|
2023-02-23 11:48:46 +08:00 |
|
Angerszhuuuu
|
1132cc25ab
|
[CELEBORN-328][MPROVEMENT] Too much noisy log when reserve slot failed (#1262)
|
2023-02-22 17:19:52 +08:00 |
|
Angerszhuuuu
|
322f0d2b41
|
[CELEBORN-316] Wrap Celeborn exception with CelebornIOException (#1253)
|
2023-02-22 16:10:11 +08:00 |
|
Shuang
|
3da615972e
|
[CELEBORN-326)] [Flink] lifecycleManager supports flink-yarn-session mode to handle multiple Flink jobs. (#1260)
|
2023-02-22 15:37:24 +08:00 |
|
Angerszhuuuu
|
251b923b5b
|
[CELEBORN-321] When register shuffle failed, DataPushQueue should directly take the task queue to avoid NPE (#1258)
|
2023-02-21 17:02:37 +08:00 |
|
Shuang
|
61065230bd
|
[CELEBORN-311] not retry when register for map partition occurs exception (#1246)
|
2023-02-21 10:16:10 +08:00 |
|
Ethan Feng
|
bfb39632d9
|
[CELEBORN-235] Implement flink plugin. (#1244)
|
2023-02-17 19:31:12 +08:00 |
|
zhongqiangchen
|
b5dc106af8
|
[CELEBORN-291] optimize shuffleclientimpl creating client and pushdata for mappartition (#1224)
|
2023-02-17 19:07:19 +08:00 |
|
Shuang
|
b7ef9cf216
|
[CELEBORN-297] don't cache file groups for map partition shuffle type (#1237)
|
2023-02-17 11:28:47 +08:00 |
|
Angerszhuuuu
|
57f775a7e9
|
[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208)
|
2023-02-16 18:27:37 +08:00 |
|
jiaoqingbo
|
318157e3e9
|
[CELEBORN-305] Change the parameter passed in the registerShuffle method to numPartitions instead of numMappers (#1240)
|
2023-02-15 17:35:43 +08:00 |
|
jiaoqingbo
|
bd9e0ddc1f
|
[CELEBORN-304] Missing setIfMissing celeborn.$module.io.serverThreads (#1238)
|
2023-02-15 15:49:08 +08:00 |
|
Shuang
|
75c83093f2
|
[CELEBORN-296] fix map partition commit using wrong partitionId and result (#1233)
|
2023-02-14 20:54:06 +08:00 |
|
Rex(Hui) An
|
bff6e91e0b
|
[CELEBORN-227] Support different push strategies to control the push speed (#1167)
|
2023-02-07 14:24:30 +08:00 |
|
Angerszhuuuu
|
ff683ffc91
|
[CELEBORN-238][IMPROVEMENT] Revive caused by PUSH_DATA_TIMEOUT_MASTER and PUSH_DATA_TIMEOUT_SLAVE should add corresponding worker into blacklist (#1180)
|
2023-02-03 17:47:24 +08:00 |
|
Angerszhuuuu
|
4b6f7e4593
|
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185)
|
2023-02-03 11:53:15 +08:00 |
|
Rex(Hui) An
|
021004714b
|
[CELEBORN-264] InFlight requests should not be expired if it's not pushed yet (#1196)
|
2023-02-01 22:16:55 +08:00 |
|
Shuang
|
7162be2fae
|
[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149)
|
2023-01-31 18:53:36 +08:00 |
|
Angerszhuuuu
|
1311fb53d1
|
[CELEBORN-243][CELEBORN-245][IMPROVEMENT] Create push client failed and connection failed cause push failed should have their own ERROR type (#1181)
* [CELEBORN-243][IMPROVEMENT] Create push client failed should have a ERROR type
|
2023-01-30 17:47:22 +08:00 |
|
Angerszhuuuu
|
8611a64400
|
[CELEBORN-237][IMPROVEMENT] push failed error message should show partition info (#1178)
* [CELEBORN-237][IMPROVEMENT] push failed error message should show partition info
|
2023-01-28 18:41:54 +08:00 |
|
Keyong Zhou
|
e47f1e33b0
|
[CELEBORN-55][FOLLOWUP] Code refine (#1175)
|
2023-01-20 16:22:47 +08:00 |
|
zy.jordan
|
c5be79ee3d
|
[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102)
|
2023-01-20 10:18:45 +08:00 |
|
zhongqiangczq
|
1836fe187b
|
[CELEBORN-197] in mappartition, check transportClient whether changed while sending messages (#1145)
|
2023-01-13 16:45:26 +08:00 |
|
Shuang
|
810a8d01e0
|
[CELEBORN-212] refresh client if current client is inactive. (#1159)
|
2023-01-11 11:54:50 +08:00 |
|
Shuang
|
1332362bff
|
[CELEBORN-213] Add configuration for whether to close idle connections in client side (#1157)
|
2023-01-10 19:13:33 +08:00 |
|
Angerszhuuuu
|
e155ec122a
|
[CELEBORN-190] doPushMergedData should also support revive multiple times, not only twice (#1136)
|
2023-01-10 11:39:40 +08:00 |
|
Shuang
|
2ec06472fe
|
[CELEBORN-203] fix NPE when removeExpiredShuffle in LifecycleManager. (#1151)
|
2023-01-06 18:32:17 +08:00 |
|
Angerszhuuuu
|
0d5809ff0c
|
[CELEBORN-192][IMPROVEMENT] Change FAILED status to REQUEST_FAILED since it's all used when RPC request failed. (#1139)
|
2023-01-06 16:53:04 +08:00 |
|
Shuang
|
3b2be25a50
|
[CELEBORN-173] refactor minicluster and fix ut (#1147)
|
2023-01-05 20:39:19 +08:00 |
|
Angerszhuuuu
|
415452d9c4
|
[CELEBORN-189][IMPROVEMENT] PushDataFailedSlave should add slave worker to blacklist (#1135)
|
2023-01-05 20:12:07 +08:00 |
|
Angerszhuuuu
|
fe8dfb05f3
|
[CELEBORN-196][REFACTOR] Rename batchHandleRequestPartitions to handleRequestPartitions (#1144)
|
2023-01-05 14:37:10 +08:00 |
|
Angerszhuuuu
|
2315f2f988
|
[CELEBORN-191][BUG] ShuffleClient registerShuffle return RESERVE_SLOTS_FAILED should also been print out (#1138)
|
2023-01-03 17:13:31 +08:00 |
|
Shuang
|
5cba307189
|
[CELEBORN-146] refactor ShuffleMapperAttempts & GetReducerFileGroup (#1116)
|
2022-12-30 18:15:23 +08:00 |
|
Cheng Pan
|
b8758a7cb6
|
[CELEBORN-181][TEST] Rename RssFunSuite to CelebornFunSuite (#1125)
|
2022-12-29 18:10:14 +08:00 |
|
RexAn
|
6432a129be
|
[CELEBORN-61][CELEBORN-62][FOLLOW_UP] Fix some issues for slow start (#1119)
|
2022-12-29 12:07:20 +08:00 |
|
Binjie Yang
|
63943cd5cc
|
[CELEBORN-147][IT]Extraction of common integration test cases (#1092)
|
2022-12-29 12:03:09 +08:00 |
|
Keyong Zhou
|
2f0682265e
|
[CELEBORN-119] Add timeout for pushdata (#1097)
|
2022-12-20 20:40:42 +08:00 |
|
Keyong Zhou
|
a2dd72f20c
|
[CELEBORN-155] Wrong TimeUnit for registerShuffleRetryWait in Shuffle… (#1099)
|
2022-12-19 17:32:18 +08:00 |
|
Shuang
|
13769f0f0a
|
[CELEBORN-121] Refactor batchHandleCommitPartition (#1089)
|
2022-12-19 12:39:39 +08:00 |
|
Ethan Feng
|
39394526a8
|
[CELEBORN-142]Keep committed partition locations semantic consistent when commit files on HDFS. (#1091)
|
2022-12-16 19:02:02 +08:00 |
|
nafiy
|
ddab27a1d7
|
[CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse (#1093)
* [CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse
|
2022-12-15 18:16:34 +08:00 |
|
Ethan Feng
|
65cb36c002
|
[CELEBORN-83][FOLLOWUP] Fix various bugs when using HDFS as storage. (#1065)
|
2022-12-15 15:20:29 +08:00 |
|
Shuang
|
e3576e4e7a
|
[CELEBORN-117] refactor CommitManager, implements M/R Partition Commi… (#1060)
|
2022-12-15 11:09:59 +08:00 |
|
Cheng Pan
|
ec371c0026
|
[CELEBORN-132] ShuffleClient should not implement Cloneable (#1077)
|
2022-12-14 10:04:39 +08:00 |
|
Angerszhuuuu
|
c924a4ff0d
|
[CELEBORN-61][CELEBORN-62][FEATURE] Shuffle client support slow start, congestion avoidance and congestion control (#1052)
|
2022-12-08 12:41:34 +08:00 |
|
zhongqiangczq
|
60f6f87832
|
[CELEBORN-11] ShuffleClient supports MapPartition shuffle write:pushdata (#1036)
|
2022-12-08 12:31:47 +08:00 |
|