Commit Graph

312 Commits

Author SHA1 Message Date
Angerszhuuuu
4f85d80687
[CELEBORN-606] Refine CommitHandler's noisy log (#1511) 2023-05-24 15:25:10 +08:00
Angerszhuuuu
811e192bbd
[CELEBORN-446] Support rack aware during assign slots for ROUNDROBIN (#1370) 2023-05-18 13:58:51 +08:00
Angerszhuuuu
a22c61e479
[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486) 2023-05-17 18:17:41 +08:00
zhongqiangchen
5769c3fdc7
[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457) 2023-05-10 19:35:51 +08:00
Shuang
fb753fd48e
[CELEBORN-573] Guarantee resource/app/worker change persistent to raft in Ha Mode. (#1477) 2023-05-10 14:28:52 +08:00
Angerszhuuuu
778b5440bc
[CELEBORN-556][BUG] ReserveSlot should not use default RPC time out since register shuffle max timeout is network timeout (#1461) 2023-05-10 12:29:06 +08:00
Angerszhuuuu
c0a9578d9f
[CELEBORN-563] Remove unnecessary code (#1469) 2023-05-06 11:25:31 +08:00
Angerszhuuuu
783d4e5dc5
[CELEBORN-551] Remove unnecessary ShuffleClient.get() (#1456) 2023-05-04 20:47:45 +08:00
Angerszhuuuu
a108d6f837
[CELEBORN-559][IMPROVEMENT] createReader should also wait for retry when change to same peer (#1465) 2023-05-04 10:51:15 +08:00
Angerszhuuuu
ef4c12e0fe
[CELEBORN-565] FETCH_MAX_RETRIES should double when enable replicates (#1471) 2023-04-28 14:27:35 +08:00
Angerszhuuuu
8d933691ae
[CELEBORN-479][FOLLOWUP] Add push task should check if loc is null (#1404) 2023-04-28 11:19:35 +08:00
Angerszhuuuu
bfce6052d7
[CELEBORN-560][FOLLOWUP] Follow the original design for handling rerun & speculative task after handleStageEnd (#1468) 2023-04-28 11:18:42 +08:00
Angerszhuuuu
7a4f2ebd8a
[CELEBORN-547] Refactor request related API (#1452) 2023-04-27 16:25:41 +08:00
Angerszhuuuu
ce21a738a9
[CELEBORN-560][BUG] Rerun task in spark later then RSS stageEnd cause NPE then job failed (#1466) 2023-04-27 14:16:32 +08:00
Angerszhuuuu
be84e8ba0d
[CELEBORN-562][REFACTOR] Rename Destroy and DestroyResponse to make it more clear (#1467) 2023-04-27 12:31:32 +08:00
Shuang
64a4f7274c
[CELEBORN-554][Tuning] Improve For LM to avoid reserve/commit empty worker resources (#1459) 2023-04-26 18:04:50 +08:00
Angerszhuuuu
4bbc8aec4f
[CELEBORN-555][REFACTOR] Avoid prin noisy blacklist info when record blacklist (#1460)
* [CELEBORN-555][REFACTOR] Avoid prin noisy blacklist info when record blacklist
2023-04-26 16:45:44 +08:00
Shuang
343f1e62d2
[CELEBORN-537][FOLLOWUP] Fix blacklist potentially lost failure workers (#1449) 2023-04-23 10:16:21 +08:00
Angerszhuuuu
17ae0cd9b1
[CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1448)
* [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
2023-04-23 10:15:41 +08:00
Shuang
d68deecaaa
[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447) 2023-04-22 16:33:22 +08:00
Angerszhuuuu
e3ae2f0e17
[CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1445)
* [CELEBORN-541][FOLLOWUP] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
2023-04-21 17:26:52 +08:00
Angerszhuuuu
16d193071f
[CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled (#1444)
* [CELEBORN-541][PERF] handleGetReducerFileGroup occupy too much RPC thread cause other RPC can't been handled
2023-04-21 17:04:52 +08:00
Shuang
62d60de8c5
[CELEBORN-537] Improve blacklist compute & minor fix for Flink (#1441)
[CELEBORN-537] improve blacklist compute & minor fix for flink
2023-04-20 18:30:10 +08:00
Ethan Feng
6378a386d0
[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439) 2023-04-20 10:17:26 +08:00
Angerszhuuuu
d53cf40728
[CELEBRON-528][REFACTOR] RegisterShuffle 's log should show clear belongs to which shuffle (#1434) 2023-04-17 16:19:29 +08:00
Shuang
412d10b7dc
[CELEBORN-479][FLINK] support stopTrackingAndReleasePartitions when worker is not available (#1405) 2023-04-17 14:44:24 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Angerszhuuuu
e5722126e9
[CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication (#1408)
* [CELEBORN-502][REFACTOR] Merge GetBlacklistResponse to HeartbeatFromApplication
2023-04-12 14:59:32 +08:00
Shuang
a892640353
[CELEBORN-503][FLINK] fix attempt task may use wrong partitionId. (#1409) 2023-04-04 15:46:35 +08:00
Angerszhuuuu
015788dd28
[CELEBORN-484][FOLLOWUP] Return shutting worker is empty also need to retain LifecycleManager's shutting workers (#1403) 2023-04-03 16:37:46 +08:00
Angerszhuuuu
bf46336d54
[CELEBORN-487][PERF] ShuffleClientSide support blacklist to avoid client side timeout in same worker multiple times (#1399) 2023-04-03 11:50:04 +08:00
Angerszhuuuu
b4f8ab19bd
[CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's partition location. (#1395)
* [CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's  partition location.
2023-04-02 09:18:12 +08:00
Aravind Patnam
2c3005ad5b
[CELEBORN-491] Improve exception logging in RssInputStream (#1398) 2023-03-30 10:21:07 +08:00
Angerszhuuuu
9d9a2d4ea8
[CELEBORN-479][FOLLOWUP] Return empty tasks instead of null to avoid NPE (#1388) 2023-03-27 17:03:06 +08:00
Keyong Zhou
cb19ed1c66
[CELEBORN-479][PERF] Refactor DataPushQueue.takePushTask to avoid busy wait (#1386) 2023-03-27 16:18:55 +08:00
Fei Wang
b40c573069
[CELEBORN-474][FOLLOWUP] Using inner static ConcurrentHashMap class and only apply for JDK8 (#1384) 2023-03-27 16:16:23 +08:00
Fei Wang
0d1695abd8
[CELEBORN-473] Enable file system cache for viewfs in ShuffleClient as well 2023-03-26 09:44:04 +08:00
Fei Wang
7c444cb0c5
[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383) 2023-03-26 09:41:59 +08:00
Fei Wang
c609c0ebaa
[MINOR] Fix typo and remove unused code (#1381)
* fix typo

* remove unused
2023-03-25 23:20:33 +08:00
Shuang
89b3f3887d
[CELEBORN-356] [FLINK] Support release single partition resource (#1314) 2023-03-24 17:15:28 +08:00
cxzl25
2adbce942a
[CELEBORN-471] Fix String.format wrong type in ShuffleClientImpl (#1378) 2023-03-24 16:05:48 +08:00
Keyong Zhou
107868d4f1
[CELEBORN-441][FLINK] Move ShuffleTaskInfo to Flink Plugin (#1361) 2023-03-20 13:31:53 +08:00
Keyong Zhou
9401db2bc8
[CELEBORN-443] Code refine for client and common (#1362) 2023-03-20 10:37:43 +08:00
Keyong Zhou
21bdfdb21b
[CELEBORN-390][FLINK] Refine synchronization in FlinkShuffleClientImpl#updateFileGroup (#1320) 2023-03-09 16:49:18 +08:00
zhongqiangchen
9dc1bc2b1c
[CELEBORN-367] [FLINK] Move pushdata functions used by mappartition from ShuffleClientImpl to FlinkShuffleClientImpl (#1295) 2023-03-02 18:50:38 +08:00
Angerszhuuuu
786fcd6744
[CELEBORN-336] Revive Failed should use keep the corresponding StatusCode (#1283)
* [CELEBORN-336] Revive Failed should use keep the corresponding StatusCode
2023-03-01 18:57:51 +08:00
Shuang
bc7da3154f
[CELEBORN-354][Flink] fix succeedPartitionIds may contain new added partitionIds (#1289) 2023-03-01 15:45:24 +08:00
Angerszhuuuu
eda21ead24
[CELEBORN-344] Change PUSH_DATA_FAIL_MASTER/SALVE to PUSH_DATA_WRITE_FAIL_MASTER/SALVE (#1281) 2023-02-28 11:29:40 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
Angerszhuuuu
24f5478adc
[CELEBORN-338] Clean duplicated exception message of handling push data (#1274) 2023-02-28 10:35:18 +08:00
Shuang
935806f036
[CELEBORN-341][Flink] cache file group for map partition in Flink plugin (#1277) 2023-02-26 20:31:20 +08:00
Angerszhuuuu
a7587c3fe7
[CELEBORN-337] Remove unnecessary StatusCode.message (#1272)
* [CELEBORN-337] Remove unnecessary StatusCode.message
2023-02-24 15:11:07 +08:00
Angerszhuuuu
81f7ffd767
[CELEBORN-332] Unify the log of ShuffleClientImpl (#1267)
* [CELEBORN-332] Unify the log of ShuffleClientImpl
2023-02-24 14:07:25 +08:00
Angerszhuuuu
3067efcfd3
[CELEBORN-331] submitRetryPushData should throw PUSH_DATA_CREATE_CONNECTION_FAIL_MASTER too (#1266)
* [CELEBORN-331] submitRetryPushData should throw PUSH_DATA_CREATE_CONNECTION_FAIL_MASTER too
2023-02-23 14:57:11 +08:00
Angerszhuuuu
f7948190cf
[CELEBORN-316][FOLLOWUP] Should not wrap CelebornIOException with CelebornIOException (#1264) 2023-02-23 11:48:46 +08:00
Angerszhuuuu
1132cc25ab
[CELEBORN-328][MPROVEMENT] Too much noisy log when reserve slot failed (#1262) 2023-02-22 17:19:52 +08:00
Angerszhuuuu
322f0d2b41
[CELEBORN-316] Wrap Celeborn exception with CelebornIOException (#1253) 2023-02-22 16:10:11 +08:00
Shuang
3da615972e
[CELEBORN-326)] [Flink] lifecycleManager supports flink-yarn-session mode to handle multiple Flink jobs. (#1260) 2023-02-22 15:37:24 +08:00
Angerszhuuuu
251b923b5b
[CELEBORN-321] When register shuffle failed, DataPushQueue should directly take the task queue to avoid NPE (#1258) 2023-02-21 17:02:37 +08:00
Shuang
61065230bd
[CELEBORN-311] not retry when register for map partition occurs exception (#1246) 2023-02-21 10:16:10 +08:00
Ethan Feng
bfb39632d9
[CELEBORN-235] Implement flink plugin. (#1244) 2023-02-17 19:31:12 +08:00
zhongqiangchen
b5dc106af8
[CELEBORN-291] optimize shuffleclientimpl creating client and pushdata for mappartition (#1224) 2023-02-17 19:07:19 +08:00
Shuang
b7ef9cf216
[CELEBORN-297] don't cache file groups for map partition shuffle type (#1237) 2023-02-17 11:28:47 +08:00
Angerszhuuuu
57f775a7e9
[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208) 2023-02-16 18:27:37 +08:00
jiaoqingbo
318157e3e9
[CELEBORN-305] Change the parameter passed in the registerShuffle method to numPartitions instead of numMappers (#1240) 2023-02-15 17:35:43 +08:00
jiaoqingbo
bd9e0ddc1f
[CELEBORN-304] Missing setIfMissing celeborn.$module.io.serverThreads (#1238) 2023-02-15 15:49:08 +08:00
Shuang
75c83093f2
[CELEBORN-296] fix map partition commit using wrong partitionId and result (#1233) 2023-02-14 20:54:06 +08:00
Rex(Hui) An
bff6e91e0b
[CELEBORN-227] Support different push strategies to control the push speed (#1167) 2023-02-07 14:24:30 +08:00
Angerszhuuuu
ff683ffc91
[CELEBORN-238][IMPROVEMENT] Revive caused by PUSH_DATA_TIMEOUT_MASTER and PUSH_DATA_TIMEOUT_SLAVE should add corresponding worker into blacklist (#1180) 2023-02-03 17:47:24 +08:00
Angerszhuuuu
4b6f7e4593
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185) 2023-02-03 11:53:15 +08:00
Rex(Hui) An
021004714b
[CELEBORN-264] InFlight requests should not be expired if it's not pushed yet (#1196) 2023-02-01 22:16:55 +08:00
Shuang
7162be2fae
[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149) 2023-01-31 18:53:36 +08:00
Angerszhuuuu
1311fb53d1
[CELEBORN-243][CELEBORN-245][IMPROVEMENT] Create push client failed and connection failed cause push failed should have their own ERROR type (#1181)
* [CELEBORN-243][IMPROVEMENT] Create push client failed should have a ERROR type
2023-01-30 17:47:22 +08:00
Angerszhuuuu
8611a64400
[CELEBORN-237][IMPROVEMENT] push failed error message should show partition info (#1178)
* [CELEBORN-237][IMPROVEMENT] push failed error message should show partition info
2023-01-28 18:41:54 +08:00
Keyong Zhou
e47f1e33b0
[CELEBORN-55][FOLLOWUP] Code refine (#1175) 2023-01-20 16:22:47 +08:00
zy.jordan
c5be79ee3d
[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102) 2023-01-20 10:18:45 +08:00
zhongqiangczq
1836fe187b
[CELEBORN-197] in mappartition, check transportClient whether changed while sending messages (#1145) 2023-01-13 16:45:26 +08:00
Shuang
810a8d01e0
[CELEBORN-212] refresh client if current client is inactive. (#1159) 2023-01-11 11:54:50 +08:00
Shuang
1332362bff
[CELEBORN-213] Add configuration for whether to close idle connections in client side (#1157) 2023-01-10 19:13:33 +08:00
Angerszhuuuu
e155ec122a
[CELEBORN-190] doPushMergedData should also support revive multiple times, not only twice (#1136) 2023-01-10 11:39:40 +08:00
Shuang
2ec06472fe
[CELEBORN-203] fix NPE when removeExpiredShuffle in LifecycleManager. (#1151) 2023-01-06 18:32:17 +08:00
Angerszhuuuu
0d5809ff0c
[CELEBORN-192][IMPROVEMENT] Change FAILED status to REQUEST_FAILED since it's all used when RPC request failed. (#1139) 2023-01-06 16:53:04 +08:00
Shuang
3b2be25a50
[CELEBORN-173] refactor minicluster and fix ut (#1147) 2023-01-05 20:39:19 +08:00
Angerszhuuuu
415452d9c4
[CELEBORN-189][IMPROVEMENT] PushDataFailedSlave should add slave worker to blacklist (#1135) 2023-01-05 20:12:07 +08:00
Angerszhuuuu
fe8dfb05f3
[CELEBORN-196][REFACTOR] Rename batchHandleRequestPartitions to handleRequestPartitions (#1144) 2023-01-05 14:37:10 +08:00
Angerszhuuuu
2315f2f988
[CELEBORN-191][BUG] ShuffleClient registerShuffle return RESERVE_SLOTS_FAILED should also been print out (#1138) 2023-01-03 17:13:31 +08:00
Shuang
5cba307189
[CELEBORN-146] refactor ShuffleMapperAttempts & GetReducerFileGroup (#1116) 2022-12-30 18:15:23 +08:00
Cheng Pan
b8758a7cb6
[CELEBORN-181][TEST] Rename RssFunSuite to CelebornFunSuite (#1125) 2022-12-29 18:10:14 +08:00
RexAn
6432a129be
[CELEBORN-61][CELEBORN-62][FOLLOW_UP] Fix some issues for slow start (#1119) 2022-12-29 12:07:20 +08:00
Binjie Yang
63943cd5cc
[CELEBORN-147][IT]Extraction of common integration test cases (#1092) 2022-12-29 12:03:09 +08:00
Keyong Zhou
2f0682265e
[CELEBORN-119] Add timeout for pushdata (#1097) 2022-12-20 20:40:42 +08:00
Keyong Zhou
a2dd72f20c
[CELEBORN-155] Wrong TimeUnit for registerShuffleRetryWait in Shuffle… (#1099) 2022-12-19 17:32:18 +08:00
Shuang
13769f0f0a
[CELEBORN-121] Refactor batchHandleCommitPartition (#1089) 2022-12-19 12:39:39 +08:00
Ethan Feng
39394526a8
[CELEBORN-142]Keep committed partition locations semantic consistent when commit files on HDFS. (#1091) 2022-12-16 19:02:02 +08:00
nafiy
ddab27a1d7
[CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse (#1093)
* [CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse
2022-12-15 18:16:34 +08:00
Ethan Feng
65cb36c002
[CELEBORN-83][FOLLOWUP] Fix various bugs when using HDFS as storage. (#1065) 2022-12-15 15:20:29 +08:00
Shuang
e3576e4e7a
[CELEBORN-117] refactor CommitManager, implements M/R Partition Commi… (#1060) 2022-12-15 11:09:59 +08:00
Cheng Pan
ec371c0026
[CELEBORN-132] ShuffleClient should not implement Cloneable (#1077) 2022-12-14 10:04:39 +08:00
Angerszhuuuu
c924a4ff0d
[CELEBORN-61][CELEBORN-62][FEATURE] Shuffle client support slow start, congestion avoidance and congestion control (#1052) 2022-12-08 12:41:34 +08:00
zhongqiangczq
60f6f87832
[CELEBORN-11] ShuffleClient supports MapPartition shuffle write:pushdata (#1036) 2022-12-08 12:31:47 +08:00