Commit Graph

127 Commits

Author SHA1 Message Date
jiaoqingbo
84795bc63b
[CELEBORN-382] Call checkDiskFullAndSplit in the handlePushData method to avoid repeated definitions (#1313) 2023-03-07 18:55:46 +08:00
Ethan Feng
675a7da393
[CELEBORN-368][FLINK] Pass exceptions in buffer stream. (#1304) 2023-03-03 15:43:30 +08:00
Keyong Zhou
dcedf7b0a9
[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287) 2023-03-02 18:31:50 +08:00
Angerszhuuuu
eda21ead24
[CELEBORN-344] Change PUSH_DATA_FAIL_MASTER/SALVE to PUSH_DATA_WRITE_FAIL_MASTER/SALVE (#1281) 2023-02-28 11:29:40 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
Angerszhuuuu
24f5478adc
[CELEBORN-338] Clean duplicated exception message of handling push data (#1274) 2023-02-28 10:35:18 +08:00
Rex(Hui) An
798ff90bb7
[CELEBORN-342] Fix the wrong avg produce bytes in Congestion control (#1279) 2023-02-27 16:29:37 +08:00
Keyong Zhou
3c8c58e09d
[CELEBORN-301] Refactor PartitionLocationInfo to use ConcurrentHashMap (#1278) 2023-02-26 16:46:30 +08:00
Angerszhuuuu
a7587c3fe7
[CELEBORN-337] Remove unnecessary StatusCode.message (#1272)
* [CELEBORN-337] Remove unnecessary StatusCode.message
2023-02-24 15:11:07 +08:00
Shuang
9754616d79
[CELEBORN-330] fix deadlock when use the same netty channel to receive data while other thread wait the response (#1265) 2023-02-23 17:57:43 +08:00
Angerszhuuuu
fc8540a2e6
[CELEBORN-325] After worker restart, throw NPE when receive not found partition (#1259)
* [CELEBORN-325] After worker restart, throw NPE when receive not found partition
2023-02-22 15:19:34 +08:00
Ethan Feng
0df08fbdf3
[CELEBORN-320][FLINK] fix handle wrong message type in FetchHandler. (#1254) 2023-02-21 11:51:01 +08:00
Ethan Feng
26a3bb5e72
[CELEBORN-308] Fix flusher will exit unexpectedly if flush task write failed. (#1249) 2023-02-20 21:45:37 +08:00
Ethan Feng
0c8bb83114
[CELEBORN-234] Implement buffer stream. (#1221) 2023-02-17 17:38:36 +08:00
zhongqiangchen
5236df68af
[CELEBORN-292] optimize mappartitionfilewriter flushing index and reading data header (#1225) 2023-02-17 13:42:28 +08:00
zhongqiangchen
79096d60d0
[CELEBORN-293] WorkerSource registers timer for mappartition message metrics (#1226) 2023-02-17 11:29:54 +08:00
Ethan Feng
1dcfdb0c8f
[CELEBORN-281] Add metrics about buffer stream read buffer. (#1216) 2023-02-17 11:20:07 +08:00
Angerszhuuuu
57f775a7e9
[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208) 2023-02-16 18:27:37 +08:00
Ethan Feng
534853bf8a
[CELEBORN-278] Add openStreamWithCredit RPC. (#1214) 2023-02-16 14:07:13 +08:00
zhongqiangchen
2c508dae0f
[CELEBORN-307] fix ArrayComparisonFailure while running lz4 ut (#1241) 2023-02-16 13:41:17 +08:00
Rex(Hui) An
2068e6ae37
[CELEBORN-279] Add user level push data speed metric (#1213) 2023-02-13 12:04:44 +08:00
Rex(Hui) An
adb6592d31
[CELEBORN-277] PushDataHandle callback could miss soft split status (#1212) 2023-02-09 14:57:18 +08:00
Rex(Hui) An
f88f5fcf55
[CELEBORN-207][FOLLOW_UP] Master could miss the congestion status if enable push.data.replicate 2023-02-07 22:57:39 +08:00
Rex(Hui) An
cfe81969c9
[CELEBORN-275] WrappedCallback should only handle response from replica (#1209) 2023-02-07 18:18:13 +08:00
Rex(Hui) An
bb113ec9be
[CELEBORN-207] Support network congestion control (#1066) 2023-02-07 12:06:18 +08:00
Angerszhuuuu
c4020100db
[CELEBORN-271][BUG] PushState in PushDataHandler should should use peer's location 2023-02-06 11:31:57 +08:00
Angerszhuuuu
ecc3a0e52f
[CELEBORN-272][BUG] Don't do replication should directly use callback not wrappedCallback (#1205) 2023-02-06 11:28:12 +08:00
zhongqiangchen
8e903840af [CELEBORN-243][REWORK]fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1202) 2023-02-04 14:27:44 +08:00
Angerszhuuuu
2e68912812
[CELEBORN-269][BUG] Disable replication throw NPE when removeBatch in pushDataHandler (#1203) 2023-02-03 20:06:59 +08:00
Shuang
2634476758
[CELEBORN-267] reuse stream when client channel reconnected (#1200) 2023-02-03 15:12:45 +08:00
Angerszhuuuu
4b6f7e4593
[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185) 2023-02-03 11:53:15 +08:00
zhongqiangczq
ff17a61ec5
[CELEBORN-243] fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1184) 2023-02-02 10:41:11 +08:00
Shuang
7162be2fae
[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149) 2023-01-31 18:53:36 +08:00
Angerszhuuuu
1311fb53d1
[CELEBORN-243][CELEBORN-245][IMPROVEMENT] Create push client failed and connection failed cause push failed should have their own ERROR type (#1181)
* [CELEBORN-243][IMPROVEMENT] Create push client failed should have a ERROR type
2023-01-30 17:47:22 +08:00
Angerszhuuuu
8611a64400
[CELEBORN-237][IMPROVEMENT] push failed error message should show partition info (#1178)
* [CELEBORN-237][IMPROVEMENT] push failed error message should show partition info
2023-01-28 18:41:54 +08:00
Ethan Feng
a239f9f284
[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168) 2023-01-16 20:06:47 +08:00
zy.jordan
bb96700415
[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163) 2023-01-16 12:03:46 +08:00
zhongqiangczq
3661222d98
[CELEBORN-195] add implementation to MapPartitionFileWriter (#1141) 2023-01-13 16:41:11 +08:00
zy.jordan
19197b9190
[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158) 2023-01-10 17:46:56 +08:00
nafiy
9635725480
[CELEBORN-204][IMPROVEMENT]Collect disk usage metrics in byte unit by default (#1153) 2023-01-09 17:36:18 +08:00
Ethan Feng
5595f2f4b3
[CELEBORN-124]Add buffer stream. (#1069) 2023-01-06 15:54:52 +08:00
Shuang
3b2be25a50
[CELEBORN-173] refactor minicluster and fix ut (#1147) 2023-01-05 20:39:19 +08:00
Angerszhuuuu
5edb21d210
[CELEBORN-168][FOLLOWUP] Device metrics should use long value and add size unit in metric name (#1143)
* [CELEBORN-168][FOLLOWUP] Device metrics should use long value and add size unit in metric name
2023-01-05 11:45:19 +08:00
nafiy
3e80cf2b87
[CELEBORN-168][FEATURE] Add disk usage related metrics for Worker (#1127) 2023-01-05 10:35:51 +08:00
Angerszhuuuu
425e31797c
[CELEBORN-182][BUG] StorageManager should not delete shuffle file when enable graceful shutdown (#1126) 2022-12-30 18:13:36 +08:00
Angerszhuuuu
7d7192af14
[CELEBORN-179][BUG] Repeat remove expired shuffle throw NPE (#1124) 2022-12-29 15:47:05 +08:00
Angerszhuuuu
6411fe71b1
[CELEBORN-178][BUG] Default registered flag should be false, not null (#1123) 2022-12-29 15:24:09 +08:00
nafiy
77cb7a0477
[CELEBORN-169][REFACTOR] Extract ObservedDevice out from LocalDeviceMonitor (#1113)
* [CELEBORN-169][REFACTOR] Extract ObservedDevice out from LocalDeviceMonitor
2022-12-28 14:29:00 +08:00
Ethan Feng
5aa959a335
[CELEBORN-157] Change prefix of configurations to celeborn. (#1104) 2022-12-21 15:17:28 +08:00
nafiy
f13dfb7421
[CELEBORN-113][FEATURE] Add metrics to monitor non-critical error number on local device (#1100) 2022-12-20 22:30:55 +08:00