celeborn

Author	SHA1	Message	Date
sychen	4cb4701ede	[CELEBORN-689] Fix the incorrect part of PushDataHandler message type converted to status code ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1601 from cxzl25/CELEBORN-689. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-19 11:25:08 +08:00
Angerszhuuuu	0aa13832b5	[CELEBORN-676] Celeborn fetch chunk also should support check timeout ### What changes were proposed in this pull request? Celeborn fetch chunk also should support check timeout #### Test case ``` executor instance 20 SQL: SELECT count(1) from (select /+ REPARTITION(100) / * from spark_auxiliary.t50g) tmp; --conf spark.celeborn.client.spark.shuffle.writer=sort \ --conf spark.celeborn.client.fetch.excludeWorkerOnFailure.enabled=true \ --conf spark.celeborn.client.push.timeout=10s \ --conf spark.celeborn.client.push.replicate.enabled=true \ --conf spark.celeborn.client.push.revive.maxRetries=10 \ --conf spark.celeborn.client.reserveSlots.maxRetries=10 \ --conf spark.celeborn.client.registerShuffle.maxRetries=3 \ --conf spark.celeborn.client.push.blacklist.enabled=true \ --conf spark.celeborn.client.blacklistSlave.enabled=true \ --conf spark.celeborn.client.fetch.timeout=30s \ --conf spark.celeborn.client.push.data.timeout=30s \ --conf spark.celeborn.client.push.limit.inFlight.timeout=600s \ --conf spark.celeborn.client.push.maxReqsInFlight=32 \ --conf spark.celeborn.client.shuffle.compression.codec=ZSTD \ --conf spark.celeborn.rpc.askTimeout=30s \ --conf spark.celeborn.client.rpc.reserveSlots.askTimeout=30s \ --conf spark.celeborn.client.shuffle.batchHandleChangePartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleCommitPartition.enabled=true \ --conf spark.celeborn.client.shuffle.batchHandleReleasePartition.enabled=true ``` Test with 3 worker and add a `Thread.sleep(100s)` before worker handle `ChunkFetchRequest` Before patch <img width="1783" alt="截屏2023-06-14 上午11 20 55" src="https://github.com/apache/incubator-celeborn/assets/46485123/182dff7d-a057-4077-8368-d1552104d206"> After patch <img width="1792" alt="image" src="https://github.com/apache/incubator-celeborn/assets/46485123/3c8b7933-8ace-426d-8e9f-04e0aabfac8e"> The log shows the fetch timeout checker workers ``` 23/06/14 11:14:54 ERROR WorkerPartitionReader: Fetch chunk 0 failed. org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/06/14 11:14:54 WARN RssInputStream: Fetch chunk failed 1/6 times for location PartitionLocation[ id-epoch:35-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.203-9092-9094-9093-9095 mode:MASTER peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.169.48.202-9092-9094-9093-9095) storage hint:StorageInfo{type=HDD, mountPoint='/mnt/ssd/0', finalResult=true, filePath=} mapIdBitMap:null], change to peer org.apache.celeborn.common.exception.CelebornIOException: Fetch chunk 0 failed. at org.apache.celeborn.client.read.WorkerPartitionReader$1.onFailure(WorkerPartitionReader.java:98) at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:146) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$1(TransportResponseHandler.java:103) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.celeborn.common.exception.CelebornIOException: FETCH_DATA_TIMEOUT at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredFetchRequest(TransportResponseHandler.java:147) ... 8 more 23/06/14 11:14:54 INFO SortBasedShuffleWriter: Memory used 72.0 MB ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1587 from AngersZhuuuu/CELEBORN-676. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-15 13:54:09 +08:00
Shuang	da85347330	[CELEBORN-675] Fix decode heartbeat message ### What changes were proposed in this pull request? Give Heartbeat one byte message and skip this byte when decode. ### Why are the changes needed? Heartbeat message may split in to two netty buffer, then the `empty buffer` (which don't need actually, but need keep) be wrong removed, then decodeNext would throw NPE. see ``` java while (headerBuf.readableBytes() < HEADER_SIZE) { ByteBuf next = buffers.getFirst(); int toRead = Math.min(next.readableBytes(), HEADER_SIZE - headerBuf.readableBytes()); headerBuf.writeBytes(next, toRead); if (!next.isReadable()) { buffers.removeFirst().release(); } } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT & MANUAL Closes #1589 from RexXiong/CELEBORN-675. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>	2023-06-14 14:37:13 +08:00
Angerszhuuuu	f2357bf75c	[CELEBORN-671] Add hasPeer method to PartitionLocation ### What changes were proposed in this pull request? Add hasPeer method to PartitionLocation ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1583 from AngersZhuuuu/CELEBORN-671. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-14 10:29:16 +08:00
zky.zhoukeyong	76831e805d	[CELEBORN-668] Report WorkerLost instead of WorkerUnavailable if grac… …eful is disabled ### What changes were proposed in this pull request? Worker should report WorkerLost instead of WorkerUnavailable in it's shutdown hook if graceful shutdown is disabled. ### Why are the changes needed? To avoid unnecessary commit file requests from lifecycle manager since it's not graceful shutdown. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #1580 from waitinfuture/668. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-13 11:30:59 +08:00
zky.zhoukeyong	dab865b68b	[CELEBORN-662] Report worker unavailable regardless graceful shutdown ### What changes were proposed in this pull request? In this PR, worker always report node unavailable regardless graceful shutdown is turned on or off. ### Why are the changes needed? To inform master the shutting down worker as soon as possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #1575 from waitinfuture/662. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-10 18:36:25 +08:00
Cheng Pan	76533d7324	[CELEBORN-650][TEST] Upgrade scalatest and unify mockito version ### What changes were proposed in this pull request? This PR upgrades - `mockito` from 1.10.19 and 3.6.0 to 4.11.0 - `scalatest` from 3.2.3 to 3.2.16 - `mockito-scalatest` from 1.16.37 to 1.17.14 ### Why are the changes needed? Housekeeping, making test dependencies up-to-date and unified. ### Does this PR introduce _any_ user-facing change? No, it only affects test. ### How was this patch tested? Pass GA. Closes #1562 from pan3793/CELEBORN-650. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-09 10:04:14 +08:00
onebox-li	0c869ac9a0	[CELEBORN-642] Improve metrics and update grafana ### What changes were proposed in this pull request? Change in grafana （ALL） add: JVMCPUTime LastMinuteSystemLoad AvailableProcessors （For Master） add: LostWorkers IsActiveMaster PartitionSize （For Worker） add: PushDataFailCount -> WriteDataFailCount ReplicateDataFailCount ReplicateDataWriteFailCount ReplicateDataCreateConnectionFailCount ReplicateDataConnectionExceptionCount ReplicateDataTimeoutCount SortedFileSize PushDataHandshakeFailCount RegionStartFailCount RegionFinishFailCount MasterPushDataHandshakeTime SlavePushDataHandshakeTime MasterRegionStartTime SlaveRegionStartTime MasterRegionFinishTime SlaveRegionFinishTime PotentialConsumeSpeed UserProduceSpeed WorkerConsumeSpeed DeviceOSFreeBytes DeviceCelebornFreeBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory remove: dup ReserveSlotsTime Change dashboard layout. Fix support for multiple labels. Modify some metrics docs. ### Why are the changes needed? For better use of metrics. ### Does this PR introduce _any_ user-facing change? Below metrics change name, extract some value to the label. DeviceOSFreeCapacity(B) -> DeviceOSFreeBytes DeviceOSTotalCapacity(B) -> DeviceOSTotalBytes DeviceCelebornFreeCapacity(B) -> DeviceCelebornFreeBytes DeviceCelebornTotalCapacity(B) -> DeviceCelebornTotalBytes push usedHeapMemory/usedDirectMemory fetch usedHeapMemory/usedDirectMemory replicate usedHeapMemory/usedDirectMemory ### How was this patch tested? Cluster test. Closes #1557 from onebox-li/improve-metrics. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-08 18:10:06 +08:00
zhongqiang.czq	586785c88d	[CELEBORN-617][FLINK] MapPartitionFileWriter updates flushing file length …ngth ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1519 from zhongqiangczq/mapfilelength. Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-08 10:47:36 +08:00
Angerszhuuuu	d4cb6dd8ab	[CELEBORN-645][REFACTOR] Refine logic about handle HeartbeatFromWorkerResponse ### What changes were proposed in this pull request? Refine the logic here to make it easier understand. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1555 from AngersZhuuuu/CELEBORN-645. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-07 16:34:44 +08:00
Cheng Pan	3c7d179e05	[CELEBORN-636] Replace SimpleDateFormat with FastDateFormat ### What changes were proposed in this pull request? `SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat` ### Why are the changes needed? `FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`. ### Does this PR introduce _any_ user-facing change? Yes, it's a bug fix. ### How was this patch tested? Manually review. Closes #1545 from pan3793/CELEBORN-636. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Ethan Feng <ethanfeng@apache.org>	2023-06-06 12:59:32 +08:00
Shuang	a1de16a80f	[CELEBORN-626] Fix potential deadlock in filewriter ### What changes were proposed in this pull request? Lock flushBuffer field and flush method to make sure thread safe access. ### Why are the changes needed? When stageEnd, worker will commit files and filewriters would be closed, the speculative task may still push data to the file writer, if the push task increment numPendingWrites. the commit thread which hold the filewriter object lock will need wait the pending writes decrement to 0. but push thread need the filewriter object lock to decrement numPendingWrites, this cause deadlock.. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #1534 from RexXiong/CELEBORN-626. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-02 17:47:39 +08:00
Angerszhuuuu	e18a5ea769	[CELEBORN-624] StorageManager should only remove expired app dirs (#1531 )	2023-06-02 11:33:33 +08:00
Angerszhuuuu	cf308aa057	[CLEBORN-595] Refine code frame of CelebornConf (#1525 )	2023-06-01 10:37:58 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
zhongqiangchen	f117cff776	[CELEBORN-618] [FLINK] worker side adds partition split configuration options (#1520 )	2023-05-30 14:13:31 +08:00
Angerszhuuuu	c4bff654b0	[CELEBORN-614] Simplify StorageManager's flushFileWriters to avoid too much cost on collection operation (#1517 )	2023-05-30 11:38:05 +08:00
Angerszhuuuu	6619015a63	[CELEBORN-596] Worker don't need to update disk max slots (#1502 )	2023-05-23 10:30:35 +08:00
Ethan Feng	7015d2463a	[CELEBORN-583] Merge pooled memory allocators. (#1490 )	2023-05-18 10:37:30 +08:00
Leo Li	65cdb3eba4	[CELEBORN-585] Create if not exists worker recoverPath when graceful shutdown is enabled (#1487 )	2023-05-17 11:29:09 +08:00
Angerszhuuuu	64a3534f71	[CELEBORN-584] Worker side should expose push/replicate/fetch Netty allocator metrics (#1489 )	2023-05-16 17:51:33 +08:00
Angerszhuuuu	d657f8268a	[CELEBORN-586] Add SystemMiscSource to indicate system running status (#1488 )	2023-05-16 14:03:07 +08:00
zhongqiangchen	5769c3fdc7	[CELEBORN-552] Add HeartBeat between the client and worker to keep alive (#1457 )	2023-05-10 19:35:51 +08:00
Shuang	fb753fd48e	[CELEBORN-573] Guarantee resource/app/worker change persistent to raft in Ha Mode. (#1477 )	2023-05-10 14:28:52 +08:00
Angerszhuuuu	5f7e1ce8e2	[CELEBORN-578][REFACTOR] Refine commit file's log to indicate more clear about empty partitions (#1481 )	2023-05-08 18:21:46 +08:00
Angerszhuuuu	7a4f2ebd8a	[CELEBORN-547] Refactor request related API (#1452 )	2023-04-27 16:25:41 +08:00
Angerszhuuuu	be84e8ba0d	[CELEBORN-562][REFACTOR] Rename Destroy and DestroyResponse to make it more clear (#1467 )	2023-04-27 12:31:32 +08:00
Shuang	0b2e4877bd	[CELEBORN-553] Improve IO (#1458 )	2023-04-25 21:14:06 +08:00
Ethan Feng	01d8d1079c	[CELEBORN-550][FLINK] Fix bufferQueue release and poll concurrent problem. (#1455 )	2023-04-25 15:06:06 +08:00
Shuang	d68deecaaa	[CELEBORN-546][FLINK] Use autoIncrement partitionId replace encode(mapId, attemptId) for generating partitionId (#1447 )	2023-04-22 16:33:22 +08:00
Angerszhuuuu	0c2d3e647d	[CELEBORN-532][METRICS] Refine push-related failure metrics (#1442 ) * [CELEBORN-532][METRICS] Refine push-related failure metrics	2023-04-21 17:05:43 +08:00
Angerszhuuuu	181c1bfcd6	[CELEBORN-524][PERF] CongestionControl call too much ChannelsLimiter onTrim cause CPU stuck or occupy too much CPU cause no cpu for handlePushData (#1428 )	2023-04-21 15:44:56 +08:00
Ethan Feng	6378a386d0	[CELEBORN-530][REFACTOR] Move stream manager and memory manager to worker module. (#1439 )	2023-04-20 10:17:26 +08:00
Angerszhuuuu	932ccd0841	[CELEBORN-523][REFACTOR] Remove unnecessary code in WorkerPartitionLocationInfo (#1427 )	2023-04-15 22:36:48 +08:00
Shuang	a22c6ca749	[CELEBORN-521] correct exception and unify unRetryableException (#1425 )	2023-04-15 22:27:28 +08:00
cxzl25	13f772e0c0	[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size	2023-04-14 20:45:25 +08:00
Rex(Hui) An	0b402b5903	[CELEBORN-522] Add worker consume speed metric	2023-04-14 13:38:49 +08:00
Angerszhuuuu	3a21362265	[CELEBORN-511][IMPROVE] Move onTrim tag to StorageManager to avoid frequent trim action (#1415 ) * [CELEBORN-511][IMPROVE] Move onTrim tag to StorageManager to avoid frequent trim action	2023-04-14 10:35:51 +08:00
Ethan Feng	9cccfc9872	[CELEBORN-431][FLINK] Support dynamic buffer allocation in reading map partition. (#1407 )	2023-04-13 10:37:47 +08:00
Angerszhuuuu	da98ed9bea	[CELEBORN-516][PERF] Remove RPCSource since it cost too much CPU (#1420 )	2023-04-12 18:47:06 +08:00
Angerszhuuuu	f574a4dafa	[CELEBORN-512][IMPROVEMENT] Sort timestamp and show in date format (#1416 )	2023-04-11 19:56:48 +08:00
Angerszhuuuu	cad2836e85	[CELEBORN-505] Fix typo of SHUFFLE_CHUCK_SIZE (#1411 )	2023-04-04 19:15:30 +08:00
Fei Wang	b40c573069	[CELEBORN-474][FOLLOWUP] Using inner static ConcurrentHashMap class and only apply for JDK8 (#1384 )	2023-03-27 16:16:23 +08:00
Fei Wang	7c444cb0c5	[CELEBORN-474] Speed up ConcurrentHashMap#computeIfAbsent (#1383 )	2023-03-26 09:41:59 +08:00
Fei Wang	c609c0ebaa	[MINOR] Fix typo and remove unused code (#1381 ) * fix typo * remove unused	2023-03-25 23:20:33 +08:00
Shuang	89b3f3887d	[CELEBORN-356] [FLINK] Support release single partition resource (#1314 )	2023-03-24 17:15:28 +08:00
Keyong Zhou	3d6fba553b	[CELEBORN-454] Code refine for worker (#1371 )	2023-03-22 10:39:14 +08:00
Keyong Zhou	13c610838b	[CELEBORN-455] Use 4 bytes instead of 16 to read mapId in FileWriter.… (#1369 )	2023-03-21 19:21:50 +08:00
Angerszhuuuu	56d796638f	[CELEBORN-438] Move ServletPath to MetricsSytsem (#1364 )	2023-03-20 18:22:40 +08:00
乐活优格	0b78c6d325	[CELEBORN-442]Support hdfs compatible file system (#1360 )	2023-03-18 11:47:46 +08:00
Angerszhuuuu	e61130d397	[CELEBORN-423][FOLLOWUP] Format http request (#1353 ) * [CELEBORN-423][FOLLOWUP] Format http request	2023-03-15 16:30:23 +08:00
Angerszhuuuu	1f56a5e5d1	[CELEBORN-423] Format http request result (#1349 )	2023-03-15 10:32:01 +08:00
Angerszhuuuu	3907d70212	[CELEBORN-421] Add shutdown and registered to http request (#1346 ) * [CELEBORN-421] Add shutdown and registered to http request	2023-03-14 18:23:21 +08:00
Angerszhuuuu	7d7279a9bc	[CELEBORN-420] Add unavailablePeers to http request (#1345 ) * [CELEBORN-420] Add unavailablePeers to http request	2023-03-14 17:23:45 +08:00
Angerszhuuuu	3600ccc4e3	[CELEBORN-409] Add PartitionLocationInfo to worker's http request (#1335 )	2023-03-13 17:02:28 +08:00
Angerszhuuuu	6f1ab70403	[CELEBORN-406] Add blacklist to http request to indicate blacklisted worker (#1334 )	2023-03-13 16:44:46 +08:00
Angerszhuuuu	144a8cdb3f	[CELEBORN-408] Add lost worker infos to http request (#1333 )	2023-03-13 15:27:41 +08:00
Ethan Feng	bb8401e401	[CELEBORN-403][FLINK] Add metrics about buffer dispatcher request queue length. (#1329 )	2023-03-13 11:15:00 +08:00
Angerszhuuuu	a336f12cc8	[CELEBORN-400] Add RPC metrics for OpenStream (#1326 )	2023-03-10 21:22:05 +08:00
Angerszhuuuu	4b334df7a6	[CELEBORN-399] Make fileSorterExecutors thread num can be customized (#1325 )	2023-03-10 21:10:43 +08:00
jiaoqingbo	84795bc63b	[CELEBORN-382] Call checkDiskFullAndSplit in the handlePushData method to avoid repeated definitions (#1313 )	2023-03-07 18:55:46 +08:00
Ethan Feng	675a7da393	[CELEBORN-368][FLINK] Pass exceptions in buffer stream. (#1304 )	2023-03-03 15:43:30 +08:00
Keyong Zhou	dcedf7b0a9	[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287 )	2023-03-02 18:31:50 +08:00
Angerszhuuuu	eda21ead24	[CELEBORN-344] Change PUSH_DATA_FAIL_MASTER/SALVE to PUSH_DATA_WRITE_FAIL_MASTER/SALVE (#1281 )	2023-02-28 11:29:40 +08:00
Keyong Zhou	7adf1fca41	[CELEBORN-295] Optimize data push (#1232 ) * [CELEBORN-295] Add double buffer for sort pusher	2023-02-28 10:35:55 +08:00
Angerszhuuuu	24f5478adc	[CELEBORN-338] Clean duplicated exception message of handling push data (#1274 )	2023-02-28 10:35:18 +08:00
Rex(Hui) An	798ff90bb7	[CELEBORN-342] Fix the wrong avg produce bytes in Congestion control (#1279 )	2023-02-27 16:29:37 +08:00
Keyong Zhou	3c8c58e09d	[CELEBORN-301] Refactor PartitionLocationInfo to use ConcurrentHashMap (#1278 )	2023-02-26 16:46:30 +08:00
Angerszhuuuu	a7587c3fe7	[CELEBORN-337] Remove unnecessary StatusCode.message (#1272 ) * [CELEBORN-337] Remove unnecessary StatusCode.message	2023-02-24 15:11:07 +08:00
Shuang	9754616d79	[CELEBORN-330] fix deadlock when use the same netty channel to receive data while other thread wait the response (#1265 )	2023-02-23 17:57:43 +08:00
Angerszhuuuu	fc8540a2e6	[CELEBORN-325] After worker restart, throw NPE when receive not found partition (#1259 ) * [CELEBORN-325] After worker restart, throw NPE when receive not found partition	2023-02-22 15:19:34 +08:00
Ethan Feng	0df08fbdf3	[CELEBORN-320][FLINK] fix handle wrong message type in FetchHandler. (#1254 )	2023-02-21 11:51:01 +08:00
Ethan Feng	26a3bb5e72	[CELEBORN-308] Fix flusher will exit unexpectedly if flush task write failed. (#1249 )	2023-02-20 21:45:37 +08:00
Ethan Feng	0c8bb83114	[CELEBORN-234] Implement buffer stream. (#1221 )	2023-02-17 17:38:36 +08:00
zhongqiangchen	5236df68af	[CELEBORN-292] optimize mappartitionfilewriter flushing index and reading data header (#1225 )	2023-02-17 13:42:28 +08:00
zhongqiangchen	79096d60d0	[CELEBORN-293] WorkerSource registers timer for mappartition message metrics (#1226 )	2023-02-17 11:29:54 +08:00
Ethan Feng	1dcfdb0c8f	[CELEBORN-281] Add metrics about buffer stream read buffer. (#1216 )	2023-02-17 11:20:07 +08:00
Angerszhuuuu	57f775a7e9	[CELEBORN-273] Move push data timeout checker into TransportResponseHandler to keep callback status consistence (#1208 )	2023-02-16 18:27:37 +08:00
Ethan Feng	534853bf8a	[CELEBORN-278] Add openStreamWithCredit RPC. (#1214 )	2023-02-16 14:07:13 +08:00
zhongqiangchen	2c508dae0f	[CELEBORN-307] fix ArrayComparisonFailure while running lz4 ut (#1241 )	2023-02-16 13:41:17 +08:00
Rex(Hui) An	2068e6ae37	[CELEBORN-279] Add user level push data speed metric (#1213 )	2023-02-13 12:04:44 +08:00
Rex(Hui) An	adb6592d31	[CELEBORN-277] PushDataHandle callback could miss soft split status (#1212 )	2023-02-09 14:57:18 +08:00
Rex(Hui) An	f88f5fcf55	[CELEBORN-207][FOLLOW_UP] Master could miss the congestion status if enable push.data.replicate	2023-02-07 22:57:39 +08:00
Rex(Hui) An	cfe81969c9	[CELEBORN-275] WrappedCallback should only handle response from replica (#1209 )	2023-02-07 18:18:13 +08:00
Rex(Hui) An	bb113ec9be	[CELEBORN-207] Support network congestion control (#1066 )	2023-02-07 12:06:18 +08:00
Angerszhuuuu	c4020100db	[CELEBORN-271][BUG] PushState in PushDataHandler should should use peer's location	2023-02-06 11:31:57 +08:00
Angerszhuuuu	ecc3a0e52f	[CELEBORN-272][BUG] Don't do replication should directly use callback not wrappedCallback (#1205 )	2023-02-06 11:28:12 +08:00
zhongqiangchen	8e903840af	[CELEBORN-243][REWORK]fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1202 )	2023-02-04 14:27:44 +08:00
Angerszhuuuu	2e68912812	[CELEBORN-269][BUG] Disable replication throw NPE when removeBatch in pushDataHandler (#1203 )	2023-02-03 20:06:59 +08:00
Shuang	2634476758	[CELEBORN-267] reuse stream when client channel reconnected (#1200 )	2023-02-03 15:12:45 +08:00
Angerszhuuuu	4b6f7e4593	[CELEBORN-239][IMPROVEMENT] Worker replicate should enable push data timeout too (#1185 )	2023-02-03 11:53:15 +08:00
zhongqiangczq	ff17a61ec5	[CELEBORN-243] fix bug that os's disk usage is low but celeborn thinks that it's high_disk_usage (#1184 )	2023-02-02 10:41:11 +08:00
Shuang	7162be2fae	[CELEBORN-201] Separate partitionLocationInfo in LifecycleManager and worker (#1149 )	2023-01-31 18:53:36 +08:00
Angerszhuuuu	1311fb53d1	[CELEBORN-243][CELEBORN-245][IMPROVEMENT] Create push client failed and connection failed cause push failed should have their own ERROR type (#1181 ) * [CELEBORN-243][IMPROVEMENT] Create push client failed should have a ERROR type	2023-01-30 17:47:22 +08:00
Angerszhuuuu	8611a64400	[CELEBORN-237][IMPROVEMENT] push failed error message should show partition info (#1178 ) * [CELEBORN-237][IMPROVEMENT] push failed error message should show partition info	2023-01-28 18:41:54 +08:00
Ethan Feng	a239f9f284	[CELEBORN-228]Refactor PartitionFileSorter to avoid specific JDK dependency. (#1168 )	2023-01-16 20:06:47 +08:00
zy.jordan	bb96700415	[CELEBORN-223] The default rpc thread num of pushServer/replicateServer/fetchServer should be the number of total of Flusher's thread (#1163 )	2023-01-16 12:03:46 +08:00
zhongqiangczq	3661222d98	[CELEBORN-195] add implementation to MapPartitionFileWriter (#1141 )	2023-01-13 16:41:11 +08:00
zy.jordan	19197b9190	[CELEBORN-214] Push/Replicate/Fetch io threads default value is 16 (#1158 )	2023-01-10 17:46:56 +08:00
nafiy	9635725480	[CELEBORN-204][IMPROVEMENT]Collect disk usage metrics in byte unit by default (#1153 )	2023-01-09 17:36:18 +08:00

1 2 3 4 5

238 Commits